Love this, even if can't use it atm (not got the h/w - only 96gb on M2 Max). I g...

embedding-shape · 2026-05-15T11:36:08 1778844968

> even if can't use it atm (not got the h/w - only 96gb on M2 Max).

Not sure if it works different on macOS, but with CUDA + DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf I can fit it within 96GB of VRAM, together with context, so theoretically I feel like you should too, unless macOS uses GB of RAM/VRAM for the OS/display by default.

ljosifov · 2026-05-15T14:27:03 1778855223

On 96gb I can give up to about 88GB to the GPU with sysctl iogpu.wired_limit_mb=88000, without suffering any ill-effects. When pushed higher I tend to notice e.g. graphic driver errors, youtube web page not working, other semi-random glitches. So the ~80 GB of DS4-flash quants I could just about fit. Leaving some extra for the KV caches. Will try, I'm curious how's the DS4 degradation with context depth growth, how fast does tok/s drop. E.g. 2-bit lowest quant MiniMax-M2.6 runs, but starts low tok/s and degrades fast with context depth.

The biggest models I can comfortably run are about 1/2 the DS4F size - like gpt-oss-120b. Lately was toying with Ling-2.6-flash. Got the agents to adapt existing metal kernels in llama.cpp, and it did run (model https://huggingface.co/ljupco/Ling-2.6-flash-GGUF, branch https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flas...). It's 104B-A7B4, and for the M2 Max 7.4B active is about the most it can take while still producing 40 tok/s. And the hybrid arch allows for graceful degradation, still close to 30 tok/s at 64K context depth.

Too bad L2.6F while the best have, is not that much better in agentic benchmarks compared to my current incumbent local llm (nemotron-cascade-2). Got inspired by DS4 to start a l26f branch (WIP https://github.com/ljubomirj/l26f). :-) Try squeeze the most from L2.6F. There should be low hanging fruit in good integration of the agent and the inferencing engine. On input - considering the huge difference cached v.s. non-cached tokens. On output - considering that the NN gives us the complete logits set for all 200K+ tokens vocabulary.

zozbot234 · 2026-05-15T09:39:09 1778837949

It should work with 96GB, especially on a limited context. But the M2 Max is a bit slower, yes.

antirez · 2026-05-15T14:09:25 1778854165

It works on your computer I believe. There are a few positive reports.

ljosifov · 2026-05-15T20:57:15 1778878635

Thanks for the DS4, will give it a try. Was hoping maybe I can re-quantise shave few GB... MiniMax-M2.7 Unsloth's UD-IQ2_XXS is down to 65GB - it run albeit too slow to be usable to an agent at context depth. I'm curious DS4F with it being economical with the KV caches - if that translates into keeping up with context. Was hoping 80GB 2-bit quants maybe come down to 70GB... that would be more comfortable to run.