I was getting dangerously close to my weekly Claude Code limit last night so I h...

pixelesque · 2026-05-20T17:16:47 1779297407

Out of interest, what machine and model are you running it on?

I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.

What sort of speed should I be expecting?

I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.

Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)

I'm not expecting it to be instant, but what I'm currently seeing is not really usable.

gcr · 2026-05-20T17:31:25 1779298285

There are two flavors of Qwen 3.6:

- A 27B "dense" model

- A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.

For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.

The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.

flockonus · 2026-05-20T20:11:23 1779307883

For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.

Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.

milch · 2026-05-21T16:19:03 1779380343

I played around with local LLMs on my M4 Max 64GB this weekend and this is exactly what I found. I put Opus 4.7 "head to head" on the same task as Qwen 3.6 and a few other local models. The 35B did not perform well IME - it needed a lot of handholding and even then the final result did not work until a few more tweaks, while Claude one shot the task. The 27B was much better and also one shot the task, but took about ~55min as opposed to about ~15min for Claude. The 27B is probably something that I could happily run for many use cases if I had some faster hardware... the main problem there seems to be that at larger context sizes, prompt decoding can take several minutes.

gcr · 2026-05-23T14:13:57 1779545637

This matches my experience too. The little a3b model is quite capable for its size class, as is the 27B model, but it’s still an order of magnitude less effective than Claude on the “effectiveness / time” curve

joelanman · 2026-05-21T07:29:34 1779348574

Using omlx on the M1 max I get about 15tps from 27b

gcr · 2026-05-21T12:29:44 1779366584

interesting! I might give omlx another chance, thank you

pixelesque · 2026-05-20T18:24:50 1779301490

Thank you - I'll give that a go!

julianlam · 2026-05-20T18:08:30 1779300510

May I ask why the M instead of XL?

Obviously bigger != better but I don't know what the differences are.

DiabloD3 · 2026-05-20T21:13:52 1779311632

These are dynamic quants, and they're basically just an indication of how far away from the desired quant it is allowed to go to achieve the goal. Generally, unsloth's toolchain moves quants up, rarely down.

* _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.

* K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.

* K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.

* Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).

That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).

Unless you're specifically trying to cure a problem, stick with K_XL.

srcrip · 2026-05-20T23:01:08 1779318068

You seem to understand this stuff pretty well, any recommendations on resources (blogs, YouTube channels, whatever) for software engineers that want to keep up with this stuff on this kind of level?

A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.

gcr · 2026-05-21T12:23:13 1779366193

Unsloth’s guides on getting various models running are great starting-off points for the “practicioner’s side” of things. Note that they include settings for llama-cpp, ollama, and other runtimes in addition to their own “unsloth studio” (their product seems like overkill imo)

If you’re curious about what a particular switch does, clone the llama-cpp repository to your computer and try asking your favorite pet rock prompts like “This is llama-cpp. Can you look at what the -ctk parameter does and explain to me?” Giving Claude/codex/whatever access to the actual code goes a long way, but it is just one opinion.

If you’d like to learn how transformer-based language modeling works in detail, I suggest starting with chapter 0 or 1 of https://arena-chapter0-fundamentals.streamlit.app/ depending on your skill level, then use that to work your way to reading research papers.

Graduate students who study these topics are generally as annoyed by the “get rich quick” style of advertising as you are, so the deeper you go toward academic research the quieter those voices tend to get, mercifully. That said, this is balanced by the unfortunate fact that top labs have strong posturing signals they try to send, so it can be hard to see which preprints actually have good ideas, which are trying to promote their group’s tech instead of doing science out of curiosity, and which have authors who’ve innocently deluded themselves into overfitting their own pet projects. Read widely but adversarially, test everything but hold fast to the good stuff, etc etc

rao-v · 2026-05-20T23:37:56 1779320276

Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants

DiabloD3 · 2026-05-21T13:06:42 1779368802

Vega... unfortunately kinda sucks.

Its not amazing at compute (yet is a member of the GCN family, which I have been a fan of since its inception) and ended up being too expensive for perf/$ and perf/watt.

The only thing it did was make Nvidia rush Series 10 out the door and make it too good. Nvidia has been unable to live up to the gen-to-gen uplift Series 10 did, all because AMD made Nvidia blink.

Basically, you're 2 gens too early. CDNA2/gfx90a is the minimum you need to get any meaningful performance out of inference, or maybe CDNA1/gfx908 if you really don't need to quantize at all.

BTW, I did suggest this elsewhere in this HN story, but have you tried just disabling KV quant entirely? That is a huge speed uplift for compute-poor users.

Also, llama.cpp's support for gfx906 is probably never going to as good as it is for other cards, and good ROCm support for cards before they rebooted the driver/stack team is probably never going to materialize. I don't see the point in hanging onto them.

Like, if I was in your place, replacing it with even a 9060xt, with half the RAM, would be a step up. They go for $450. People have been building dedicated inference machines with these and they've been amazing, just throwing in 3 or 4 in, and scaling VRAM to meet needs.

rao-v · 2026-05-29T06:10:50 1780035050

I'd have to try the KV cache trick but folks get pretty competitive speeds with the current 31B/27B dense models e.g. https://www.reddit.com/r/LocalLLaMA/comments/1tc9j6u/mi50s_q...

gcr · 2026-05-21T12:14:46 1779365686

If your hardware fits K_M but not K_XL, should you prefer going down to a lower quantization’s XL or sticking to the higher quant’s Q_M?

DiabloD3 · 2026-05-21T12:33:25 1779366805

The correct answer should be "try it!"

But as models are starting to pack more information into less bits, some weights are just going to end up becoming super important and very sensitive to quant. So, I'd just move down a Q size, and continue with K_XL. Like, I'm betting Q3_K_XL will beat Q4_K_M on any given model in real world testing, even though its ~20% smaller, but perform worse on benchmaxxing.

The only exception I could think of is quantizing small models, like, my testing on Gemma E2B/E4B and Qwen 3.5 9B, quantizing at all was super noticeable... they can't spread the error across more weights.

Good news (at least for me), 24GB of VRAM is enough to store either of those in BF16 and then a ton of room for F16/F16 KV cache.

khimaros · 2026-05-21T07:02:09 1779346929

MTP recommended

gcr · 2026-05-21T12:11:57 1779365517

on my M1 Max, MTP consistently lowers my performance! I’ve tried both llama-cpp’s recently landed MTP support (cloned and built Tuesday) as well as one of the other forks a few weeks ago. Suspect nobody’s done a comparison on hardware like mine.

DiabloD3 · 2026-05-20T20:32:13 1779309133

I recommend sticking with the dense models for both Qwen and Gemma.

On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.

By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.

Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/

Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.

I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.

Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.

booty · 2026-05-20T19:31:25 1779305485

    I tried the qwen3.6-27b Q6_k GUFF in llama.cpp 
    and LM Studio on my M2 MacBook Pro 32GB machine 
    last week, and I barely get a token a second with either.

The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.

(not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)

Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.

mark_l_watson · 2026-05-20T19:51:04 1779306664

You are using Q6 6 bit quantization; on my 32G MacMini I use Q4 and it is faster but when I use it with OpenCode, I set up a task and go outside to walk for ten minutes. Smart, capable, and slow. Still, I love using local models.

EDIT: I run with context wired at 64K

mft_ · 2026-05-20T17:44:46 1779299086

The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.

For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:

Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).

Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.

stebalien · 2026-05-20T22:05:40 1779314740

Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).

mft_ · 2026-05-21T19:25:52 1779391552

Thanks - I’m in the process. I’ve tried briefly, but so far it appears marginally slower. (Noting that llama-bench doesn’t support MTP yet so you’re reduced to running different prompts and eyeballing the log.)

So I’m assuming I’ve done something wrong along the way, but I’ve not had time yet to explore it.

pixelesque · 2026-05-20T18:25:01 1779301501

Thanks for the info.

Figs · 2026-05-20T17:30:02 1779298202

27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.

pixelesque · 2026-05-20T18:25:11 1779301511

Thanks!

satvikpendem · 2026-05-20T19:36:43 1779305803

Check out Unsloth Studio it provides MTP support now which 2x the token generation speed with no loss of accuracy: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

127 · 2026-05-20T22:10:27 1779315027

I get 150t/s peak, 120t/s avg with Qwen3.6 27B Q4 with a 4090 on Linux. Now that MTP has landed into llama.cpp.

KronisLV · 2026-05-20T17:33:22 1779298402

> qwen3.6-27b Q6_k

That's the dense model, you probably want a mixture-of-experts (MoE) one.

Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

pixelesque · 2026-05-20T18:25:22 1779301522

Thanks!

dzr0001 · 2026-05-20T19:42:10 1779306130

My token throughput is much better using vLLM-mlx on my M2 ultra than llama.cpp. It might be worth a shot to give it a try.

electroglyph · 2026-05-20T23:09:37 1779318577

you should be using dflash with that model, look it up

plufz · 2026-05-20T15:33:18 1779291198

Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/

briga · 2026-05-20T15:45:17 1779291917

I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.

SeriousM · 2026-05-20T18:41:25 1779302485

This is the repo: https://huggingface.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4...

gcr · 2026-05-20T17:19:31 1779297571

here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.

  /Users/gcr/llama.cpp/build/bin/llama-server
      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
      --no-mmproj-offload
      --fit on
      -c 65536 # edit to taste
      --reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.

I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.

For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.

You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).

Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc

Take backups and then go have fun. Hope this helps.

irishcoffee · 2026-05-21T02:24:44 1779330284

I have a 5070TI (16gb VRAM) with 32GB system ram and a 16 core AMD cpu. I am considering buying a second used videocard, probably the same model, but not for months yet. This hardware setup is new-for-me in that a buddy gave me most of it and I bought the TI card.

Are there any resources to help me figure out how to best optimize my runtime paramaters for a given model, based on a given task, similar to what you've shown?

I've been a little... irritated? that hooking vscode up to my company LLM subscription seems so much more out-of-the-box capiable than what I can get to work. My assumption at the moment is that I need to create a lot of... I think they're called harnesses? agents? workflows? integrations? (not sure) by hand. Is that accurate?

Right now I have ollama running an nvidia nano model and I can poke it with a stick over a web interface I installed. It works, initial token response is slow, after that it seems fine enough.

I can't seem to get a good handle on how much context I've used, when context usage starts to degrade response accuracy, or in general how to mirror the results I get (not in terms of accuracy or speed, just features) from the company github copilot + vscode integration.

I was also trying to get a plugin called qodeassist working via qtcreator, mixed results there as well.

I've been keeping up with this space since the jump, never paid for a sub, work gave me a sub a handful of weeks ago, so the actual useage is all new to me.

I can't say I'm super impressed with any of it relative to the hype, but I found it neat to be able to point vscode at a c++ codebase and say "enable wextra, build the code, tell me if there is any low-hanging fruit I can clean up" and get a useful response.

I also asked my local model to turn a picture of my dog into a picture of an otter, got a blank picture back, which the thinking bit told me it would do. The whole thing was actually kind of funny. "I am allowed to edit pictures, I can't edit pictures, I am allowed to edit pictures, I'll tell the user I did and send a blank picture back because I can't edit pictures, but I am allowed to."

srcrip · 2026-05-20T23:12:02 1779318722

Can you elaborate more on the differences in running ollama or lmstudio? Do they actually slow down the speed of the inference and if so why? Or is it just a preference thing?

gcr · 2026-05-21T14:53:58 1779375238

Ollama and LM-Studio are fine. Their main advantage is that they have a nice way to browse models -- LMStudio from huggingface and Ollama from their own curated list. Both are great ways of getting started. Pick LM-Studio if you'd like a nice GUI frontend to mlx-lm or llama-cpp; pick ollama if you'd like a nice command line interface and don't need non-default parameters.

LM-Studio doesn't support certain parameter combinations. For instance, LM-Studio supports KV quantization....but if you're using the MLX backend, you can't set the context length when KV quantization is used? Why? Running a model with certain settings requires keeping a little SAT solver going in your head. I found that overwhelming, so I just stopped using it.

The Ollama devs want to offer a central curated experience, but I perceive their approach as "playing fast and loose." They've re-implemented unique code for every model they support in their own Go runtime, so certain parameter choices aren't supported. On my hardware, their MLX backend just doesn't work at all without segfaulting the server process for example. It doesn't smack as vibe coded the way oMLX does, but it also doesn't smack as professional or battle-tested.

Ultimately, just dropping down to llama-cpp's GGUF model support and asking for default settings has provided faster inference speeds than anything I've been able to benchmark with them, but everything's within 10% of each other anyway so it's not a huge deal for me.

srcrip · 2026-05-21T23:55:51 1779407751

Thank you, that makes a lot of sense

plufz · 2026-05-21T01:23:49 1779326629

Thanks a million!

ecshafer · 2026-05-20T16:41:16 1779295276

Qwen3.6 with claude code works great. I get a lot better results with that than opencode and qwen3.6. Claude Code is a great harness, and good harness/tool integration makes a big difference. You just have a settings.json with your ollama setup and the qwen model and you can use it.

growt · 2026-05-20T19:34:35 1779305675

Where and how do you run that? I tried it but somehow I always ran out of context or generation was incredibly slow (mbp m4 pro 48gb).

leonidasv · 2026-05-20T15:32:17 1779291137

Qwen Max are usually closed, unfortunately.

mostafab · 2026-05-20T21:49:08 1779313748

That's a signal of being SOTA.

wuliwong · 2026-05-20T19:18:21 1779304701

Do you have a feel for how it Qwen 3.6 compares to Sonnet 4.6? B/C in reality, that's what we use a lot. If we just use Opus 4.7 for everything code related, we'd have a monthly bill 10-20 times higher than using Sonnet where we can.

nl · 2026-05-21T01:26:58 1779326818

I think you could well be surprised by the Sonnet vs Opus bill (assuming you are paying via the API)

In my experience Sonnet bills can be higher than Opus because it churns a lot more trying to get things right.

Example from my fairly simple but agentic benchmark:

Opus 4.7, 25/25, 81c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...

Opus 4.6, 24/25, 61c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...

Sonnet 4.6: 24/25, 41c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...

I only tested the free OpenRouter version of Qwen 3.6 Plus, and it scored 23/25: https://sql-benchmark.nicklothian.com/?highlight=qwen_qwen3....

This doesn't quite show Opus cheaper, but it isn't the 10-20 times more either. Harder tasks close the gap even further.

briga · 2026-05-20T20:43:07 1779309787

I would say if Sonnet is a senior engineer, then Qwen3.6 (the 27b model) is probably closer to a junior engineer. Still capable of getting stuff done, just needs more guidance and makes mistakes more often.

Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.

Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.

kolinko · 2026-05-20T18:00:41 1779300041

As Opus maximalist ;) I was very surprised by the quality if Qwen3.6-27B - trying to figure out how to get it going on RTX 90k now to offload some lighter tasks :)

aembleton · 2026-05-20T20:44:04 1779309844

> Today we introduce Qwen3.7-Max, our latest proprietary model

This is not an open model

chr15m · 2026-05-20T23:58:24 1779321504

This new version is not something you'll be able to run locally. It's a "cloud" model and likely too beefy if they do release the weights.

wouldbecouldbe · 2026-05-20T17:21:03 1779297663

This one doesnt seem to be open source though sadly. Using chinese servers is a step to far for me personally

gcr · 2026-05-20T17:33:59 1779298439

Look for an open release from the Qwen team in the coming weeks. They like to showcase their proprietary models first, which score higher on benchmarks anyway due to model size.

ttoinou · 2026-05-20T19:20:50 1779304850

Which agentic coding tool and how do you make sure you have prefix consistency ?

par · 2026-05-20T17:42:02 1779298922

Do you have an opinion on OpenCode vs Aider?

briga · 2026-05-20T20:44:45 1779309885

I haven't tried Aider yet but perhaps I will. Another one that seems to be getting traction is Pi Coding Agent.

sunaookami · 2026-05-20T20:51:51 1779310311

Aider is still around? That is pre-tool-calling era stuff. Better compare against Pi.

par · 2026-05-20T23:35:13 1779320113

I just started running coding agents locally. So you recommend Pi over opencode? (And obviously aider is out?)

sunaookami · 2026-05-21T11:54:17 1779364457

Haven't tried OpenCode too much but I found it great. It's more batteries included so I would recommend it over Pi if you don't want to write extensions yourself or use community-provided ones (like webfetch and websearch).

anderber · 2026-05-21T03:51:59 1779335519

I personally found better results with Opencode. But Pi is really nice too.