*I am not sure how many people will run AI models locally. It still seems like a...

chpatrick · 2026-06-06T13:38:21 1780753101

I think it's niche now because getting the hardware to run it is expensive and the quantized models don't work as well. If those improve then it would be a no brainer to pay one off for the hardware instead of a fortune for API calls.

dofm · 2026-06-06T14:05:37 1780754737

I am not really convinced that four bit quantisation is that bad; almost certainly six will be enough. But Google are making claims for their QAT tech in Gemma that they are surely using or testing in Gemini that it preserves nearly source model quality while reducing footprint.

The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.

Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.

The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.

jqpabc123 · 2026-06-06T13:54:54 1780754094

AI vendors are attempting to offer the whole apple. And they are spending huge sums of money in the process.

But most businesses don't really care about most of the apple --- they only need their special bite out of it.

For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.

dofm · 2026-06-06T14:17:23 1780755443

I think it is likely to appeal to video and photo editors who want to use AI tools (the press release has a quote from Blackmagic Design, as well as from Adobe, who I think have no stomach for their own cloud AI).

But I don’t know about specialised: this could run quite large models with MoE.

dgellow · 2026-06-06T13:55:22 1780754122

Performances of local models are pretty bad compared to what AI vendors offer, token generation is just too slow to be that useful. And you need to allocate GBs of memories, something that will stay very expensive to buy for a long time.

Running local models will stay niche for a while, unless we see breakthroughs

jqpabc123 · 2026-06-06T14:01:15 1780754475

Dumb idea --- how about if we limit local models to specific domains --- medicine for example.

Most doctors don't care much about engineering or accounting or software development or 10000 other things that big vendor models address.

This area is yet to be really explored. Nvidia aims to provide the hardware to do so.

CamperBob2 · 2026-06-06T17:59:50 1780768790

That's a fairly obvious idea, not dumb at all, but unfortunately it doesn't seem to pan out. Trying to specialize an LLM in one area harms its 'cognition' in all areas. For instance, if you train a coding model without all the Shakespeare and soap operas and Wikipedia and pirated Stephen King books and ancient Roman history and whatever, you end up with a worse coding model.

I'm not sure anyone really understands why.

jqpabc123 · 2026-06-07T12:27:26 1780835246

https://www.ibm.com/think/topics/domain-specific-llm

CamperBob2 · 2026-06-07T16:57:49 1780851469

The article is not backed up by reality. Why would use anything but a domain-specific LLM, if they actually worked?

The author is probably confusing RAG with pretraining. You can RAG on PubMed but you can't arrive at a competitive model by pretraining solely on it.