A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP...

iberator · 2025-11-05T09:03:37 1762333417

That's out of touch for 90% of developers worldwide

brazukadev · 2025-11-05T10:14:03 1762337643

Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.

Dilettante_ · 2025-11-05T10:35:15 1762338915

But the progress goes both ways: In five years, you would still want to use whatever is running on the cloud supercenters. Just like today you could run gpt-2 locally as a coding agent, but we want the 100x-as-powerful shiny thing.

mcny · 2025-11-05T10:46:57 1762339617

That would be great if that was the case but my understanding is that the progress is plateauing. I don't know how much of this is anthorpic / Google / openAI holding itself back to save money and how much is the state of the art improvement slowing down though. I can imagine there could be a 64 GB GPU in five years as absurd as it feels to type that today.

simonw · 2025-11-05T11:22:20 1762341740

What gives you the impression the progress is plateauing?

I'm finding the difference just between Sonnet 4 and Sonnet 4.5 to be meaningful in terms of the complexity of tasks I'm willing to use them for.

lelanthran · 2025-11-06T16:47:40 1762447660

> I'm finding the difference just between Sonnet 4 and Sonnet 4.5 to be meaningful in terms of the complexity of tasks I'm willing to use them for.

That doesn't mean "not plateauing".

It's better, certainly, but the difference between SOTA now and SOTA 6 months ago is a fraction of the difference between SOTA 6 months ago and the difference 18 months ago.

It doesn't mean that the models aren't getting better, it means that the improvement in each generation is smaller than the the improvement in the previous generation.

simonw · 2025-11-06T17:40:21 1762450821

18 months ago to 6 months ago was indeed a busy period - both multimodal image input and reasoning models were rare at the start of that time period and common by the end of it.

Comparing a 12 month period to a 6 month period feels unfair to me though. I think we will have a much fuller picture by the end of the year - I have high expectations for the next wave of Chinese models and for Gemini 3.

lelanthran · 2025-11-06T19:50:15 1762458615

> Comparing a 12 month period to a 6 month period feels unfair to me though.

Okay. Let me clarify then.

The difference between SOTA now and SOTA 6 months ago is a fraction of the difference between SOTA 6 months ago and SOTA 12 months ago.

That still "plateauing". The performance of the models, should you take the time to chart them, is clearly asymptotic and we're in the flattening out phase now.

I also observe that all the models are converging on roughly the same performance, which makes me think that we are approaching some maxima with the current approach.

sebastiennight · 2025-11-05T15:10:47 1762355447

> a 64 GB GPU in five years

Is there a digit missing? I don't understand why this existing in 5 years is absurd

mcny · 2025-11-06T09:58:01 1762423081

I meant for me it feels absurd today but it will likely happen in five years.

brazukadev · 2025-11-05T11:39:35 1762342775

Not really, for many cases I'm happy using Qwen3-8B in my computer and would be very happy if I could run Qwen3-Coder-30B-A3B.

infecto · 2025-11-05T14:15:56 1762352156

Paying for compute in the cloud. That’s what I am betting on. Multiple providers, different data center players. There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.

lelanthran · 2025-11-06T16:51:26 1762447886

> There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.

Depends almost completely on usage. No one is renting out hardware 24x7 and making a loss on it.

If you only have sporadic use then renting is better. If you're running it almost all the time of purchasing it outright is better.

infecto · 2025-11-06T17:44:43 1762451083

Sure but we were talking about gaming rigs to run models locally. You are describing some extreme edge folks that are keeping 24/7 work on gaming rigs in your home.

lelanthran · 2025-11-06T19:53:38 1762458818

> Sure but we were talking about gaming rigs to run models locally. You are describing some extreme edge folks that are keeping 24/7 work on gaming rigs in your home.

In that scenario the case is even weaker for the rented-hardware model - if you're going to have a gaming rig, you're only paying a little bit more on top for a GPU with more RAM, not the full cost of the rig.

The comparison then is the extra cost of using a 24GB GPU over a standard gaming rig GPU (12GB? 8GB?) versus the cost of renting the GPU whenever you need it.

infecto · 2025-11-06T20:50:53 1762462253

Honestly not sure what you are talking about.

I could either spend $20 a month for my cursor license.

Or

Spend $2k+ upfront to build a machine to run models locally. Pay for the electricity cost and time to set both the machine and software up.

lelanthran · 2025-11-06T22:56:46 1762469806

> Spend $2k+ upfront to build a machine to run models locally.

You said this was in the context of a gaming rig. You're not spending an extra $2k on your gaming rig to run models locally.

If you're building a dedicated LLM machine OR you're using less compute than you are paying the provider for, then, yup - $20/m is cheaper.

When you start using the model more, or if you're already building a gaming rig, then it's going to be cheaper to self-host.

infecto · 2025-11-07T13:58:48 1762523928

I think the plot flew way over your head. We were comparing costs and for my reading, saying gaming rig is more about consumer grade hardware and not so much an assumption that you already have one. After all I assume we would be buying a 5090 for the vram and at current market price that’s $3k alone. You would probably end up spending at least $20 in electricity and cooling every month of you are running it near 24/7.

So again, the economics don’t really make sense except in specific edge cases or for folks that don’t want to pay vendors. Also please don’t use italics, I don’t know why but every time you see them used it’s always a silly comment.

alfiedotwtf · 2025-11-05T18:34:15 1762367655

Woah, woah, woah. I thought in 5 years time we would all be out of a job lol

jen729w · 2025-11-05T08:44:33 1762332273

Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?

> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.

Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

radicalbyte · 2025-11-05T11:05:45 1762340745

> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.

The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)

alfiedotwtf · 2025-11-05T18:35:39 1762367739

That crazy Burt thing these days, is quitting Chrome because it’s consuming 90% ram

reaslonik · 2025-11-05T10:30:28 1762338628

>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.

As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too

Foobar8568 · 2025-11-05T08:41:33 1762332093

Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.

reaslonik · 2025-11-05T10:37:34 1762339054

You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).

You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM

cmclaughlin · 2025-11-06T14:07:40 1762438060

I also expect local LLMs to catch up to the cloud providers.

I spent last weekend experimenting with Ollama and LM studio. I was impressed at how good Qwen3-Coder is. Not as good as Claude, but close - maybe even better in some ways.

As I understand it, the latest Macs are good for local LLMs due to their unified memory. 32GB of RAM in one of the newer M-series seems to be the "sweet spot" for price versus performance.

ashirviskas · 2025-11-05T08:58:30 1762333110

How much context do you get with 2GB of leftover VRAM on Nvidia GPU?

electroglyph · 2025-11-05T10:43:53 1762339433

you need a couple RTX 6000 pros to come close to matching cloud capability