Unified memory is only a feature because NVidia so aggressively uses VRAM for ma...

Salgat · 2026-06-06T19:21:09 1780773669

To this day I do not get why Intel doesn't just offer massive memory options for their cards. Just charge what it costs to add the extra memory, no upcharge, and they will never be able to keep up with demand. Cheap VRAM is enough to justify a lot of open source investment into challenging CUDA.

zozbot234 · 2026-06-06T19:36:53 1780774613

> To this day I do not get why Intel doesn't just offer massive memory options for their cards.

They seem to? Intel Arc is the cheapest option by far for a discrete card with 32GB VRAM.

Auracle · 2026-06-06T20:21:28 1780777288

That’s not massive, though. Make it 96GB at $2,000 (ok, probably impossible right now, but they could have before the surge in prices) and you’ll see developers work really hard to make AI tooling work for their cards, CUDA be damned. The same goes for AMD.

It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.

zozbot234 · 2026-06-06T20:36:40 1780778200

If you buy three 32GB GPUs, that's 96GB total at a very reasonable price. An AI model splits easily by layers, so running on multiple GPUs is quite feasible.

schubidubiduba · 2026-06-07T15:57:37 1780847857

Doesn't split as easily on an Intel GPU as ona NVIDA GPU though, regarding software support. Sure, it's probably not too difficult if you know what you're doing, but not sure how big that market would be.

to11mtm · 2026-06-06T20:31:36 1780777896

They took longer than everyone expected and then shortly after release they made announcements that made people worry that Intel might kill the project the way they tend to kill GPU projects.

(I still kinda want to get one tho.)

htrp · 2026-06-06T21:34:39 1780781679

Missed a zero here.

Needs 320 GB Vram

ActorNightly · 2026-06-06T21:41:43 1780782103

Memory is just one part. AMD has had offerings competitive to NVIDIA for quite some time, but nobody uses AMD cards.

The biggest advantage with NVIDIA is CUDA.

overfeed · 2026-06-07T00:53:41 1780793621

> but nobody uses AMD cards

AMD is selling every MI card it makes, and the market wants more of them.

ActorNightly · 2026-06-07T20:19:57 1780863597

They are only selling because Nvidia is hard to get, and something is better than nothing.

dahart · 2026-06-06T18:26:52 1780770412

I have so many questions… Since Apple already sells unified memory systems, what is the market opportunity you envision? Do you see Nvidia and Apple as competitors, and how? (And I’m not suggesting they’re not, necessarily, but I want to hear where you’re coming from, and they do have very different markets.) Hasn’t Apple used storage size (RAM & disk) for market segmentation for decades? And how does a machine with 128GB unified mem not potentially cut into some people’s reasons for wanting a 96GB GPU?

jmyeet · 2026-06-06T18:56:10 1780772170

Apple offers relatively affordable options for a high-memory workstation that uses unified memory. They previously offered 256/512GB Mac Studios (both discontinued). Because of this they can keep larger models in memory.

BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:

1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and

2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.

According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.

The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.

You can chain Mac Studios together into a cluster with TB5 too.

But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.

JohnBooty · 2026-06-06T18:56:08 1780772168

I'm not the person you're replying to, but I wholeheartedly agree with them...

Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.

Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.

Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.

Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.

By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.

The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.

    Since Apple already sells unified memory systems, what 
    is the market opportunity you envision?

nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.

(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)

So.

Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.

They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.

The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.

zozbot234 · 2026-06-06T20:05:11 1780776311

They use different technology for their VRAM though. Apple, AMD Strix and NVidia DGX/RTX Spark use LPDDR, whereas discrete cards will be either GDDR or HBM. That directly impacts the memory bandwidth figures. As for compute available, Apple and AMD still have very good figures there for what's essentially a general-purpose iGPU that ships as part of the stock system, rather than a special-purpose piece of dedicated hardware.

robotresearcher · 2026-06-06T23:25:18 1780788318

The M5 has 16 dedicated ‘Neural Engine’ cores and a ‘Neural accelerator’ in each of its conventional GPU cores. It’s been pretty special-purpose juiced for inference.

zozbot234 · 2026-06-06T23:39:35 1780789175

When it comes to the very largest models the ANE seems to be only marginally useful for prefill. The M5 Neural Accelerators (NAX) help a lot but at a real cost wrt. power and thermals.

robotresearcher · 2026-06-06T23:42:29 1780789349

Yep, but Apple products don’t spend most of their time running huge models. They are running lots of little ones all the time, using hardware designed for that.

zozbot234 · 2026-06-06T23:49:18 1780789758

It seems that you're agreeing with what I wrote above. They ship a general-purpose stock system and tailor their compute offering towards that. Accelerating 'lots of little models' fits naturally into what they offer, in a way that a more compute-intensive design might not.

robotresearcher · 2026-06-07T00:48:51 1780793331

Yep, I misunderstood your point. Thanks for your patience. In my defense, the 'general purpose system' has a lot of model-inference-specific hardware. But not LLM-specific hardware.

If there's an M5 Ultra it'll be interesting to see what they've optimized it for.

MBCook · 2026-06-06T20:19:05 1780777145

There’s something else. Memory size.

Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.

My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.

Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.

JohnBooty · 2026-06-06T22:48:43 1780786123

You're correct about some things but mostly wrong.

Yes, a Mac with 128GB+ will let you load some pretty big models.

However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.

https://www.reddit.com/r/oMLX/comments/1swztoh/m5_max_128gb_...

And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.

So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.

The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.

   My understanding is this is the advantage that’s pushing huge Mac Studio demand.

This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.

zozbot234 · 2026-06-06T23:07:06 1780787226

12 tok/s can absolutely be "usable output" depending on what you're doing. I agree though that the 27B dense model often feels slow due to an overall weakness of memory throughput on that particular platform. Most real-world 120B models though will be MoE-based with only a small fraction of active parameters, and these run quite well. Also, dense models can benefit from batching, which is at least marginally viable with Qwen if you stick to shorter contexts and smaller batches.

zozbot234 · 2026-06-06T17:33:20 1780767200

Even low-VRAM cards are actually very useful for running the comparatively smaller dense layers in large local MoE models. This only requires transfering very small amounts of data across the PCIe bus (similar to pipeline parallelism) so it fits nicely around the existing bottlenecks on that hardware.

Melatonic · 2026-06-07T19:45:07 1780861507

It's also ECC ram but to be fair - yes quite overpriced. The RTX Pro line are basically what the Titan line used to be but way way more expensive.

woodson · 2026-06-06T18:14:24 1780769664

> 5090 ($2k MSRP but realistically $3-3.5k)

These days, more like >$4.1K (at least in the US).

simonebrunozzi · 2026-06-06T20:58:58 1780779538

What should Apple do, in your view, to "embrace" it?

Nevermark · 2026-06-06T23:22:10 1780788130

Mx Extreme = 2 x Mx Ultra = more cores. (Opportunity: processor chiplets could be designed to integrate in higher quantities.)

Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.

Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.

Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.