Unified memory is only a feature because NVidia so aggressively uses VRAM for market segmentation.
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.
To this day I do not get why Intel doesn't just offer massive memory options for their cards. Just charge what it costs to add the extra memory, no upcharge, and they will never be able to keep up with demand. Cheap VRAM is enough to justify a lot of open source investment into challenging CUDA.
That’s not massive, though. Make it 96GB at $2,000 (ok, probably impossible right now, but they could have before the surge in prices) and you’ll see developers work really hard to make AI tooling work for their cards, CUDA be damned. The same goes for AMD.
It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.
If you buy three 32GB GPUs, that's 96GB total at a very reasonable price. An AI model splits easily by layers, so running on multiple GPUs is quite feasible.
Doesn't split as easily on an Intel GPU as ona NVIDA GPU though, regarding software support. Sure, it's probably not too difficult if you know what you're doing, but not sure how big that market would be.
They took longer than everyone expected and then shortly after release they made announcements that made people worry that Intel might kill the project the way they tend to kill GPU projects.
I have so many questions… Since Apple already sells unified memory systems, what is the market opportunity you envision? Do you see Nvidia and Apple as competitors, and how? (And I’m not suggesting they’re not, necessarily, but I want to hear where you’re coming from, and they do have very different markets.) Hasn’t Apple used storage size (RAM & disk) for market segmentation for decades? And how does a machine with 128GB unified mem not potentially cut into some people’s reasons for wanting a 96GB GPU?
Apple offers relatively affordable options for a high-memory workstation that uses unified memory. They previously offered 256/512GB Mac Studios (both discontinued). Because of this they can keep larger models in memory.
BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:
1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and
2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.
According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.
The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.
You can chain Mac Studios together into a cluster with TB5 too.
But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.
I'm not the person you're replying to, but I wholeheartedly agree with them...
Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.
Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.
Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.
Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.
By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.
The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.
Since Apple already sells unified memory systems, what
is the market opportunity you envision?
nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.
(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)
So.
Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.
They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.
The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.
They use different technology for their VRAM though. Apple, AMD Strix and NVidia DGX/RTX Spark use LPDDR, whereas discrete cards will be either GDDR or HBM. That directly impacts the memory bandwidth figures. As for compute available, Apple and AMD still have very good figures there for what's essentially a general-purpose iGPU that ships as part of the stock system, rather than a special-purpose piece of dedicated hardware.
The M5 has 16 dedicated ‘Neural Engine’ cores and a ‘Neural accelerator’ in each of its conventional GPU cores. It’s been pretty special-purpose juiced for inference.
When it comes to the very largest models the ANE seems to be only marginally useful for prefill. The M5 Neural Accelerators (NAX) help a lot but at a real cost wrt. power and thermals.
Yep, but Apple products don’t spend most of their time running huge models. They are running lots of little ones all the time, using hardware designed for that.
It seems that you're agreeing with what I wrote above. They ship a general-purpose stock system and tailor their compute offering towards that. Accelerating 'lots of little models' fits naturally into what they offer, in a way that a more compute-intensive design might not.
Yep, I misunderstood your point. Thanks for your patience. In my defense, the 'general purpose system' has a lot of model-inference-specific hardware. But not LLM-specific hardware.
If there's an M5 Ultra it'll be interesting to see what they've optimized it for.
Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.
My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.
Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.
You're correct about some things but mostly wrong.
Yes, a Mac with 128GB+ will let you load some pretty big models.
However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.
And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.
So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.
The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.
My understanding is this is the advantage that’s pushing huge Mac Studio demand.
This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.
12 tok/s can absolutely be "usable output" depending on what you're doing. I agree though that the 27B dense model often feels slow due to an overall weakness of memory throughput on that particular platform. Most real-world 120B models though will be MoE-based with only a small fraction of active parameters, and these run quite well. Also, dense models can benefit from batching, which is at least marginally viable with Qwen if you stick to shorter contexts and smaller batches.
Even low-VRAM cards are actually very useful for running the comparatively smaller dense layers in large local MoE models. This only requires transfering very small amounts of data across the PCIe bus (similar to pipeline parallelism) so it fits nicely around the existing bottlenecks on that hardware.
Mx Extreme = 2 x Mx Ultra = more cores. (Opportunity: processor chiplets could be designed to integrate in higher quantities.)
Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.
Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.
Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.