Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I can already run Vicuna(llama) 7B on my 2020, 14" PC laptop at ~3.5 tokens/sec, and more speed can definitely be squeezed out.

Most future laptops and phones will ship with NPUs next to the CPU silicon. Once they get enabled in software, that means a 16GB machine can run a 13B model, or a 7B model with room for other heavy apps.

As for the benefits of batching and centralization, that is true, but its somewhat countered by the high cost of server accelerators and the high profit margins of cloud services.



It's not just the compute, you need fast memory too.

And 7B and 13B are nowhere near enough to get you GPT-3.5 level of performance, which is where it becomes actually interesting.

We'll get there eventually but I don't think it's right around the corner or anything like that.


Setting the M series aside, the AMD 7000 laptops already have reasonably fast memory. Faster than some old GPUs.

And that trend is accelerating. The latest rumor is that Intel is bringing back the eDRAM cache next (which means it was in planning long before the generative ai craze), and more stacked/on package memory is just around the corner.


While 7000U laptops have yet to be benchmarked, dual-channel DDR5/quad-channel LPDDR5 systems top out at about 60GB/s. (The M1/M2 by comparison is a 100GB/s, and doubles for Pro, Ultra, and Max up to 800GB/s). As a point of reference, top end consumer GPUs like the RTX 4090 are at about 1000GB/s.

My understanding is things like V-Cache, eDRAM have limited benefits for dense transformers, as they need to cycle through all/most of the parameters when running.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: