I can already run Vicuna(llama) 7B on my 2020, 14" PC laptop at ~3.5 tokens/sec,...

int_19h · on April 20, 2023

It's not just the compute, you need fast memory too.

And 7B and 13B are nowhere near enough to get you GPT-3.5 level of performance, which is where it becomes actually interesting.

We'll get there eventually but I don't think it's right around the corner or anything like that.

brucethemoose2 · on April 20, 2023

Setting the M series aside, the AMD 7000 laptops already have reasonably fast memory. Faster than some old GPUs.

And that trend is accelerating. The latest rumor is that Intel is bringing back the eDRAM cache next (which means it was in planning long before the generative ai craze), and more stacked/on package memory is just around the corner.

lhl · on April 20, 2023

While 7000U laptops have yet to be benchmarked, dual-channel DDR5/quad-channel LPDDR5 systems top out at about 60GB/s. (The M1/M2 by comparison is a 100GB/s, and doubles for Pro, Ultra, and Max up to 800GB/s). As a point of reference, top end consumer GPUs like the RTX 4090 are at about 1000GB/s.

My understanding is things like V-Cache, eDRAM have limited benefits for dense transformers, as they need to cycle through all/most of the parameters when running.