Vega... unfortunately kinda sucks. Its not amazing at compute (yet is a member o...

Vega... unfortunately kinda sucks.

Its not amazing at compute (yet is a member of the GCN family, which I have been a fan of since its inception) and ended up being too expensive for perf/$ and perf/watt.

The only thing it did was make Nvidia rush Series 10 out the door and make it too good. Nvidia has been unable to live up to the gen-to-gen uplift Series 10 did, all because AMD made Nvidia blink.

Basically, you're 2 gens too early. CDNA2/gfx90a is the minimum you need to get any meaningful performance out of inference, or maybe CDNA1/gfx908 if you really don't need to quantize at all.

BTW, I did suggest this elsewhere in this HN story, but have you tried just disabling KV quant entirely? That is a huge speed uplift for compute-poor users.

Also, llama.cpp's support for gfx906 is probably never going to as good as it is for other cards, and good ROCm support for cards before they rebooted the driver/stack team is probably never going to materialize. I don't see the point in hanging onto them.

Like, if I was in your place, replacing it with even a 9060xt, with half the RAM, would be a step up. They go for $450. People have been building dedicated inference machines with these and they've been amazing, just throwing in 3 or 4 in, and scaling VRAM to meet needs.