This sounds great! TurboQuant does KV cache compression using quantization via rotations, and ParoQuant [1] does weight compression using quantization via rotations! So we can get 4-bit weights that match bf16 precision, the KV cache goes down to 3 bits per key. This brings larger models and long contexts into the range of "possibly runnable" on beefy consumer hardware.
> Yet, the majority of new apps and services that I see are all AI ecosystem stuff.
The same was true of all this computer science stuff too. We built parsers, compilers, calculators, ftp and http, all cool stuff that just builds up our own ecosystem. Look how that turned out.
An ecosystem has to hit a critical mass of sophistication before it breaks out to the mainstream. It's not going to take very long for AI.
> The semantics imposed on the bit strings does not exist anywhere in the arithmetic operations,
Correct, the semantics is actually in the network of relations between the nodes. That has been one of the major lessons of LLMs, and it validates the systems response to Searle's Chinese Room.
Interesting idea, but I hope people just start switching to ParoQuant and eliminate basically all quantization errors relative to fp16/bf16 even going down to 4-bits:
> The human genome contains around 1.5GB of information and DeepSeek v3 weighs in at around 800GB, so it's a bit apples-to-oranges.
The apples-to-apples comparison is comparing the human genome to the code behind a particular LLM. The genome defines the structure that learns and thinks, just like the code for the LLM.
[1] https://github.com/z-lab/paroquant
reply