You have to run the entire model once for every token. To generate the first token you have to process the entire preceding context once. On the second token you don't need to recompute the context, you can just take what you have calculated in the previous pass. For attention you need to take the current token and run a pairwise computation against all the cached values so that you know which tokens relate to the current token.
This is unavoidable by the way. Any method to avoid quadratic attention will necessarily have to degrade accuracy, because less than quadratic means you will have to look at "less than all the tokens" in a given pass. You're bound to miss at least some of them. When you consider how simple the classical attention mechanism is, you realize that there is not much you can do. Any preattention pass is necessarily going to be more complicated than your quadratic attention mechanism.
What they do is just get rid of it entirely after w layers. This saves memory consumption, which allows you to fit more context in the same amount of VRAM. In the paper they decided to run more batches in parallel, but they could have also advertised a massive increase in context length. 128k context needs 16GB on my machine for llama3 7b. Their approach would allow at least 784k context in the same 16GB.
The KV cache is just another tensor to be used with matmuls. Unlike the model weights which are fixed, the KV cache is uniquely constructed for every input. Think of it as the model growing new weights to represent the new knowledge it learns about the user's input at inference time because not everything can be baked into the pretrained model.
You want to store your KV cache in the same processor that does the rest of your matmuls.