maybe you can do something like speculative decoding where you decode with a sma...

make3 on May 20, 2025 | parent | context | favorite | on: Google AI Ultra

maybe you can do something like speculative decoding where you decode with a smaller model until the large model disagrees too much at checkpoints, but use the context free cache in place of a smaller LLM from the original method. you could also like do it multi level, fixed context free cache, small model, large model

ethbr1 on May 21, 2025 [–]

Something like higher-dimensional Huffman compression for queries?