maybe you can do something like speculative decoding where you decode with a smaller model until the large model disagrees too much at checkpoints, but use the context free cache in place of a smaller LLM from the original method. you could also like do it multi level, fixed context free cache, small model, large model