Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

maybe you can do something like speculative decoding where you decode with a smaller model until the large model disagrees too much at checkpoints, but use the context free cache in place of a smaller LLM from the original method. you could also like do it multi level, fixed context free cache, small model, large model


Something like higher-dimensional Huffman compression for queries?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: