Curious how the semantic caching layer works.. are you embedding requests on the gateway side and doing a vector similarity lookup before proxying? And if so, how do you handle cache invalidation when the underlying model changes or gets updated?
Hey, contributor here. That's right, GoModel embeds requests and does vector similarity lookup before proxying. Regarding the cache invalidation, there is no "purging" involved – the model is part of the namespace (params_hash includes the LLM model, path, guardrails hash, etc). TTL takes care of the cleanup later.
This blog post doesn't address how GPU "threads" can be mapped to Rust SIMD/SPMD "lanes" yet, though it hints at that. I assume that this is planned to be a topic for a future blog post.
I'd like to understand how the overall amount of "warps" to be launched on the GPU is determined. Is it fixed at shader launch, or can warps be created and destroyed on demand? If it's fixed, these are more like CPU-side "virtual processors" (in OS terminology) than true OS "threads".
It depends on your use case. I wouldn't use it for a JS-heavy site. But if you have simple static content, it's probably enough. It's worth testing it out as a standalone app before integrating it as a library.
It doesn't crash as often as it used to few years ago. JS heavy sites might not work, and layout issues too. And internet gatekeepers cloudflare turnstile doesn't work.
crashes happen for reasons besides memory safety. web-engines are crazy complicated pieces of software and crashes could happen for any number of reasons. also I would be shocked if this was written using purely safe rust
rust fixed memory safety but left build-time trust wide open. What’s the realistic path to fixing this? sandboxed builds by default, or stricter provenance (sigstore-style) or what?