We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.
The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).
Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.
Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.
RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )
What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?
Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.
we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.
I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.
Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!
> 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances.
Is there a write up of the analysis? Something seems very wrong with that taking 700ms
we have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.
For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.
The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).
Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.