We’re also building billion-scale pipeline for indexing embeddings. Like the aut...

ddematheu · on Oct 9, 2023

Co-author of article here.

Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.

RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?

juxtaposicion · on Oct 9, 2023

Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.

bryan0 · on Oct 9, 2023

we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.

ddematheu · on Oct 9, 2023

What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?

juxtaposicion · on Oct 9, 2023

I'd love to know the answer here too!

I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.

Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!

losteric · on Oct 10, 2023

> 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances.

Is there a write up of the analysis? Something seems very wrong with that taking 700ms

bryan0 · on Oct 10, 2023

we have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.

For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.

yes, we use postgres pretty extensively already.

omneity · on Oct 10, 2023

What size are your embeddings?

bryan0 · on Oct 10, 2023

384 dims. we're using: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...

esafak · on Oct 10, 2023

What kind of retrieval performance are you observing with Lance?

juxtaposicion · on Oct 10, 2023

For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.