Co-author of the article here. You are right. Retrieval accuracy is important as...

visarga · on Oct 9, 2023

Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.

ddematheu · on Oct 9, 2023

Not at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.

We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.

We think there are clever approaches like this that can help improve RAG while still being scalable.

dartos · on Oct 9, 2023

Do you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.

janalsncm · on Oct 10, 2023

> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.