Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Co-author of the article here.

You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.

As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.



Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.


Not at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.

We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.

We think there are clever approaches like this that can help improve RAG while still being scalable.


Do you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.


> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: