> We start by parsing documents into chunks. A sensible default is to chunk docu...

gwern · on June 13, 2023

You don't do a sliding window? That seems like the logical way to maintain context but allow look up by 'chunks'. Embed it, say, 3 paragraphs at a time, advancing 1 paragraph per embedding.

chaxor · on June 14, 2023

This is only a good idea if you are *specifically not* using OpenAI.

If you use local models then it's a fantastic idea.

screye · on June 13, 2023

If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues.

Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better.

That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows.

rahimnathwani · on June 14, 2023

In place of simply concatenating after chunking, a more effective approach might be to retrieve and return the corresponding segments from the original documents that are relevant to the context. For instance, if we're dealing with short pieces of text such as Hacker News comments, it's fairly straightforward. Any partial match can prompt the return of the entire comment as it is.

When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.

gwern · on June 14, 2023

I don't think the repetition is a problem. He's using a local model for human-assisted writing with pre-generated embeddings - he can use essentially an arbitrary number of embedding calls, as long as it's more useful for the human. So it's just a question of whether that improves the quality or not. (Not that the cost would be more than a rounding error to embed your typical personal wiki with something like the OA API, especially since they just dropped the prices of embeddings again.)

SmooL · on June 13, 2023

I've thought about doing this as well, but I haven't tried it yet. Are there any resources/blogs/information on various strategies on how to best chunk & embed arbitrary text?

busseio · on June 13, 2023

I’ve been experimenting with sliding window chunking using SRT files. They’re the subtitle format for television and have 1 to _n_ sequence numbers for each chunk, along with time stamps for when the chunk should appear on the screen. Traditionally it’s two lines of text per chunk but you can make chunks of other line counts and sizes. Much of my work with this has been with SRT files that are transcriptions exported from Otter.ai; GPT-3.5 & 4 natively understand the SRT format and the concepts of the sequence numbers and time stamps, so you can refer to them or ask for confirmation of them in a prompt.

crucialfelix · on June 13, 2023

The unstructured package works well to partition text, markdown, html, even pdf on structural boundaries like paragraphs, h, hr etc

https://unstructured-io.github.io/unstructured/bricks.html#p...