Chunking
Chunkers turn loaded documents into the small pieces of text that get embedded and stored. Chunk size matters: too small and you lose context, too big and you blow the LLM's context window when several chunks are retrieved.
IChunkingStrategy contract
public interface IChunkingStrategy
{
IList<TextChunk> Chunk(DocumentPage page);
}
public sealed class TextChunk
{
public string Text { get; }
public int ChunkIndex { get; }
public IDictionary<string, string> Metadata { get; }
}
RecursiveTextChunker (default)
public RecursiveTextChunker(int chunkSize = 512, int overlap = 64)
Splits recursively at natural boundaries — paragraphs (\n\n), then
single newlines, then sentence terminators, then spaces, then
characters. Produces semantically coherent chunks. Use this by
default.
using LogicGrid.Rag.Chunking;
var chunker = new RecursiveTextChunker(chunkSize: 800, overlap: 80);
var pipeline = new RagPipeline(embedder, store, chunker);
FixedSizeChunker
public FixedSizeChunker(int chunkSize = 512, int overlap = 64)
Splits by character count with overlap. Predictable. Use when:
- Documents are uniform (logs, CSV rows, code).
- You need every chunk to be exactly the same size for downstream bookkeeping.
var chunker = new FixedSizeChunker(chunkSize: 1000, overlap: 200);
var pipeline = new RagPipeline(embedder, store, chunker);
Choosing a chunk size
Two competing constraints:
| Smaller chunks | Bigger chunks |
|---|---|
| Higher precision — top-K hits are tightly scoped | Higher recall — full context preserved |
| More chunks → more embedding calls (cost, latency) | Fewer embedding calls |
| Fits more diverse hits into the LLM context | Risks hitting context window with K=5 |
A sensible starting point for documentation/articles: chunkSize=512, overlap=64. For dense technical reference: chunkSize=300, overlap=40.
For long narrative prose: chunkSize=1000, overlap=150.
Overlap
Overlap copies the last N characters of one chunk to the start of
the next so an answer that straddles the boundary still appears in
at least one chunk verbatim. 10–20% of ChunkSize is a reasonable
rule of thumb.
Custom chunkers
Implement IChunkingStrategy. Useful when:
- You want hierarchical chunking (sections → paragraphs).
- You want to dedupe near-identical chunks before embedding.
public sealed class CodeChunker : IChunkingStrategy
{
public IList<TextChunk> Chunk(DocumentPage page)
{
var chunks = new List<TextChunk>();
// split page.Text on function/class declarations…
// produce TextChunk for each block, populate Metadata
return chunks;
}
}
var pipeline = new RagPipeline(embedder, store, new CodeChunker());
When chunking goes wrong
- Too-small chunks lose context. Searches return tightly scoped hits that don't carry enough surrounding meaning to answer.
- Too-large chunks lose precision. A 5K-character chunk represents too much in one vector and dilutes similarity.
- No overlap on bounded answers. A definition that straddles the boundary can be missing from every chunk that contains its question. Always set a non-zero overlap.