Chunking

Chunkers turn loaded documents into the small pieces of text that get embedded and stored. Chunk size matters: too small and you lose context, too big and you blow the LLM's context window when several chunks are retrieved.

`IChunkingStrategy` contract

public interface IChunkingStrategy
{
    IList<TextChunk> Chunk(DocumentPage page);
}

public sealed class TextChunk
{
    public string Text { get; }
    public int    ChunkIndex { get; }
    public IDictionary<string, string> Metadata { get; }
}

`RecursiveTextChunker` (default)

public RecursiveTextChunker(int chunkSize = 512, int overlap = 64)

Splits recursively at natural boundaries — paragraphs (\n\n), then single newlines, then sentence terminators, then spaces, then characters. Produces semantically coherent chunks. Use this by default.

using LogicGrid.Rag.Chunking;

var chunker = new RecursiveTextChunker(chunkSize: 800, overlap: 80);
var pipeline = new RagPipeline(embedder, store, chunker);

`FixedSizeChunker`

public FixedSizeChunker(int chunkSize = 512, int overlap = 64)

Splits by character count with overlap. Predictable. Use when:

Documents are uniform (logs, CSV rows, code).
You need every chunk to be exactly the same size for downstream bookkeeping.

var chunker = new FixedSizeChunker(chunkSize: 1000, overlap: 200);
var pipeline = new RagPipeline(embedder, store, chunker);

Choosing a chunk size

Two competing constraints:

Smaller chunks	Bigger chunks
Higher precision — top-K hits are tightly scoped	Higher recall — full context preserved
More chunks → more embedding calls (cost, latency)	Fewer embedding calls
Fits more diverse hits into the LLM context	Risks hitting context window with K=5

A sensible starting point for documentation/articles: chunkSize=512, overlap=64. For dense technical reference: chunkSize=300, overlap=40. For long narrative prose: chunkSize=1000, overlap=150.

Overlap

Overlap copies the last N characters of one chunk to the start of the next so an answer that straddles the boundary still appears in at least one chunk verbatim. 10–20% of ChunkSize is a reasonable rule of thumb.

Custom chunkers

Implement IChunkingStrategy. Useful when:

You want hierarchical chunking (sections → paragraphs).
You want to dedupe near-identical chunks before embedding.

public sealed class CodeChunker : IChunkingStrategy
{
    public IList<TextChunk> Chunk(DocumentPage page)
    {
        var chunks = new List<TextChunk>();
        // split page.Text on function/class declarations…
        // produce TextChunk for each block, populate Metadata
        return chunks;
    }
}

var pipeline = new RagPipeline(embedder, store, new CodeChunker());

When chunking goes wrong

Too-small chunks lose context. Searches return tightly scoped hits that don't carry enough surrounding meaning to answer.
Too-large chunks lose precision. A 5K-character chunk represents too much in one vector and dilutes similarity.
No overlap on bounded answers. A definition that straddles the boundary can be missing from every chunk that contains its question. Always set a non-zero overlap.

IChunkingStrategy contract​

RecursiveTextChunker (default)​

FixedSizeChunker​

Choosing a chunk size​

Overlap​

Custom chunkers​

When chunking goes wrong​