Skip to main content

Chunking

Chunkers turn loaded documents into the small pieces of text that get embedded and stored. Chunk size matters: too small and you lose context, too big and you blow the LLM's context window when several chunks are retrieved.

IChunkingStrategy contract

public interface IChunkingStrategy
{
IList<TextChunk> Chunk(DocumentPage page);
}

public sealed class TextChunk
{
public string Text { get; }
public int ChunkIndex { get; }
public IDictionary<string, string> Metadata { get; }
}

RecursiveTextChunker (default)

public RecursiveTextChunker(int chunkSize = 512, int overlap = 64)

Splits recursively at natural boundaries — paragraphs (\n\n), then single newlines, then sentence terminators, then spaces, then characters. Produces semantically coherent chunks. Use this by default.

using LogicGrid.Rag.Chunking;

var chunker = new RecursiveTextChunker(chunkSize: 800, overlap: 80);
var pipeline = new RagPipeline(embedder, store, chunker);

FixedSizeChunker

public FixedSizeChunker(int chunkSize = 512, int overlap = 64)

Splits by character count with overlap. Predictable. Use when:

  • Documents are uniform (logs, CSV rows, code).
  • You need every chunk to be exactly the same size for downstream bookkeeping.
var chunker = new FixedSizeChunker(chunkSize: 1000, overlap: 200);
var pipeline = new RagPipeline(embedder, store, chunker);

Choosing a chunk size

Two competing constraints:

Smaller chunksBigger chunks
Higher precision — top-K hits are tightly scopedHigher recall — full context preserved
More chunks → more embedding calls (cost, latency)Fewer embedding calls
Fits more diverse hits into the LLM contextRisks hitting context window with K=5

A sensible starting point for documentation/articles: chunkSize=512, overlap=64. For dense technical reference: chunkSize=300, overlap=40. For long narrative prose: chunkSize=1000, overlap=150.

Overlap

Overlap copies the last N characters of one chunk to the start of the next so an answer that straddles the boundary still appears in at least one chunk verbatim. 10–20% of ChunkSize is a reasonable rule of thumb.

Custom chunkers

Implement IChunkingStrategy. Useful when:

  • You want hierarchical chunking (sections → paragraphs).
  • You want to dedupe near-identical chunks before embedding.
public sealed class CodeChunker : IChunkingStrategy
{
public IList<TextChunk> Chunk(DocumentPage page)
{
var chunks = new List<TextChunk>();
// split page.Text on function/class declarations…
// produce TextChunk for each block, populate Metadata
return chunks;
}
}

var pipeline = new RagPipeline(embedder, store, new CodeChunker());

When chunking goes wrong

  • Too-small chunks lose context. Searches return tightly scoped hits that don't carry enough surrounding meaning to answer.
  • Too-large chunks lose precision. A 5K-character chunk represents too much in one vector and dilutes similarity.
  • No overlap on bounded answers. A definition that straddles the boundary can be missing from every chunk that contains its question. Always set a non-zero overlap.