Skip to main content

Document loaders

Document loaders turn a file (or a stream) into a list of DocumentPage objects — each page is a chunk-of-text-plus-metadata that the chunker breaks down further.

LogicGrid serves loaders for the most common formats. The pipeline picks the right one based on file extension automatically.

Built-in loaders

LoaderExtensionsNotes
PlainTextLoader.txt, .md, .csv, .tsv, .logRead as UTF-8. One page per file.
HtmlLoader.html, .htmStrips tags, keeps inner text. One page per file

Support for other formats (e.g. pdf, docx) is on the road map.

Loading explicitly

If you want to inspect or transform documents before they hit the pipeline:

using LogicGrid.Rag.Documents;

var loader = new PlainTextLoader();
var pages = await loader.LoadAsync("./docs/architecture.md");

foreach (var page in pages)
{
Console.WriteLine($"Page {page.PageNumber}, {page.Text.Length} chars");
Console.WriteLine($"Metadata: {string.Join(", ", page.Metadata.Select(kv => $"{kv.Key}={kv.Value}"))}");
}

DocumentPage has:

public sealed class DocumentPage
{
public string Text { get; }
public int PageNumber { get; }
public IDictionary<string, string> Metadata { get; }
}

Implementing a custom loader

Implement IDocumentLoader:

public interface IDocumentLoader
{
/// <summary>True if this loader handles the given file extension.</summary>
bool CanLoad(string path);

/// <summary>Load the file into one or more pages.</summary>
Task<IList<DocumentPage>> LoadAsync(
string path, CancellationToken ct = default);
}

Example — a PDF loader using PdfPig:

public sealed class PdfLoader : IDocumentLoader
{
public bool CanLoad(string path) =>
Path.GetExtension(path)?.ToLowerInvariant() == ".pdf";

public async Task<IList<DocumentPage>> LoadAsync(
string path, CancellationToken ct = default)
{
var pages = new List<DocumentPage>();
using var document = UglyToad.PdfPig.PdfDocument.Open(path);
for (int i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
pages.Add(new DocumentPage(
text: page.Text,
pageNumber: i + 1,
metadata: new Dictionary<string, string>
{
["source"] = path,
["page"] = (i + 1).ToString(),
}));
}
return pages;
}
}

Register it with the pipeline:

var pipeline = new RagPipeline(embedder, store)
.AddLoader(new PdfLoader());

await pipeline.IngestAsync("./docs/manual.pdf");

AddLoader returns the pipeline for chaining, so you can add several in one expression.

Ingesting a folder

Walk the filesystem and call IngestAsync on each file:

foreach (var path in Directory.GetFiles("./docs", "*.*", SearchOption.AllDirectories))
{
if (path.EndsWith(".md") || path.EndsWith(".txt") || path.EndsWith(".html"))
{
await pipeline.IngestAsync(path);
}
}

For large corpuses, chunk the work and let RagPipelineOptions.MaxConcurrentIngest control parallelism — see Pipeline for options.

Tips

  • Strip boilerplate before ingesting. A nav bar repeated on every HTML page becomes the most common chunk and dilutes search quality. Pre-process if needed.
  • Tag chunks with rich metadata. Source file, section, version, permission scope. Use the metadata to filter search results — many vector stores let you query with metadata predicates.
  • Re-ingestion is fine. The pipeline derives chunk IDs from content; running ingest twice on the same file is a no-op.