Document loaders
Document loaders turn a file (or a stream) into a list of
DocumentPage objects — each page is a chunk-of-text-plus-metadata
that the chunker breaks down further.
LogicGrid serves loaders for the most common formats. The pipeline picks the right one based on file extension automatically.
Built-in loaders
| Loader | Extensions | Notes |
|---|---|---|
PlainTextLoader | .txt, .md, .csv, .tsv, .log | Read as UTF-8. One page per file. |
HtmlLoader | .html, .htm | Strips tags, keeps inner text. One page per file |
Support for other formats (e.g. pdf, docx) is on the road map.
Loading explicitly
If you want to inspect or transform documents before they hit the pipeline:
using LogicGrid.Rag.Documents;
var loader = new PlainTextLoader();
var pages = await loader.LoadAsync("./docs/architecture.md");
foreach (var page in pages)
{
Console.WriteLine($"Page {page.PageNumber}, {page.Text.Length} chars");
Console.WriteLine($"Metadata: {string.Join(", ", page.Metadata.Select(kv => $"{kv.Key}={kv.Value}"))}");
}
DocumentPage has:
public sealed class DocumentPage
{
public string Text { get; }
public int PageNumber { get; }
public IDictionary<string, string> Metadata { get; }
}
Implementing a custom loader
Implement IDocumentLoader:
public interface IDocumentLoader
{
/// <summary>True if this loader handles the given file extension.</summary>
bool CanLoad(string path);
/// <summary>Load the file into one or more pages.</summary>
Task<IList<DocumentPage>> LoadAsync(
string path, CancellationToken ct = default);
}
Example — a PDF loader using PdfPig:
public sealed class PdfLoader : IDocumentLoader
{
public bool CanLoad(string path) =>
Path.GetExtension(path)?.ToLowerInvariant() == ".pdf";
public async Task<IList<DocumentPage>> LoadAsync(
string path, CancellationToken ct = default)
{
var pages = new List<DocumentPage>();
using var document = UglyToad.PdfPig.PdfDocument.Open(path);
for (int i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
pages.Add(new DocumentPage(
text: page.Text,
pageNumber: i + 1,
metadata: new Dictionary<string, string>
{
["source"] = path,
["page"] = (i + 1).ToString(),
}));
}
return pages;
}
}
Register it with the pipeline:
var pipeline = new RagPipeline(embedder, store)
.AddLoader(new PdfLoader());
await pipeline.IngestAsync("./docs/manual.pdf");
AddLoader returns the pipeline for chaining, so you can add several
in one expression.
Ingesting a folder
Walk the filesystem and call IngestAsync on each file:
foreach (var path in Directory.GetFiles("./docs", "*.*", SearchOption.AllDirectories))
{
if (path.EndsWith(".md") || path.EndsWith(".txt") || path.EndsWith(".html"))
{
await pipeline.IngestAsync(path);
}
}
For large corpuses, chunk the work and let RagPipelineOptions.MaxConcurrentIngest
control parallelism — see Pipeline for options.
Tips
- Strip boilerplate before ingesting. A nav bar repeated on every HTML page becomes the most common chunk and dilutes search quality. Pre-process if needed.
- Tag chunks with rich metadata. Source file, section, version, permission scope. Use the metadata to filter search results — many vector stores let you query with metadata predicates.
- Re-ingestion is fine. The pipeline derives chunk IDs from content; running ingest twice on the same file is a no-op.