RAG over conversation history
When a conversation grows past the model's context window — or past what your budget tolerates — a sliding window starts dropping information that the agent later needs. The standard fix is to stop sending the whole history every turn and instead retrieve only the parts relevant to the current question.
This is the same RAG flow you'd use for documents, applied to your own message log:
- As the conversation progresses, embed and store each turn (or each pair of turns) in a vector store.
- On a new user question, embed the question and retrieve the top-K most relevant past turns.
- Inject those turns into the agent's prompt instead of the full history (or alongside a small recency window).
Wiring it up
You can either:
- Use the standard
RagPipelineand ingest each conversation turn as a tiny document (cheap and works today), or - Override
AgentBase<T>.RenderSystemPromptAsyncand inject the retrieved turns into a{{history}}slot viaPromptTemplateso every call automatically pulls the right turns.
A minimal, today-works pattern:
using LogicGrid.Memory.Embeddings;
using LogicGrid.Memory.VectorStores;
using LogicGrid.Rag;
var embedder = new OllamaEmbeddingClient("nomic-embed-text");
var store = new InMemoryVectorStore();
var pipeline = new RagPipeline(embedder, store);
// After each turn, store it (id can be the timestamp, role, etc.):
async Task RememberAsync(string role, string content, DateTime when)
{
await pipeline.IngestTextAsync(
text: $"[{role} @ {when:O}] {content}",
sourceId: $"{role}-{when.Ticks}");
}
// On the next user question, retrieve the relevant past turns:
var relevant = await pipeline.SearchAsync(
query: userQuestion,
topK: 5);
var historySlot = string.Join("\n", relevant.Select(r => r.Document.Text));
Pair this with Sliding window — keep the last few turns verbatim for short-term continuity, and bring in older relevant turns via retrieval.
Why this is better than just a longer window
- Cost — you pay for the few retrieved turns, not the entire log.
- Quality — long contexts dilute attention. Retrieved-relevant turns score higher than buried-in-the-middle ones.
- Bounded prompt size — the prompt stays the same shape no matter how long the conversation runs.