How to Use Ollama with C# — A Production-Ready Guide
If you want to run an LLM from a C# application without sending data to OpenAI or Anthropic, Ollama is the easiest path. It runs llama3, mistral, qwen, deepseek, and dozens of other models on your laptop or server, and exposes a simple HTTP API.
This guide walks through using Ollama from C# end-to-end — installation, basic chat, streaming, embeddings, tool calling, and the production gotchas you only learn after you ship.
Why Ollama from C#?
Three real reasons teams choose this path:
- Data privacy. Your prompts and responses never leave your network. Critical for regulated industries.
- Cost. No per-token billing. A single GPU server can serve thousands of requests per day.
- Latency. No round-trip to OpenAI's data centers. For interactive applications, sub-50ms first-token latency matters.
The trade-off: smaller models, more infrastructure to manage, and a steeper learning curve around prompt engineering for less-capable models. For many use cases — summarization, classification, extraction, simple chatbots — that trade-off is worth it.
Step 1: Install Ollama
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from https://ollama.ai/download
Pull a model:
ollama pull llama3.2
Verify it works:
ollama run llama3.2
> Hello
Hello! How can I help you today?
By default, Ollama listens on http://localhost:11434.
Step 2: Talk to it from C# — the hard way
Ollama exposes an HTTP API. You can hit it directly:
using System.Net.Http;
using System.Net.Http.Json;
using System.Text.Json;
var http = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };
var request = new
{
model = "llama3.2",
prompt = "What is the capital of France?",
stream = false
};
var response = await http.PostAsJsonAsync("/api/generate", request);
var json = await response.Content.ReadAsStringAsync();
using var doc = JsonDocument.Parse(json);
var answer = doc.RootElement.GetProperty("response").GetString();
Console.WriteLine(answer);
This works, but you're now in the business of:
- Hand-rolling HTTP retries when the model is loading (Ollama returns 503 for ~30s on first request)
- Streaming responses (each token comes back as a separate JSON line)
- Tool calling (Ollama's tool format is different from OpenAI's, but similar)
- Embedding endpoints (a different API shape)
- Switching providers (you'll rewrite all of this when you add OpenAI)
For a one-off script that's fine. For anything that ships, you want a library.
Step 3: The easy way
LogicGrid provides a unified C# API across Ollama, OpenAI, Anthropic, Gemini, and any OpenAI-compatible endpoint. The Ollama integration is first-class, not an afterthought.
dotnet add package LogicGrid.Core
using LogicGrid.Core.Agents;
using LogicGrid.Core.Llm;
var llm = LlmClientBase.Ollama("llama3.2");
IAgent agent = new Agent<string>(
name: "Helper",
description: "A helpful assistant.",
systemPrompt: "Answer the user concisely.",
llm: llm);
var result = await agent.RunAsync(
"What is the capital of France?", new AgentContext("ask"));
Console.WriteLine(result);
That's the entire program. Three using statements, four logical lines.
Working with full responses
RunAsync returns the complete LLM response once it's ready. For chat UIs that want tokens to appear progressively, you can drop down to the LlmClientBase directly and use the underlying chat-completion APIs the framework exposes — see the provider docs for the streaming patterns each provider supports.
For most non-interactive workloads (summarization, classification, batch jobs) RunAsync is what you want.
Embeddings
Ollama can generate embeddings using models like nomic-embed-text or mxbai-embed-large. These are useful for semantic search, RAG, clustering — anything where you need a vector representation of text.
ollama pull nomic-embed-text
using LogicGrid.Memory.Embeddings;
var embedder = new OllamaEmbeddingClient("nomic-embed-text");
float[] vector = await embedder.EmbedAsync(
"The quick brown fox jumps over the lazy dog.");
Console.WriteLine($"Vector length: {vector.Length}");
// 768 for nomic-embed-text
The same embedder plugs into LogicGrid's RAG pipeline:
var pipeline = new RagPipeline(embedder, new InMemoryVectorStore());
await pipeline.IngestAsync("./docs/manual.txt");
var hits = await pipeline.SearchAsync("how do I configure SSL?", topK: 3);
See the Build a RAG pipeline in C# guide for the full pattern.
Tool calling
Ollama-hosted models can use tools — functions you define in C# that the LLM decides to call when relevant. This is how you build agents that can hit your database, query an API, or read a file.
using System.ComponentModel;
using LogicGrid.Core.Agents;
using LogicGrid.Core.Llm;
using LogicGrid.Core.Tools;
public class WeatherArgs
{
[Description("Name of the city to look up weather for.")]
public string City { get; set; } = string.Empty;
}
public sealed class WeatherTool : ToolBase<WeatherArgs>
{
public override string Name => "get_weather";
public override string Description =>
"Returns the current weather for a city.";
public override Task<string> ExecuteAsync(
WeatherArgs args, CancellationToken ct = default)
{
// Call your real weather API here
return Task.FromResult(
$"Weather in {args.City}: 18°C, partly cloudy.");
}
}
var llm = LlmClientBase.Ollama("llama3.2");
IAgent agent = new Agent<string>(
name: "WeatherBot",
description: "Tells the user the weather.",
systemPrompt: "Use tools when asked about weather.",
llm: llm,
tools: new ToolBase[] { new WeatherTool() });
Console.WriteLine(await agent.RunAsync(
"What's the weather in Berlin?", new AgentContext()));
LogicGrid auto-generates the JSON schema for each tool from the typed args class — you don't hand-write schema definitions.
Gotcha: not every Ollama model supports tool calling well. Stick to recent models from Meta (llama3.2, llama3.3) or Mistral (mistral-nemo) for reliable function calling. Older or smaller models will hallucinate tool arguments.
Production gotchas nobody tells you
1. The first request after a cold start is slow
Ollama keeps models in memory after first use, but the first request loads the model from disk — that can take 30+ seconds for a 7B model. Send a tiny warm-up request when your service starts:
await llm.CompleteAsync("ping", maxTokens: 1);
2. Concurrent requests on a single GPU don't help
Ollama serves one request at a time per model on the GPU. Sending 10 concurrent requests doesn't make them go faster — they queue. If you need throughput, run multiple Ollama instances behind a load balancer or use vLLM (which supports continuous batching).
3. Context windows lie
llama3.2 advertises 128k context. In practice quality degrades sharply past 8k–16k. Don't shovel huge documents in raw — chunk and retrieve.
4. Model versions move
ollama pull llama3.2 pulls the latest tag, which can change. Pin a specific digest in production:
ollama pull llama3.2:3b-instruct-q4_K_M
5. JSON mode is unreliable on small models
Tell llama3.2:3b to return JSON and you'll get JSON 95% of the time. The other 5% will be markdown-wrapped JSON, JSON with trailing commentary, or garbage. Always validate and parse defensively.
LogicGrid's typed agents handle this — you specify Agent<T> with a structured output type and the framework retries with corrective prompts on parse failure.
When Ollama is not the right answer
Ollama is great for development, on-prem deployment, and small-scale production. It is not great when:
- You need batched throughput (use vLLM with
LlmClientBase.OpenAIpointed at the vLLM endpoint) - You need top-tier model quality (Claude or GPT-4 still outperform anything you can run locally for complex reasoning)
- You're embedding millions of documents (use a hosted embedding API or TEI for serious throughput)
LogicGrid handles all of these — same code, different LlmClientBase.* factory call.
Next steps
- Quickstart — build your first agent
- Provider setup — Ollama, OpenAI, Anthropic, Gemini, vLLM, TEI
- Build a RAG pipeline — full RAG walkthrough using Ollama embeddings
If you're evaluating frameworks, the Semantic Kernel alternative post covers the trade-offs in more depth.
