How to Use Ollama with C# — A Production-Ready Guide

Name: LogicGrid
Author: LogicGrid

March 26, 2026 · 7 min read

Maintainers

If you want to run an LLM from a C# application without sending data to OpenAI or Anthropic, Ollama is the easiest path. It runs llama3, mistral, qwen, deepseek, and dozens of other models on your laptop or server, and exposes a simple HTTP API.

This guide walks through using Ollama from C# end-to-end — installation, basic chat, streaming, embeddings, tool calling, and the production gotchas you only learn after you ship.

Why Ollama from C#?

Three real reasons teams choose this path:

Data privacy. Your prompts and responses never leave your network. Critical for regulated industries.
Cost. No per-token billing. A single GPU server can serve thousands of requests per day.
Latency. No round-trip to OpenAI's data centers. For interactive applications, sub-50ms first-token latency matters.

The trade-off: smaller models, more infrastructure to manage, and a steeper learning curve around prompt engineering for less-capable models. For many use cases — summarization, classification, extraction, simple chatbots — that trade-off is worth it.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows
# Download from https://ollama.ai/download

Pull a model:

ollama pull llama3.2

Verify it works:

ollama run llama3.2
> Hello
Hello! How can I help you today?

By default, Ollama listens on http://localhost:11434.

Step 2: Talk to it from C# — the hard way

Ollama exposes an HTTP API. You can hit it directly:

using System.Net.Http;
using System.Net.Http.Json;
using System.Text.Json;

var http = new HttpClient { BaseAddress = new Uri("http://localhost:11434") };

var request = new
{
    model = "llama3.2",
    prompt = "What is the capital of France?",
    stream = false
};

var response = await http.PostAsJsonAsync("/api/generate", request);
var json = await response.Content.ReadAsStringAsync();

using var doc = JsonDocument.Parse(json);
var answer = doc.RootElement.GetProperty("response").GetString();
Console.WriteLine(answer);

This works, but you're now in the business of:

Hand-rolling HTTP retries when the model is loading (Ollama returns 503 for ~30s on first request)
Streaming responses (each token comes back as a separate JSON line)
Tool calling (Ollama's tool format is different from OpenAI's, but similar)
Embedding endpoints (a different API shape)
Switching providers (you'll rewrite all of this when you add OpenAI)

For a one-off script that's fine. For anything that ships, you want a library.

Step 3: The easy way

LogicGrid provides a unified C# API across Ollama, OpenAI, Anthropic, Gemini, and any OpenAI-compatible endpoint. The Ollama integration is first-class, not an afterthought.

dotnet add package LogicGrid.Core

using LogicGrid.Core.Agents;
using LogicGrid.Core.Llm;

var llm = LlmClientBase.Ollama("llama3.2");

IAgent agent = new Agent<string>(
    name: "Helper",
    description: "A helpful assistant.",
    systemPrompt: "Answer the user concisely.",
    llm: llm);

var result = await agent.RunAsync(
    "What is the capital of France?", new AgentContext("ask"));

Console.WriteLine(result);

That's the entire program. Three using statements, four logical lines.

Working with full responses

RunAsync returns the complete LLM response once it's ready. For chat UIs that want tokens to appear progressively, you can drop down to the LlmClientBase directly and use the underlying chat-completion APIs the framework exposes — see the provider docs for the streaming patterns each provider supports.

For most non-interactive workloads (summarization, classification, batch jobs) RunAsync is what you want.

Embeddings

Ollama can generate embeddings using models like nomic-embed-text or mxbai-embed-large. These are useful for semantic search, RAG, clustering — anything where you need a vector representation of text.

ollama pull nomic-embed-text

using LogicGrid.Memory.Embeddings;

var embedder = new OllamaEmbeddingClient("nomic-embed-text");

float[] vector = await embedder.EmbedAsync(
    "The quick brown fox jumps over the lazy dog.");

Console.WriteLine($"Vector length: {vector.Length}");
// 768 for nomic-embed-text

The same embedder plugs into LogicGrid's RAG pipeline:

var pipeline = new RagPipeline(embedder, new InMemoryVectorStore());
await pipeline.IngestAsync("./docs/manual.txt");
var hits = await pipeline.SearchAsync("how do I configure SSL?", topK: 3);

See the Build a RAG pipeline in C# guide for the full pattern.

Tool calling

Ollama-hosted models can use tools — functions you define in C# that the LLM decides to call when relevant. This is how you build agents that can hit your database, query an API, or read a file.

using System.ComponentModel;
using LogicGrid.Core.Agents;
using LogicGrid.Core.Llm;
using LogicGrid.Core.Tools;

public class WeatherArgs
{
    [Description("Name of the city to look up weather for.")]
    public string City { get; set; } = string.Empty;
}

public sealed class WeatherTool : ToolBase<WeatherArgs>
{
    public override string Name => "get_weather";
    public override string Description =>
        "Returns the current weather for a city.";

    public override Task<string> ExecuteAsync(
        WeatherArgs args, CancellationToken ct = default)
    {
        // Call your real weather API here
        return Task.FromResult(
            $"Weather in {args.City}: 18°C, partly cloudy.");
    }
}

var llm = LlmClientBase.Ollama("llama3.2");

IAgent agent = new Agent<string>(
    name: "WeatherBot",
    description: "Tells the user the weather.",
    systemPrompt: "Use tools when asked about weather.",
    llm: llm,
    tools: new ToolBase[] { new WeatherTool() });

Console.WriteLine(await agent.RunAsync(
    "What's the weather in Berlin?", new AgentContext()));

LogicGrid auto-generates the JSON schema for each tool from the typed args class — you don't hand-write schema definitions.

Gotcha: not every Ollama model supports tool calling well. Stick to recent models from Meta (llama3.2, llama3.3) or Mistral (mistral-nemo) for reliable function calling. Older or smaller models will hallucinate tool arguments.

Production gotchas nobody tells you

1. The first request after a cold start is slow

Ollama keeps models in memory after first use, but the first request loads the model from disk — that can take 30+ seconds for a 7B model. Send a tiny warm-up request when your service starts:

await llm.CompleteAsync("ping", maxTokens: 1);

2. Concurrent requests on a single GPU don't help

Ollama serves one request at a time per model on the GPU. Sending 10 concurrent requests doesn't make them go faster — they queue. If you need throughput, run multiple Ollama instances behind a load balancer or use vLLM (which supports continuous batching).

3. Context windows lie

llama3.2 advertises 128k context. In practice quality degrades sharply past 8k–16k. Don't shovel huge documents in raw — chunk and retrieve.

4. Model versions move

ollama pull llama3.2 pulls the latest tag, which can change. Pin a specific digest in production:

ollama pull llama3.2:3b-instruct-q4_K_M

5. JSON mode is unreliable on small models

Tell llama3.2:3b to return JSON and you'll get JSON 95% of the time. The other 5% will be markdown-wrapped JSON, JSON with trailing commentary, or garbage. Always validate and parse defensively.

LogicGrid's typed agents handle this — you specify Agent<T> with a structured output type and the framework retries with corrective prompts on parse failure.

When Ollama is not the right answer

Ollama is great for development, on-prem deployment, and small-scale production. It is not great when:

You need batched throughput (use vLLM with LlmClientBase.OpenAI pointed at the vLLM endpoint)
You need top-tier model quality (Claude or GPT-4 still outperform anything you can run locally for complex reasoning)
You're embedding millions of documents (use a hosted embedding API or TEI for serious throughput)

LogicGrid handles all of these — same code, different LlmClientBase.* factory call.

Next steps

Quickstart — build your first agent
Provider setup — Ollama, OpenAI, Anthropic, Gemini, vLLM, TEI
Build a RAG pipeline — full RAG walkthrough using Ollama embeddings

If you're evaluating frameworks, the Semantic Kernel alternative post covers the trade-offs in more depth.

Why Ollama from C#?​

Step 1: Install Ollama​

Step 2: Talk to it from C# — the hard way​

Step 3: The easy way​

Working with full responses​

Embeddings​

Tool calling​

Production gotchas nobody tells you​

1. The first request after a cold start is slow​

2. Concurrent requests on a single GPU don't help​

3. Context windows lie​

4. Model versions move​

5. JSON mode is unreliable on small models​

When Ollama is not the right answer​

Next steps​