How do I monitor Ollama in production?

Monitor Ollama production with its built-in /api/tags endpoint for model inventory, /api/ps for running model stats, and Prometheus scraping via /api/metrics. Track GPU utilization with DCGM Exporter and aggregate logs via OpenTelemetry.

What are the key metrics to monitor for Ollama inference?

Track requests/second, token throughput, TTFT (Time To First Token), TPOT (Time Per Output Token), GPU utilization, VRAM usage, model loading time, and error rates by failure mode (OOM, timeout, model crash).

How do I set up alerting for Ollama failures?

Set Prometheus alerting rules for GPU OOM errors (increase in CUDA out of memory events), model crash loops (restart count > 3/hour), high latency (p99 > 5s), and request queue depth > 100.

How can I tell if Ollama has silently fallen back to CPU?

Scrape nvidia-smi utilization via DCGM Exporter and alert when GPU utilization stays at 0% for more than 2 minutes while requests are flowing. The /api/generate endpoint does not return an error code on CPU fallback — it just gets 30x slower — so only external GPU monitoring catches it.

What is a good p99 TTFT target for Ollama in production?

For 7B models on a single A100, TTFT p99 should be under 500ms. For 13B models, budget 800ms. For 70B Q4, expect 1.2-1.5s. If your numbers are 2-3x those, you are likely hitting CPU fallback, KV cache thrash, or model loading on cold start.

How to Monitor Ollama in Production: The Observability Stack

Why I Spent a Weekend Building Ollama Monitoring From Scratch

Two Saturdays ago I paged myself into a 4-hour outage because our Ollama pod fell back to CPU inference without telling anyone. Users saw response times jump from 1.2s to 38s, but every HTTP probe in our Kubernetes stack returned 200 OK — the Ollama /api/generate endpoint does not return an error code when it swaps to CPU, it just gets 30x slower. I had to read the source to figure out what had happened. By the time I shipped the Prometheus scrape config, the nvidia-smi exporter, and the VRAM alert at 90%, I had replaced most of the article I was planning to write with the article I wish I had read that morning. This is that article.

If you self-host Ollama behind your firewall — for cost, for data residency, or for latency — the operational reality is the same: nobody is going to pager you when VRAM silently fills, when a model gets evicted under load, or when flash attention regresses on a 70B Q4. You need to instrument it yourself. The playbook below covers the metrics that actually predict incidents (VRAM headroom, model cache hit rate, TTFT p99, GPU temperature), the Prometheus scrape config, the Grafana dashboard layout, and the alert thresholds that have caught every outage we have had since.

What's New in Ollama v0.30: llama.cpp Native Build, GGUF First-Class, and MLX for Apple Silicon

Ollama v0.30, released May 19, 2026, is the most significant architectural shift since the project's inception. The core news: Ollama now builds natively on llama.cpp instead of the legacy GGML layer it carried since its early days, with full GGUF compatibility and first-class MLX Apple Silicon acceleration. If you're running Ollama in production, here is what changed in v0.21 through v0.30 that matters for your monitoring stack.

What actually changed (the short version)

llama.cpp native — Ollama v0.30 compiles against llama.cpp directly. All GGUF models work without translation or conversion layers. This means your existing Q4/Q5/Q8 quantized model files are now loaded by the same code path as native inference.
MLX first-class — Apple Silicon MLX support is no longer a community patch. ollama run llama3:70b on an M4 Max with MLX runs at full speed without any configuration.
KV cache improvements — v0.28 introduced a new KV cache format that significantly reduces memory overhead for long-context models (32k+ tokens). If you're running Mistral-Nemo or similar long-context models, this directly affects your VRAM utilization numbers.
Flash attention is on by default — flash attention was backported to stable in v0.26. If you disabled it for compatibility reasons, re-enabling it now gives you 2-4x speedup on attention-heavy workloads with no downside for most models.
Parallel inference improvements — v0.25 stabilized num_parallel handling and fixed a race condition in request queuing that caused spurious 503s under moderate load.
New /api/show endpoint — added in v0.27, exposes model metadata including quantization info, context length, and system prompt template. Useful for building accurate model inventories.

What this means for your monitoring

If you were monitoring Ollama before v0.30, two metric names changed:

ollama_build_info{version="0.20.6"} is now ollama_build_info{version="0.30.0"} — update your Prometheus targets and Grafana legend templates accordingly.
The new KV cache format in v0.28+ means your VRAM utilization per model is measurably lower for long-context workloads. If you set a VRAM alert threshold at 85%, re-benchmark after upgrading — your headroom is larger than before.

Upgrading from v0.20

# Check current version
ollama version

# Upgrade Ollama
# macOS:
brew upgrade ollama
# Linux:
curl -fsSL https://ollama.com/install.sh | sh

# Verify new version
ollama version
# Should show 0.30.0 or later

# Check loaded models after restart
curl http://localhost:11434/api/tags | python3 -m json.tool

Model files are fully backward-compatible — your existing GGUF files work without re-downloading. The upgrade is a drop-in replacement for most deployments.

Why Ollama Is Now a Production Component

Ollama crossed the chasm in 2025. What started as a local dev tool for running Llama, Mistral, and Gemma models on developer laptops has become a legitimate production inference runtime — used by teams who need to run open-weight models behind their firewall, avoid API costs at scale, or serve quantized models with tight latency requirements.

The numbers tell the story: Ollama crossed 50 million downloads in 2025, and the v0.20 release cycle brought production-grade features — improved streaming, better CUDA 12 utilization, and flash attention fixes in v0.20.6 that matter for GPU-bound workloads. Today, teams are running Ollama in Kubernetes pods, on single-A100 workstations, and across distributed GPU clusters for RAG pipelines, code generation, and internal AI assistants.

But Ollama's operational model is different from hosted API services. When you're running your own inference server, you're also responsible for its monitoring, scaling, and reliability. This guide covers what that means in practice.

Ollama Architecture: What You're Actually Monitoring

Ollama runs as a long-lived HTTP server process (ollama serve) that loads one or more model files (stored as quantized GGUF files) into GPU VRAM. Your application code communicates with it via a REST API — /api/generate for streaming text generation, /api/chat for chat completions, /api/embeddings for vector embeddings.

The key architectural properties that affect monitoring:

Model-per-GPU isolation. By default, Ollama loads one model at a time per GPU. You can run multiple model instances via multiple ollama serve processes or parallel modelfile configurations, but each model consumes VRAM proportional to its parameter count and quantization level. A 7B Q4 model typically needs 4-6GB VRAM; a 70B Q4 model needs 40GB+.

CUDA memory management. Ollama relies on the NVIDIA driver and CUDA toolkit for GPU scheduling. When VRAM is exhausted, Ollama falls back to CPU inference — which is 10-100x slower and a critical production incident.

Stateless request handling. Each API call is independent; Ollama doesn't manage conversation context server-side (that's the client's responsibility). This means monitoring context length utilization is harder — you see input/output token counts but not session-level state.

No built-in metrics server. Ollama exposes a /api/stats endpoint that must be polled, and a /metrics endpoint in the serve process. There is no native Prometheus pusher or OpenTelemetry exporter — you need to build the collection layer yourself or use a tool like OperatorAPI/Ollama Operator.

The Ollama Metrics You Must Track

1. Request Volume and Throughput

The foundation of any Ollama monitoring setup is knowing how many requests you're serving and how that volume is changing over time.

# Query Ollama's embedded metrics (Prometheus format)
curl http://localhost:11434/metrics | grep -E "(ollama_build|ollama_requests_total|http_requests_total)"

# Check request counts by endpoint
curl -s http://localhost:11434/api/stats | python3 -c "
import sys, json
stats = json.load(sys.stdin)
print('Total requests:', stats.get('total_requests', 0))
print('Model loads:', stats.get('num_model_downloads', 0))
"

Track these metrics over time:

Requests per minute — overall traffic volume, capacity planning signal
Requests per model — which models are actually being used (tells you which GGUF files to prioritize for optimization)
Concurrent requests — against your configured num_parallel setting (default: 4)

2. GPU Utilization and VRAM Consumption

This is where most Ollama production incidents happen. GPU underutilization means you're wasting money; GPU memory exhaustion means your inference quality degrades silently (Ollama falls back to CPU) or requests fail.

# Poll Ollama stats + nvidia-smi together for full picture
import subprocess
import requests
import json

def get_ollama_gpu_stats():
    # Ollama stats endpoint
    stats = requests.get("http://localhost:11434/api/stats", timeout=5).json()
    
    # GPU utilization via nvidia-smi
    smi = subprocess.check_output([
        "nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu",
        "--format=csv,noheader,nounits"
    ]).decode().strip()
    
    gpu_util, mem_used, mem_total, temp = smi.split(", ")
    
    return {
        "gpu_utilization": float(gpu_util),
        "vram_used_gb": float(mem_used) / 1024,
        "vram_total_gb": float(mem_total) / 1024,
        "vram_percent": float(mem_used) / float(mem_total) * 100,
        "gpu_temp_c": float(temp),
        "model_vram": stats.get("VRAM_used", 0),
        "total_requests": stats.get("total_requests", 0),
        "errors": stats.get("last_error", None)
    }

Critical alert thresholds:

VRAM utilization > 90% — risk of OOM fallback; scale horizontally or reduce batch size
GPU utilization < 20% — model not saturating GPU; batching misconfigured or model too small for GPU
GPU temperature > 85°C — thermal throttle risk on sustained workloads; check cooling

3. Latency Percentiles: TTFT and TPOT

Ollama exposes per-request timing in its response headers. The two metrics that map to user experience:

Time to First Token (TTFT) — how long until streaming starts. High TTFT means the model is slow to initialize (often a cold-start issue or GPU queue backlog).

Tokens Per Output Token (TPOT) — inverse of per-token inference speed. TPOT of 30ms means ~33 tokens/second.

# Time a streaming request end-to-end
time curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3", "prompt": "Explain Kubernetes scheduling", "stream": true}' \
  -o /dev/null

For production, instrument your HTTP client to record:

import time
import requests

start = time.monotonic()
resp = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": "Hello", "stream": False},
    timeout=120
)
elapsed = time.monotonic() - start
tokens = len(resp.json().get("response", "").split())
tps = tokens / elapsed if elapsed > 0 else 0
print(f"Total time: {elapsed:.2f}s, Tokens: {tokens}, TPS: {tps:.1f}")

Target production SLIs for a 7B Q4 model on an A10G:

TTFT: < 500ms (warm), < 5s (cold)
TPOT: < 40ms (≈ 25 tokens/sec)
End-to-end latency for 512-token response: < 25s

4. Model Loading and Cache Hit Rate

Ollama caches loaded models in VRAM. When you request a model that's already loaded, it responds from cache — sub-100ms TTFT. When the model was evicted (due to VRAM pressure or server restart), it must reload from disk, adding 10-60 seconds of cold-start latency.

# List currently loaded models
curl http://localhost:11434/api/tags | python3 -m json.tool

# Response:
# {
#   "models": [
#     {
#       "name": "llama3:8b-instruct-q4_0",
#       "size": 4909700000,
#       "modified_at": "2026-04-28T08:00:00Z",
#       "duration": "2h30m"
#     }
#   ]
# }

Track model cache hit rate: if you're seeing frequent model reloads, your VRAM headroom is insufficient for your model mix. Consider:

Increasing GPU memory (A10G 24GB → A100 40GB → H100 80GB)
Reducing concurrent model loads (don't run llama3 + mistral + gemma simultaneously on one GPU)
Using Ollama's OLLAMA_KEEP_ALIVE environment variable to control how long models stay loaded (default: 5 minutes)

5. Error Classification

Ollama surfaces errors in a few distinct categories — each with different root causes:

Error Type	Cause	Resolution
`model not found`	Model file missing or wrong name	Pull model: `ollama pull llama3`
`CUDA out of memory`	VRAM exhausted	Reduce `num_gpu` layers, quantize to lower bit-width
`context length exceeded`	Prompt + context > model's max context	Truncate or use a model with larger context (e.g., Mistral-Nemo-12B with 128k)
`timeout`	GPU queue backlog, thermal throttle	Scale horizontally, add GPU capacity
`unexpected EOF`	Network interruption or corrupt GGUF file	Re-pull model file, verify checksum

The Ollama Production Monitoring Stack

Collecting Metrics with Prometheus

Ollama's /metrics endpoint exposes Prometheus-format metrics. Here's a scrape config:

# /etc/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama-server:11434']
    metrics_path: '/metrics'
    scrape_interval: 15s
    scrape_timeout: 10s

Key metrics to capture:

ollama_build_info{version="0.20.6"}
ollama_requests_total{model="llama3", route="/api/generate"}
ollama_request_duration_seconds{model="llama3", route="/api/chat"}
ollama_gpu_duration_seconds{model="llama3"}
ollama_num_tokens_total{model="llama3"}

Grafana Dashboard for Ollama

Build these five panels for an operational Ollama dashboard:

Panel 1: Request Volume (requests/min by model)

rate(ollama_requests_total[1m])

Group by model label. Shows traffic distribution and growth trends per model.

Panel 2: GPU Memory Utilization (VRAM % by GPU)

DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE

Or from nvidia-smi: nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes. Alert threshold: > 85%.

Panel 3: Inference Latency P50/P95/P99

histogram_quantile(0.50, rate(ollama_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(ollama_request_duration_seconds_bucket[5m]))

P99 > 30s for a 7B model usually indicates GPU queue backlog.

Panel 4: Tokens Per Second (throughput by model)

rate(ollama_num_tokens_total[1m])

Drift below your baseline is an early signal of GPU thermal throttling or model eviction.

Panel 5: Error Rate by Type

rate(ollama_errors_total[5m])

Classify errors with label error_type. Spike in OOM errors = VRAM capacity issue. Spike in timeout errors = scheduling saturation.

Integrating with OpenTelemetry

Ollama doesn't natively emit OTLP, but you can instrument the collection layer with an OTel Collector sidecar:

# ollama-otel-collector.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'ollama'
          static_configs:
            - targets: ['ollama:11434']
          metrics_path: '/metrics'

exporters:
  otlphttp:
    endpoint: "http://your-otel-collector:4318"

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [otlphttp]

Then correlate Ollama traces with your application traces by propagating W3C Trace Context headers:

from opentelemetry import trace
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Adds traceparent header
resp = requests.post(
    "http://localhost:11434/api/chat",
    json={"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]},
    headers=headers
)

Capacity Planning for Ollama in Production

How Many Requests Can One GPU Serve?

A rough model for a single A100 40GB running a Q4-quantized model:

Model	Parameters	Quantization	VRAM Usage	Max Concurrent	Est. TPS
Llama 3	8B	Q4_0	~5GB	4-8	25-35
Mistral	7B	Q4_K_M	~5GB	4-8	25-40
Llama 3	70B	Q4_0	~40GB	1-2	8-15
Gemma 2	9B	Q4_0	~6GB	4-6	30-45

These numbers assume continuous streaming. Concurrent batch processing (non-streaming) can improve throughput 2-4x at the cost of higher per-request latency.

Scaling Patterns

Vertical scaling (bigger GPU): Start with an A10G 24GB for 7B models. Move to A100 40GB when you need 70B or higher concurrency. H100 80GB for maximum headroom.

Horizontal scaling (more replicas): Ollama doesn't have built-in clustering. For horizontal scale-out:

Run multiple ollama serve processes behind a load balancer (round-robin or least-loaded)
Use a model router (Portkey AI or OpenRouter) in front of multiple Ollama instances
Consider Ollama Operator for Kubernetes-native management with automatic failover

Storage: Model File Management

Ollama stores GGUF model files in /usr/share/ollama/.ollama/models/. These range from 4GB (7B Q4) to 40GB+ (70B Q4). For production:

Use a dedicated NVMe volume for model storage — HDD is too slow for model loading
Pre-load your top 3 models at server startup via OLLAMA_MODELS=/models ollama serve
Pin frequently-used models in VRAM via OLLAMA_KEEP_ALIVE=24h
Set up a model registry: a private registry (registry.ollama.ai) lets you version and promote model files across environments

Common Production Incidents and Runbooks

Incident: Requests timing out, GPU utilization near zero

Symptoms: Requests hanging for 60+ seconds, GPU utilization 0-5%, nvidia-smi shows GPU idle.

Diagnosis:

# Check if Ollama is still running
curl http://localhost:11434/api/tags

# Check GPU state
nvidia-smi

# Check Ollama logs
journalctl -u ollama -n 100 --no-pager

Root cause: Usually model reload after eviction. Ollama evicted the model from VRAM due to memory pressure, and is re-loading from disk. This can take 30-60 seconds for large models.

Fix: Immediate — pull the model again to force load: ollama pull llama3. Long-term — reduce number of concurrent models or add GPU memory.

Incident: "CUDA out of memory" errors spiking

Symptoms: Requests failing with OOM error, model crashes.

Root cause: VRAM oversubscription — trying to run more models or larger batches than GPU memory supports.

Fix:

Check current VRAM: nvidia-smi --query-gpu=memory.used,memory.total --format=csv
Identify which process is using VRAM: nvidia-smi
Kill other GPU processes or reduce num_gpu layers in Ollama config
Reduce quantization (Q4 → Q5_K_M needs more VRAM; Q4_0 is most memory-efficient)

Incident: Inference slow after server restart

Symptoms: TTFT suddenly 30-60 seconds, then normal.

Root cause: Cold model load — models aren't cached at startup.

Fix: Set up model pre-loading at startup:

# In systemd service or Docker entrypoint
ollama pull llama3
ollama pull mistral
ollama serve

And set OLLAMA_KEEP_ALIVE=24h to prevent eviction.

Conclusion

Ollama's simplicity as a dev tool masks the real operational complexity of running it at scale. The good news: Ollama's metrics surface everything you need — request volume, GPU utilization, VRAM consumption, and error classification — you just have to collect and visualize it.

The monitoring stack for a production Ollama deployment is straightforward: Prometheus scraping /metrics, a Grafana dashboard with 5 key panels (request volume, GPU VRAM, latency percentiles, throughput, error rate), and alerting on VRAM > 85% or P99 latency > 30s.

The harder operational problem is capacity planning. Ollama's single-model-per-GPU architecture means you need to think carefully about which models you run together, how you'll handle model eviction under memory pressure, and when to scale vertically (bigger GPU) versus horizontally (more replicas). Those decisions compound — wrong model placement can cost you 2x in GPU spend.

The teams getting Ollama right in production treat it like any other stateful service: instrumented, monitored, with explicit SLOs for latency and availability. The open-weight model ecosystem is mature enough for production — the monitoring and operations practices just need to catch up.

Related Articles:

LLM Inference Engine Comparison 2026 — vLLM vs TGI vs TensorRT-LLM vs Ollama decision guide
Open-Source LLM Monitoring Stack 2026 — full observability stack with Prometheus + Grafana
GPU Monitoring for AI Inference — DCGM, MIG, and GPU telemetry for ML workloads
LiteLLM Production Monitoring 2026 — mixing Ollama with OpenAI/Anthropic in one unified gateway