What's New in Ollama v0.30: llama.cpp Native Build, GGUF First-Class, and MLX for Apple Silicon
Ollama v0.30, released May 19, 2026, is the most significant architectural shift since the project's inception. The core news: Ollama now builds natively on llama.cpp instead of the legacy GGML layer it carried since its early days, with full GGUF compatibility and first-class MLX Apple Silicon acceleration. If you're running Ollama in production, here is what changed in v0.21 through v0.30 that matters for your monitoring stack.
What actually changed (the short version)
- llama.cpp native — Ollama v0.30 compiles against
llama.cppdirectly. All GGUF models work without translation or conversion layers. This means your existing Q4/Q5/Q8 quantized model files are now loaded by the same code path as native inference. - MLX first-class — Apple Silicon MLX support is no longer a community patch.
ollama run llama3:70bon an M4 Max with MLX runs at full speed without any configuration. - KV cache improvements — v0.28 introduced a new KV cache format that significantly reduces memory overhead for long-context models (32k+ tokens). If you're running Mistral-Nemo or similar long-context models, this directly affects your VRAM utilization numbers.
- Flash attention is on by default — flash attention was backported to stable in v0.26. If you disabled it for compatibility reasons, re-enabling it now gives you 2-4x speedup on attention-heavy workloads with no downside for most models.
- Parallel inference improvements — v0.25 stabilized
num_parallelhandling and fixed a race condition in request queuing that caused spurious 503s under moderate load. - New
/api/showendpoint — added in v0.27, exposes model metadata including quantization info, context length, and system prompt template. Useful for building accurate model inventories.
What this means for your monitoring
If you were monitoring Ollama before v0.30, two metric names changed:
ollama_build_info{version="0.20.6"}is nowollama_build_info{version="0.30.0"}— update your Prometheus targets and Grafana legend templates accordingly.- The new KV cache format in v0.28+ means your VRAM utilization per model is measurably lower for long-context workloads. If you set a VRAM alert threshold at 85%, re-benchmark after upgrading — your headroom is larger than before.
Upgrading from v0.20
# Check current version
ollama version
# Upgrade Ollama
# macOS:
brew upgrade ollama
# Linux:
curl -fsSL https://ollama.com/install.sh | sh
# Verify new version
ollama version
# Should show 0.30.0 or later
# Check loaded models after restart
curl http://localhost:11434/api/tags | python3 -m json.tool
Model files are fully backward-compatible — your existing GGUF files work without re-downloading. The upgrade is a drop-in replacement for most deployments.
Why Ollama Is Now a Production Component
Ollama crossed the chasm in 2025. What started as a local dev tool for running Llama, Mistral, and Gemma models on developer laptops has become a legitimate production inference runtime — used by teams who need to run open-weight models behind their firewall, avoid API costs at scale, or serve quantized models with tight latency requirements.
The numbers tell the story: Ollama crossed 50 million downloads in 2025, and the v0.20 release cycle brought production-grade features — improved streaming, better CUDA 12 utilization, and flash attention fixes in v0.20.6 that matter for GPU-bound workloads. Today, teams are running Ollama in Kubernetes pods, on single-A100 workstations, and across distributed GPU clusters for RAG pipelines, code generation, and internal AI assistants.
But Ollama's operational model is different from hosted API services. When you're running your own inference server, you're also responsible for its monitoring, scaling, and reliability. This guide covers what that means in practice.
Ollama Architecture: What You're Actually Monitoring
Ollama runs as a long-lived HTTP server process (ollama serve) that loads one or more model files (stored as quantized GGUF files) into GPU VRAM. Your application code communicates with it via a REST API — /api/generate for streaming text generation, /api/chat for chat completions, /api/embeddings for vector embeddings.
The key architectural properties that affect monitoring:
Model-per-GPU isolation. By default, Ollama loads one model at a time per GPU. You can run multiple model instances via multiple ollama serve processes or parallel modelfile configurations, but each model consumes VRAM proportional to its parameter count and quantization level. A 7B Q4 model typically needs 4-6GB VRAM; a 70B Q4 model needs 40GB+.
CUDA memory management. Ollama relies on the NVIDIA driver and CUDA toolkit for GPU scheduling. When VRAM is exhausted, Ollama falls back to CPU inference — which is 10-100x slower and a critical production incident.
Stateless request handling. Each API call is independent; Ollama doesn't manage conversation context server-side (that's the client's responsibility). This means monitoring context length utilization is harder — you see input/output token counts but not session-level state.
No built-in metrics server. Ollama exposes a /api/stats endpoint that must be polled, and a /metrics endpoint in the serve process. There is no native Prometheus pusher or OpenTelemetry exporter — you need to build the collection layer yourself or use a tool like OperatorAPI/Ollama Operator.
The Ollama Metrics You Must Track
1. Request Volume and Throughput
The foundation of any Ollama monitoring setup is knowing how many requests you're serving and how that volume is changing over time.
# Query Ollama's embedded metrics (Prometheus format)
curl http://localhost:11434/metrics | grep -E "(ollama_build|ollama_requests_total|http_requests_total)"
# Check request counts by endpoint
curl -s http://localhost:11434/api/stats | python3 -c "
import sys, json
stats = json.load(sys.stdin)
print('Total requests:', stats.get('total_requests', 0))
print('Model loads:', stats.get('num_model_downloads', 0))
"
Track these metrics over time:
- Requests per minute — overall traffic volume, capacity planning signal
- Requests per model — which models are actually being used (tells you which GGUF files to prioritize for optimization)
- Concurrent requests — against your configured
num_parallelsetting (default: 4)
2. GPU Utilization and VRAM Consumption
This is where most Ollama production incidents happen. GPU underutilization means you're wasting money; GPU memory exhaustion means your inference quality degrades silently (Ollama falls back to CPU) or requests fail.
# Poll Ollama stats + nvidia-smi together for full picture
import subprocess
import requests
import json
def get_ollama_gpu_stats():
# Ollama stats endpoint
stats = requests.get("http://localhost:11434/api/stats", timeout=5).json()
# GPU utilization via nvidia-smi
smi = subprocess.check_output([
"nvidia-smi", "--query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu",
"--format=csv,noheader,nounits"
]).decode().strip()
gpu_util, mem_used, mem_total, temp = smi.split(", ")
return {
"gpu_utilization": float(gpu_util),
"vram_used_gb": float(mem_used) / 1024,
"vram_total_gb": float(mem_total) / 1024,
"vram_percent": float(mem_used) / float(mem_total) * 100,
"gpu_temp_c": float(temp),
"model_vram": stats.get("VRAM_used", 0),
"total_requests": stats.get("total_requests", 0),
"errors": stats.get("last_error", None)
}
Critical alert thresholds:
- VRAM utilization > 90% — risk of OOM fallback; scale horizontally or reduce batch size
- GPU utilization < 20% — model not saturating GPU; batching misconfigured or model too small for GPU
- GPU temperature > 85°C — thermal throttle risk on sustained workloads; check cooling
3. Latency Percentiles: TTFT and TPOT
Ollama exposes per-request timing in its response headers. The two metrics that map to user experience:
Time to First Token (TTFT) — how long until streaming starts. High TTFT means the model is slow to initialize (often a cold-start issue or GPU queue backlog).
Tokens Per Output Token (TPOT) — inverse of per-token inference speed. TPOT of 30ms means ~33 tokens/second.
# Time a streaming request end-to-end
time curl -s http://localhost:11434/api/generate \
-d '{"model": "llama3", "prompt": "Explain Kubernetes scheduling", "stream": true}' \
-o /dev/null
For production, instrument your HTTP client to record:
import time
import requests
start = time.monotonic()
resp = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": "Hello", "stream": False},
timeout=120
)
elapsed = time.monotonic() - start
tokens = len(resp.json().get("response", "").split())
tps = tokens / elapsed if elapsed > 0 else 0
print(f"Total time: {elapsed:.2f}s, Tokens: {tokens}, TPS: {tps:.1f}")
Target production SLIs for a 7B Q4 model on an A10G:
- TTFT: < 500ms (warm), < 5s (cold)
- TPOT: < 40ms (≈ 25 tokens/sec)
- End-to-end latency for 512-token response: < 25s
4. Model Loading and Cache Hit Rate
Ollama caches loaded models in VRAM. When you request a model that's already loaded, it responds from cache — sub-100ms TTFT. When the model was evicted (due to VRAM pressure or server restart), it must reload from disk, adding 10-60 seconds of cold-start latency.
# List currently loaded models
curl http://localhost:11434/api/tags | python3 -m json.tool
# Response:
# {
# "models": [
# {
# "name": "llama3:8b-instruct-q4_0",
# "size": 4909700000,
# "modified_at": "2026-04-28T08:00:00Z",
# "duration": "2h30m"
# }
# ]
# }
Track model cache hit rate: if you're seeing frequent model reloads, your VRAM headroom is insufficient for your model mix. Consider:
- Increasing GPU memory (A10G 24GB → A100 40GB → H100 80GB)
- Reducing concurrent model loads (don't run llama3 + mistral + gemma simultaneously on one GPU)
- Using Ollama's
OLLAMA_KEEP_ALIVEenvironment variable to control how long models stay loaded (default: 5 minutes)
5. Error Classification
Ollama surfaces errors in a few distinct categories — each with different root causes:
| Error Type | Cause | Resolution |
|---|---|---|
model not found | Model file missing or wrong name | Pull model: ollama pull llama3 |
CUDA out of memory | VRAM exhausted | Reduce num_gpu layers, quantize to lower bit-width |
context length exceeded | Prompt + context > model's max context | Truncate or use a model with larger context (e.g., Mistral-Nemo-12B with 128k) |
timeout | GPU queue backlog, thermal throttle | Scale horizontally, add GPU capacity |
unexpected EOF | Network interruption or corrupt GGUF file | Re-pull model file, verify checksum |
The Ollama Production Monitoring Stack
Collecting Metrics with Prometheus
Ollama's /metrics endpoint exposes Prometheus-format metrics. Here's a scrape config:
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama-server:11434']
metrics_path: '/metrics'
scrape_interval: 15s
scrape_timeout: 10s
Key metrics to capture:
ollama_build_info{version="0.20.6"}
ollama_requests_total{model="llama3", route="/api/generate"}
ollama_request_duration_seconds{model="llama3", route="/api/chat"}
ollama_gpu_duration_seconds{model="llama3"}
ollama_num_tokens_total{model="llama3"}
Grafana Dashboard for Ollama
Build these five panels for an operational Ollama dashboard:
Panel 1: Request Volume (requests/min by model)
rate(ollama_requests_total[1m])
Group by model label. Shows traffic distribution and growth trends per model.
Panel 2: GPU Memory Utilization (VRAM % by GPU)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE
Or from nvidia-smi: nvidia_smi_memory_used_bytes / nvidia_smi_memory_total_bytes. Alert threshold: > 85%.
Panel 3: Inference Latency P50/P95/P99
histogram_quantile(0.50, rate(ollama_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(ollama_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(ollama_request_duration_seconds_bucket[5m]))
P99 > 30s for a 7B model usually indicates GPU queue backlog.
Panel 4: Tokens Per Second (throughput by model)
rate(ollama_num_tokens_total[1m])
Drift below your baseline is an early signal of GPU thermal throttling or model eviction.
Panel 5: Error Rate by Type
rate(ollama_errors_total[5m])
Classify errors with label error_type. Spike in OOM errors = VRAM capacity issue. Spike in timeout errors = scheduling saturation.
Integrating with OpenTelemetry
Ollama doesn't natively emit OTLP, but you can instrument the collection layer with an OTel Collector sidecar:
# ollama-otel-collector.yaml
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama:11434']
metrics_path: '/metrics'
exporters:
otlphttp:
endpoint: "http://your-otel-collector:4318"
service:
pipelines:
metrics:
receivers: [prometheus]
exporters: [otlphttp]
Then correlate Ollama traces with your application traces by propagating W3C Trace Context headers:
from opentelemetry import trace
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Adds traceparent header
resp = requests.post(
"http://localhost:11434/api/chat",
json={"model": "llama3", "messages": [{"role": "user", "content": "Hello"}]},
headers=headers
)
Capacity Planning for Ollama in Production
How Many Requests Can One GPU Serve?
A rough model for a single A100 40GB running a Q4-quantized model:
| Model | Parameters | Quantization | VRAM Usage | Max Concurrent | Est. TPS |
|---|---|---|---|---|---|
| Llama 3 | 8B | Q4_0 | ~5GB | 4-8 | 25-35 |
| Mistral | 7B | Q4_K_M | ~5GB | 4-8 | 25-40 |
| Llama 3 | 70B | Q4_0 | ~40GB | 1-2 | 8-15 |
| Gemma 2 | 9B | Q4_0 | ~6GB | 4-6 | 30-45 |
These numbers assume continuous streaming. Concurrent batch processing (non-streaming) can improve throughput 2-4x at the cost of higher per-request latency.
Scaling Patterns
Vertical scaling (bigger GPU): Start with an A10G 24GB for 7B models. Move to A100 40GB when you need 70B or higher concurrency. H100 80GB for maximum headroom.
Horizontal scaling (more replicas): Ollama doesn't have built-in clustering. For horizontal scale-out:
- Run multiple
ollama serveprocesses behind a load balancer (round-robin or least-loaded) - Use a model router (Portkey AI or OpenRouter) in front of multiple Ollama instances
- Consider Ollama Operator for Kubernetes-native management with automatic failover
Storage: Model File Management
Ollama stores GGUF model files in /usr/share/ollama/.ollama/models/. These range from 4GB (7B Q4) to 40GB+ (70B Q4). For production:
- Use a dedicated NVMe volume for model storage — HDD is too slow for model loading
- Pre-load your top 3 models at server startup via
OLLAMA_MODELS=/models ollama serve - Pin frequently-used models in VRAM via
OLLAMA_KEEP_ALIVE=24h - Set up a model registry: a private registry (
registry.ollama.ai) lets you version and promote model files across environments
Common Production Incidents and Runbooks
Incident: Requests timing out, GPU utilization near zero
Symptoms: Requests hanging for 60+ seconds, GPU utilization 0-5%, nvidia-smi shows GPU idle.
Diagnosis:
# Check if Ollama is still running
curl http://localhost:11434/api/tags
# Check GPU state
nvidia-smi
# Check Ollama logs
journalctl -u ollama -n 100 --no-pager
Root cause: Usually model reload after eviction. Ollama evicted the model from VRAM due to memory pressure, and is re-loading from disk. This can take 30-60 seconds for large models.
Fix: Immediate — pull the model again to force load: ollama pull llama3. Long-term — reduce number of concurrent models or add GPU memory.
Incident: "CUDA out of memory" errors spiking
Symptoms: Requests failing with OOM error, model crashes.
Root cause: VRAM oversubscription — trying to run more models or larger batches than GPU memory supports.
Fix:
- Check current VRAM:
nvidia-smi --query-gpu=memory.used,memory.total --format=csv - Identify which process is using VRAM:
nvidia-smi - Kill other GPU processes or reduce
num_gpulayers in Ollama config - Reduce quantization (Q4 → Q5_K_M needs more VRAM; Q4_0 is most memory-efficient)
Incident: Inference slow after server restart
Symptoms: TTFT suddenly 30-60 seconds, then normal.
Root cause: Cold model load — models aren't cached at startup.
Fix: Set up model pre-loading at startup:
# In systemd service or Docker entrypoint
ollama pull llama3
ollama pull mistral
ollama serve
And set OLLAMA_KEEP_ALIVE=24h to prevent eviction.
Conclusion
Ollama's simplicity as a dev tool masks the real operational complexity of running it at scale. The good news: Ollama's metrics surface everything you need — request volume, GPU utilization, VRAM consumption, and error classification — you just have to collect and visualize it.
The monitoring stack for a production Ollama deployment is straightforward: Prometheus scraping /metrics, a Grafana dashboard with 5 key panels (request volume, GPU VRAM, latency percentiles, throughput, error rate), and alerting on VRAM > 85% or P99 latency > 30s.
The harder operational problem is capacity planning. Ollama's single-model-per-GPU architecture means you need to think carefully about which models you run together, how you'll handle model eviction under memory pressure, and when to scale vertically (bigger GPU) versus horizontally (more replicas). Those decisions compound — wrong model placement can cost you 2x in GPU spend.
The teams getting Ollama right in production treat it like any other stateful service: instrumented, monitored, with explicit SLOs for latency and availability. The open-weight model ecosystem is mature enough for production — the monitoring and operations practices just need to catch up.
Related Articles:
- LLM Inference Engine Comparison 2026 — vLLM vs TGI vs TensorRT-LLM vs Ollama decision guide
- Open-Source LLM Monitoring Stack 2026 — full observability stack with Prometheus + Grafana
- GPU Monitoring for AI Inference — DCGM, MIG, and GPU telemetry for ML workloads
- LiteLLM Production Monitoring 2026 — mixing Ollama with OpenAI/Anthropic in one unified gateway