LiteLLM Production Monitoring 2026: Gateway + Cost Tracking

What is New in LiteLLM 1.87.0

LiteLLM 1.87.0 continues the cosign image signing initiative introduced in 1.85.0 with the same cryptographic verification approach: every release is signed with a key pinned to a specific Git commit hash. Verify your running containers with:

cosign verify \
  --key https://raw.githubusercontent.com/BerriAI/litellm/0112e53046018d726492c814b3644b7d376029d0/cosign.pub \
  ghcr.io/berriai/litellm:v1.87.0-dev.1

This release carries forward all the monitoring improvements from 1.86.0 — the litellm_budget_resets_total metric, per-provider status in the relay proxy /health endpoint, and corrected cost attribution for batched cached responses. Upgrade via pip install litellm==1.87.0 or pull the new Docker image tag ghcr.io/berriai/litellm:v1.87.0. The relay proxy configuration is backwards-compatible — no migration steps required.

What LiteLLM Does in Your Stack

LiteLLM has become the standard API gateway layer for teams running multiple LLM providers. Instead of managing separate SDK integrations for OpenAI, Anthropic, Azure, vLLM, Ollama, and a dozen other providers, teams standardize on LiteLLM's single interface — and get unified logging, cost tracking, and rate limiting as a byproduct.

But that ubiquity creates a new monitoring challenge: LiteLLM is now a critical path component. When it goes down, every AI feature in your stack goes down. And because LiteLLM sits at the aggregation point for all model calls, the metrics it emits are the most complete view of your AI infrastructure health you can get.

This guide covers what to monitor in a LiteLLM production deployment — from the core proxy metrics to spend tracking, latency percentiles, and error classification across providers.

LiteLLM runs as a proxy server between your application code and your LLM providers. Your code calls litellm.completion() or sends requests to the proxy endpoint, and LiteLLM handles:

Provider routing — sending requests to OpenAI, Anthropic, Azure, or self-hosted endpoints based on model name or cost
Unified interface — same request/response format regardless of which provider backs it
Cost normalization — converting provider-specific pricing into a unified cost ledger
Rate limiting — enforcing per-model, per-team, and per-API-key rate limits
Retries and fallbacks — automatically retrying on provider errors and falling back to cheaper models when limits are hit

LiteLLM also ships a relay proxy (litellm --port 8000 --detailed_settings) that runs as a dedicated service with its own metrics endpoint and separate configuration from the SDK usage pattern.

Key Metrics to Monitor

1. Request Volume and Throughput

Track total requests and tokens processed per minute. LiteLLM exposes these via its /metrics endpoint in Prometheus format:

# HELP litellm_requests_total Total number of calls to litellm
litellm_requests_total{model="gpt-4o", team="backend"} 4821
litellm_requests_total{model="claude-3-5-sonnet", team="frontend"} 1204

Slice by model and team labels to understand which parts of your product are driving volume. A sudden spike in team="backend" requests while team="frontend" stays flat tells you something specific changed in your backend workflows, not your user-facing features.

Track rate(litellm_requests_total[5m]) to get requests per second.

2. Token Throughput

Both input and output tokens:

litellm_tokens_used_total{model="gpt-4o", token_type="input"} 2847392
litellm_tokens_used_total{model="gpt-4o", token_type="output"} 1847293

Use these to calculate cost. For OpenAI models, input tokens are charged at $2.50-15.00 per million tokens and output at $10-60 per million. Calculate your per-token cost:

cost_per_request = (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000

Track this in Grafana with a PromQL query like sum by (model) (rate(litellm_tokens_used_total[1h])).

3. Latency Percentiles

LiteLLM latency has three distinct components:

Time to first token (TTFT) — dominated by provider API latency and your model choice
Time per output token (TPOT) — determined by provider throughput
End-to-end latency — from your application sending the request to receiving the full response

Track all three via the litellm latency histogram. Separate p50 from p95 and p99. A healthy p50 with a spiking p99 means you have tail latency outliers — often a specific model or provider behaving badly under load.

4. Error Rate by Provider

LiteLLM surfaces errors with provider-specific labels:

litellm_errors_total{error_type="rate_limit_error", provider="openai"}
litellm_errors_total{error_type="authentication_error", provider="anthropic"}
litellm_errors_total{error_type="timeout", provider="azure"}

Track error rate as a percentage of total requests:

sum(rate(litellm_errors_total[5m])) by (provider) /
sum(rate(litellm_requests_total[5m])) by (provider)

A rising rate_limit_error rate on OpenAI is a leading indicator you need to either switch to a fallback model or negotiate higher limits. Don't wait for the error rate to hit 100% — alert at 5%.

5. Cost Tracking

LiteLLM has built-in spend tracking that logs every request with its calculated cost. The spend logs table (if you're using the database adapter) includes:

field	description
`model`	e.g. `gpt-4o`, `claude-3-5-sonnet-20241022`
`total_cost`	Calculated cost in USD
`total_tokens`	Input + output tokens
`response_ms`	Response time in milliseconds
`user`	Team or API key identifier
`metadata`	Custom metadata passed at call time

Query this to build a cost dashboard. Track spend by user (team) to allocate AI costs back to product teams. This is essential for FinOps reporting and for detecting runaway experiments.

6. Cache Hit Rate

LiteLLM supports semantic caching via Redis. When enabled, semantically similar requests return cached responses instead of calling the provider:

litellm_cache_hits_total{model="gpt-4o"}
litellm_cache_misses_total{model="gpt-4o"}

Cache hit rate directly reduces your cost:

litellm_cache_hits_total / (litellm_cache_hits_total + litellm_cache_misses_total)

A low cache hit rate (below 30%) might mean your cache TTL is too short or your request patterns are too diverse. Tune with the caching_budget parameter at call time.

7. Fallback Success Rate

When LiteLLM falls back from a primary to a secondary model, track whether the fallback succeeds:

litellm_fallbacks_total{from_model="gpt-4o", to_model="gpt-4o-mini", success="true"}
litellm_fallbacks_total{from_model="gpt-4o", to_model="gpt-4o-mini", success="false"}

A fallback that fails 100% of the time is worse than no fallback — it's adding latency with no reliability benefit. A 50% success rate on gpt-4o-mini might be acceptable for non-critical features but unacceptable for customer-facing chat.

Setting Up Prometheus Monitoring

LiteLLM's proxy endpoint exposes a Prometheus-compatible /metrics endpoint. Add it to your Prometheus scrape config:

scrape_configs:
  - job_name: 'litellm'
    static_configs:
      - targets: ['litellm-proxy:4000']
    metrics_path: '/metrics'
    scrape_interval: 15s

The /metrics endpoint includes all the metrics above plus several more: litellm_requests_remaining (rate limit headroom), litellm_current_load (concurrent requests), and model-specific pricing labels.

Grafana Dashboard

Import the official LiteLLM Grafana dashboard (ID: 19253) or build your own. Key panels:

Request volume — requests per second by model, colored by provider
Token throughput — input and output tokens per hour
Cost accumulator — cumulative daily spend by model
Latency percentiles — p50/p95/p99 over time, faceted by model
Error rate — errors per 100 requests by provider
Cache efficiency — hit rate as a gauge with 30% and 70% thresholds
Fallback heatmap — when fallbacks fire and whether they succeed

Recommended Tool Grafana Cloud

Full-stack observability for LiteLLM — Prometheus metrics, Grafana dashboards, and managed alerting. 14-day retention free.

Alerting Rules

At minimum, alert on:

groups:
  - name: litellm_alerts
    rules:
      - alert: LitellmHighErrorRate
        expr: |
          sum(rate(litellm_errors_total[5m])) by (provider)
          / sum(rate(litellm_requests_total[5m])) by (provider)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LiteLLM error rate above 5% for provider {{ $labels.provider }}"

      - alert: LitellmRateLimitHeadroom
        expr: litellm_requests_remaining / litellm_max_parallel_requests < 0.2
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "LiteLLM rate limit headroom below 20%"

      - alert: LitellmHighP99Latency
        expr: histogram_quantile(0.99, rate(litellm_request_duration_bucket[5m])) > 30
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LiteLLM p99 latency above 30s"

Recommended Tool Helicone

AI observability for LiteLLM — logging, analytics, and caching with a 5-minute setup. Tracks cost, latency, and usage across all providers.

Cost Allocation by Team

Pass user and metadata at call time to get per-team cost attribution:

response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "analyze this data"}],
    user="data-team",
    metadata={
        "feature": "analytics-dashboard",
        "environment": "production",
        "request_id": "req_abc123"
    }
)

These fields appear in spend logs and allow Grafana to partition costs by team, feature, or environment. For chargeback reporting, group by user to get a per-team AI spend report.

LiteLLM vs Running Provider SDKs Directly

The monitoring argument for LiteLLM is strong: instead of five different logging interfaces from five different providers, you get one metrics endpoint with unified cost and latency tracking. The tradeoff is added infrastructure — LiteLLM itself becomes a dependency — and a small latency overhead (typically 5-15ms for the proxy routing layer).

For teams running two or more providers, the unified observability and cost normalization justify the overhead. For single-provider teams, the operational complexity might outweigh the benefits until you scale.

Recommended Tool Portkey AI

AI gateway with LiteLLM compatibility — full observability, cost management, and multi-provider routing. 50+ models supported.

Conclusion

LiteLLM's value as an API gateway is matched by its value as a monitoring aggregation point. The /metrics endpoint gives you request volume, token throughput, cost tracking, error classification, cache efficiency, and fallback behavior — everything you need to operate a multi-provider LLM stack with confidence.

The key panels to have in Grafana: latency percentiles by model, cost accumulator by team, error rate by provider, and fallback success rate. Alert when error rate exceeds 5%, when p99 latency crosses 30 seconds, or when rate limit headroom drops below 20%.

Combined with LiteLLM's built-in spend logs, you get the full picture: how much you're spending, where, and why — which is the foundation of any serious LLM FinOps practice.