What is New in LiteLLM 1.87.0
LiteLLM 1.87.0 continues the cosign image signing initiative introduced in 1.85.0 with the same cryptographic verification approach: every release is signed with a key pinned to a specific Git commit hash. Verify your running containers with:
cosign verify \
--key https://raw.githubusercontent.com/BerriAI/litellm/0112e53046018d726492c814b3644b7d376029d0/cosign.pub \
ghcr.io/berriai/litellm:v1.87.0-dev.1
This release carries forward all the monitoring improvements from 1.86.0 — the litellm_budget_resets_total metric, per-provider status in the relay proxy /health endpoint, and corrected cost attribution for batched cached responses. Upgrade via pip install litellm==1.87.0 or pull the new Docker image tag ghcr.io/berriai/litellm:v1.87.0. The relay proxy configuration is backwards-compatible — no migration steps required.
What LiteLLM Does in Your Stack
LiteLLM has become the standard API gateway layer for teams running multiple LLM providers. Instead of managing separate SDK integrations for OpenAI, Anthropic, Azure, vLLM, Ollama, and a dozen other providers, teams standardize on LiteLLM's single interface — and get unified logging, cost tracking, and rate limiting as a byproduct.
But that ubiquity creates a new monitoring challenge: LiteLLM is now a critical path component. When it goes down, every AI feature in your stack goes down. And because LiteLLM sits at the aggregation point for all model calls, the metrics it emits are the most complete view of your AI infrastructure health you can get.
This guide covers what to monitor in a LiteLLM production deployment — from the core proxy metrics to spend tracking, latency percentiles, and error classification across providers.
LiteLLM runs as a proxy server between your application code and your LLM providers. Your code calls litellm.completion() or sends requests to the proxy endpoint, and LiteLLM handles:
- Provider routing — sending requests to OpenAI, Anthropic, Azure, or self-hosted endpoints based on model name or cost
- Unified interface — same request/response format regardless of which provider backs it
- Cost normalization — converting provider-specific pricing into a unified cost ledger
- Rate limiting — enforcing per-model, per-team, and per-API-key rate limits
- Retries and fallbacks — automatically retrying on provider errors and falling back to cheaper models when limits are hit
LiteLLM also ships a relay proxy (litellm --port 8000 --detailed_settings) that runs as a dedicated service with its own metrics endpoint and separate configuration from the SDK usage pattern.
Key Metrics to Monitor
1. Request Volume and Throughput
Track total requests and tokens processed per minute. LiteLLM exposes these via its /metrics endpoint in Prometheus format:
# HELP litellm_requests_total Total number of calls to litellm
litellm_requests_total{model="gpt-4o", team="backend"} 4821
litellm_requests_total{model="claude-3-5-sonnet", team="frontend"} 1204
Slice by model and team labels to understand which parts of your product are driving volume. A sudden spike in team="backend" requests while team="frontend" stays flat tells you something specific changed in your backend workflows, not your user-facing features.
Track rate(litellm_requests_total[5m]) to get requests per second.
2. Token Throughput
Both input and output tokens:
litellm_tokens_used_total{model="gpt-4o", token_type="input"} 2847392
litellm_tokens_used_total{model="gpt-4o", token_type="output"} 1847293
Use these to calculate cost. For OpenAI models, input tokens are charged at $2.50-15.00 per million tokens and output at $10-60 per million. Calculate your per-token cost:
cost_per_request = (input_tokens * input_rate + output_tokens * output_rate) / 1_000_000
Track this in Grafana with a PromQL query like sum by (model) (rate(litellm_tokens_used_total[1h])).
3. Latency Percentiles
LiteLLM latency has three distinct components:
- Time to first token (TTFT) — dominated by provider API latency and your model choice
- Time per output token (TPOT) — determined by provider throughput
- End-to-end latency — from your application sending the request to receiving the full response
Track all three via the litellm latency histogram. Separate p50 from p95 and p99. A healthy p50 with a spiking p99 means you have tail latency outliers — often a specific model or provider behaving badly under load.
4. Error Rate by Provider
LiteLLM surfaces errors with provider-specific labels:
litellm_errors_total{error_type="rate_limit_error", provider="openai"}
litellm_errors_total{error_type="authentication_error", provider="anthropic"}
litellm_errors_total{error_type="timeout", provider="azure"}
Track error rate as a percentage of total requests:
sum(rate(litellm_errors_total[5m])) by (provider) /
sum(rate(litellm_requests_total[5m])) by (provider)
A rising rate_limit_error rate on OpenAI is a leading indicator you need to either switch to a fallback model or negotiate higher limits. Don't wait for the error rate to hit 100% — alert at 5%.
5. Cost Tracking
LiteLLM has built-in spend tracking that logs every request with its calculated cost. The spend logs table (if you're using the database adapter) includes:
| field | description |
|---|---|
model | e.g. gpt-4o, claude-3-5-sonnet-20241022 |
total_cost | Calculated cost in USD |
total_tokens | Input + output tokens |
response_ms | Response time in milliseconds |
user | Team or API key identifier |
metadata | Custom metadata passed at call time |
Query this to build a cost dashboard. Track spend by user (team) to allocate AI costs back to product teams. This is essential for FinOps reporting and for detecting runaway experiments.
6. Cache Hit Rate
LiteLLM supports semantic caching via Redis. When enabled, semantically similar requests return cached responses instead of calling the provider:
litellm_cache_hits_total{model="gpt-4o"}
litellm_cache_misses_total{model="gpt-4o"}
Cache hit rate directly reduces your cost:
litellm_cache_hits_total / (litellm_cache_hits_total + litellm_cache_misses_total)
A low cache hit rate (below 30%) might mean your cache TTL is too short or your request patterns are too diverse. Tune with the caching_budget parameter at call time.
7. Fallback Success Rate
When LiteLLM falls back from a primary to a secondary model, track whether the fallback succeeds:
litellm_fallbacks_total{from_model="gpt-4o", to_model="gpt-4o-mini", success="true"}
litellm_fallbacks_total{from_model="gpt-4o", to_model="gpt-4o-mini", success="false"}
A fallback that fails 100% of the time is worse than no fallback — it's adding latency with no reliability benefit. A 50% success rate on gpt-4o-mini might be acceptable for non-critical features but unacceptable for customer-facing chat.
Setting Up Prometheus Monitoring
LiteLLM's proxy endpoint exposes a Prometheus-compatible /metrics endpoint. Add it to your Prometheus scrape config:
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm-proxy:4000']
metrics_path: '/metrics'
scrape_interval: 15s
The /metrics endpoint includes all the metrics above plus several more: litellm_requests_remaining (rate limit headroom), litellm_current_load (concurrent requests), and model-specific pricing labels.
Grafana Dashboard
Import the official LiteLLM Grafana dashboard (ID: 19253) or build your own. Key panels:
- Request volume — requests per second by model, colored by provider
- Token throughput — input and output tokens per hour
- Cost accumulator — cumulative daily spend by model
- Latency percentiles — p50/p95/p99 over time, faceted by model
- Error rate — errors per 100 requests by provider
- Cache efficiency — hit rate as a gauge with 30% and 70% thresholds
- Fallback heatmap — when fallbacks fire and whether they succeed
Full-stack observability for LiteLLM — Prometheus metrics, Grafana dashboards, and managed alerting. 14-day retention free.
Alerting Rules
At minimum, alert on:
groups:
- name: litellm_alerts
rules:
- alert: LitellmHighErrorRate
expr: |
sum(rate(litellm_errors_total[5m])) by (provider)
/ sum(rate(litellm_requests_total[5m])) by (provider)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "LiteLLM error rate above 5% for provider {{ $labels.provider }}"
- alert: LitellmRateLimitHeadroom
expr: litellm_requests_remaining / litellm_max_parallel_requests < 0.2
for: 2m
labels:
severity: warning
annotations:
summary: "LiteLLM rate limit headroom below 20%"
- alert: LitellmHighP99Latency
expr: histogram_quantile(0.99, rate(litellm_request_duration_bucket[5m])) > 30
for: 10m
labels:
severity: warning
annotations:
summary: "LiteLLM p99 latency above 30s"
AI observability for LiteLLM — logging, analytics, and caching with a 5-minute setup. Tracks cost, latency, and usage across all providers.
Cost Allocation by Team
Pass user and metadata at call time to get per-team cost attribution:
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "analyze this data"}],
user="data-team",
metadata={
"feature": "analytics-dashboard",
"environment": "production",
"request_id": "req_abc123"
}
)
These fields appear in spend logs and allow Grafana to partition costs by team, feature, or environment. For chargeback reporting, group by user to get a per-team AI spend report.
LiteLLM vs Running Provider SDKs Directly
The monitoring argument for LiteLLM is strong: instead of five different logging interfaces from five different providers, you get one metrics endpoint with unified cost and latency tracking. The tradeoff is added infrastructure — LiteLLM itself becomes a dependency — and a small latency overhead (typically 5-15ms for the proxy routing layer).
For teams running two or more providers, the unified observability and cost normalization justify the overhead. For single-provider teams, the operational complexity might outweigh the benefits until you scale.
AI gateway with LiteLLM compatibility — full observability, cost management, and multi-provider routing. 50+ models supported.
Conclusion
LiteLLM's value as an API gateway is matched by its value as a monitoring aggregation point. The /metrics endpoint gives you request volume, token throughput, cost tracking, error classification, cache efficiency, and fallback behavior — everything you need to operate a multi-provider LLM stack with confidence.
The key panels to have in Grafana: latency percentiles by model, cost accumulator by team, error rate by provider, and fallback success rate. Alert when error rate exceeds 5%, when p99 latency crosses 30 seconds, or when rate limit headroom drops below 20%.
Combined with LiteLLM's built-in spend logs, you get the full picture: how much you're spending, where, and why — which is the foundation of any serious LLM FinOps practice.