vLLM has become the de facto open-source inference engine for serving large language models in production. Its PagedAttention mechanism delivers 2-5x throughput improvements over naive Hugging Face Transformers by managing the KV cache memory more efficiently. But that performance gain comes with new operational complexity: GPU memory pressure, KV cache hit rates, prefill/decoding phase imbalances, and speculative decoding behavior all require specialized monitoring that traditional DevOps tooling misses.
This guide walks you through building a production monitoring stack for vLLM covering what metrics matter, which open-source tools to use, and how to wire everything together.
Why vLLM Needs Its Own Monitoring Stack
Traditional API services expose request rate, error rate, and latency percentiles. vLLM is different. The core performance gains come from memory management decisions made inside the inference loop decisions that affect throughput by 2-10x depending on your workload.
The key things you need to monitor that standard HTTP observability will not catch:
- GPU memory utilization: vLLM pre-allocates a KV cache based on
gpu_memory_utilization. If this is set too high, you will get OOM kills. Too low, you are wasting expensive GPU memory. - KV cache hit rate: vLLM cache is your primary throughput lever. A low cache hit rate means you are recomputing tokens instead of serving from memory.
- Prefill vs decode throughput imbalance: If prefill is bottlenecking, you need different optimizations than if decode is the constraint.
- Number of ongoing sequences: vLLM block manager tracks active sequences. Understanding batch composition helps you tune
max_num_seqs.
If you are running vLLM through Ray Serve (via vllm.entrypoint.api or serve run), you also have the Ray dashboard and Ray metrics to contend with.
The Metrics That Actually Matter
Throughput Metrics
vllm:num_generation_tokens_total: Total tokens generated. Monitor rate.vllm:num_prefill_tokens_total: Total prefill tokens processed. Ratio to decode tracks workload shape.vllm:scheduler_running_steps: Number of active generation steps. Shows actual parallelism.
Memory Metrics
vllm:gpu_cache_usage: KV cache memory used over allocated. 85-95% healthy, above 98% OOM risk.vllm:gpu_cache_usage_utilization: Normalized 0-1 scale. Primary tuning knob.vllm:num_mixed_chunked_prefill: Prefill batches mixed with decode. Indicates memory pressure.
Latency Metrics
vllm:e2e_request_latency_seconds: Wall clock time from request to last token. p99 under 10s for most apps.vllm:time_to_first_token_seconds: TTFT, critical for streaming UX. p99 under 1 second.vllm:time_per_output_token_seconds: TPOT, inter-token latency. p99 under 100ms.
Setting Up Prometheus + Grafana for vLLM
Step 1: Enable vLLM Metrics
vLLM exposes metrics via a Prometheus endpoint at /metrics when you start the server. Make sure you are running with --enable-metrics (enabled by default in recent versions):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--enable-metrics The server will expose metrics at http://your-host:8000/metrics.
Step 2: Set Up Prometheus to Scrape vLLM
Add a scrape_config to your prometheus.yml:
scrape_configs:
- job_name: vllm
static_configs:
- targets: [localhost:8000]
metrics_path: /metrics
scrape_interval: 10s If you are running behind a reverse proxy or on Kubernetes, adjust the target accordingly.
Step 3: Grafana Dashboard
Here is a starting dashboard JSON for Grafana. Import via + Import and paste the JSON:
GPU Cache Utilization Gauge
- Metric:
vllm:gpu_cache_usage_utilization - Type: Gauge
- Thresholds: green (0-0.85), yellow (0.85-0.95), red (0.95-1.0)
Tokens Generated per Minute
- Metric:
rate(vllm:num_generation_tokens_total[5m]) - Type: Time series graph
- Shows actual throughput in tokens per second
Request Latency Percentiles (p50/p95/p99)
- Metric:
histogram_quantile(0.50, rate(vllm:e2e_request_latency_seconds_bucket[5m])) - Same for p95 and p99. Overlay all three on one graph.
Time to First Token (TTFT) p99
- Metric:
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m])) - Alert threshold: over 1 second for most models is noticeable UX degradation.
Scheduler Running Steps
- Metric:
vllm:scheduler_running_steps - Shows actual GPU parallelism. If this frequently drops to 1-2 while the queue is non-empty, you have a scheduling bottleneck.
Ray Dashboard: If You Are Running via Ray Serve
If you are using ray serve run or the vllm.entrypoint.api with Ray, you get the Ray dashboard for free:
ray start --head
# Access at http://localhost:8265
ray metrics --dashboard-address=localhost:8265 Ray metrics include:
ray_num_task_executions: task throughputray_resource_usage: CPU/GPU/memory per actorray_get_task_latency: end-to-end task latency
Combined with vLLM native metrics, this gives you a full picture from request arrival to token delivery.
Common Issues and Detection Patterns
Issue 1: OOM Kills from Overallocated KV Cache
Symptom: vLLM container gets OOM-killed. nvidia-smi shows GPU memory at 99% right before crash.
Detection: Alert when vllm:gpu_cache_usage_utilization exceeds 0.95 for more than 5 minutes.
Fix: Reduce --gpu-memory-utilization by 5-10% increments. If you are below 0.80 and still OOMing, reduce max_model_len.
Issue 2: Thrashing Due to Too Many Concurrent Sequences
Symptom: Throughput drops suddenly. Scheduler running steps fluctuates wildly. Latency spikes.
Detection: If vllm:scheduler_running_steps frequently drops to 1-2 while the queue is non-empty, you have a scheduling bottleneck.
Fix: Lower --max-num-seqs (default 32). Fewer, larger batches often outperform many small sequences.
Issue 3: Low Cache Hit Rate
Symptom: Tokens per second is lower than expected for your hardware.
Detection: vLLM exposes KV cache metrics in recent versions including vllm:kv_cache_prefix_hit_rate if available.
Fix: Increase average prompt similarity (batch similar requests together), or serve with a larger gpu_memory_utilization to fit more sequences in cache.
Issue 4: Prefill Bottleneck
Symptom: TTFT is high but TPOT is normal. The model is slow to start generating.
Detection: Compare num_prefill_tokens_total rate against num_generation_tokens_total rate. If prefill rate is much higher than decode rate, you are prefill-bound.
Fix: Use continuous batching (vLLM default). For very long prompts, consider prompt caching or splitting across smaller chunks.
Cost Optimization: Getting More Tokens Per Dollar
vLLM efficiency directly translates to cloud spend. Here is how to optimize:
1. Tune gpu_memory_utilization with profiling
Profile your specific model to find the optimal setting. Reduce in 5% steps until OOM-free with headroom.
2. Use fp8 quantization for inference
Requires H100/A100 fp8 support. Delivers approximately 40% memory reduction with minimal accuracy loss.
3. Enable speculative decoding for lower latency workloads
Uses a smaller draft model to predict tokens, accepting speed-latency tradeoffs for batch inference scenarios.
4. Monitor tokens per GPU-second as your primary cost metric
Calculate as: tokens generated per hour divided by (number of GPUs times GPU cost per hour). This gives you cost per million tokens generated.
OpenTelemetry Integration
For enterprise environments, vLLM supports OpenTelemetry traces:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("vllm_generate") as span:
outputs = llm.generate(prompts)
span.set_attribute("num_tokens", len(outputs[0].outputs[0].token_ids)) Export to Jaeger, Datadog, or any OTLP-compatible backend.
The Monitoring Stack at a Glance
The architecture that works:
- vLLM Server: exposes
/metricsendpoint with native Prometheus format for GPU utilization, cache hit rate, latency histograms - Prometheus: scrapes vLLM every 10s, stores time series, fires alerts on OOM risk and latency spikes
- Grafana: dashboards for KV cache utilization, throughput, latency percentiles, and token efficiency
If you are running vLLM in production today and not monitoring these metrics, start with two panels: GPU cache utilization (catch OOM risk before it kills your service) and request latency p99 (track user experience degradation). From there, add throughput efficiency metrics to correlate GPU spend with business value.
Next Steps
If you are running vLLM in production today and not monitoring these metrics, start with two panels:
- GPU cache utilization: catch OOM risk before it kills your service
- Request latency p99: track user experience degradation
From there, add throughput efficiency metrics to correlate GPU spend with business value.
GPU droplets for inference workloads: NVIDIA H100s available in select regions. Spin up a vLLM serving stack in minutes with One-Click Apps, starting at 6 dollars per hour per H100.