vLLM has become the de facto open-source inference engine for serving large language models in production. Its PagedAttention mechanism delivers 2-5x throughput improvements over naive Hugging Face Transformers by managing the KV cache memory more efficiently. But that performance gain comes with new operational complexity: GPU memory pressure, KV cache hit rates, prefill/decoding phase imbalances, and speculative decoding behavior all require specialized monitoring that traditional DevOps tooling misses.

This guide walks you through building a production monitoring stack for vLLM covering what metrics matter, which open-source tools to use, and how to wire everything together.

Why vLLM Needs Its Own Monitoring Stack

Traditional API services expose request rate, error rate, and latency percentiles. vLLM is different. The core performance gains come from memory management decisions made inside the inference loop decisions that affect throughput by 2-10x depending on your workload.

The key things you need to monitor that standard HTTP observability will not catch:

  • GPU memory utilization: vLLM pre-allocates a KV cache based on gpu_memory_utilization. If this is set too high, you will get OOM kills. Too low, you are wasting expensive GPU memory.
  • KV cache hit rate: vLLM cache is your primary throughput lever. A low cache hit rate means you are recomputing tokens instead of serving from memory.
  • Prefill vs decode throughput imbalance: If prefill is bottlenecking, you need different optimizations than if decode is the constraint.
  • Number of ongoing sequences: vLLM block manager tracks active sequences. Understanding batch composition helps you tune max_num_seqs.

If you are running vLLM through Ray Serve (via vllm.entrypoint.api or serve run), you also have the Ray dashboard and Ray metrics to contend with.

The Metrics That Actually Matter

Throughput Metrics

  • vllm:num_generation_tokens_total: Total tokens generated. Monitor rate.
  • vllm:num_prefill_tokens_total: Total prefill tokens processed. Ratio to decode tracks workload shape.
  • vllm:scheduler_running_steps: Number of active generation steps. Shows actual parallelism.

Memory Metrics

  • vllm:gpu_cache_usage: KV cache memory used over allocated. 85-95% healthy, above 98% OOM risk.
  • vllm:gpu_cache_usage_utilization: Normalized 0-1 scale. Primary tuning knob.
  • vllm:num_mixed_chunked_prefill: Prefill batches mixed with decode. Indicates memory pressure.

Latency Metrics

  • vllm:e2e_request_latency_seconds: Wall clock time from request to last token. p99 under 10s for most apps.
  • vllm:time_to_first_token_seconds: TTFT, critical for streaming UX. p99 under 1 second.
  • vllm:time_per_output_token_seconds: TPOT, inter-token latency. p99 under 100ms.

Setting Up Prometheus + Grafana for vLLM

Step 1: Enable vLLM Metrics

vLLM exposes metrics via a Prometheus endpoint at /metrics when you start the server. Make sure you are running with --enable-metrics (enabled by default in recent versions):

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --enable-metrics

The server will expose metrics at http://your-host:8000/metrics.

Step 2: Set Up Prometheus to Scrape vLLM

Add a scrape_config to your prometheus.yml:

scrape_configs:
  - job_name: vllm
    static_configs:
      - targets: [localhost:8000]
    metrics_path: /metrics
    scrape_interval: 10s

If you are running behind a reverse proxy or on Kubernetes, adjust the target accordingly.

Step 3: Grafana Dashboard

Here is a starting dashboard JSON for Grafana. Import via + Import and paste the JSON:

GPU Cache Utilization Gauge

  • Metric: vllm:gpu_cache_usage_utilization
  • Type: Gauge
  • Thresholds: green (0-0.85), yellow (0.85-0.95), red (0.95-1.0)

Tokens Generated per Minute

  • Metric: rate(vllm:num_generation_tokens_total[5m])
  • Type: Time series graph
  • Shows actual throughput in tokens per second

Request Latency Percentiles (p50/p95/p99)

  • Metric: histogram_quantile(0.50, rate(vllm:e2e_request_latency_seconds_bucket[5m]))
  • Same for p95 and p99. Overlay all three on one graph.

Time to First Token (TTFT) p99

  • Metric: histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))
  • Alert threshold: over 1 second for most models is noticeable UX degradation.

Scheduler Running Steps

  • Metric: vllm:scheduler_running_steps
  • Shows actual GPU parallelism. If this frequently drops to 1-2 while the queue is non-empty, you have a scheduling bottleneck.

Ray Dashboard: If You Are Running via Ray Serve

If you are using ray serve run or the vllm.entrypoint.api with Ray, you get the Ray dashboard for free:

ray start --head
# Access at http://localhost:8265

ray metrics --dashboard-address=localhost:8265

Ray metrics include:

  • ray_num_task_executions: task throughput
  • ray_resource_usage: CPU/GPU/memory per actor
  • ray_get_task_latency: end-to-end task latency

Combined with vLLM native metrics, this gives you a full picture from request arrival to token delivery.

Advertisement
Advertisement

Common Issues and Detection Patterns

Issue 1: OOM Kills from Overallocated KV Cache

Symptom: vLLM container gets OOM-killed. nvidia-smi shows GPU memory at 99% right before crash.

Detection: Alert when vllm:gpu_cache_usage_utilization exceeds 0.95 for more than 5 minutes.

Fix: Reduce --gpu-memory-utilization by 5-10% increments. If you are below 0.80 and still OOMing, reduce max_model_len.

Issue 2: Thrashing Due to Too Many Concurrent Sequences

Symptom: Throughput drops suddenly. Scheduler running steps fluctuates wildly. Latency spikes.

Detection: If vllm:scheduler_running_steps frequently drops to 1-2 while the queue is non-empty, you have a scheduling bottleneck.

Fix: Lower --max-num-seqs (default 32). Fewer, larger batches often outperform many small sequences.

Issue 3: Low Cache Hit Rate

Symptom: Tokens per second is lower than expected for your hardware.

Detection: vLLM exposes KV cache metrics in recent versions including vllm:kv_cache_prefix_hit_rate if available.

Fix: Increase average prompt similarity (batch similar requests together), or serve with a larger gpu_memory_utilization to fit more sequences in cache.

Issue 4: Prefill Bottleneck

Symptom: TTFT is high but TPOT is normal. The model is slow to start generating.

Detection: Compare num_prefill_tokens_total rate against num_generation_tokens_total rate. If prefill rate is much higher than decode rate, you are prefill-bound.

Fix: Use continuous batching (vLLM default). For very long prompts, consider prompt caching or splitting across smaller chunks.

Cost Optimization: Getting More Tokens Per Dollar

vLLM efficiency directly translates to cloud spend. Here is how to optimize:

1. Tune gpu_memory_utilization with profiling

Profile your specific model to find the optimal setting. Reduce in 5% steps until OOM-free with headroom.

2. Use fp8 quantization for inference

Requires H100/A100 fp8 support. Delivers approximately 40% memory reduction with minimal accuracy loss.

3. Enable speculative decoding for lower latency workloads

Uses a smaller draft model to predict tokens, accepting speed-latency tradeoffs for batch inference scenarios.

4. Monitor tokens per GPU-second as your primary cost metric

Calculate as: tokens generated per hour divided by (number of GPUs times GPU cost per hour). This gives you cost per million tokens generated.

OpenTelemetry Integration

For enterprise environments, vLLM supports OpenTelemetry traces:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("vllm_generate") as span:
    outputs = llm.generate(prompts)
    span.set_attribute("num_tokens", len(outputs[0].outputs[0].token_ids))

Export to Jaeger, Datadog, or any OTLP-compatible backend.

The Monitoring Stack at a Glance

The architecture that works:

  • vLLM Server: exposes /metrics endpoint with native Prometheus format for GPU utilization, cache hit rate, latency histograms
  • Prometheus: scrapes vLLM every 10s, stores time series, fires alerts on OOM risk and latency spikes
  • Grafana: dashboards for KV cache utilization, throughput, latency percentiles, and token efficiency

If you are running vLLM in production today and not monitoring these metrics, start with two panels: GPU cache utilization (catch OOM risk before it kills your service) and request latency p99 (track user experience degradation). From there, add throughput efficiency metrics to correlate GPU spend with business value.

Next Steps

If you are running vLLM in production today and not monitoring these metrics, start with two panels:

  1. GPU cache utilization: catch OOM risk before it kills your service
  2. Request latency p99: track user experience degradation

From there, add throughput efficiency metrics to correlate GPU spend with business value.

Recommended Tool DigitalOcean

GPU droplets for inference workloads: NVIDIA H100s available in select regions. Spin up a vLLM serving stack in minutes with One-Click Apps, starting at 6 dollars per hour per H100.

Further reading: The Complete LLM Observability Guide for Production TeamsThe Open-Source LLM Monitoring Stack in 2026Kubernetes GPU Scheduling for ML Workloads