The Fundamental Distinction: Data Plane vs Visualization Plane

Before diving into configurations, understand the architectural split that governs everything else.

Prometheus is a time-series database with a built-in data collection and alerting engine. It owns the data plane: ingestion, storage, querying, and alert rule evaluation. Prometheus stores your metrics in its TSDB (Time Series Database) with a write-ahead log (WAL) for crash recovery, stores series on disk in 2-hour block files, and serves PromQL queries from both live data and historical blocks.

Grafana is a visualization and correlation platform that queries external data sources. It owns the control plane and presentation layer: dashboards, alert rule management, multi-source correlation, and team workflows. Grafana does not store your metrics — it queries them from Prometheus (or Loki, Tempo, Elasticsearch, or any of its 100+ data source plugins) and renders them as interactive dashboards.

This distinction matters practically: you cannot run Grafana without a data source, and Prometheus's native dashboarding is intentionally primitive. The stack only works well when you respect this boundary.


When to Use Prometheus Alone

Prometheus's strength is the query. If your workflow fits the pull-based scrape model and you live in terminals and config files, Prometheus alone handles most production monitoring needs without adding operational surface area.

Use Prometheus alone when: - You are debugging a specific service in development or CI - You need to validate that your application is exposing the right metrics before wiring up visualization - Your team is small (1-3 engineers) and you do not need multi-source correlation dashboards yet - You are running automated test suites that query Prometheus via the HTTP API and assert on metric values

Prometheus Architecture Under the Hood

Understanding the internal components clarifies why certain configurations behave the way they do.

The Write-Ahead Log (WAL): Every sample ingested by Prometheus is first written to the WAL before being memory-mapped and served. If Prometheus crashes, the WAL ensures no data loss for the window between the last TSDB block flush and the crash. The WAL retains roughly 2 hours of data in compressed segment files at $PROMETHEUS_HOME/wal.

TSDB Block Storage: Prometheus writes data in 2-hour chunks called blocks. Each block contains a meta.json, an index file, and chunk files (one per series). After the retention period, older blocks are deleted or offloaded to long-term storage (Thanos, Mimir).

Scrape Pools and relabel_configs: The scrape pipeline is highly configurable. A typical scrape_configs section:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'prod-us-east-1'
    environment: 'production'

scrape_configs:
  # Kubernetes service discovery
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
        action: replace

  # Pod monitoring via pod annotation
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape='true'
      - action: keep
        source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        regex: 'true'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        regex: '(.+)'
        target_label: __metrics_path__
        replacement: '${1}'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        regex: '(\d+)'
        target_label: __address__
        replacement: '${1}:${2}'

  # Remote write receiver (for Prometheus-as-aggregator setups)
  - job_name: 'federation'
    static_configs:
      - targets: ['cluster-prometheus.monitoring.svc:9090']
    metrics_path: /federate
    params:
      'match[]':
        - '{job="ai-inference"}'

Recording Rules: Pre-Compute Expensive Queries

Recording rules are the performance tool most teams underuse. If you have a PromQL expression that takes more than 500ms to execute on your largest dashboards, pre-compute it as a recording rule.

# PrometheusRule CRD (prometheus-operator)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-inference-rules
  namespace: monitoring
  labels:
    app: prometheus
    role: alert-rules
spec:
  groups:
    - name: ai-inference.recording
      interval: 30s
      rules:
        # Pre-compute GPU utilization — expensive join otherwise
        - record: gpu:utilization:p95:5m
          expr: |
            histogram_quantile(0.95,
              rate(DCGM_FI_DEV_GPU_UTIL_bucket[5m])
            )
        # Token throughput pre-aggregated per model
        - record: vllm:tokens_per_second:rate5m
          expr: |
            rate(vllm_num_tokens_total[5m])
        # Error budget consumption for SLO
        - record: slo:error_rate:30d
          expr: |
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )

    - name: ai-inference.alerts
      interval: 30s
      rules:
        - alert: HighGPUUtilization
          expr: gpu:utilization:p95:5m > 90
          for: 10m
          labels:
            severity: warning
            team: ml-platform
          annotations:
            summary: "GPU utilization p95 above 90% for 10 minutes"
            description: "Model {{ $labels.model }} GPU {{ $labels.gpu }} at {{ $value }}%"

        - alert: KVCacheHitRateLow
          expr: vllm:kv_cache_hit_rate < 0.7
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "KV cache hit rate dropped below 70%"
            description: "Indicates suboptimal batch sizing or context length configuration"

Prometheus Alertmanager Routing

Prometheus generates alerts; Alertmanager handles routing, grouping, deduplication, and silencing. Without Alertmanager, you get raw alerts flooding every notification channel. With proper routing:

# alertmanager.yml (ConfigMap mounted into Alertmanager StatefulSet)
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'

route:
  receiver: 'team-ops'
  group_by: ['alertname', 'cluster', 'namespace']
  group_wait: 30s          # Wait this long to batch initial alerts
  group_interval: 5m       # Resend if situation persists
  repeat_interval: 4h       # Stop repeating if no new events
  routes:
    # Critical alerts page immediately
    - match:
        severity: critical
      receiver: 'pagerduty'
      group_wait: 10s
      continue: true
    - match:
        severity: warning
      receiver: 'slack-monitoring'
    # AI inference specific routing
    - match:
        team: ml-platform
      receiver: 'ml-oncall'
      routes:
        - match:
            alertname: 'HighGPUUtilization'
          receiver: 'ml-platform-slack'
          # Inhibit GPU temp warnings when GPU is already in critical throttling
          inhibit_rules:
            - source_match:
                alertname: 'GPUCriticalThrottling'
              target_match:
                alertname: 'GPUTemperatureWarning'
              equal: ['gpu', 'model']

receivers:
  - name: 'team-ops'
    slack_configs:
      - channel: '#ops-critical'
        send_resolved: true
        title: 'Critical Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_KEY'
        severity: critical
  - name: 'ml-oncall'
    webhook_configs:
      - url: 'http://ml-oncall-service.monitoring.svc:9095/webhook'

When to Use Grafana Alone

Grafana alone — without a data source actively feeding it — is just an empty dashboard builder. But in one specific scenario, you might deploy Grafana first: when your infrastructure is already instrumented with other backends (CloudWatch, Elasticsearch, PostgreSQL) and you need a unified visualization layer before committing to the full Prometheus+Grafana stack.

More realistically, "using Grafana alone" means deploying Grafana Cloud with its hosted Prometheus backend (Grafana Cloud Metrics), which removes the need to operate your own Prometheus infrastructure entirely.

Use Grafana (Cloud or self-hosted) when: - You need to correlate metrics, logs, and traces in a single view - You need SLO burn-rate tracking with multi-window alerting - Your team is non-technical enough that Prometheus's native UI is a barrier - You want pre-built dashboard templates for Kubernetes, Prometheus, vLLM, Ray, or other common stacks - You need alerting with notification channels beyond what Alertmanager supports (OpsGenie, Microsoft Teams, custom webhooks with complex routing logic)

Grafana Explore: The Debugging Tool Built for Observability

GrafanaExplore is the most underutilized feature for practitioners who spend time debugging production issues. Rather than navigating pre-built dashboards, Explore lets you query any data source ad-hoc — Prometheus via PromQL, Loki via LogQL, Tempo via TraceQL — with automatic panel-to-panel correlation.

When you click on a metric in Explore, Grafana can automatically query Loki for logs from the same time window and same label set. When you click on a log line with a trace_id field, Grafana jumps to that trace in Tempo. This correlated workflow is what makes Grafana essential for production debugging rather than just dashboarding.

Grafana Provisioning: GitOps-Friendly Dashboard Management

In production environments, dashboards should be version-controlled and deployed like any other infrastructure. Grafana supports three provisioning methods:

File-based provisioning (static JSON dashboards committed to Git):

# grafana.yaml (provisioning configuration)
apiVersion: 1
providers:
  - name: 'Observability Dashboards'
    orgId: 1
    folder: 'Kubernetes'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /etc/grafana/provisioning/dashboards
    # Automatically reload when JSON files change
    allowUiUpdates: true
  - name: 'AI Inference'
    orgId: 1
    folder: 'ML Platform'
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards/ai-inference

API-based provisioning (programmatic dashboard creation):

# Create a dashboard via Grafana API
curl -X POST \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "dashboard": {
      "title": "vLLM Production Overview",
      "uid": "vllm-production",
      "timezone": "browser",
      "panels": [
        {
          "title": "Token Throughput",
          "type": "timeseries",
          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "rate(vllm_num_tokens_total[2m])",
              "legendFormat": "{{model}} - {{type}}"
            }
          ]
        },
        {
          "title": "KV Cache Hit Rate",
          "type": "gauge",
          "gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "vllm:kv_cache_hit_ratio",
              "legendFormat": "{{model}}"
            }
          ],
          "fieldConfig": {
            "defaults": {
              "thresholds": {
                "mode": "absolute",
                "steps": [
                  {"color": "red", "value": null},
                  {"color": "yellow", "value": 0.7},
                  {"color": "green", "value": 0.9}
                ]
              },
              "unit": "percentunit",
              "min": 0,
              "max": 1
            }
          }
        }
      ]
    },
    "folderId": 12,
    "overwrite": true
  }' \
  https://your-grafana-instance.com/api/dashboards/db

Kubernetes Operator provisioning (GrafanaOperator CRD):

# GrafanaDashboard CRD — managed by the Grafana Operator
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaDashboard
metadata:
  name: vllm-dashboard
  namespace: monitoring
spec:
  instanceSelector:
    matchLabels:
      dashboards: production
  json: >
    {
      "title": "vLLM Production",
      "panels": [
        {
          "title": "GPU Utilization",
          "type": "timeseries",
          "datasource": "Prometheus",
          "targets": [
            {
              "expr": "rate(DCGM_FI_DEV_GPU_UTIL_total[2m]) * 100",
              "legendFormat": "GPU {{gpu}}"
            }
          ]
        }
      ]
    }

When to Use Both Together: The Production Stack

For any production environment with more than two engineers, the correct architecture is Prometheus and Grafana deployed together. The decision is not which one to choose — it is how much operational responsibility you want to take on.

The Modern Observability Stack in 2026

┌─────────────────────────────────────────────────────────────┐
│                    Grafana (Visualization + Alerting)        │
│         Dashboards · Explore · Alert Rules · Auth (SSO)      │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│   Prometheus  │    │    Loki       │    │    Tempo      │
│   (Metrics)   │    │   (Logs)      │    │   (Traces)    │
│  TSDB + WAL    │    │  LogQL        │    │  TraceQL      │
└───────────────┘    └───────────────┘    └───────────────┘
        │                     │                     │
        └─────────────────────┼─────────────────────┘
                              ▼
              ┌───────────────────────────────┐
              │  Mimir / Thanos (Long-term    │
              │  metrics storage + horizontal  │
              │  scalability)                  │
              └───────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
    ┌─────────────────┐             ┌─────────────────┐
    │ Grafana Alloy   │             │  Prometheus     │
    │ (Collection     │             │  (Native scrape│
    │  agent: logs,   │             │   + Alloy       │
    │   metrics,      │             │   remote_write) │
    │   traces)       │             │                 │
    └─────────────────┘             └─────────────────┘

Grafana Alloy: The Unified Collection Agent

Grafana Alloy is Grafana Labs' modern successor to the Prometheus Agent and Grafana Loki Promtail. It unifies metrics, logs, and traces collection into a single binary configured via a River DSL (a declarative configuration language). For teams running mixed infrastructure:

// alloy.river — single binary, unified telemetry collection
prometheus.scrape "default" {
  scrape_interval = "15s"
  targets = [
    { "__address__" = "vllm-service.monitoring.svc:9100" },
    { "__address__" = "ray-dashboard.monitoring.svc:52365" },
  ]
  forward_to = prometheus.remote_write.mimir.receiver
}

// Remote write to Mimir (Grafana's hosted Prometheus backend)
prometheus.remote_write "mimir" {
  endpoint {
    url = "https://mimir-stackpulsar.grafana.net/prometheus/api/v1/write"
    // Grafana Cloud uses a credentials file, not inline auth
    basic_auth {
      username = "123456"
      password = "YOUR_GRAFANA_CLOUD_API_KEY"
    }
  }
  // Writable static config — equivalent to Prometheus agent mode
  write_capacity = 2000
  metadata_capacity = 1000
}

// Collect logs and forward to Loki
loki.source.file "application_logs" {
  targets = [
    { "path" = "/var/log/pods/*/vllm/*.log", "job" = "vllm" },
    { "path" = "/var/log/pods/*/ray/*.log", "job" = "ray" },
  ]
  forward_to = loki.write.loki.receiver
}

loki.write "loki" {
  endpoint {
    url = "https://logs-prod-012.grafana.net/loki/api/v1/push"
    basic_auth {
      username = "789012"
      password = "YOUR_LOKI_API_KEY"
    }
  }
}

// Collect traces and forward to Tempo
otelcol.receiver.otlp "tempo" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }
  output {
    traces = [otelcol.processor.batch.tempo.input]
  }
}

otelcol.processor.batch "tempo" {
  output { traces = [otelcol.exporter.otlphttp.tempo.input] }
}

otelcol.exporter.otlphttp "tempo" {
  client { endpoint = "https://tempo-prod-012.grafana.net:443" }
}

Grafana Loki: Log Aggregation That Works With Prometheus

Loki is not a full-text search engine — it indexes log stream labels, not log content. This is a deliberate design trade-off that makes Loki dramatically cheaper than Elasticsearch for high-volume environments. You query Loki with LogQL, which has label matchers that feel similar to PromQL:

# LogQL — retrieve vLLM logs with OOM errors from the last 5 minutes
{service="vllm", namespace="production"}
  | json
  | level="error"
  | line_format `{{.timestamp}} [{{.level}}] {{.message}} GPU={{.gpu_id}}`

# Correlate: get logs from the exact time window around an alert
{service="vllm"}
  | json
  | __error__=""
  | level!="debug"
  | timestamp >= 1709300000000000000 and timestamp <= 1709300600000000000

# Parse structured JSON logs and extract latency
{service="vllm"}
  | json
  | latency_ms = latency_seconds * 1000
  | latency_ms > 500
  | line_format `{{.timestamp}} SLOW REQUEST: {{.endpoint}} took {{latency_ms}}ms`

Grafana Tempo: Distributed Tracing Without the Operational Overhead

Tempo is Grafana Labs' trace storage backend. It accepts OpenTelemetry Protocol (OTLP) traces, stores them in object storage (S3, GCS, Azure Blob), and exposes them via the TraceQL query language. The key advantage over Jaeger is operational simplicity: Tempo has one moving part, and it integrates natively with Grafana's existing infrastructure.

# TraceQL — find slow inference requests with specific error conditions
{ service.name="vllm-inference" }
  and span.http.status_code >= 500
  and span.duration > 5s

# Correlate: find traces with high token count that exceeded p99 latency
{tracer_name="ai-inference"}
  and span.attributes["vllm.tokens.total"] > 4096
  and span.attributes["llm.latency.total"] > 10

Mimir: Horizontal Scale for Prometheus

Grafana Mimir is the open-source, horizontally scalable fork of Cortex. It replaces a single Prometheus instance with a distributed system that handles ingestion, storage, and querying across multiple nodes. For teams with more than 500,000 active metric series, Mimir eliminates the single-node bottleneck.

# Mimir distributed mode — Helm values (production)
# kube-prometheus-stack already supports Mimir as a drop-in backend
# Override with:
prometheus:
  prometheusSpec:
    remoteWrite:
      - url: https://mimir.monitoring.svc:8080/api/v1/push
        bearerTokenFile: /var/run/secrets/mimir-auth/token
        queue_config:
          capacity: 100000
          max_shards: 50
          max_samples_per_send: 2000
          min_shards: 10
    # Use the Mimir query frontend instead of Prometheus's built-in query engine
    queryLogFile: /var/log/prometheus/query.log
    # Store 2 years of data in Mimir
    retention: 2y

# Mimir component scaling (relevant for 1M+ series)
mimir:
  distributor:
    replicas: 3
    resources:
      limits:
        cpu: "2"
        memory: 4Gi
  ingester:
    replicas: 4
    persistence:
      size: 50Gi
      storageClass: ssd
  store_gateway:
    replicas: 3
  query_scheduler:
    replicas: 2

Migration Playbook: Moving Between Stack Configurations

Moving from Self-Hosted Prometheus to Grafana Cloud

If you are outgrowing your self-managed Prometheus and want to migrate to Grafana Cloud Metrics without downtime:

Step 1: Provision a Grafana Cloud Metrics instance and note the Remote Write endpoint and credentials.

Step 2: Add a second remote_write target to your existing Prometheus config — do not remove the local TSDB yet:

remote_write:
  # Existing local storage
  - url: http://localhost:9090/api/v1/write
  # Grafana Cloud as a hot standby / additional destination
  - url: https://prometheus-us-central1.grafana.net/api/v1/write
    bearer_token_file: /etc/prometheus/grafana-cloud-token
    queue_config:
      capacity: 5000
      max_shards: 20

Step 3: Validate that metrics are appearing in Grafana Cloud using the Explore page. Run targeted PromQL queries for your critical series.

Step 4: Update dashboard data sources in Grafana to point to the Cloud Prometheus instance.

Step 5: Once validated (24-48 hours), decommission the local Prometheus write target.

Moving from Legacy scrape_configs to Prometheus Operator

If you are still managing Prometheus via raw prometheus.yml files and want to adopt the prometheus-operator for GitOps-friendly management:

Step 1: Install the prometheus-operator via Helm:

helm install prometheus-operator prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheusOperator.enabled=true \
  --set prometheusOperator.admissionWebhooks.patch=true

Step 2: Convert your existing scrape_configs into ServiceMonitor or PodMonitor CRDs. The semantics map directly:

# Old scrape_configs equivalent via ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-service
  namespace: monitoring
  labels:
    release: prometheus  # Required: prometheus-operator selects on this label
spec:
  jobLabel: my-job
  selector:
    matchLabels:
      app: my-service
  namespaceSelector:
    matchNames:
      - production
  endpoints:
    - port: web
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s
      relabelings:
        - source_labels: [__meta_kubernetes_pod_node_name]
          target_label: node

Step 3: Create PrometheusRule CRDs for all existing alerting and recording rules. Delete the old prometheus.yml configmap once all rules are migrated.


Cost Comparison: Prometheus vs Grafana at Scale

Configuration Self-Hosted Cost Managed (Grafana Cloud) Notes
Prometheus only (≤50K series) Free (1-2 VMs: ~$40-80/mo) Grafana Cloud Metrics: $25/mo + $0.30/10K series Free tier available for <10K series
Prometheus + Grafana OSS (≤100K series) $80-150/mo (3 VMs) Grafana Cloud Pro: $75/mo + consumption OSS requires ops FTE (~0.25 FTE)
Mimir + Loki + Tempo + Grafana (500K series) $400-800/mo (5-8 VMs + object storage) Grafana Cloud Advanced: $159/mo + $0.20/10K metrics + $0.10/GB logs 1-2 FTE for ops at this scale
Multi-cluster federation (2M+ series) $1,500-3,000/mo (dedicated ops team) Not recommended for cloud — self-hosted Thanos or Mimir federation required
Datadog equivalent N/A $1,500-3,000/mo at equivalent scale Datadog includes APM, security, and log management

Team Size Recommendations: Which Stack Matches Your Headcount?

Team Size Recommended Stack Infrastructure Cost Ops Overhead Best For
Solo / Side project Grafana Cloud Free Tier Free Near zero Side projects, hobby experiments
1-3 engineers (startup) Grafana Cloud Pro + Hosted Metrics/Logs $75-150/month Minimal Startups shipping fast, no dedicated ops
3-10 engineers Self-hosted kube-prometheus-stack + Loki + Tempo $150-400/month 0.25-0.5 FTE Product teams with Kubernetes already
10-50 engineers Mimir + Loki + Tempo + Grafana Enterprise + Alloy $500-1,200/month 0.5-1 FTE Engineering orgs with dedicated platform team
50+ engineers (enterprise) Multi-cluster Thanos/Mimir federation + Loki + Tempo + Grafana Enterprise + SSO $2,000-5,000/month 1-2 FTE Large orgs, multi-cloud, compliance requirements

Grafana vs Prometheus Decision Matrix

Use this table as a quick reference when making architectural decisions:

Decision Point Choose Prometheus Choose Grafana Choose Both
Where to write alerts Prometheus + Alertmanager Grafana Alerting (simpler UI) Prometheus evaluates; Grafana manages routing
Debugging in the terminal ✅ promtool, promql ❌ requires browser Prometheus for query; Grafana for visualization
Correlating logs with metrics ❌ requires Loki separately ✅ GrafanaExplore Loki + Explore in Grafana over Prometheus data
Long-term storage (1yr+) ❌ limited retention ❌ not a data store Mimir or Thanos as Prometheus backend
Pre-built dashboard templates ❌ none ✅ grafana.com/dashboards Grafana with Prometheus data source
GitOps-configured monitoring ✅ prometheus-operator CRDs ✅ Grafana Operator CRDs Both operators work together
Multi-source correlation (metrics + traces + logs) ❌ not supported ✅ single pane of glass Grafana queries Prometheus for metrics
SLI/SLO tracking ⚠️ via recording rules ✅ native SLO panel + burn-rate alerts Prometheus for metric computation; Grafana for visualization
Multi-tenant monitoring ⚠️ federation required ✅ orgs + folder-based RBAC Prometheus federation + Grafana multi-tenancy
GPU/inference monitoring ✅ DCGM Exporter + custom metrics ✅ pre-built AI dashboards Prometheus scrapes GPU metrics; Grafana visualizes

The Newcomers: What Changed in 2025-2026

Prometheus 3.x (late 2025) brought three production-critical features. Remote write protocol v2 improves compression and reliability for high-volume metric streams — critical for AI inference workloads generating 50K+ metric series per second. Native OTLP ingestion means OpenTelemetry SDKs can send traces, metrics, and logs directly to Prometheus without an adapter layer, collapsing a common source of operational complexity. Exemplars — contextual metadata attached to histogram samples — now link Prometheus metrics to distributed traces seamlessly, enabling correlation workflows that previously required custom instrumentation.

Grafana 11.x (2026) introduced dashboard version control with Git-like semantics, allowing rollbacks and PR-based dashboard changes through the Grafana IaC ecosystem. The Grafana Assistant (AI-powered) can generate PromQL queries from natural language, explain what a metric represents, and suggest correlated signals during incident investigation. Grafana Cloud's pre-built AI dashboards now ship with templates for vLLM, Ray, and Hugging Face Transformers — token throughput, KV cache statistics, batch scheduling metrics, and GPU memory pressure out of the box.


Essential PromQL Patterns for Production

These are the queries that separate teams who can debug their systems from teams who cannot:

# Error rate with 5-minute window — the most important SLO signal
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# p99 latency — histogram_quantile requires _bucket suffix
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m])
)

# Availability SLO — proportion of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))

# Capacity planning — predict when you hit memory limits
predict_linear(container_memory_working_set_bytes{container!=""}[1h], 4 * 3600)

# Kubernetes pod restarts — alerting on restarts within 1 hour
increase(kube_pod_container_status_restarts_total[1h]) > 3

# vLLM-specific: GPU utilization via DCGM Exporter
rate(DCGM_FI_DEV_GPU_UTIL_total{device="GPU-all"}[2m]) > 90

# vLLM-specific: batch queue saturation
vllm:scheduler_pending_tokens / vllm:scheduler_total_tokens > 0.8

Related Articles


The Bottom Line

Prometheus and Grafana are not competitors — they are the data plane and presentation layer of the same observability system. Use Prometheus when you need to collect, store, and query metrics, especially in terminal or API-driven workflows. Use Grafana when you need visualization, multi-source correlation, SLO tracking, or a shared operational view for non-promQL users. Deploy both together for any production environment that matters.

For teams running AI inference infrastructure: the combination of Prometheus (scraping vLLM, Ray, and DCGM metrics), Grafana Loki (application logs), Grafana Tempo (distributed traces), and Grafana (dashboards and alerting) covers the full observability surface — and Grafana Cloud can host all of it if you want to eliminate operational overhead.

The real decision is not Prometheus versus Grafana. It is how much operational complexity your team can absorb versus how much money you want to pay for managed infrastructure. For most teams in 2026, the right answer is: start with Grafana Cloud Pro (managed Prometheus + Grafana), migrate to self-hosted Mimir + Grafana when you hit 500K+ series, and use the savings to fund platform engineering headcount.

Recommended Tool Grafana Cloud

Prometheus + Loki + Tempo + Grafana alerting — all managed. Pro plan at $75/mo covers 50K Prometheus series and 1TB logs. 60% cheaper than Datadog for the full stack.

Recommended Tool Datadog

Full-stack monitoring for AI inference: GPU utilization, token throughput, model latency, and infrastructure metrics in a single platform. 14-day free trial, no credit card required.

Recommended Tool Honeycomb

Observability for complex distributed systems. Honeycomb's columnar storage means you never have to guess which query to run — re-query any dimension, any time. Free tier: 20M events/month.

Recommended Tool Weights & Biases

ML experiment tracking, model versioning, and production model monitoring. Weights & Biases integrates with your existing Prometheus + Grafana stack to give AI teams full lineage from training to inference.