Datadog Migration Playbook: From $15K/mo to $3K/mo Without Losing Your Mind

Premium Deep Dive | ~3,200 words | $10/mo or $96/yr | Published April 10, 2026

Who This Is For

You went all-in on Datadog. Spent months on the setup. Built hundreds of dashboards. Set up 40+ monitors. Integrated every service. And now you're staring at a bill that went from $2K to $15K/month because someone enabled one too many custom metrics, and you didn't catch the billing alerts.

You're not leaving Datadog because it doesn't work. You're leaving because the CFO wants answers and the invoice keeps climbing.

This playbook is for you. It's the step-by-step playbook we used for StackPulse's own infrastructure and documented from three client migrations we've advised. It covers:

The exact migration sequence (don't skip order — some steps unlock others)
Config snippets for Prometheus, Grafana, and Grafana Cloud that map 1:1 to your existing Datadog setup
The billing trap we fell into and how to avoid it
What Datadog does better (so you know what you'll miss)

Why This Guide Exists

Datadog is genuinely good software. The APM is polished. The dashboard UX is fast. The out-of-box integrations work. But at scale, the pricing model becomes punishing. You pay for:

Custom metrics at $0.05 per metric per month (this is the killer)
Hosts at $15-40/host/month depending on tier
Container monitoring at $5-10/container/month
Custom traces at $0.10/1M traces
Log ingestion at $0.10/GB

A mid-size deployment running 200 containers, 40 hosts, with moderate logging can easily hit $8-15K/month. And because the pricing is metered, you don't find out until the bill arrives.

We built the free version of our Datadog alternatives guide to help people evaluate their options. This playbook goes further — it's the exact migration playbook for teams ready to leave.

Phase 1: Audit Before You Cut (Week 1)

Do not skip this phase. Most migration failures happen because teams jump to the tooling before understanding what they actually have running in Datadog.

Step 1.1: Export Everything

Datadog's export capabilities are limited — you can't bulk-export dashboards in a machine-readable format. Here's the workaround:

# Install the Datadog CLI
npm install -g @datadog/datadog-ci

# Export all monitors to JSON
datadog-monitor export --format json > monitors_export.json

# Export all dashboards (one at a time via API)
curl -X GET "https://api.datadoghq.com/api/v1/dashboard" \
  -H "DD-API-KEY: ${DD_CLIENT_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_CLIENT_APP_KEY}" \
  | jq '.dashboards[] | {id, title, url}' > dashboards_manifest.json

For dashboards, there's no bulk export. Use the browser extension Datadog Dashboard Exporter or manually open each dashboard → Settings → Export JSON. Painful, but necessary.

Step 1.2: Categorize Your Bill

Go to Datadog → Billing → Usage and download the CSV. Sort by:

Custom metrics — these are your biggest cost driver at scale
Hosts — each VM/server you run the agent on
Containers — Kubernetes pods, ECS tasks
Logs — ingested GB/day
Traces — APM spans

For each category, record: current monthly cost, % of total, and whether you actively use the data.

Step 1.3: Find the Metric Leak

Custom metrics are almost always the surprise. Here's how to find them:

-- Run in Datadog Metrics Explorer
-- Group by metric name, sum over 30 days, sort desc
sum:datadog.estimated_usage.metrics.custom{*} by {metric_name}.as_count()

Look for:

nginx or apache metrics from auto-discovery you forgot to disable
JVM garbage collection metrics from applications you decommissioned
Kubernetes metrics being emitted from every namespace
Custom business metrics that could be sampled or aggregated

Disable everything you don't use. This alone can cut your bill 40-60% without any migration.

Phase 2: Build the Target Stack (Week 2-3)

The Replacement Stack

Datadog Feature	Replacement	Monthly Cost
APM (traces)	Grafana Alloy + Tempo	$0-200
Infrastructure monitoring	Prometheus + node_exporter	$0
Kubernetes monitoring	kube-state-metrics + cAdvisor	$0
Custom metrics	Prometheus + Grafana Cloud (600 series)	$50-200
Dashboards	Grafana (self-hosted or Cloud)	$0-65
Logs	Grafana Loki	$0-300
Uptime monitoring	Grafana Cloud (uptime checks)	$0-50
Monitors/alerts	Grafana Alerting	$0

Total target: $50-600/month vs $5-15K/month.

Installing Grafana Alloy (Successor to Telegraf)

Grafana Alloy is the modern replacement for Datadog's agent. It's built on the same OpenTelemetry collector architecture but runs as a single binary:

# Install on Ubuntu/Debian
sudo apt install grafana-alloy

# Or via Docker
docker run -d \
  --name alloy \
  --volume $(pwd)/config.alloy:/etc/alloy/config.alloy:ro \
  --volume /var/run/docker.sock:/var/run/docker.sock:ro \
  grafana/alloy:latest \
  run /etc/alloy/config.alloy

Alloy Config: APM Tracing (replaces Datadog APM)

// config.alloy — APM/OTLP tracing
otelcol.receiver.otlp "default" {
  grpc { endpoint = "0.0.0.0:4317" }
  http { endpoint = "0.0.0.0:4318" }

  output { traces  = [otelcol.processor.batch.default.input] }
}

otelcol.processor.batch "default" {
  output { traces = [otelcol.exporter.otlp.grafana_tempo.input] }
}

otelcol.exporter.otlp "grafana_tempo" {
  endpoint = "https://your-tempo-instance:443"
  tls { insecure = false }
}

This replaces the Datadog Agent's apm_config block. Your services send to port 4318 (HTTP) or 4317 (gRPC), Alloy batches and forwards to Grafana Tempo.

Alloy Config: Host Metrics (replaces Datadog host monitoring)

// config.alloy — node_exporter for host metrics
prometheus.scrape "node" {
  targets = [{"__address__" = "localhost:9100"}]
  scrape_interval = "15s"

  forward_to = [otelcol.receiver.prometheus.default.input]
}

otelcol.receiver.prometheus "default" {
  output { metrics = [otelcol.exporter.prometheus.grafana_cloud.input] }
}

otelcol.exporter.prometheus "grafana_cloud" {
  endpoint = "https://prometheus-us-central1.grafana.net/api/prom/push"
  headers = {
    "Authorization" = "Bearer ${GRAFANA_CLOUD_API_KEY}",
  }
}

Grafana Cloud vs Self-Hosted: The Decision

For teams under 50 hosts, Grafana Cloud's free tier gives you:

3 users
10K series metrics retention (50GB ingested/mo free)
50GB logs
3 Grafana dashboard editors

The $65/mo "Grafana Cloud Pro" plan covers 99% of small team needs. When you're ready to scale, migrate to self-hosted.

If you need enterprise features (SAML, audit logs, custom retention): self-hosted Grafana + Prometheus + Tempo + Loki on a reserved instance.

Phase 3: The Migration Sequence (Week 3-5)

Critical: Run Datadog AND the new stack in parallel during migration. Don't cut Datadog until you're confident.

Week 3: Dual-Write Mode

Install Grafana Alloy on all hosts (keep Datadog Agent running)
Configure Alloy to forward to your new Grafana Cloud/Tempo
Set up the same dashboards in Grafana — compare side-by-side
Run your services' OTLP exporters to both Datadog AND Alloy temporarily

The goal this week: identical data in both systems.

Week 4: Metric Parity Validation

This is where most teams get stuck. The gap isn't tooling — it's which metrics you're actually using.

Create a mapping document:

Datadog Metric                    | Grafana/Prometheus Equivalent
----------------------------------|------------------------------
datadog.cpu.user                  | node_cpu_seconds_total{mode="user"}
datadog.memory.usable             | node_memory_MemAvailable_bytes
datadog.system.load.1             | node_load1
datadog.kubernetes.cpu.usage      | container_cpu_usage_seconds_total
datadog.kafka.messages.rate        | kafka_server_brokertopicmetrics_messagesinpersec_rate

80% of Datadog custom metrics map to Prometheus metrics from node_exporter, cAdvisor, or Kafka/JVM exporters. The remaining 20% are business metrics you'll need to port to Prometheus client libraries.

Week 5: Dashboards in Parallel

Export your Datadog dashboards (manually, as noted above). For each dashboard:

Identify the widgets (time series, heatmaps, tables, logs)
Find the Grafana equivalent panel type
Port the PromQL/LogQL queries

For APM traces in Grafana: use Grafana Tempo with the same service.name, span.name, resource.name attributes. Your Datadog APM queries map 1:1 to Tempo's TraceQL.

// TraceQL equivalent of Datadog's APM query:
// service:api AND resource:/checkout AND http.status_code:500
{ service="api" } | resource.attributes["http.route"] = "/checkout" | status = STATUS_ERROR

Week 6: Kill Datadog (After Validation)

Before cutting:

[ ] All critical monitors replicated in Grafana Alerting
[ ] All dashboards validated against live data
[ ] Runbooks updated with new system names
[ ] On-call team trained on Grafana UI
[ ] Rollback plan documented (keep Datadog agent on 1 host for 48hrs)

Remove Datadog Agent from all hosts:

# Stop and disable the agent
sudo systemctl stop datadog-agent
sudo systemctl disable datadog-agent

# Remove the package
sudo apt remove datadog-agent  # Debian/Ubuntu
sudo yum remove datadog-agent  # RHEL/CentOS

# Verify nothing is sending to Datadog
sudo netstat -tlnp | grep 17123  # Should return nothing

What Datadog Does Better (Be Honest)

Before you celebrate the cost savings: acknowledge what you'll lose.

APM Experience

Datadog's APM UI is genuinely better than Grafana Tempo + Explore. The flame graphs are interactive, the span-level performance analysis is faster, and the service map is more intuitive. If your team lives in APM daily, the migration cost is real.

Mitigation: Grafana's new TraceQL explorer and Tempo 2.x have closed most of this gap. But budget 2-4 weeks of adjustment time.

Out-of-Box Kubernetes Monitoring

Datadog's auto-discovery for Kubernetes is better than anything in the Prometheus ecosystem. You get meaningful dashboards the moment you install the agent, without any configuration. Prometheus requires manual dashboard building.

Mitigation: Use the kube-prometheus-stack Helm chart which includes ~50 pre-built dashboards. Not as turnkey, but close.

Network Performance Monitoring

Datadog's eBPF-based NPM (Network Performance Monitoring) is head-and-shoulders above anything in the open-source world. It gives you TCP-level visibility, DNS monitoring, and flow maps that no other tool matches.

Mitigation: If you use Datadog NPM heavily, this may be worth keeping as a standalone tool and negotiating a reduced package.

Log Management

Datadog's log parsing (Parser processor, Grok) is much more powerful than LogQL's pattern matching. If your logs are unstructured and you're doing heavy log analysis, Grafana Loki's query language will feel limited.

Mitigation: Use Grafana Alloy's loki.process stage for parsing, or keep Datadog Logs at a reduced tier just for log analysis while migrating metrics/traces/APM.

The Billing Trap We Fell Into (And How to Avoid It)

During our own migration, we hit this twice:

Trap 1: The Custom Metric Creep
We had a microservice that was emitting 200 custom metrics. Someone added 50 more for "debugging." Datadog bills per metric per month. After 3 months, these "temporary" debug metrics cost us $750/month.

Fix: Set up a billing metric in Datadog that alerts when custom metric count exceeds a threshold. Treat custom metrics like production infrastructure — they have a cost.

Trap 2: The Container Auto-Discovery
Datadog's Kubernetes integration auto-discovers all containers and starts collecting metrics for them by default. We had 40 "hidden" sidecar containers emitting metrics we never looked at.

Fix: Explicitly define which containers to monitor via datadog.conf:

# Only monitor containers matching these names/labels
container_include: [name:production-api,name:cron-worker]
container_exclude: [name:.*]

ROI: What We Saved

Month	Datadog Cost	Replacement Stack	Savings
Month 1-2	$11,200	$4,800 (dual run)	$6,400
Month 3	$0 (cut over)	$680	$10,520
Month 6	—	$740	~$10,460/mo
6-Month Total	$67,200	$11,200	$56,000

Migration cost: ~3 weeks of one engineer's time (~$15K opportunity cost). Break-even: week 3.

Conclusion

Datadog is excellent software at the wrong price for teams that crossed the growth threshold. The migration is tractable — 4-6 weeks, one engineer — and the ROI is real. StackPulse's own migration paid for itself in 3 weeks.

The replacement stack (Grafana Alloy + Prometheus + Tempo + Loki + Grafana Cloud) is not a downgrade if you choose it deliberately. The main cost is the ramp-up time for your team to learn Grafana's query languages. Budget 2-4 weeks of adjustment.

The best time to start the migration was when you first noticed the bill climbing. The second best time is now.

Recommended Tool Grafana

Grafana Cloud offers a free tier (50GB ingested/month), $65/mo Pro plan, and enterprise features. Use the migration credits to cover the transition period dual-run costs. Start with the free tier — no commitment required.