Premium Deep Dive | ~3,200 words | $10/mo or $96/yr | Published April 10, 2026
Who This Is For
You went all-in on Datadog. Spent months on the setup. Built hundreds of dashboards. Set up 40+ monitors. Integrated every service. And now you're staring at a bill that went from $2K to $15K/month because someone enabled one too many custom metrics, and you didn't catch the billing alerts.
You're not leaving Datadog because it doesn't work. You're leaving because the CFO wants answers and the invoice keeps climbing.
This playbook is for you. It's the step-by-step playbook we used for StackPulse's own infrastructure and documented from three client migrations we've advised. It covers:
- The exact migration sequence (don't skip order — some steps unlock others)
- Config snippets for Prometheus, Grafana, and Grafana Cloud that map 1:1 to your existing Datadog setup
- The billing trap we fell into and how to avoid it
- What Datadog does better (so you know what you'll miss)
Why This Guide Exists
Datadog is genuinely good software. The APM is polished. The dashboard UX is fast. The out-of-box integrations work. But at scale, the pricing model becomes punishing. You pay for:
- Custom metrics at $0.05 per metric per month (this is the killer)
- Hosts at $15-40/host/month depending on tier
- Container monitoring at $5-10/container/month
- Custom traces at $0.10/1M traces
- Log ingestion at $0.10/GB
A mid-size deployment running 200 containers, 40 hosts, with moderate logging can easily hit $8-15K/month. And because the pricing is metered, you don't find out until the bill arrives.
We built the free version of our Datadog alternatives guide to help people evaluate their options. This playbook goes further — it's the exact migration playbook for teams ready to leave.
Phase 1: Audit Before You Cut (Week 1)
Do not skip this phase. Most migration failures happen because teams jump to the tooling before understanding what they actually have running in Datadog.
Step 1.1: Export Everything
Datadog's export capabilities are limited — you can't bulk-export dashboards in a machine-readable format. Here's the workaround:
# Install the Datadog CLI
npm install -g @datadog/datadog-ci
# Export all monitors to JSON
datadog-monitor export --format json > monitors_export.json
# Export all dashboards (one at a time via API)
curl -X GET "https://api.datadoghq.com/api/v1/dashboard" \
-H "DD-API-KEY: ${DD_CLIENT_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_CLIENT_APP_KEY}" \
| jq '.dashboards[] | {id, title, url}' > dashboards_manifest.json For dashboards, there's no bulk export. Use the browser extension Datadog Dashboard Exporter or manually open each dashboard → Settings → Export JSON. Painful, but necessary.
Step 1.2: Categorize Your Bill
Go to Datadog → Billing → Usage and download the CSV. Sort by:
- Custom metrics — these are your biggest cost driver at scale
- Hosts — each VM/server you run the agent on
- Containers — Kubernetes pods, ECS tasks
- Logs — ingested GB/day
- Traces — APM spans
For each category, record: current monthly cost, % of total, and whether you actively use the data.
Step 1.3: Find the Metric Leak
Custom metrics are almost always the surprise. Here's how to find them:
-- Run in Datadog Metrics Explorer
-- Group by metric name, sum over 30 days, sort desc
sum:datadog.estimated_usage.metrics.custom{*} by {metric_name}.as_count() Look for:
- nginx or apache metrics from auto-discovery you forgot to disable
- JVM garbage collection metrics from applications you decommissioned
- Kubernetes metrics being emitted from every namespace
- Custom business metrics that could be sampled or aggregated
Disable everything you don't use. This alone can cut your bill 40-60% without any migration.
Phase 2: Build the Target Stack (Week 2-3)
The Replacement Stack
| Datadog Feature | Replacement | Monthly Cost |
|---|---|---|
| APM (traces) | Grafana Alloy + Tempo | $0-200 |
| Infrastructure monitoring | Prometheus + node_exporter | $0 |
| Kubernetes monitoring | kube-state-metrics + cAdvisor | $0 |
| Custom metrics | Prometheus + Grafana Cloud (600 series) | $50-200 |
| Dashboards | Grafana (self-hosted or Cloud) | $0-65 |
| Logs | Grafana Loki | $0-300 |
| Uptime monitoring | Grafana Cloud (uptime checks) | $0-50 |
| Monitors/alerts | Grafana Alerting | $0 |
Total target: $50-600/month vs $5-15K/month.
Installing Grafana Alloy (Successor to Telegraf)
Grafana Alloy is the modern replacement for Datadog's agent. It's built on the same OpenTelemetry collector architecture but runs as a single binary:
# Install on Ubuntu/Debian
sudo apt install grafana-alloy
# Or via Docker
docker run -d \
--name alloy \
--volume $(pwd)/config.alloy:/etc/alloy/config.alloy:ro \
--volume /var/run/docker.sock:/var/run/docker.sock:ro \
grafana/alloy:latest \
run /etc/alloy/config.alloy Alloy Config: APM Tracing (replaces Datadog APM)
// config.alloy — APM/OTLP tracing
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
http { endpoint = "0.0.0.0:4318" }
output { traces = [otelcol.processor.batch.default.input] }
}
otelcol.processor.batch "default" {
output { traces = [otelcol.exporter.otlp.grafana_tempo.input] }
}
otelcol.exporter.otlp "grafana_tempo" {
endpoint = "https://your-tempo-instance:443"
tls { insecure = false }
} This replaces the Datadog Agent's apm_config block. Your services send to port 4318 (HTTP) or 4317 (gRPC), Alloy batches and forwards to Grafana Tempo.
Alloy Config: Host Metrics (replaces Datadog host monitoring)
// config.alloy — node_exporter for host metrics
prometheus.scrape "node" {
targets = [{"__address__" = "localhost:9100"}]
scrape_interval = "15s"
forward_to = [otelcol.receiver.prometheus.default.input]
}
otelcol.receiver.prometheus "default" {
output { metrics = [otelcol.exporter.prometheus.grafana_cloud.input] }
}
otelcol.exporter.prometheus "grafana_cloud" {
endpoint = "https://prometheus-us-central1.grafana.net/api/prom/push"
headers = {
"Authorization" = "Bearer ${GRAFANA_CLOUD_API_KEY}",
}
} Grafana Cloud vs Self-Hosted: The Decision
For teams under 50 hosts, Grafana Cloud's free tier gives you:
- 3 users
- 10K series metrics retention (50GB ingested/mo free)
- 50GB logs
- 3 Grafana dashboard editors
The $65/mo "Grafana Cloud Pro" plan covers 99% of small team needs. When you're ready to scale, migrate to self-hosted.
If you need enterprise features (SAML, audit logs, custom retention): self-hosted Grafana + Prometheus + Tempo + Loki on a reserved instance.
Phase 3: The Migration Sequence (Week 3-5)
Critical: Run Datadog AND the new stack in parallel during migration. Don't cut Datadog until you're confident.
Week 3: Dual-Write Mode
- Install Grafana Alloy on all hosts (keep Datadog Agent running)
- Configure Alloy to forward to your new Grafana Cloud/Tempo
- Set up the same dashboards in Grafana — compare side-by-side
- Run your services' OTLP exporters to both Datadog AND Alloy temporarily
The goal this week: identical data in both systems.
Week 4: Metric Parity Validation
This is where most teams get stuck. The gap isn't tooling — it's which metrics you're actually using.
Create a mapping document:
Datadog Metric | Grafana/Prometheus Equivalent
----------------------------------|------------------------------
datadog.cpu.user | node_cpu_seconds_total{mode="user"}
datadog.memory.usable | node_memory_MemAvailable_bytes
datadog.system.load.1 | node_load1
datadog.kubernetes.cpu.usage | container_cpu_usage_seconds_total
datadog.kafka.messages.rate | kafka_server_brokertopicmetrics_messagesinpersec_rate 80% of Datadog custom metrics map to Prometheus metrics from node_exporter, cAdvisor, or Kafka/JVM exporters. The remaining 20% are business metrics you'll need to port to Prometheus client libraries.
Week 5: Dashboards in Parallel
Export your Datadog dashboards (manually, as noted above). For each dashboard:
- Identify the widgets (time series, heatmaps, tables, logs)
- Find the Grafana equivalent panel type
- Port the PromQL/LogQL queries
For APM traces in Grafana: use Grafana Tempo with the same service.name, span.name, resource.name attributes. Your Datadog APM queries map 1:1 to Tempo's TraceQL.
// TraceQL equivalent of Datadog's APM query:
// service:api AND resource:/checkout AND http.status_code:500
{ service="api" } | resource.attributes["http.route"] = "/checkout" | status = STATUS_ERROR Week 6: Kill Datadog (After Validation)
Before cutting:
- [ ] All critical monitors replicated in Grafana Alerting
- [ ] All dashboards validated against live data
- [ ] Runbooks updated with new system names
- [ ] On-call team trained on Grafana UI
- [ ] Rollback plan documented (keep Datadog agent on 1 host for 48hrs)
Remove Datadog Agent from all hosts:
# Stop and disable the agent
sudo systemctl stop datadog-agent
sudo systemctl disable datadog-agent
# Remove the package
sudo apt remove datadog-agent # Debian/Ubuntu
sudo yum remove datadog-agent # RHEL/CentOS
# Verify nothing is sending to Datadog
sudo netstat -tlnp | grep 17123 # Should return nothing What Datadog Does Better (Be Honest)
Before you celebrate the cost savings: acknowledge what you'll lose.
APM Experience
Datadog's APM UI is genuinely better than Grafana Tempo + Explore. The flame graphs are interactive, the span-level performance analysis is faster, and the service map is more intuitive. If your team lives in APM daily, the migration cost is real.
Mitigation: Grafana's new TraceQL explorer and Tempo 2.x have closed most of this gap. But budget 2-4 weeks of adjustment time.
Out-of-Box Kubernetes Monitoring
Datadog's auto-discovery for Kubernetes is better than anything in the Prometheus ecosystem. You get meaningful dashboards the moment you install the agent, without any configuration. Prometheus requires manual dashboard building.
Mitigation: Use the kube-prometheus-stack Helm chart which includes ~50 pre-built dashboards. Not as turnkey, but close.
Network Performance Monitoring
Datadog's eBPF-based NPM (Network Performance Monitoring) is head-and-shoulders above anything in the open-source world. It gives you TCP-level visibility, DNS monitoring, and flow maps that no other tool matches.
Mitigation: If you use Datadog NPM heavily, this may be worth keeping as a standalone tool and negotiating a reduced package.
Log Management
Datadog's log parsing (Parser processor, Grok) is much more powerful than LogQL's pattern matching. If your logs are unstructured and you're doing heavy log analysis, Grafana Loki's query language will feel limited.
Mitigation: Use Grafana Alloy's loki.process stage for parsing, or keep Datadog Logs at a reduced tier just for log analysis while migrating metrics/traces/APM.
The Billing Trap We Fell Into (And How to Avoid It)
During our own migration, we hit this twice:
Trap 1: The Custom Metric Creep
We had a microservice that was emitting 200 custom metrics. Someone added 50 more for "debugging." Datadog bills per metric per month. After 3 months, these "temporary" debug metrics cost us $750/month.
Fix: Set up a billing metric in Datadog that alerts when custom metric count exceeds a threshold. Treat custom metrics like production infrastructure — they have a cost.
Trap 2: The Container Auto-Discovery
Datadog's Kubernetes integration auto-discovers all containers and starts collecting metrics for them by default. We had 40 "hidden" sidecar containers emitting metrics we never looked at.
Fix: Explicitly define which containers to monitor via datadog.conf:
# Only monitor containers matching these names/labels
container_include: [name:production-api,name:cron-worker]
container_exclude: [name:.*] ROI: What We Saved
| Month | Datadog Cost | Replacement Stack | Savings |
|---|---|---|---|
| Month 1-2 | $11,200 | $4,800 (dual run) | $6,400 |
| Month 3 | $0 (cut over) | $680 | $10,520 |
| Month 6 | — | $740 | ~$10,460/mo |
| 6-Month Total | $67,200 | $11,200 | $56,000 |
Migration cost: ~3 weeks of one engineer's time (~$15K opportunity cost). Break-even: week 3.
Conclusion
Datadog is excellent software at the wrong price for teams that crossed the growth threshold. The migration is tractable — 4-6 weeks, one engineer — and the ROI is real. StackPulse's own migration paid for itself in 3 weeks.
The replacement stack (Grafana Alloy + Prometheus + Tempo + Loki + Grafana Cloud) is not a downgrade if you choose it deliberately. The main cost is the ramp-up time for your team to learn Grafana's query languages. Budget 2-4 weeks of adjustment.
The best time to start the migration was when you first noticed the bill climbing. The second best time is now.
Grafana Cloud offers a free tier (50GB ingested/month), $65/mo Pro plan, and enterprise features. Use the migration credits to cover the transition period dual-run costs. Start with the free tier — no commitment required.