If you're running AI training or inference workloads on AWS, you're probably burning money on On-Demand pricing. GPU instances — P5, P4d, P3 — cost thousands per hour On-Demand. The gap between On-Demand and reserved pricing is not marginal; for sustained GPU workloads, it can mean the difference between profitable and unprofitable.
This guide cuts through the confusion: Savings Plans vs Reserved Instances — what they are, when to use each, and how to structure a coverage strategy specifically for AI infrastructure.
The Three Pricing Models
On-Demand — Pay per second, no commitment. Highest cost, highest flexibility.
Reserved Instances (RI) — Make a 1-year or 3-year commitment to a specific instance family in a specific Availability Zone. Up to 72% savings vs On-Demand. Lower flexibility — you're locked to AZ and instance type.
Savings Plans (SP) — Make a commitment to spend a certain dollar amount per hour on compute (not specific instance types). More flexibility than RIs — you can apply SP coverage across instance families, sizes, and AZs. Up to 72% savings as well.
The Critical Distinction: Compute Savings Plans vs EC2 Reserved Instances
Most people compare SP vs RI as if they're equivalent. They're not.
EC2 Reserved Instances:
- Tied to a specific instance family (e.g., `p5.48xlarge`)
- Tied to a specific Availability Zone
- Can only cover that exact instance type
- If you stop using that instance, your reservation still burns
Compute Savings Plans:
- Apply to ANY EC2 instance within the selected family across ANY AZ
- More flexible — a `ml.p5.48xlarge` SP covers `p5.48xlarge` but also `p4d.24xlarge` if needed (within the same instance family)
- You can change instance sizes and AZs as your workload evolves
- Same 72% theoretical maximum savings
Recommendation: Always use Compute Savings Plans over EC2 Reserved Instances for AI workloads. The flexibility far outweighs any marginal difference in realized savings.
Instance Family Nuance for AI
AWS separates instance families into "families" for SP purposes. Here's what matters:
| Instance Family | Common AI Use Case | On-Demand $/hr | 1yr SP $/hr | Savings |
|---|---|---|---|---|
| `ml.p5` | H100/H200 training | ~$45 | ~$28 | ~38% |
| `ml.p4d` | A100 training | ~$25 | ~$15 | ~40% |
| `ml.g5` | Inference (moderate) | ~$8 | ~$5 | ~37% |
| `ml.g6` | Inference (T4/L4) | ~$4 | ~$2.50 | ~37% |
A Compute Savings Plan for `ml.p5` covers all P5 sizes. A plan for `ml.p4d` covers P4d. They do NOT cross-cover — a `ml.p5` SP doesn't cover `ml.g5` instances.
The GPU Workload Pattern Problem
AI infrastructure has a unique cost challenge: workloads vary dramatically between training (bursty, high GPU utilization for days/weeks) and inference (sustained, lower utilization).
Training workloads — RIs/SPs are risky because training runs are often:
- Experiment-driven (you don't know how long a training run will take)
- Multi-cloud (switching between AWS, GCP, and Azure as capacity fluctuates)
- Short-lived experiments that get killed
Inference workloads — RIs/SPs are a no-brainer because:
- Production inference is sustained 24/7/365
- Model serving is typically stable — same instance types for months
- Predictable traffic patterns
Recommendation: Commit reserved capacity ONLY for inference, not training. Use On-Demand + Spot for training unless you have extreme certainty about the training duration and instance type.
The Auto-Refit Strategy
Auto-Refit is a native AWS feature that automatically applies your Savings Plans to cover running instances. Here's the workflow:
- Buy Compute Savings Plans for your expected baseline inference capacity
- Set coverage target — aim for 70-80% coverage of your steady-state inference spend
- Let Auto-Refit handle the rest — AWS automatically applies your SP coverage to any matching instance running below your SP limit
- Fill remaining gap with On-Demand for traffic spikes
Coverage breakdown:
Baseline inference (70% of traffic) → Covered by SP
Traffic spikes (30%) → On-Demand
Experimental deployments → Spot instances
Training runs → On-Demand or Spot
Azure and GCP Equivalents
Azure:
- Azure Reserved Instances — similar to AWS RIs, 1 or 3 year commitments
- Azure Savings Plans for Compute — equivalent to AWS Compute Savings Plans, flexibility across instance sizes
- Azure Hybrid Benefit — Windows/SQL licenses can be reused; also applies to some GPU VMs
Google Cloud:
- Committed Use Discounts (CUDs) — similar to RIs, commitment to specific instance families
- Resource-based committed use — newer, more flexible, applies to GPU memory and custom machine types
- Spot VMs — the GCP equivalent of Spot instances, up to 91% off On-Demand
Kubernetes cost monitoring and FinOps visibility. Kubecost gives you real-time spend attribution per namespace, deployment, and service — with built-in Savings Plans recommendations based on your actual usage patterns. Free tier for clusters under $10k/month cloud spend.
Coverage Analysis: How Much Can You Actually Save?
Using AWS Cost Explorer, you can model Savings Plans coverage:
Example: P5 inference deployment
Current monthly spend on ml.p5.48xlarge On-Demand: $32,000
Baseline (predictable inference): 70% = $22,400
Committed via 1-year Compute SP at 38% savings: $22,400 × 0.62 = $13,888/month
Remaining On-Demand (spikes): $9,112
Monthly savings: $9,000 | Annual savings: $108,000
That's realistic for a mid-size inference deployment. Larger deployments scale quadratically.
The Commitment Trap
The biggest mistake teams make: over-committing SPs/RIs for workloads that shrink.
- A 1-year commitment doesn't care if you deprecate a model
- You CAN sell unused RI capacity on the AWS RI Marketplace (at 10-30% of original value, depending on remaining term)
- For rapidly-changing AI infra, 1-year commitments are safer than 3-year
For AI specifically: The pace of model improvement means you're likely to migrate to newer GPU generations within 18-24 months. Don't lock into 3-year RIs for production inference unless you have extreme confidence in your instance family's longevity.
Tools for Managing Reserved Capacity
| Tool | Use Case | Affiliate |
|---|---|---|
| AWS Cost Explorer | Coverage analysis, savings projections | — |
| CloudHealth (VMware) | Multi-cloud RI/SP management | Has affiliate program |
| Spot.io (NetApp) | Auto-recommendations, Spot + SP optimization | Has affiliate program |
| AWS Budgets | Alert when usage drops below SP coverage | — |
| Kubecost | Kubernetes cost attribution + SP recommendations | 20% recurring |
Cloud GPU infrastructure built for AI workloads. H100 and A100 instances at 20-30% below AWS pricing, with Kubernetes-native deployment. Committed capacity pricing available — talk to their team for custom SP-aligned contracts.
Summary: When to Use What
| Workload Type | Pricing Model | Commitment | Expected Savings |
|---|---|---|---|
| Production inference (stable) | Compute Savings Plans | 1-year | 37-40% |
| Production inference (growing) | Compute Savings Plans | 1-year, scale gradually | 30-37% |
| Variable inference load | Savings Plans (partial) + On-Demand | 50% covered | 20-30% |
| Training runs | On-Demand or Spot | None | 0% |
| Short experiments | Spot Instances | None | 60-91% off |
| Batch inference | Spot + On-Demand mix | None | 40-60% |
Conclusion
For AI infrastructure teams, the Savings Plans vs Reserved Instances decision is simpler than it appears: always use Compute Savings Plans over EC2 RIs, commit only for stable inference workloads, and leave training and experimentation on On-Demand/Spot.
The 37-40% savings on your largest inference bill is real money — at scale, a $100K/month inference bill becomes $62K/month with SP coverage. That's not marginal. Start with coverage analysis, model your baseline, and commit conservatively (you can always add more SPs as confidence grows).