Introduction

If you're running AI training or inference workloads on AWS, you're probably burning money on On-Demand pricing. GPU instances — P5, P4d, P3 — cost thousands per hour On-Demand. The gap between On-Demand and reserved pricing is not marginal; for sustained GPU workloads, it can mean the difference between profitable and unprofitable.

This guide cuts through the confusion: Savings Plans vs Reserved Instances — what they are, when to use each, and how to structure a coverage strategy specifically for AI infrastructure.

The Three Pricing Models

On-Demand — Pay per second, no commitment. Highest cost, highest flexibility.

Reserved Instances (RI) — Make a 1-year or 3-year commitment to a specific instance family in a specific Availability Zone. Up to 72% savings vs On-Demand. Lower flexibility — you're locked to AZ and instance type.

Savings Plans (SP) — Make a commitment to spend a certain dollar amount per hour on compute (not specific instance types). More flexibility than RIs — you can apply SP coverage across instance families, sizes, and AZs. Up to 72% savings as well.

The Critical Distinction: Compute Savings Plans vs EC2 Reserved Instances

Most people compare SP vs RI as if they're equivalent. They're not.

EC2 Reserved Instances:

  • Tied to a specific instance family (e.g., p5.48xlarge)
  • Tied to a specific Availability Zone
  • Can only cover that exact instance type
  • If you stop using that instance, your reservation still burns

Compute Savings Plans:

  • Apply to ANY EC2 instance within the selected family across ANY AZ
  • More flexible — a ml.p5.48xlarge SP covers p5.48xlarge but also p4d.24xlarge if needed (within the same instance family)
  • You can change instance sizes and AZs as your workload evolves
  • Same 72% theoretical maximum savings

Recommendation: Always use Compute Savings Plans over EC2 Reserved Instances for AI workloads. The flexibility far outweighs any marginal difference in realized savings.

Instance Family Nuance for AI

AWS separates instance families into "families" for SP purposes. Here's what matters:

Instance Family Common AI Use Case On-Demand $/hr 1yr SP $/hr Savings
ml.p5 H100/H200 training ~$45 ~$28 ~38%
ml.p4d A100 training ~$25 ~$15 ~40%
ml.g5 Inference (moderate) ~$8 ~$5 ~37%
ml.g6 Inference (T4/L4) ~$4 ~$2.50 ~37%

A Compute Savings Plan for ml.p5 covers all P5 sizes. A plan for ml.p4d covers P4d. They do NOT cross-cover — a ml.p5 SP doesn't cover ml.g5 instances.

The GPU Workload Pattern Problem

AI infrastructure has a unique cost challenge: workloads vary dramatically between training (bursty, high GPU utilization for days/weeks) and inference (sustained, lower utilization).

Training workloads — RIs/SPs are risky because training runs are often:

  • Experiment-driven (you don't know how long a training run will take)
  • Multi-cloud (switching between AWS, GCP, and Azure as capacity fluctuates)
  • Short-lived experiments that get killed

Inference workloads — RIs/SPs are a no-brainer because:

  • Production inference is sustained 24/7/365
  • Model serving is typically stable — same instance types for months
  • Predictable traffic patterns

Recommendation: Commit reserved capacity ONLY for inference, not training. Use On-Demand + Spot for training unless you have extreme certainty about the training duration and instance type.

The Auto-Refit Strategy

Auto-Refit is a native AWS feature that automatically applies your Savings Plans to cover running instances. Here's the workflow:

  1. Buy Compute Savings Plans for your expected baseline inference capacity
  2. Set coverage target — aim for 70-80% coverage of your steady-state inference spend
  3. Let Auto-Refit handle the rest — AWS automatically applies your SP coverage to any matching instance running below your SP limit
  4. Fill remaining gap with On-Demand for traffic spikes
Baseline inference (70% of traffic) → Covered by SP
Traffic spikes (30%) → On-Demand
Experimental deployments → Spot instances
Training runs → On-Demand or Spot

Azure and GCP Equivalents

Azure:

  • Azure Reserved Instances — similar to AWS RIs, 1 or 3 year commitments
  • Azure Savings Plans for Compute — equivalent to AWS Compute Savings Plans, flexibility across instance sizes
  • Azure Hybrid Benefit — Windows/SQL licenses can be reused; also applies to some GPU VMs

Google Cloud:

  • Committed Use Discounts (CUDs) — similar to RIs, commitment to specific instance families
  • Resource-based committed use — newer, more flexible, applies to GPU memory and custom machine types
  • Spot VMs — the GCP equivalent of Spot instances, up to 91% off On-Demand
Sponsor
Advertisement

Coverage Analysis: How Much Can You Actually Save?

Using AWS Cost Explorer, you can model Savings Plans coverage:

Current monthly spend on ml.p5.48xlarge On-Demand: $32,000
Baseline (predictable inference): 70% = $22,400
Committed via 1-year Compute SP at 38% savings: $22,400 × 0.62 = $13,888/month
Remaining On-Demand (spikes): $9,112 × $45/hr = ~202 hours of spike capacity

Monthly savings: $32,000 - $13,888 - $9,112 = $9,000
Annual savings: $108,000

That's realistic for a mid-size inference deployment. Larger deployments scale quadratically.

The Commitment Trap

The biggest mistake teams make: over-committing SPs/RIs for workloads that shrink.

  • A 1-year commitment doesn't care if you deprecate a model
  • You CAN sell unused RI capacity on the AWS RI Marketplace (at 10-30% of original value, depending on remaining term)
  • For rapidly-changing AI infra, 1-year commitments are safer than 3-year

For AI specifically: The pace of model improvement means you're likely to migrate to newer GPU generations within 18-24 months. Don't lock into 3-year RIs for production inference unless you have extreme confidence in your instance family's longevity.

Tools for Managing Reserved Capacity

Tool Use Case Affiliate
AWS Cost Explorer Coverage analysis, savings projections
CloudHealth (VMware) Multi-cloud RI/SP management Has affiliate program
Spot.io (NetApp) Auto-recommendations, Spot + SP optimization Has affiliate program
AWS Budgets Alert when usage drops below SP coverage
Kubecost Kubernetes cost attribution + SP recommendations Has affiliate program

Summary: When to Use What

Workload Type Pricing Model Commitment Expected Savings
Production inference (stable) Compute Savings Plans 1-year 37-40%
Production inference (growing) Compute Savings Plans 1-year, scale gradually 30-37%
Variable inference load Savings Plans (partial) + On-Demand 50% covered 20-30%
Training runs On-Demand or Spot None 0%
Short experiments Spot Instances None 60-91% off
Batch inference Spot + On-Demand mix None 40-60%

Conclusion

For AI infrastructure teams, the Savings Plans vs Reserved Instances decision is simpler than it appears: always use Compute Savings Plans over EC2 RIs, commit only for stable inference workloads, and leave training and experimentation on On-Demand/Spot.

The 37-40% savings on your largest inference bill is real money — at scale, a $100K/month inference bill becomes $62K/month with SP coverage. That's not marginal. Start with coverage analysis, model your baseline, and commit conservatively (you can always add more SPs as confidence grows).

Recommended Tool Kubecost

Kubecost provides real-time GPU cost attribution, Savings Plans recommendations, and namespace-level spend visibility for Kubernetes-based AI infrastructure. Free tier available.