Why Standard Kubernetes Autoscaling Breaks for AI Workloads
Horizontal Pod Autoscaler (HPA) scales based on CPU and memory — metrics that work for web servers and APIs. AI inference workloads don't behave like web servers. A vLLM pod serving 1,000 tokens/second might show 40% CPU utilization while its GPU is completely saturated. CPU is idle because the GPU queue is full. Scale that pod by following CPU and you add more GPU-waiting pods that accomplish nothing except increasing scheduling overhead.
The same problem applies to Vertical Pod Autoscaler (VPA) for AI. VPA recommends resource changes based on historical usage — useless when inference demand is bursty and unpredictable, driven by user traffic patterns that have nothing to do with current node metrics.
AI workloads need event-driven autoscaling: scaling based on the actual demand signals that matter — queue depth, token throughput, concurrent requests, or external events like a marketing campaign launching. This is where KEDA changes the equation.
What KEDA Brings to AI Workloads
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 50+ built-in scalers that respond to external metrics instead of just CPU and memory. For AI inference, the most relevant scalers are:
- Prometheus — scale on any Prometheus metric: GPU utilization, KV cache hit rate, request queue depth, token throughput
- RabbitMQ / Apache Kafka — scale based on queue depth for async inference pipelines
- AWS CloudWatch — scale based on Bedrock or SageMaker metrics
- Azure Monitor — scale based on Azure OpenAI metrics
- Datadog — scale on any Datadog query for teams using Datadog APM
- CPU / Memory — still useful for non-GPU pods in the inference stack (tokenizers, preprocessors)
KEDA works as a Kubernetes Custom Resource Definition (CRD). You install it as a Deployment, define a ScaledObject that connects your inference Deployment to a scaler, and KEDA automatically manages the HPA for you.
Installing KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace Or via ArgoCD if you manage GitOps:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: keda
spec:
project: default
source:
chart: keda
repoURL: https://kedacore.github.io/charts
targetRevision: 2.15.0
helm:
values: |
resources:
limits:
cpu: "300m"
memory: "128Mi"
requests:
cpu: "100m"
memory: "64Mi"
destination:
server: https://kubernetes.default.svc
namespace: keda Scaling AI Inference with KEDA + Prometheus
The most powerful combination for AI inference is KEDA with the Prometheus scaler. This lets you scale based on any metric your inference server exposes — GPU utilization, request queue depth, token throughput, or custom business metrics.
Example: Scaling vLLM on Request Queue Depth
vLLM exposes a metric called vllm:scheduler_pending_tokens — the number of tokens waiting in the scheduling queue. When this exceeds a threshold, it means the GPU is backlogged and you need more replicas. Here's how to wire it up:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference-scaler
namespace: inference
spec:
scaleTargetRef:
name: vllm-inference # Your inference Deployment
pollingInterval: 15 # Check every 15 seconds
cooldownPeriod: 300 # Wait 5 minutes before scaling down
minReplicaCount: 1
maxReplicaCount: 10
metricsServer:
address: prometheus.monitoring.svc:9090
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: vllm_scheduler_pending_tokens
query: |
sum(vllm_scheduler_pending_tokens{model="$MODEL_NAME"})
threshold: "8192" # Scale up when 8K+ tokens queued The key configuration decisions:
- pollingInterval: 15 — Aggressive enough to catch bursty AI traffic without causing thrashing. For latency-sensitive APIs, you can go as low as 5 seconds.
- cooldownPeriod: 300 — AI inference is GPU-bound and startup is slow (30-90 seconds for a vLLM pod). A 5-minute cooldown prevents scale-down during momentary dips. Tune down to 120 seconds if your inference pods start faster.
- threshold: 8192 — This is the sum of pending tokens across all replicas. Start conservative and adjust based on observed p99 latency. If latency spikes before hitting this threshold, lower it. If you're scaling up too aggressively, raise it.
Example: Scaling on GPU Utilization for Mixed Workloads
For mixed inference batches where GPU utilization is the real bottleneck:
triggers:
- type: prometheus
metadata:
metricName: gpu_utilization
query: |
avg(rate(DCGM_FI_DEV_GPU_UTIL_total{device="GPU-all"}[2m])) * 100
threshold: "75" # Scale up when avg GPU util > 75% Note: GPU metrics require the DCGM Exporter running in your cluster with the Prometheus node_exporter scraping GPU metrics. See our GPU monitoring guide for setup instructions.
KEDA vs HPA vs VPA: When to Use Each
For AI workloads, these three autoscaling mechanisms serve different purposes:
| Scaler | What it scales | Best for | AI relevance |
|---|---|---|---|
| HPA (CPU/Memory) | Pod replicas | Web APIs, CPU-bound workers | Low — GPU bottleneck not visible in CPU metrics |
| HPA (KEDA + Prometheus) | Pod replicas | AI inference, queue-driven workloads | High — scale on token queue, GPU util, request rate |
| VPA | Pod resource requests | Stateful workloads, databases | Low — AI workloads need fast scaling, not resource right-sizing |
| Karpenter | Node count + type | Any workload needing nodes | High — provision GPU nodes on demand |
| Cluster Autoscaler | Node pool size | Managed K8s (EKS/GKE/AKS) | Medium — simpler than Karpenter, less flexible |
The recommendation for AI inference in 2026: Use KEDA for pod-level scaling (fast, event-driven), and Karpenter for node-level provisioning (dynamic, responsive to actual node pressure rather than just pending pods).
Karpenter: Dynamic GPU Node Provisioning
Traditional Cluster Autoscaler works at the node pool level — you pre-define node groups and it adds/removes nodes from those groups. This is limiting when your AI workloads need different GPU types at different times (an A100 for large batch inference, a T4 for small real-time requests).
Karpenter, developed by AWS and now CNCF, provisions exactly the right node type for each pending pod — no node pools, no pre-configuration. When a GPU pod can't schedule because no suitable node exists, Karpenter launches the cheapest available node that satisfies the pod's resources.
Karpenter Provisioner for GPU Nodes
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
name: gpu-provisioner
spec:
# Only provision nodes for pods that need GPUs
requirements:
- key: node.kubernetes.io/lifecycle
operator: Exists
values: [spot, on-demand]
- key: nvidia.com/gpu
operator: Exists
values: ["1", "2", "4"] # 1, 2, or 4 GPU nodes
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
limits:
resources:
nvidia.com/gpu: "8" # Max 8 GPUs in cluster at once
cpu: "64"
memory: "256Gi"
providerRef:
name: default
# TTL — recycle nodes after 24h to capture price changes
ttlSecondsUntilExpired: 86400
weight: 100 The key decisions for AI workloads:
- Spot + On-Demand mixing — AI training jobs (fault-tolerant) use spot instances. Inference APIs (latency-sensitive) use on-demand. Use node taints to route appropriately.
- GPU count limits — Set an upper bound on total GPUs to prevent runaway costs during an attack or misconfiguration. The GPU limit in the provisioner above (8 GPUs) acts as a circuit breaker.
- ttlSecondsUntilExpired: 86400 — Forces node recycling every 24 hours. This is important for spot instances because you get lower prices on new instances, and it ensures you're always running on the latest driver versions.
Tainting GPU Nodes for Appropriate Workloads
# In your Inference Deployment
spec:
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
# Only schedule on spot GPU nodes for batch inference
nodeSelector:
karpenter.sh/capacity-type: spot
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
memory: "64Gi"
cpu: "8" Putting It Together: The AI Autoscaling Stack
A production AI inference cluster in 2026 needs three layers of elasticity:
- Pod-level (KEDA) — Fast response to inference demand signals: queue depth, token throughput, concurrent requests. Scale within seconds.
- Node-level (Karpenter) — Provision the right GPU node type when pod scaling exhausts current capacity. Responds in 30-60 seconds.
- Cluster-level (CloudQuota) — Monthly FinOps guardrail. Set a maximum GPU count per cloud account to prevent runaway spend. This is a policy control, not an autoscaler.
Here's how the three layers interact for a traffic spike:
- Traffic spike → KEDA scales inference pods from 2 → 6 replicas (15-30 seconds)
- Existing GPU nodes are saturated → pending pods appear → Karpenter provisions a new GPU node (30-60 seconds)
- If GPU count approaches the limit → Cluster Autoscaler or cloud quota prevents further node provisioning, protecting against runaway costs
- Traffic normalizes → KEDA scales pods down after cooldown (5 minutes) → Karpenter terminates idle nodes after 24h TTL
The FinOps Perspective: Autoscaling Without Overspending
Autoscaling without cost controls is a liability. Every AI team has a story of a pod that scaled to 50 replicas on a weekend and ran up $4,000 in cloud costs before anyone noticed. The fix isn't to disable autoscaling — it's to add the right guardrails.
1. Set Namespace-Level GPU Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: inference-gpu-quota
namespace: inference
spec:
hard:
nvidia.com/gpu: "8" # Max 8 GPUs in inference namespace
requests.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: LimitRange
metadata:
name: inference-limits
namespace: inference
spec:
limits:
- type: Container
max:
nvidia.com/gpu: "4" # Single pod can't monopolize 2. Use Spot for Batch Inference, On-Demand for Real-Time APIs
Separate your inference workloads by SLA:
- Real-time APIs (p99 < 500ms) — Run on on-demand or reserved instances. You need guaranteed capacity and fast scaling.
- Batch inference / async processing — Run on spot instances with checkpointing. 70-90% cost savings with fault-tolerant code.
3. Monitor Your Autoscaler Efficiency
Track these metrics to ensure your autoscaling isn't creating waste:
| Metric | What it tells you | Target |
|---|---|---|
| keda_scaler_metrics_value | What signal is triggering scale | Stable, not spiking wildly |
| keda_pod_scale_count | Actual replica count over time | Gradual changes, not oscillation |
| karpenter_nodes_terminated | Node churn rate | < 20% daily churn — too high means waste |
| GPU utilization avg | Are you overprovisioning GPUs? | > 60% for inference, > 80% for training |
| Monthly GPU cost | Total spend on GPU nodes | < forecast ± 10% |
For GPU cost monitoring, see our guide to GPU monitoring for AI inference and Kubernetes cost optimization for the full FinOps stack.
Common Failure Modes
KEDA Not Scaling Down After Traffic Drops
If replicas stay high after traffic normalizes, check cooldownPeriod — 300 seconds is conservative but prevents thrashing. Also verify your Prometheus query is returning values when traffic is low (some metrics go to zero, some don't).
Karpenter Not Terminating Idle Nodes
Karpenter waits until all pods are evicted from a node before terminating it. If you have long-running inference requests, nodes can appear "idle" but be waiting for requests to complete. Set ttlSecondsAfterEmpty: 60 to terminate nodes faster when they go empty.
Scale-Up Too Slow for Latency-Sensitive APIs
For APIs requiring <200ms latency, KEDA's 15-second polling interval is too slow. Pre-scale your minimum replicas to handle normal peak load, and use KEDA only for overflow. Set minReplicaCount: 5 (or whatever your normal peak is) rather than 1.
Summary
AI inference on Kubernetes requires event-driven autoscaling — not CPU-based HPA. The winning combination in 2026 is:
- KEDA + Prometheus for pod-level scaling on actual AI metrics (queue depth, token throughput, GPU utilization)
- Karpenter for dynamic GPU node provisioning that responds to actual demand, not just pending pods
- Namespace GPU quotas as a hard cost guardrail against runaway scaling
- Spot/On-Demand separation to capture 70% savings on batch workloads without impacting API latency
The result is an inference cluster that scales as fast as your users need it to, procures exactly the GPU capacity required, and stops before it empties your cloud budget.
Kubecost provides real-time visibility into your Karpenter and KEDA spend. GPU cost attribution per namespace and workload, budget alerts, and recommendations for right-sizing. Free tier available.