Horizontal Pod Autoscaler (HPA) scales based on CPU and memory — metrics that work for web servers and APIs. AI inference workloads don't behave like web servers. A vLLM pod serving 1,000 tokens/second might show 40% CPU utilization while its GPU is completely saturated. CPU is idle because the GPU queue is full. Scale that pod by following CPU and you add more GPU-waiting pods that accomplish nothing except increasing scheduling overhead.
The same problem applies to Vertical Pod Autoscaler (VPA) for AI. VPA recommends resource changes based on historical usage — useless when inference demand is bursty and unpredictable, driven by user traffic patterns that have nothing to do with current node metrics.
AI workloads need event-driven autoscaling: scaling based on the actual demand signals that matter — queue depth, token throughput, concurrent requests, or external events like a marketing campaign launching. This is where KEDA changes the equation.
What KEDA Brings to AI Workloads
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 50+ built-in scalers that respond to external metrics instead of just CPU and memory. For AI inference, the most relevant scalers are:
- Prometheus — scale on any Prometheus metric: GPU utilization, KV cache hit rate, request queue depth, token throughput
- RabbitMQ / Apache Kafka — scale based on queue depth for async inference pipelines
- AWS CloudWatch — scale based on Bedrock or SageMaker metrics
- Azure Monitor — scale based on Azure OpenAI metrics
- Datadog — scale on any Datadog query for teams using Datadog APM
- CPU / Memory — still useful for non-GPU pods in the inference stack (tokenizers, preprocessors)
KEDA works as a Kubernetes Custom Resource Definition (CRD). You install it as a Deployment, define a ScaledObject that connects your inference Deployment to a scaler, and KEDA automatically manages the HPA for you.
Installing KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace Or via ArgoCD if you manage GitOps:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: keda
spec:
project: default
source:
chart: keda
repoURL: https://kedacore.github.io/charts
targetRevision: 2.15.0
helm:
values: |
resources:
limits:
cpu: "300m"
memory: "128Mi"
requests:
cpu: "100m"
memory: "64Mi"
destination:
server: https://kubernetes.default.svc
namespace: keda Real-time Kubernetes cost monitoring with GPU attribution per namespace and workload. Kubecost tracks your Karpenter and KEDA spend live, alerts on budget overruns, and shows right-sizing recommendations for GPU nodes — essential for AI infrastructure FinOps.
Scaling AI Inference with KEDA + Prometheus
The most powerful combination for AI inference is KEDA with the Prometheus scaler. This lets you scale based on any metric your inference server exposes — GPU utilization, request queue depth, token throughput, or custom business metrics.
Example: Scaling vLLM on Request Queue Depth
vLLM exposes a metric called vllm_scheduler_pending_tokens — the number of tokens waiting in the scheduling queue. When this exceeds a threshold, it means the GPU is backlogged and you need more replicas. Here's how to wire it up:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-inference-scaler
namespace: inference
spec:
scaleTargetRef:
name: vllm-inference
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 1
maxReplicaCount: 10
metricsServer:
address: prometheus.monitoring.svc:9090
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: vllm_scheduler_pending_tokens
query: |
sum(vllm_scheduler_pending_tokens{model="$MODEL_NAME"})
threshold: "8192" The key configuration decisions:
- pollingInterval: 15 — Aggressive enough to catch bursty AI traffic without causing thrashing. For latency-sensitive APIs, you can go as low as 5 seconds.
- cooldownPeriod: 300 — AI inference is GPU-bound and startup is slow (30-90 seconds for a vLLM pod). A 5-minute cooldown prevents scale-down during momentary dips.
- threshold: 8192 — This is the sum of pending tokens across all replicas. Start conservative and adjust based on observed p99 latency. If latency spikes before hitting this threshold, lower it.
Example: Scaling on GPU Utilization for Mixed Workloads
For mixed inference batches where GPU utilization is the real bottleneck:
triggers:
- type: prometheus
metadata:
metricName: gpu_utilization
query: |
avg(rate(DCGM_FI_DEV_GPU_UTIL_total{device="GPU-all"}[2m])) * 100
threshold: "75" Note: GPU metrics require the DCGM Exporter running in your cluster with Prometheus scraping GPU metrics. See our GPU monitoring guide for setup instructions.
KEDA vs HPA vs VPA: When to Use Each
For AI workloads, these three autoscaling mechanisms serve different purposes:
| Scaler | What it scales | Best for | AI relevance |
|---|---|---|---|
| HPA (CPU/Memory) | Pod replicas | Web APIs, CPU-bound workers | Low — GPU bottleneck not visible in CPU metrics |
| HPA (KEDA + Prometheus) | Pod replicas | AI inference, queue-driven workloads | High — scale on token queue, GPU util, request rate |
| VPA | Pod resource requests | Stateful workloads, databases | Low — AI workloads need fast scaling, not resource right-sizing |
| Karpenter | Node count + type | Any workload needing nodes | High — provision GPU nodes on demand |
| Cluster Autoscaler | Node pool size | Managed K8s (EKS/GKE/AKS) | Medium — simpler than Karpenter, less flexible |
The recommendation for AI inference in 2026: Use KEDA for pod-level scaling (fast, event-driven), and Karpenter for node-level provisioning (dynamic, responsive to actual node pressure rather than just pending pods).
Karpenter: Dynamic GPU Node Provisioning
Traditional Cluster Autoscaler works at the node pool level — you pre-define node groups and it adds/removes nodes from those groups. This is limiting when your AI workloads need different GPU types at different times (an A100 for large batch inference, a T4 for small real-time requests).
Karpenter, developed by AWS and now CNCF, provisions exactly the right node type for each pending pod — no node pools, no pre-configuration. When a GPU pod can't schedule because no suitable node exists, Karpenter launches the cheapest available node that satisfies the pod's resources.
Karpenter Provisioner for GPU Nodes
apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
name: gpu-provisioner
spec:
requirements:
- key: node.kubernetes.io/lifecycle
operator: Exists
values: [spot, on-demand]
- key: nvidia.com/gpu
operator: Exists
values: ["1", "2", "4"]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
limits:
resources:
nvidia.com/gpu: "8"
cpu: "64"
memory: "256Gi"
providerRef:
name: default
ttlSecondsUntilExpired: 86400
weight: 100 The key decisions for AI workloads:
- Spot + On-Demand mixing — AI training jobs (fault-tolerant) use spot instances. Inference APIs (latency-sensitive) use on-demand. Use node taints to route appropriately.
- GPU count limits — Set an upper bound on total GPUs to prevent runaway costs during an attack or misconfiguration. The GPU limit above (8 GPUs) acts as a circuit breaker.
- ttlSecondsUntilExpired: 86400 — Forces node recycling every 24 hours. This captures lower spot prices on new instances and ensures you're running latest driver versions.
Tainting GPU Nodes for Appropriate Workloads
# In your Inference Deployment
spec:
template:
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
karpenter.sh/capacity-type: spot
containers:
- name: vllm
resources:
limits:
nvidia.com/gpu: "1"
memory: "64Gi"
cpu: "8" Auto-scaling GPU infrastructure for AI inference — CoreWeave Kubernetes Engine (CKE) provisions NVIDIA H100, A100, and L40S nodes on demand with built-in KEDA compatibility. No node pool management, no reserved capacity commitments. Pay per second for what you use.
Putting It Together: The AI Autoscaling Stack
A production AI inference cluster in 2026 needs three layers of elasticity:
- Pod-level (KEDA) — Fast response to inference demand signals: queue depth, token throughput, concurrent requests. Scale within seconds.
- Node-level (Karpenter) — Provision the right GPU node type when pod scaling exhausts current capacity. Responds in 30-60 seconds.
- Cluster-level (CloudQuota) — Monthly FinOps guardrail. Set a maximum GPU count per cloud account to prevent runaway spend.
Here's how the three layers interact for a traffic spike:
- Traffic spike → KEDA scales inference pods from 2 → 6 replicas (15-30 seconds)
- Existing GPU nodes are saturated → pending pods appear → Karpenter provisions a new GPU node (30-60 seconds)
- If GPU count approaches the limit → Cluster Autoscaler or cloud quota prevents further node provisioning, protecting against runaway costs
- Traffic normalizes → KEDA scales pods down after cooldown (5 minutes) → Karpenter terminates idle nodes after 24h TTL
The FinOps Perspective: Autoscaling Without Overspending
Autoscaling without cost controls is a liability. Every AI team has a story of a pod that scaled to 50 replicas on a weekend and ran up $4,000 in cloud costs before anyone noticed. The fix isn't to disable autoscaling — it's to add the right guardrails.
1. Set Namespace-Level GPU Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: inference-gpu-quota
namespace: inference
spec:
hard:
nvidia.com/gpu: "8"
requests.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: LimitRange
metadata:
name: inference-limits
namespace: inference
spec:
limits:
- type: Container
max:
nvidia.com/gpu: "4" 2. Use Spot for Batch Inference, On-Demand for Real-Time APIs
Separate your inference workloads by SLA:
- Real-time APIs (p99 < 500ms) — Run on on-demand or reserved instances. You need guaranteed capacity and fast scaling.
- Batch inference / async processing — Run on spot instances with checkpointing. 70-90% cost savings with fault-tolerant code.
3. Monitor Your Autoscaler Efficiency
Track these metrics to ensure your autoscaling isn't creating waste:
| Metric | What it tells you | Target |
|---|---|---|
| keda_scaler_metrics_value | What signal is triggering scale | Stable, not spiking wildly |
| keda_pod_scale_count | Actual replica count over time | Gradual changes, not oscillation |
| karpenter_nodes_terminated | Node churn rate | < 20% daily churn — too high means waste |
| GPU utilization avg | Are you overprovisioning GPUs? | > 60% for inference, > 80% for training |
| Monthly GPU cost | Total spend on GPU nodes | < forecast ± 10% |
For GPU cost monitoring, see our guide to GPU monitoring for AI inference and Kubernetes cost optimization for the full FinOps stack.
Common Failure Modes
KEDA Not Scaling Down After Traffic Drops
If replicas stay high after traffic normalizes, check cooldownPeriod — 300 seconds is conservative but prevents thrashing. Also verify your Prometheus query is returning values when traffic is low.
Karpenter Not Terminating Idle Nodes
Karpenter waits until all pods are evicted from a node before terminating it. If you have long-running inference requests, nodes can appear "idle" but be waiting for requests to complete. Set ttlSecondsAfterEmpty: 60 to terminate nodes faster when they go empty.
Scale-Up Too Slow for Latency-Sensitive APIs
For APIs requiring <200ms latency, KEDA's 15-second polling interval is too slow. Pre-scale your minimum replicas to handle normal peak load, and use KEDA only for overflow. Set minReplicaCount: 5 rather than 1.
Summary
AI inference on Kubernetes requires event-driven autoscaling — not CPU-based HPA. The winning combination in 2026 is:
- KEDA + Prometheus for pod-level scaling on actual AI metrics (queue depth, token throughput, GPU utilization)
- Karpenter for dynamic GPU node provisioning that responds to actual demand, not just pending pods
- Namespace GPU quotas as a hard cost guardrail against runaway scaling
- Spot/On-Demand separation to capture 70% savings on batch workloads without impacting API latency
The result is an inference cluster that scales as fast as your users need it to, procures exactly the GPU capacity required, and stops before it empties your cloud budget.
Preemptible GPU instances with Kubernetes support — Lambda Labs offers H100, A100, and RTX 4090 nodes with automatic spot interruption handling and Kubernetes device plugin support. Alternative to Karpenter for teams wanting fully managed GPU nodes without cloud-specific integrations.