Horizontal Pod Autoscaler (HPA) scales based on CPU and memory — metrics that work for web servers and APIs. AI inference workloads don't behave like web servers. A vLLM pod serving 1,000 tokens/second might show 40% CPU utilization while its GPU is completely saturated. CPU is idle because the GPU queue is full. Scale that pod by following CPU and you add more GPU-waiting pods that accomplish nothing except increasing scheduling overhead.

The same problem applies to Vertical Pod Autoscaler (VPA) for AI. VPA recommends resource changes based on historical usage — useless when inference demand is bursty and unpredictable, driven by user traffic patterns that have nothing to do with current node metrics.

AI workloads need event-driven autoscaling: scaling based on the actual demand signals that matter — queue depth, token throughput, concurrent requests, or external events like a marketing campaign launching. This is where KEDA changes the equation.

What KEDA Brings to AI Workloads

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 50+ built-in scalers that respond to external metrics instead of just CPU and memory. For AI inference, the most relevant scalers are:

  • Prometheus — scale on any Prometheus metric: GPU utilization, KV cache hit rate, request queue depth, token throughput
  • RabbitMQ / Apache Kafka — scale based on queue depth for async inference pipelines
  • AWS CloudWatch — scale based on Bedrock or SageMaker metrics
  • Azure Monitor — scale based on Azure OpenAI metrics
  • Datadog — scale on any Datadog query for teams using Datadog APM
  • CPU / Memory — still useful for non-GPU pods in the inference stack (tokenizers, preprocessors)

KEDA works as a Kubernetes Custom Resource Definition (CRD). You install it as a Deployment, define a ScaledObject that connects your inference Deployment to a scaler, and KEDA automatically manages the HPA for you.

Installing KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

Or via ArgoCD if you manage GitOps:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: keda
spec:
  project: default
  source:
    chart: keda
    repoURL: https://kedacore.github.io/charts
    targetRevision: 2.15.0
    helm:
      values: |
        resources:
          limits:
            cpu: "300m"
            memory: "128Mi"
          requests:
            cpu: "100m"
            memory: "64Mi"
  destination:
    server: https://kubernetes.default.svc
    namespace: keda
Tool Spotlight Kubecost

Real-time Kubernetes cost monitoring with GPU attribution per namespace and workload. Kubecost tracks your Karpenter and KEDA spend live, alerts on budget overruns, and shows right-sizing recommendations for GPU nodes — essential for AI infrastructure FinOps.

Scaling AI Inference with KEDA + Prometheus

The most powerful combination for AI inference is KEDA with the Prometheus scaler. This lets you scale based on any metric your inference server exposes — GPU utilization, request queue depth, token throughput, or custom business metrics.

Example: Scaling vLLM on Request Queue Depth

vLLM exposes a metric called vllm_scheduler_pending_tokens — the number of tokens waiting in the scheduling queue. When this exceeds a threshold, it means the GPU is backlogged and you need more replicas. Here's how to wire it up:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-inference
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 1
  maxReplicaCount: 10
  metricsServer:
    address: prometheus.monitoring.svc:9090
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: vllm_scheduler_pending_tokens
      query: |
        sum(vllm_scheduler_pending_tokens{model="$MODEL_NAME"})
      threshold: "8192"

The key configuration decisions:

  • pollingInterval: 15 — Aggressive enough to catch bursty AI traffic without causing thrashing. For latency-sensitive APIs, you can go as low as 5 seconds.
  • cooldownPeriod: 300 — AI inference is GPU-bound and startup is slow (30-90 seconds for a vLLM pod). A 5-minute cooldown prevents scale-down during momentary dips.
  • threshold: 8192 — This is the sum of pending tokens across all replicas. Start conservative and adjust based on observed p99 latency. If latency spikes before hitting this threshold, lower it.

Example: Scaling on GPU Utilization for Mixed Workloads

For mixed inference batches where GPU utilization is the real bottleneck:

  triggers:
  - type: prometheus
    metadata:
      metricName: gpu_utilization
      query: |
        avg(rate(DCGM_FI_DEV_GPU_UTIL_total{device="GPU-all"}[2m])) * 100
      threshold: "75"

Note: GPU metrics require the DCGM Exporter running in your cluster with Prometheus scraping GPU metrics. See our GPU monitoring guide for setup instructions.

Advertisement
Advertisement

KEDA vs HPA vs VPA: When to Use Each

For AI workloads, these three autoscaling mechanisms serve different purposes:

Scaler What it scales Best for AI relevance
HPA (CPU/Memory) Pod replicas Web APIs, CPU-bound workers Low — GPU bottleneck not visible in CPU metrics
HPA (KEDA + Prometheus) Pod replicas AI inference, queue-driven workloads High — scale on token queue, GPU util, request rate
VPA Pod resource requests Stateful workloads, databases Low — AI workloads need fast scaling, not resource right-sizing
Karpenter Node count + type Any workload needing nodes High — provision GPU nodes on demand
Cluster Autoscaler Node pool size Managed K8s (EKS/GKE/AKS) Medium — simpler than Karpenter, less flexible

The recommendation for AI inference in 2026: Use KEDA for pod-level scaling (fast, event-driven), and Karpenter for node-level provisioning (dynamic, responsive to actual node pressure rather than just pending pods).

Karpenter: Dynamic GPU Node Provisioning

Traditional Cluster Autoscaler works at the node pool level — you pre-define node groups and it adds/removes nodes from those groups. This is limiting when your AI workloads need different GPU types at different times (an A100 for large batch inference, a T4 for small real-time requests).

Karpenter, developed by AWS and now CNCF, provisions exactly the right node type for each pending pod — no node pools, no pre-configuration. When a GPU pod can't schedule because no suitable node exists, Karpenter launches the cheapest available node that satisfies the pod's resources.

Karpenter Provisioner for GPU Nodes

apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
  name: gpu-provisioner
spec:
  requirements:
    - key: node.kubernetes.io/lifecycle
      operator: Exists
      values: [spot, on-demand]
    - key: nvidia.com/gpu
      operator: Exists
      values: ["1", "2", "4"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: [spot, on-demand]
  limits:
    resources:
      nvidia.com/gpu: "8"
      cpu: "64"
      memory: "256Gi"
  providerRef:
    name: default
  ttlSecondsUntilExpired: 86400
  weight: 100

The key decisions for AI workloads:

  • Spot + On-Demand mixing — AI training jobs (fault-tolerant) use spot instances. Inference APIs (latency-sensitive) use on-demand. Use node taints to route appropriately.
  • GPU count limits — Set an upper bound on total GPUs to prevent runaway costs during an attack or misconfiguration. The GPU limit above (8 GPUs) acts as a circuit breaker.
  • ttlSecondsUntilExpired: 86400 — Forces node recycling every 24 hours. This captures lower spot prices on new instances and ensures you're running latest driver versions.

Tainting GPU Nodes for Appropriate Workloads

# In your Inference Deployment
spec:
  template:
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        karpenter.sh/capacity-type: spot
      containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "64Gi"
            cpu: "8"
Tool Spotlight CoreWeave

Auto-scaling GPU infrastructure for AI inference — CoreWeave Kubernetes Engine (CKE) provisions NVIDIA H100, A100, and L40S nodes on demand with built-in KEDA compatibility. No node pool management, no reserved capacity commitments. Pay per second for what you use.

Putting It Together: The AI Autoscaling Stack

A production AI inference cluster in 2026 needs three layers of elasticity:

  1. Pod-level (KEDA) — Fast response to inference demand signals: queue depth, token throughput, concurrent requests. Scale within seconds.
  2. Node-level (Karpenter) — Provision the right GPU node type when pod scaling exhausts current capacity. Responds in 30-60 seconds.
  3. Cluster-level (CloudQuota) — Monthly FinOps guardrail. Set a maximum GPU count per cloud account to prevent runaway spend.

Here's how the three layers interact for a traffic spike:

  1. Traffic spike → KEDA scales inference pods from 2 → 6 replicas (15-30 seconds)
  2. Existing GPU nodes are saturated → pending pods appear → Karpenter provisions a new GPU node (30-60 seconds)
  3. If GPU count approaches the limit → Cluster Autoscaler or cloud quota prevents further node provisioning, protecting against runaway costs
  4. Traffic normalizes → KEDA scales pods down after cooldown (5 minutes) → Karpenter terminates idle nodes after 24h TTL
Advertisement
Advertisement

The FinOps Perspective: Autoscaling Without Overspending

Autoscaling without cost controls is a liability. Every AI team has a story of a pod that scaled to 50 replicas on a weekend and ran up $4,000 in cloud costs before anyone noticed. The fix isn't to disable autoscaling — it's to add the right guardrails.

1. Set Namespace-Level GPU Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: inference-gpu-quota
  namespace: inference
spec:
  hard:
    nvidia.com/gpu: "8"
    requests.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: inference-limits
  namespace: inference
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: "4"

2. Use Spot for Batch Inference, On-Demand for Real-Time APIs

Separate your inference workloads by SLA:

  • Real-time APIs (p99 < 500ms) — Run on on-demand or reserved instances. You need guaranteed capacity and fast scaling.
  • Batch inference / async processing — Run on spot instances with checkpointing. 70-90% cost savings with fault-tolerant code.

3. Monitor Your Autoscaler Efficiency

Track these metrics to ensure your autoscaling isn't creating waste:

Metric What it tells you Target
keda_scaler_metrics_value What signal is triggering scale Stable, not spiking wildly
keda_pod_scale_count Actual replica count over time Gradual changes, not oscillation
karpenter_nodes_terminated Node churn rate < 20% daily churn — too high means waste
GPU utilization avg Are you overprovisioning GPUs? > 60% for inference, > 80% for training
Monthly GPU cost Total spend on GPU nodes < forecast ± 10%

For GPU cost monitoring, see our guide to GPU monitoring for AI inference and Kubernetes cost optimization for the full FinOps stack.

Common Failure Modes

KEDA Not Scaling Down After Traffic Drops

If replicas stay high after traffic normalizes, check cooldownPeriod — 300 seconds is conservative but prevents thrashing. Also verify your Prometheus query is returning values when traffic is low.

Karpenter Not Terminating Idle Nodes

Karpenter waits until all pods are evicted from a node before terminating it. If you have long-running inference requests, nodes can appear "idle" but be waiting for requests to complete. Set ttlSecondsAfterEmpty: 60 to terminate nodes faster when they go empty.

Scale-Up Too Slow for Latency-Sensitive APIs

For APIs requiring <200ms latency, KEDA's 15-second polling interval is too slow. Pre-scale your minimum replicas to handle normal peak load, and use KEDA only for overflow. Set minReplicaCount: 5 rather than 1.

Advertisement
Advertisement

Summary

AI inference on Kubernetes requires event-driven autoscaling — not CPU-based HPA. The winning combination in 2026 is:

  • KEDA + Prometheus for pod-level scaling on actual AI metrics (queue depth, token throughput, GPU utilization)
  • Karpenter for dynamic GPU node provisioning that responds to actual demand, not just pending pods
  • Namespace GPU quotas as a hard cost guardrail against runaway scaling
  • Spot/On-Demand separation to capture 70% savings on batch workloads without impacting API latency

The result is an inference cluster that scales as fast as your users need it to, procures exactly the GPU capacity required, and stops before it empties your cloud budget.

Tool Spotlight Lambda Labs

Preemptible GPU instances with Kubernetes support — Lambda Labs offers H100, A100, and RTX 4090 nodes with automatic spot interruption handling and Kubernetes device plugin support. Alternative to Karpenter for teams wanting fully managed GPU nodes without cloud-specific integrations.