How does KEDA differ from standard Kubernetes HPA for AI workloads?

KEDA scales based on event-driven metrics (queue depth, request rate, Prometheus queries) rather than just CPU/memory. For AI inference, this means scaling based on actual request load rather than proxy metrics that don't reflect LLM token volume.

What is Karpenter and when should I use it for AI infrastructure?

Karpenter is a node auto-provisioner that dynamically creates appropriately-sized nodes based on pod needs. Use it for AI workloads when you need GPU nodes provisioned on-demand without Reserved Instance commitments, particularly for batch inference workloads.

How do I autoscale GPU workloads on Kubernetes?

Use KEDA with a Prometheus scaler to watch request queue depth or inference latency, scaling replicas based on actual load. Combine with Karpenter for node provisioning. Set scale-to-zero for non-production environments to cut costs by 60-80%.

Cutting vLLM GPU Costs 40% with KEDA Queue-Depth Autoscaling

Why Standard Kubernetes Autoscaling Breaks for AI Workloads

Horizontal Pod Autoscaler (HPA) scales based on CPU and memory — metrics that work for web servers and APIs. AI inference workloads don't behave like web servers. A vLLM pod serving 1,000 tokens/second might show 40% CPU utilization while its GPU is completely saturated. CPU is idle because the GPU queue is full. Scale that pod by following CPU and you add more GPU-waiting pods that accomplish nothing except increasing scheduling overhead.

The same problem applies to Vertical Pod Autoscaler (VPA) for AI. VPA recommends resource changes based on historical usage — useless when inference demand is bursty and unpredictable, driven by user traffic patterns that have nothing to do with current node metrics.

AI workloads need event-driven autoscaling: scaling based on the actual demand signals that matter — queue depth, token throughput, concurrent requests, or external events like a marketing campaign launching. This is where KEDA changes the equation.

What KEDA Brings to AI Workloads

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 50+ built-in scalers that respond to external metrics instead of just CPU and memory. For AI inference, the most relevant scalers are:

Prometheus — scale on any Prometheus metric: GPU utilization, KV cache hit rate, request queue depth, token throughput
RabbitMQ / Apache Kafka — scale based on queue depth for async inference pipelines
AWS CloudWatch — scale based on Bedrock or SageMaker metrics
Azure Monitor — scale based on Azure OpenAI metrics
Datadog — scale on any Datadog query for teams using Datadog APM
CPU / Memory — still useful for non-GPU pods in the inference stack (tokenizers, preprocessors)

KEDA works as a Kubernetes Custom Resource Definition (CRD). You install it as a Deployment, define a ScaledObject that connects your inference Deployment to a scaler, and KEDA automatically manages the HPA for you.

What's New in KEDA 2.19.0

KEDA 2.19.0 introduces two features particularly relevant to AI infrastructure teams:

Kubernetes Resource Scaler — Scale based on arbitrary Kubernetes resource metrics directly from the API server, without needing an external monitoring system. For AI workloads this means you can scale on custom metrics your inference server exposes through the Kubernetes Metrics API, giving a simpler alternative to running a full Prometheus stack for metric-based scaling.
File-based authentication for ClusterTriggerAuthentication — KEDA's trigger authentication now supports reading credentials from mounted files rather than environment variables or Secrets. This is useful for AI inference deployments that use credential files (such as model registry auth tokens or cloud provider configuration files) without exposing them as Kubernetes Secrets.

Upgrade to 2.19.0 via Helm or your GitOps pipeline — the core ScaledObject API is unchanged, so no migration is required for existing workloads.

Installing KEDA

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

Or via ArgoCD if you manage GitOps:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: keda
spec:
  project: default
  source:
    chart: keda
    repoURL: https://kedacore.github.io/charts
    targetRevision: 2.19.0
    helm:
      values: |
        resources:
          limits:
            cpu: "300m"
            memory: "128Mi"
          requests:
            cpu: "100m"
            memory: "64Mi"
  destination:
    server: https://kubernetes.default.svc
    namespace: keda

Scaling AI Inference with KEDA + Prometheus

The most powerful combination for AI inference is KEDA with the Prometheus scaler. This lets you scale based on any metric your inference server exposes — GPU utilization, request queue depth, token throughput, or custom business metrics.

Example: Scaling vLLM on Request Queue Depth

vLLM exposes a metric called vllm:scheduler_pending_tokens — the number of tokens waiting in the scheduling queue. When this exceeds a threshold, it means the GPU is backlogged and you need more replicas. Here's how to wire it up:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-inference-scaler
  namespace: inference
spec:
  scaleTargetRef:
    name: vllm-inference  # Your inference Deployment
  pollingInterval: 15   # Check every 15 seconds
  cooldownPeriod: 300    # Wait 5 minutes before scaling down
  minReplicaCount: 1
  maxReplicaCount: 10
  metricsServer:
    address: prometheus.monitoring.svc:9090
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: vllm_scheduler_pending_tokens
      query: |
        sum(vllm_scheduler_pending_tokens{model="$MODEL_NAME"})
      threshold: "8192"  # Scale up when 8K+ tokens queued

The key configuration decisions:

pollingInterval: 15 — Aggressive enough to catch bursty AI traffic without causing thrashing. For latency-sensitive APIs, you can go as low as 5 seconds.
cooldownPeriod: 300 — AI inference is GPU-bound and startup is slow (30-90 seconds for a vLLM pod). A 5-minute cooldown prevents scale-down during momentary dips. Tune down to 120 seconds if your inference pods start faster.
threshold: 8192 — This is the sum of pending tokens across all replicas. Start conservative and adjust based on observed p99 latency. If latency spikes before hitting this threshold, lower it. If you're scaling up too aggressively, raise it.

Example: Scaling on GPU Utilization for Mixed Workloads

For mixed inference batches where GPU utilization is the real bottleneck:

  triggers:
  - type: prometheus
    metadata:
      metricName: gpu_utilization
      query: |
        avg(rate(DCGM_FI_DEV_GPU_UTIL_total{device="GPU-all"}[2m])) * 100
      threshold: "75"   # Scale up when avg GPU util > 75%

Note: GPU metrics require the DCGM Exporter running in your cluster with the Prometheus node_exporter scraping GPU metrics. See our GPU monitoring guide for setup instructions.

KEDA vs HPA vs VPA: When to Use Each

For AI workloads, these three autoscaling mechanisms serve different purposes:

Scaler	What it scales	Best for	AI relevance
HPA (CPU/Memory)	Pod replicas	Web APIs, CPU-bound workers	Low — GPU bottleneck not visible in CPU metrics
HPA (KEDA + Prometheus)	Pod replicas	AI inference, queue-driven workloads	High — scale on token queue, GPU util, request rate
VPA	Pod resource requests	Stateful workloads, databases	Low — AI workloads need fast scaling, not resource right-sizing
Karpenter	Node count + type	Any workload needing nodes	High — provision GPU nodes on demand
Cluster Autoscaler	Node pool size	Managed K8s (EKS/GKE/AKS)	Medium — simpler than Karpenter, less flexible

The recommendation for AI inference in 2026: Use KEDA for pod-level scaling (fast, event-driven), and Karpenter for node-level provisioning (dynamic, responsive to actual node pressure rather than just pending pods).

Karpenter: Dynamic GPU Node Provisioning

Traditional Cluster Autoscaler works at the node pool level — you pre-define node groups and it adds/removes nodes from those groups. This is limiting when your AI workloads need different GPU types at different times (an A100 for large batch inference, a T4 for small real-time requests).

Karpenter, developed by AWS and now CNCF, provisions exactly the right node type for each pending pod — no node pools, no pre-configuration. When a GPU pod can't schedule because no suitable node exists, Karpenter launches the cheapest available node that satisfies the pod's resources.

Karpenter Provisioner for GPU Nodes

apiVersion: karpenter.sh/v1beta1
kind: Provisioner
metadata:
  name: gpu-provisioner
spec:
  # Only provision nodes for pods that need GPUs
  requirements:
    - key: node.kubernetes.io/lifecycle
      operator: Exists
      values: [spot, on-demand]
    - key: nvidia.com/gpu
      operator: Exists
      values: ["1", "2", "4"]  # 1, 2, or 4 GPU nodes
    - key: karpenter.sh/capacity-type
      operator: In
      values: [spot, on-demand]
  limits:
    resources:
      nvidia.com/gpu: "8"   # Max 8 GPUs in cluster at once
      cpu: "64"
      memory: "256Gi"
  providerRef:
    name: default
  # TTL — recycle nodes after 24h to capture price changes
  ttlSecondsUntilExpired: 86400
  weight: 100

The key decisions for AI workloads:

Spot + On-Demand mixing — AI training jobs (fault-tolerant) use spot instances. Inference APIs (latency-sensitive) use on-demand. Use node taints to route appropriately.
GPU count limits — Set an upper bound on total GPUs to prevent runaway costs during an attack or misconfiguration. The GPU limit in the provisioner above (8 GPUs) acts as a circuit breaker.
ttlSecondsUntilExpired: 86400 — Forces node recycling every 24 hours. This is important for spot instances because you get lower prices on new instances, and it ensures you're always running on the latest driver versions.

Tainting GPU Nodes for Appropriate Workloads

# In your Inference Deployment
spec:
  template:
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      # Only schedule on spot GPU nodes for batch inference
      nodeSelector:
        karpenter.sh/capacity-type: spot
      containers:
      - name: vllm
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "64Gi"
            cpu: "8"

Putting It Together: The AI Autoscaling Stack

A production AI inference cluster in 2026 needs three layers of elasticity:

Pod-level (KEDA) — Fast response to inference demand signals: queue depth, token throughput, concurrent requests. Scale within seconds.
Node-level (Karpenter) — Provision the right GPU node type when pod scaling exhausts current capacity. Responds in 30-60 seconds.
Cluster-level (CloudQuota) — Monthly FinOps guardrail. Set a maximum GPU count per cloud account to prevent runaway spend. This is a policy control, not an autoscaler.

Here's how the three layers interact for a traffic spike:

Traffic spike → KEDA scales inference pods from 2 → 6 replicas (15-30 seconds)
Existing GPU nodes are saturated → pending pods appear → Karpenter provisions a new GPU node (30-60 seconds)
If GPU count approaches the limit → Cluster Autoscaler or cloud quota prevents further node provisioning, protecting against runaway costs
Traffic normalizes → KEDA scales pods down after cooldown (5 minutes) → Karpenter terminates idle nodes after 24h TTL

The FinOps Perspective: Autoscaling Without Overspending

Autoscaling without cost controls is a liability. Every AI team has a story of a pod that scaled to 50 replicas on a weekend and ran up $4,000 in cloud costs before anyone noticed. The fix isn't to disable autoscaling — it's to add the right guardrails.

1. Set Namespace-Level GPU Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: inference-gpu-quota
  namespace: inference
spec:
  hard:
    nvidia.com/gpu: "8"    # Max 8 GPUs in inference namespace
    requests.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: inference-limits
  namespace: inference
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: "4"  # Single pod can't monopolize

2. Use Spot for Batch Inference, On-Demand for Real-Time APIs

Separate your inference workloads by SLA:

Real-time APIs (p99 below 500ms) — Run on on-demand or reserved instances. You need guaranteed capacity and fast scaling.
Batch inference / async processing — Run on spot instances with checkpointing. 70-90% cost savings with fault-tolerant code.

3. Monitor Your Autoscaler Efficiency

Track these metrics to ensure your autoscaling isn't creating waste:

Metric	What it tells you	Target
keda_scaler_metrics_value	What signal is triggering scale	Stable, not spiking wildly
keda_pod_scale_count	Actual replica count over time	Gradual changes, not oscillation
karpenter_nodes_terminated	Node churn rate	< 20% daily churn — too high means waste
GPU utilization avg	Are you overprovisioning GPUs?	> 60% for inference, > 80% for training
Monthly GPU cost	Total spend on GPU nodes	< forecast ± 10%

For GPU cost monitoring, see our guide to GPU monitoring for AI inference and Kubernetes cost optimization for the full FinOps stack.

Common Failure Modes

KEDA Not Scaling Down After Traffic Drops

If replicas stay high after traffic normalizes, check cooldownPeriod — 300 seconds is conservative but prevents thrashing. Also verify your Prometheus query is returning values when traffic is low (some metrics go to zero, some don't).

Karpenter Not Terminating Idle Nodes

Karpenter waits until all pods are evicted from a node before terminating it. If you have long-running inference requests, nodes can appear "idle" but be waiting for requests to complete. Set ttlSecondsAfterEmpty: 60 to terminate nodes faster when they go empty.

Scale-Up Too Slow for Latency-Sensitive APIs

For APIs requiring below 200ms latency, KEDA's 15-second polling interval is too slow. Pre-scale your minimum replicas to handle normal peak load, and use KEDA only for overflow. Set minReplicaCount: 5 (or whatever your normal peak is) rather than 1.

Summary

AI inference on Kubernetes requires event-driven autoscaling — not CPU-based HPA. The winning combination in 2026 is:

KEDA + Prometheus for pod-level scaling on actual AI metrics (queue depth, token throughput, GPU utilization)
Karpenter for dynamic GPU node provisioning that responds to actual demand, not just pending pods
Namespace GPU quotas as a hard cost guardrail against runaway scaling
Spot/On-Demand separation to capture 70% savings on batch workloads without impacting API latency

The result is an inference cluster that scales as fast as your users need it to, procures exactly the GPU capacity required, and stops before it empties your cloud budget.

Recommended Tool Kubecost

Kubecost provides real-time visibility into your Karpenter and KEDA spend. GPU cost attribution per namespace and workload, budget alerts, and recommendations for right-sizing. Free tier available.