eBPF for AI Networking: GPU Workload Visibility

Last quarter I was debugging a multi-node vLLM deployment where inference p99 latency was 30 percent higher than the single-node baseline, and the application logs were useless. The LLM call returned 200. The model server reported normal. Prometheus showed healthy network throughput. The only signal that something was wrong was a vague, repeating pattern in the GPU event log: NVLink stalls that came and went without explanation.

When I finally attached an eBPF program to the kernel's TCP and netfilter hooks, the problem was obvious in about ten minutes. NCCL allreduce traffic between pods was being routed through a Cilium-managed overlay that was re-encrypting packets on every hop, and the encryption context was being rebuilt every few seconds because the underlying WireGuard tunnel was exhausting its key rotation cycle. The fix was a Cilium config flag, but the diagnosis was only possible because eBPF let me see what was happening to the packets after they left userspace.

This is the gap I want to talk about. AI workloads are the worst possible fit for traditional black-box monitoring. They are bursty, distributed across many nodes, GPU-bound, and they depend on fast kernel-bypass paths (RDMA, NVLink, GPUDirect) that no userspace agent can see. eBPF — running sandboxed programs in the kernel itself — is the only observability layer that can see the whole path. This article is the guide I wish I had: how eBPF applies specifically to AI networking, what tools exist, and how to wire them up in a Kubernetes cluster running inference at scale.

Why AI Networking Is Different

AI workloads break every assumption that traditional network observability tools are built on. Three reasons matter.

GPU-direct paths bypass the kernel. NCCL allreduce, NVLink, GPUDirect-RDMA, and RoCE traffic use kernel-bypass paths. A standard packet capture or a userspace agent like the Datadog tracer sees nothing — the packets never reach the Linux network stack. The performance gains (50 to 200 percent speedup for distributed training) come from exactly the visibility gap that monitoring tools have traditionally depended on.

Latency is dominated by the long tail, not the median. A 1000-token inference call at p50 takes 180ms, but p99 is 1.4 seconds. The difference is almost never the LLM forward pass. It is scheduler delays, kernel queue contention, encryption context rebuilds, or silent packet retransmits on the cluster network. You cannot see any of these with HTTP-level tracing.

Traffic patterns are bursty and unpredictable. A single prompt with a 200k-token context can trigger a 90MB allgather. A training step with 64 GPUs exchanges gradient updates in tight synchronization windows where any single slow packet blocks every other GPU. The shape of the traffic changes constantly based on model architecture, batch composition, and KV cache state.

You need to see the kernel to debug any of this. That is what eBPF gives you.

What eBPF Sees That Userspace Agents Cannot

For the uninitiated, eBPF (extended Berkeley Packet Filter) is a Linux kernel feature that lets you run sandboxed programs at hook points throughout the kernel — system calls, network events, kernel function entries and exits, security contexts — without recompiling the kernel or loading modules. Every eBPF program passes through a static verifier before it runs, so the kernel guarantees it will not crash, run in an infinite loop, or access invalid memory.

For AI networking specifically, eBPF gives you three things nothing else can match.

1. Kernel-bypass visibility (sort of). eBPF cannot directly see RoCE or GPUDirect traffic because those paths never reach the kernel. But eBPF CAN see the syscall and control-plane traffic that orchestrates those paths — the RDMA connection setup, the CUDA driver ioctls, the NCCL rendezvous messages. In practice, the failure modes for distributed training almost always show up in the orchestration traffic, not the data plane itself. eBPF sees the orchestration, the scheduling events, the memory registrations, the QP (queue pair) state transitions.

2. Cilium and Hubble for pod-level network observability. Cilium is the eBPF-based CNI used in most production AI clusters today. It implements the entire Kubernetes networking layer in eBPF — kube-proxy replacement, network policy, service load balancing, encryption. Hubble is its observability layer. Together, they give you per-pod, per-flow, per-policy traffic visibility that no other tool can match. Hubble can tell you that pod vllm-worker-3 sent 14,000 packets to vllm-worker-7 over the past minute, 23 of which were dropped by NetworkPolicy deny-cross-az.

3. Socket and TCP-level tracing with low overhead. Programs like bpftrace, Pixie, and Beyla (from OpenTelemetry) attach eBPF probes to socket operations, TCP state machines, and the scheduler to give you per-flow latency distributions, retransmit rates, and congestion window behavior. Overhead is typically under 1 percent — far less than sidecar or userspace agents.

The Core Tooling Stack

For an AI infrastructure team in 2026, the practical eBPF+AI networking stack is four tools. I run all four on every cluster I touch.

Cilium 1.19+ for the CNI

Cilium 1.19, released in 2026, is the most production-hardened eBPF CNI for Kubernetes. The recent releases added features that matter directly for AI networking:

Encryption strict mode for both IPsec and WireGuard, which forces all inter-pod traffic to be encrypted. For AI clusters running on shared bare metal or multi-tenant cloud, this is the only way to meet compliance requirements without sacrificing kernel-bypass performance for the data plane itself.
Accelerated IPsec with BPF host routing — a 2026 change that makes IPsec lookups an order of magnitude faster. Previously, the IPsec route lookup itself was a bottleneck for high-throughput inference traffic.
Hubble filtered flows with encryption status, network policy attribution, and per-pod flow logs. Hubble v1.Events now tag drops with the specific NetworkPolicy that caused them — when an inference pod is mysteriously silent, you can immediately see which policy decision blocked its traffic.
Trace IP options — embed a custom IP option in a packet to follow it through every hop in the cluster. Indispensable for debugging "where did this request disappear to" questions in multi-tenant clusters.

Hubble for the observability layer

Hubble is the observability face of Cilium. It exposes flows, metrics, and events over a standard API that you can wire into Prometheus, Grafana, and any OpenTelemetry-compatible backend. For an AI cluster, the Hubble metrics I rely on most:

hubble_flows_processed_total — total flows observed. Sanity check that Hubble is actually capturing traffic from the pods you care about.
hubble_dns_queries_total and hubble_dns_responses_total — DNS resolution latency for service discovery. AI workloads are particularly sensitive to DNS; a 50ms DNS resolution on every retry adds up.
hubble_drop_total{reason="..."} — drops by reason. The most common AI-cluster drops are Policy denied (your NetworkPolicy is too tight) and CT map full (connection tracking is exhausted).
hubble_port_distribution — which ports are seeing traffic. Useful for spotting pods that are unexpectedly chatty.

bpftrace for ad-hoc kernel investigation

When Cilium and Hubble do not give you enough — usually because the question is "what is the kernel doing to this traffic right now" — bpftrace is the Swiss army knife. It is a high-level tracing language for eBPF, with one-liners that answer most questions. For AI networking, my go-tos:

# Trace TCP retransmits per pod
bpftrace -e 'kprobe:tcp_retransmit_skb /pid == $1/ { printf("retransmit pid=%d sport=%d dport=%d seq=%llu\n", pid, args->skb->sk->sk_num, args->skb->sk->sk_dport, args->seq); }'

# Trace TLS handshake latency
bpftrace -e 'uretprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_do_handshake { printf("ssl handshake pid=%d duration_ns=%lld\n", pid, ((ns_t)(arg1)) - ((ns_t)(arg0))); }'

# Trace which pod is calling connect() on which destination
bpftrace -e 'kprobe:__sys_connect { printf("connect pid=%d fd=%d sockaddr=%s\n", pid, args->fd, ntop(args->uservaddr->sa_family, args->uservaddr)); }'

These are not for production dashboards — they are for the 30 minutes of "I have no idea why this is slow" debugging that every distributed inference deployment eventually hits.

OpenTelemetry eBPF auto-instrumentation (Pixie / Beyla)

For the application side of AI networking, eBPF-based auto-instrumentation tools like Pixie and Beyla attach to the language runtime (Python for most AI workloads) and emit OpenTelemetry traces without any code changes. They capture HTTP, gRPC, and database calls, but more importantly they can also capture the CUDA driver calls and the Python-to-NCCL transitions. This is how you get an end-to-end picture: HTTP request arrives at the inference gateway, the gateway calls vLLM, vLLM calls NCCL allreduce, NCCL exchanges memory over RDMA, results come back. Each leg of that journey is a different observability surface, and eBPF spans all of them.

What to Monitor for AI Workloads Specifically

Beyond the standard Cilium/Hubble metrics, there are signals I have learned to alert on for AI clusters that the general networking guidance misses.

RDMA connection churn. A healthy distributed training job establishes a fixed number of RDMA queue pairs and keeps them stable for the duration of the run. If you see QPs being torn down and rebuilt at high frequency, something is killing and restarting workers, or the NCCL init is failing. Use ibstat on the host plus eBPF probes on the rdma_resolve_route and ib_modify_qp kernel functions.

GPU scheduler latency. Linux's CFS scheduler was not designed with GPU workloads in mind. A 50ms scheduler delay before a kernel starts running on a GPU is invisible to GPU-level metrics but catastrophic for tail latency. The eBPF probe sched:sched_stat_runtime and the BPF program type BPF_PROG_TYPE_TRACING let you measure the gap between when a CUDA kernel was ready to run and when the scheduler actually started it. The nvidia-cuda-exporter from Pixie has prebuilt probes for this.

NCCL allreduce phase time. NCCL exposes a profiling hook, but the most useful signal is the wall-clock time between the start of an allreduce and the time all ranks have arrived. If that time spikes without the network being saturated, the problem is almost always on the GPU side (a slow rank that is doing other work) and not the network. The nccl-tests benchmark plus an eBPF probe on the NCCL rendezvous messages gives you the diagnostic.

Per-pod encryption CPU cost. WireGuard and IPsec both consume CPU. For high-throughput inference on smaller instances (8 to 16 vCPU), encryption can become the bottleneck. Cilium's cilium_bpf_map_ops_total plus the host's softirq CPU time tells you when encryption is eating your latency budget.

Building the Dashboard

The minimum viable eBPF-AI-networking dashboard has four panels, in priority order:

Hubble drop reasons by namespace — bar chart. If any inference namespace is showing non-zero drops, something is broken.
TCP retransmits per pod — heatmap. The pods that retransmit the most are the pods that will hit p99 latency problems first.
RDMA queue pair count — single-stat. Should be steady during a training run. Spikes mean workers are dying.
Encryption CPU time — graph. Tracks WireGuard/IPsec cost over time. Spikes often correlate with encryption context rebuilds (the bug I opened with).

Wire this to Prometheus via the Cilium Helm chart's default exporters, and you have the same observability that hyperscalers use internally — without sending a single byte to a SaaS vendor.

Limitations and Gotchas

eBPF is not magic. Three things bite teams who adopt it for AI networking.

You still cannot see RoCE/RoCEv2 data plane traffic. The packets themselves do not traverse the kernel. You can see the setup, the teardown, and the orchestration, but not the data. For full visibility you need a hardware tap or a switch with mirroring. eBPF is the right tool for "why is the control plane slow," but not for "what is on the wire."

Verifier limits block some programs. The kernel verifier is conservative. Complex eBPF programs (more than ~1M instructions of verified complexity) will be rejected. Pixie and Cilium work around this with CO-RE (Compile Once, Run Everywhere) and program splitting. Custom one-off programs in bpftrace usually fit, but if you are trying to ship a complex custom probe you will hit this.

Kernel version matters. eBPF features accumulate quickly. Cilium 1.19 needs Linux 5.10 or newer for full functionality. If you are pinned to an older LTS (Ubuntu 20.04 ships 5.4, for example), you will lose features. The default EKS and GKE AMIs in 2026 are fine. Self-managed clusters on older distros will need attention.

Where to Start

If you have no eBPF on your cluster today, the order of operations that has worked for me:

Install Cilium 1.19+ as the CNI on a new cluster. Keep kube-proxy running initially so you can compare. Watch your existing latency benchmarks — most teams see a 5 to 15 percent improvement just from replacing iptables with eBPF, before turning on any observability.
Enable Hubble with the default metrics export. Wire the four dashboard panels above. Spend a week just observing.
Install Pixie or Beyla for application-side auto-instrumentation. You will get HTTP/gRPC traces for your inference gateways without changing a line of code.
Keep bpftrace installed on a bastion. The day you have a mystery, it will save you hours.

The hardest part is not the tooling — it is accepting that AI networking needs a fundamentally different observability layer than web services. eBPF is that layer. It is the only thing that can see the kernel, the scheduler, and the orchestration traffic that actually determines whether your distributed inference job runs in 200ms or 2 seconds.

Recommended Tool Cilium

eBPF-based CNI with Hubble network observability