Arize Phoenix 15.4.0: Open Source LLM Observability

What's New in Arize Phoenix 15.4.0

Arize Phoenix 15.4.0, released 2026-05-05, lands a set of features that turn Phoenix from "another tracing tool" into a credible platform for production agentic RAG. The headline changes, pulled from the GitHub release notes:

Agent set_time_range tool with hardened context injection — The PXI agent can now adjust the time window on a metric chart via a tool call, with context-injection hardening that prevents the tool from being redirected by untrusted prompt content. In practice this means the agent can navigate a dashboard on the user's behalf without becoming a prompt-injection relay.
ToolPart styles and subcomponents — A new ToolPart UI primitive gives consistent rendering for tool calls in the trace view, with subcomponents for input, output, error state, and streaming. This is the building block for everything else in the agent UI.
Filter-based DELETE endpoints for span, trace, and session annotations — You can now purge annotations that match a filter expression. This sounds boring until you have 40 million spans in Postgres and need to scrub a tenant's data for a GDPR deletion request.
Token counts in span/trace/session REST payloads — Token attribution is no longer a separate API call. Every span response includes the input, output, and total token count, which makes building cost dashboards on top of Phoenix meaningfully simpler.
Simplified trace/span status icons and status badge in panel views — Visual cleanup. The old six-state status icon (running, completed, failed, cancelled, queued, partial) collapsed to a cleaner four-state indicator with a panel-level status badge.
Vendor passthrough tools — You can register a tool that delegates to an upstream vendor SDK (Anthropic tool use, OpenAI function calling, Gemini function calling) without Phoenix redefining the tool schema. The vendor's native tool definition is passed through unmodified.
Bug fix — Selecting traces within the session view now correctly preserves the active filter, instead of silently resetting to all traces on selection.

No breaking changes. Upgrade via pip install --upgrade arize-phoenix or pull the new Docker image. The release is committed as b2dfdd0 on main.

Why Phoenix, Specifically

I have run Phoenix in production on a self-hosted RAG stack since late 2025. The reason it stayed when we tried six other observability tools is the embedding drift layer. Every other tool gives you request traces — input tokens, output tokens, latency, model name, status code. Phoenix gives you that, plus a continuous measurement of how the embedding distribution of your queries and retrieved documents is shifting over time.

This is the metric that actually matters for RAG quality. A model can be returning HTTP 200 with the same latency, same token count, and same nominal cost for weeks while the embedding distance between your live queries and the indexed corpus drifts by 15%. When that drift crosses a threshold, your retrieval quality collapses and your answers start degrading — but no traditional metric will fire an alert.

Phoenix surfaces this with a single dashboard panel: a time series of the average cosine distance between query embeddings and the centroid of the indexed corpus, sliced by retrieval index version. When the line slopes up, you know your corpus has drifted before your users do.

Architecture: Where Phoenix Fits

Phoenix runs as a single Docker container with a Postgres backend. The instrumentation layer is an OpenTelemetry SDK with Phoenix-specific semantic conventions. Your application code wraps each LLM call in a @tracer decorator or invokes the Phoenix SDK directly:

from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register(
    project_name="rag-prod",
    endpoint="http://phoenix-collector:6006/v1/traces",
)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

That single block wires OpenTelemetry traces from your LangChain pipelines into Phoenix's collector. The collector normalizes the spans, extracts embedding vectors where present, and writes everything to Postgres. The Phoenix UI is a Next.js app that reads from the same Postgres instance — no separate analytics store required for deployments under ~10 million spans.

For larger deployments, Phoenix supports an S3 or GCS export tier for spans older than 30 days, with the hot path staying in Postgres. This is the configuration I run for the production cluster — Postgres for the last 30 days, S3 cold storage for everything older, and a single Phoenix UI that queries both transparently.

The Drift Detection Layer in Practice

Drift detection in Phoenix is not a black box. The platform computes four core metrics on the embedding space:

Euclidean distance from corpus centroid — How far, on average, are query embeddings from the centroid of the indexed corpus? Rising values mean the corpus no longer represents the query distribution.
Cosine similarity distribution — The full histogram of similarity scores between queries and their top-k retrieved documents. Watching the p10 and p90 over time tells you whether the retrieval quality is degrading asymmetrically.
Cluster entropy — How concentrated are queries within a few dominant clusters? Spikes in entropy mean new query patterns are emerging that the corpus does not cover well.
Retrieval precision drift — For spans that include ground-truth relevance labels, the precision at k over time. This requires you to log relevance signals in your application, but it is the single most actionable drift metric.

The metric I actually alert on is the first one — Euclidean distance from corpus centroid. I run this as a Grafana panel that fires a PagerDuty alert when the 1-hour moving average crosses 1.5 standard deviations above the trailing 30-day baseline. The threshold is loose on purpose: I want to know about drift early, not at the point where retrieval has visibly broken.

RAG Pipeline Analysis

Beyond drift, Phoenix gives you a per-trace view of every retrieval, every re-rank, and every LLM call in your RAG pipeline. The trace view breaks down into stages, each with its own latency and token attribution:

Query embedding — Time and tokens for the embedding model call.
Vector retrieval — Latency for the vector database query, plus the IDs of the top-k retrieved documents.
Re-ranking (if present) — Latency and the re-ranked order.
Context assembly — Trivial latency, but useful to verify the prompt template is producing the expected structure.
LLM call — The main inference step, with full prompt, completion, and token counts.
Post-processing — Any parsing, validation, or transformation applied to the LLM output.

This is the view you want when debugging a specific user complaint. Pull up their session, walk through the stages, and identify which one is misbehaving. I have lost count of how many "the model is suddenly terrible" tickets turned out to be a vector index that returned the wrong document because of a stale snapshot.

Self-Hosting Phoenix

The full Phoenix stack runs in three containers: the Phoenix server, Postgres, and (optionally) a separate collector. For a small team with low traffic, the default single-container Phoenix works fine. For production, the multi-container setup with a dedicated collector is the right call — it isolates the trace ingestion path from the UI and query path.

Postgres sizing depends on span volume. At ~5 million spans per day with 30-day retention, you are looking at 50-80 GB of indexed data with the default schema. The Phoenix team has documented compression configurations that bring this down by ~60% if storage cost matters, but I have not needed them at our current volume.

Backup strategy is straightforward: standard Postgres logical backups daily, point-in-time recovery enabled, span data older than 30 days exported to S3 and not backed up locally. The S3 export is the source of truth for the historical view, and you can rebuild Postgres from S3 + a fresh span stream if you need to.

When Phoenix Is the Wrong Choice

I will be specific here because most vendor reviews are not. Phoenix is the wrong choice if:

You do not run RAG. The embedding drift layer is the core differentiator, and it is mostly noise if you are not retrieving documents. For pure text-to-text LLM applications, the platform comparison collapses to "another OpenTelemetry-compatible tracing tool," and you can pick based on which UI you prefer.
You need evaluation CI/CD. Phoenix has evaluation, but it is not a regression-gating tool. If your primary need is "block deployments when eval scores regress," Braintrust is the better fit.
You need native guardrails. Phoenix has zero guardrail features. No PII detection, no prompt injection prevention, no output constraints. If you need that, you are pairing Phoenix with a separate guardrail layer (Guardrails AI or NeMo Guardrails), and you should factor that integration cost into your decision.
You need cost attribution at the user level. The new token counts in REST payloads (15.4.0) make cost dashboards easier to build, but the platform does not ship user-level attribution. You build that on top.

For everything else in the LLM observability space — RAG quality monitoring, embedding drift, agent tracing with the new ToolPart components — Phoenix 15.4.0 is the strongest open-source option as of mid-2026.

Production Checklist

If you are deploying Phoenix to a production cluster, the minimum viable setup is:

Run the multi-container deployment with a dedicated collector
Postgres 15+ with logical replication enabled for backups
30-day hot retention in Postgres, S3 export for older spans
Grafana panels for the four drift metrics with alerting at 1.5σ
OpenTelemetry SDK instrumentation in every LLM call site
Filter-based DELETE endpoints wired into your GDPR data deletion pipeline (the 15.4.0 feature pays for itself the first time you need it)
Daily Postgres logical backup to a separate bucket with 90-day retention

That setup has held up on a 5-million-spans-per-day production workload for six months without intervention. The new 15.4.0 features — particularly the token counts in REST payloads and the agent set_time_range tool — make the platform meaningfully more capable for production agentic RAG without adding operational complexity.

Recommended Tool Arize AI

Open source LLM observability with embedding-based drift detection