← Blog

AI Agent Observability: The Signals That Turn Autonomous Behavior From a Black Box Into a Debuggable System

Application observability (metrics, logs, traces) misses the signals that matter for AI agents: which tools the agent called, which sub-agent it delegated to, which policy decision permitted or denied each step, how many tokens each decision cost. Autonomous behavior without per-step observability is an unauditable black box. This covers the signals a production agent has to emit, the OpenTelemetry semantic conventions taking shape, and where the AI request boundary fits in the telemetry pipeline.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Platform & Architectureai-agent-observabilityai-agent-securityopentelemetryai-securityagentic-aitraces
AI Agent Observability: The Signals That Turn Autonomous Behavior From a Black Box Into a Debuggable System

An AI agent that runs autonomously across multiple tool calls, retrieves context from vector stores, delegates to sub-agents, and hands the final result back to the calling application produces one trace in most application observability stacks: the entry-point request and its final response. Everything the agent did in between (tool selection, retrieval hits, policy evaluations, token consumption per step) collapses into an opaque black box. When a customer support agent gives a refund to the wrong account, the incident review team has one signal to work with: the fact that a refund happened. The signals that would answer "why" are missing from the telemetry pipeline. I want to walk through the observability signals AI agents have to emit, the OpenTelemetry semantic conventions taking shape, and where the AI request boundary sits in the pipeline.

Application observability treats the agent as one call. Agent observability treats each internal step as a call.

The signals that matter

Six signal categories separate a debuggable agent from a black box.

Tool call traces. Every tool the agent selects and every tool call it issues has to land in the trace with the tool name, arguments, and result. When a coding agent runs git commit with the wrong scope or a support agent calls the issue_refund tool with a customer ID it derived incorrectly, the trace shows the argument the agent chose, not just the effect.

Delegation traces. In multi-agent workflows, the parent agent hands sub-tasks to child agents. The trace has to connect the parent's decision to delegate with the child's execution, so the incident review can follow the delegation chain end-to-end. The ai agent lateral movement piece covers the attack surface delegation opens.

Policy decisions. Every policy evaluation at the AI request boundary produces a signal: which policy version applied, which identity claim it evaluated against, which decision (allow, deny, transform) it returned. The policy-as-code piece covers the artifact side; observability covers the runtime side.

Retrieval hits. For agents backed by RAG, the retrieval step selects context that shapes the model's output. The trace has to include the retrieval query, the documents returned, and their similarity scores. When the agent hallucinates from retrieved content, the retrieval trace shows what the model actually saw.

Token consumption per step. Cost attribution runs on per-step token counts, not per-request totals. An agent that runs 40 tool calls in a session consumes 40 sets of tokens; the aggregate hides which step is expensive. The ai cost attribution per team piece covers the record design.

Content classification tags. When the request or response payload triggers a classifier (PII, PHI, prompt injection heuristics, jailbreak patterns), the classification lands in the trace so the incident review can filter for high-risk sessions.

OpenTelemetry semantic conventions for GenAI

The OpenTelemetry GenAI working group has been building semantic conventions for AI observability since 2024. The conventions define span attributes for LLM calls (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) and are extending to agent-level attributes (gen_ai.operation.name for tool call versus completion versus embedding, gen_ai.agent.id).

The conventions matter because they make traces portable across observability vendors. A trace emitted with OTel GenAI attributes flows into Datadog, Grafana, Honeycomb, or a custom SIEM without vendor-specific transformation. The ai request logging format piece covers the mapping between the AI audit log format and the OTel semantic conventions.

The gap in the current conventions is on the policy and identity side. OTel GenAI conventions cover what the model did. They do not, yet, cover which identity the request was bound to or which policy decision applied. That gap is where AI-security-specific instrumentation extends the standard set.

The pipeline

The production pipeline for AI agent observability runs in three stages.

Stage one is emission. Each component in the agent stack (the framework runtime, the tool implementations, the retrieval layer, the policy engine, the AI gateway) emits OTel spans with GenAI attributes. The agent framework's built-in instrumentation covers most of the model-call spans; the policy engine and gateway emit the identity and policy spans that the framework does not.

Stage two is collection and enrichment. An OTel collector receives the spans, enriches them with environment metadata (deployment tag, region, tenant), and routes them to the observability backend. Enrichment at this stage lets the operations team query by tenant or by deployment version without every emitter having to know its own context.

Stage three is analysis. The observability backend runs the queries the operations team needs (per-tenant token spend, per-tool error rate, per-policy denial rate) and the security team needs (sessions with content classifier triggers, delegation chains that crossed trust boundaries, requests whose policy evaluation errored). The ai gateway observability piece covers the operational side.

Where the AI request boundary fits

The agent framework alone cannot produce the identity and policy signals. The framework knows the model call happened; it does not know whether the calling identity was authorized, which policy version applied, or which classification the payload carried. Those signals originate at the AI request boundary where identity binding and policy evaluation happen.

Two integration patterns dominate. The gateway emits spans directly to the observability pipeline and correlates them to the agent framework's spans by trace ID. The agent framework calls into the gateway's SDK, which emits both the framework span and the gateway span with attribute continuity. Both work; the choice depends on which team owns the framework versus which team owns the gateway.

DeepInspect

This is exactly what DeepInspect does. DeepInspect sits at the AI request boundary as an external enforcement layer. Every request emits OpenTelemetry spans with GenAI semantic attributes, extended with identity claim attributes, policy version attributes, and content classification attributes. Spans flow into the customer's existing OTel collector and land in whichever observability backend the operations team already runs.

Trace IDs propagate through the agent framework, the gateway, and the LLM provider's response, so a single incident review query returns the full session from user entry point to final tool call. The audit log (which is the durable, tamper-evident record) references the same trace ID for cross-referencing.

Book a technical deep dive at deepinspect.ai.

Frequently asked questions

How is AI agent observability different from LLM observability?

LLM observability covers the model call: prompt, response, token counts, latency. Agent observability covers everything around the model call: tool selection, tool arguments, sub-agent delegation, retrieval, policy evaluation, identity binding. A production agent produces dozens of spans per user request. An LLM call produces one.

Does OpenTelemetry cover this?

Partially. The OTel GenAI semantic conventions cover LLM calls and are extending to agents. Identity binding and policy decisions are not yet part of the standard set. Production deployments extend the OTel attribute set with organization-specific attributes for the gaps.

How much telemetry data does an agent produce?

An agent that runs 40 tool calls in a session produces 40 tool call spans plus surrounding spans for retrieval, policy, and delegation. A high-throughput agent deployment can produce hundreds of gigabytes of trace data per day. Sampling strategies (head sampling for cost-sensitive traces, tail sampling for error and high-latency traces) apply the same way they apply for application traces.

Where should the observability data live?

Traces belong in the observability backend the operations team already runs (Datadog, Grafana Tempo, Honeycomb, or self-hosted Jaeger). The tamper-evident audit log belongs in an append-only store separate from the observability backend, because the audit log has to survive scenarios where the observability backend is unavailable or is compromised. The ai audit log immutability piece covers the storage-layer contract.

How do we correlate observability data with audit logs?

Trace ID propagation. Every span in the observability pipeline and every record in the audit log carries the same trace ID for a given user request. Incident review starts in either place (a suspicious span in the trace, an unusual policy decision in the audit log) and pivots to the other by trace ID.

What is the latency cost of emitting these signals?

Sub-millisecond per span with a local OTel collector and async export. The dominant latency in the AI request path is the model call itself (hundreds of milliseconds to seconds). Observability adds under 1% to that total. The ai gateway latency piece covers the measurement methodology.