The first production incident usually does not look like a science-fiction failure. It looks like a meeting.

Someone shares a screenshot of an agent response. Someone else pastes the prompt they think was used. A third person asks whether the model changed last week. Engineering wants the tool-call history. Security wants to know whether a guardrail fired. Product wants to know how many users were affected. Finance wants to know why the run cost so much. Governance wants evidence that the system behaved inside its approved use case.

If the only artifact is a prompt replay, everyone is guessing with confidence.

That is why the next serious layer in agent infrastructure is not another dashboard. It is a telemetry contract: a pre-launch agreement about what every agent run must emit, how that evidence is shaped, who can inspect it, how long it lives, and how it flows into debugging, evaluation, safety review, and business reporting.

Prompt replays are helpful. They are not a control plane.

Agents break the old logging habit

Traditional application logs are usually built around relatively stable code paths. A request arrives, a service calls a database, maybe a queue receives a job, and the team traces latency or errors across known components. The workflow can still be complex, but the shape is usually defined by software engineers before runtime.

Agents bend that assumption. The OpenTelemetry agent-observability work describes agents as systems that combine LLM capabilities, tools connected to the external world, high-level reasoning, and goal-oriented behavior. The important part is that the model can influence the process: which tool to call, what to retrieve, when to hand off, when to stop, and sometimes how to recover from a failed step.

That makes the run itself part of the evidence.

For an ordinary service, “HTTP 500” may be enough to start debugging. For an agent, the useful question is more specific: what did the agent believe the task was, what context did it see, which tools were called, what arguments were sent, what came back, what guardrails fired, and what final answer or action reached the user?

Without that structure, teams end up replaying prompts as if the prompt were the system. It is not. The system is the prompt plus retrieved context, tool schemas, permissions, model version, policy layer, orchestration code, memory, handoffs, latency, cost, and hidden failures along the way.

What a telemetry contract should define

A telemetry contract is not just “turn tracing on.” It is the minimum evidence schema required for an agent to be production-grade.

At the run level, every workflow needs a durable identity: a trace ID, workflow name, user or tenant boundary where appropriate, environment, model/provider metadata, deployment version, and a safe way to group related runs in the same conversation or business process.

At the step level, the contract needs spans for model generations, tool calls, retrieval, guardrails, handoffs, human approvals, retries, and failures. OpenAI’s Agents SDK tracing documentation uses this traces-and-spans framing directly: a trace represents an end-to-end workflow, while spans represent operations with timing and parent relationships. Its default tracing covers agent runs, LLM generations, function tool calls, guardrails, handoffs, and custom spans. That is the right mental model even if your stack uses different tooling.

At the metric level, the contract should capture latency, token usage, cost, tool error rates, retry counts, guardrail rates, escalation rates, and user-visible completion status. If the agent is tied to a business workflow, the contract should also record domain outcomes: order created, ticket resolved, claim routed, note drafted, code review completed, or human takeover required.

At the policy level, the contract must define what cannot be logged. This matters as much as what is logged. Sensitive inputs, regulated data, secrets, and customer-specific context need redaction rules, retention limits, and access controls. OpenAI’s own tracing docs note that tracing is unavailable for organizations operating under Zero Data Retention policies. That caveat is a useful reminder: observability cannot be designed separately from data governance.

Standards are becoming the escape hatch from vendor silos

The agent observability market is moving quickly, and that creates a predictable problem: every framework and vendor wants to show a beautiful trace viewer. Beautiful trace viewers are useful. But if each tool emits a different shape of evidence, the enterprise ends up with a pile of observability islands.

That is why OpenTelemetry’s GenAI semantic-conventions work matters. The dedicated repository describes conventions for spans, metrics, and events for GenAI clients, MCP, and provider-specific integrations such as OpenAI. The OpenTelemetry blog makes the larger point plainly: because GenAI observability and evaluation tools come from many vendors, common telemetry standards reduce lock-in and make observability data more interoperable.

The practical takeaway is not that one standard has solved everything. The practical takeaway is that teams should avoid designing agent telemetry as a private screenshot format. If traces may later need to flow from a framework into a security tool, an evaluation platform, a data warehouse, a compliance archive, and an incident-review process, the schema should be boring, explicit, and portable.

A good telemetry contract lets a team switch model providers, orchestration frameworks, or observability vendors without losing the ability to compare runs over time.

Observability is now part of agent engineering

The industry data points in the same direction. LangChain’s State of Agent Engineering report, based on 1,340 professionals, says 57.3% of respondents have agents in production and another 30.4% are actively developing agents with concrete deployment plans. The same report identifies quality as a leading production blocker and says 89% of organizations have implemented some form of agent observability, with 62% using detailed tracing to inspect individual agent steps and tool calls.

That does not mean most teams are done. It means observability has moved from “nice to have” into the deployment checklist.

The next maturity jump is making telemetry actionable. A trace that only helps a developer inspect one weird run is useful. A trace that can be joined to eval results, release versions, user feedback, cost reports, guardrail outcomes, and incident tickets is operational infrastructure.

This is where governance becomes concrete. NIST’s AI Risk Management Framework emphasizes incorporating trustworthiness considerations into the design, development, use, and evaluation of AI systems. For agent systems, those considerations cannot live only in policy documents. They need evidence. Telemetry is how a team proves what happened, detects drift, evaluates changes, and decides whether a workflow is still inside its approved operating envelope.

A practical starting checklist

Before launching an agent, ask for the telemetry contract in writing.

Define the run identity. Every run should have a trace ID, workflow name, environment, deployment version, model/provider metadata, and safe tenant or session grouping.

Define the span taxonomy. Decide how the system records model calls, retrieval, tool use, guardrails, approvals, handoffs, retries, exceptions, and final outputs. Do not wait for an incident to discover that tool arguments were never captured.

Define the sensitive-data boundary. Decide what is redacted, hashed, summarized, retained, or never collected. Observability that creates a new privacy problem is not production readiness.

Define evaluation linkage, export, and retention. A production run should connect to offline tests, online evals, user feedback, human review, release gates, data-warehouse exports, audit archives, and deletion rules.

Define operational questions. If the agent fails tomorrow, can the team answer what changed, what the agent saw, which tool it called, what guardrail fired, who approved the action, what it cost, and how many users were affected?

If not, the agent is not observable. It is just narratable.

The difference matters. Narratives explain one run after the fact. Telemetry contracts make thousands of runs governable while they are happening.

The companies that get this right will not be the ones with the prettiest prompt replay. They will be the ones that treat every agent run as structured operational evidence from the first day it reaches production.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->