The word “pilot” has done too much work in enterprise AI.
A pilot sounds safe. It implies a bounded experiment, a friendly stakeholder, a little extra supervision, and a slide at the end that says whether the system was promising. That framing works when the AI is drafting summaries or helping an internal team search documentation. It breaks down when an agent can touch records, call tools, change files, query production-like data, or route work across a regulated process.
At that point, the important question is no longer “did the pilot look good?” It is: what exactly is this agent certified to do, under which conditions, with what evidence, and when does that certification expire?
That is the argument behind a new June 2026 paper, [“Toward Pre-Deployment Assurance for Enterprise AI Agents”](https://arxiv.org/html/2606.04037v2). The authors propose a framework built around three ideas: an agent operational envelope, ontology-grounded scenario generation, and a machine-verifiable trust certificate. The paper is research, not an industry standard, and it should be read with the normal caution we apply to early frameworks. But the shape is right. Enterprise agents need a release artifact that looks less like a demo recap and more like a preflight certificate.
The operational envelope comes first
The most useful part of the proposal is the operational envelope. Before arguing about model scores, the organization has to define the space in which the agent is allowed to operate.
That means writing down the agent’s authorized actions, domain scope, safety properties, governance constraints, and autonomy level. In plain English: what can it do, where can it reason, what must never happen, which rules bind it, and how independently may it act?
This is different from a prompt. A prompt asks the model to behave. An envelope gives the business a reviewable contract around behavior. It should include things like:
- the exact agent version, model, prompt, tools, connectors, and policy bundle;
- the data domains it may access;
- the actions it may take without approval;
- the actions that always require escalation;
- invariants such as “never disclose this filing status to the subject” or “never modify a record without a traceable source”; and
- the events that invalidate the certificate, such as a model change, new connector, policy update, or schema migration.
The point is not to make the agent perfectly safe by documentation. The point is to make the release surface explicit enough that engineering, security, legal, and operations are not each imagining a different system.
Scenario evidence has to look like the real workplace
A certificate is only as credible as the scenarios behind it. This is where many agent evaluations still feel too clean.
Real work does not arrive as a tidy prompt with three attached files. It sits in a messy workspace: old drafts, confusing folders, spreadsheets with informal conventions, PDFs with important footnotes, stale exports, implicit dependencies, and a human expectation that the agent will know which artifact matters.
[Workspace-Bench 1.0](https://arxiv.org/html/2605.03596v1) is useful because it pushes evaluation in that direction. The benchmark describes five workspace personas, 388 tasks, 20,476 files, 74 file types, and thousands of rubrics. More importantly, it evaluates process, not just final answers: whether an agent found the right files, used the correct versions, followed dependencies, made sound intermediate decisions, and saved outputs properly.
That is the kind of evidence a preflight certificate should preserve. Not just “the agent answered correctly,” but “the agent used the right source, ignored the stale one, transformed the spreadsheet correctly, wrote the output to the expected place, and did not exceed its authority.”
If your agent is going to operate inside a claims workflow, a revenue operations workspace, a clinical intake process, or an internal engineering repository, then generic chat benchmarks are not enough. The scenario suite has to model the workflow’s actual friction.
Data agents make the case even sharper
The same issue appears in enterprise data work. [Data Agent Benchmark](https://arxiv.org/html/2603.20576v1) focuses on agents answering natural-language questions over messy, multi-database, enterprise-like data. Its tasks involve multiple database systems, inconsistent identifiers, semantic operations over text, and domain-specific definitions. The extracted paper summary reports the best frontier model at 38% pass@1 on the benchmark.
That number should not be used as a universal indictment of data agents. It should be read as a warning about release evidence. If the production question requires reconciling fuzzy customer names across systems, interpreting a field whose meaning depends on domain policy, and computing a decision from both structured and unstructured data, then “the model is strong at SQL” is not a release argument.
A preflight certificate for a data agent should show which joins were tested, which ambiguous identifiers were handled, which semantic operations were validated, and where the agent is expected to escalate. It should also include negative evidence: known weak cases, failed scenarios, and conditions that require manual review.
That is not bureaucracy. That is how you prevent a demo from quietly becoming an unmanaged business process.
Align the certificate with governance language
There is already a governance vocabulary for this. The [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) is a voluntary framework for incorporating trustworthiness considerations into the design, development, use, and evaluation of AI systems. NIST’s [Generative AI Profile](https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence) adds cross-sectoral guidance for generative AI risk management.
That does not mean NIST has blessed any particular agent certificate format. It means teams should avoid inventing a separate island of agent governance. A useful preflight certificate should map back to the organization’s existing risk language: lifecycle stage, intended use, evaluation method, accountable owner, risk treatment, monitoring requirement, and review cadence.
The certificate should be small enough to read and structured enough to automate. A practical version might include:
1. Identity: agent version, model, prompt, tools, connectors, deployment environment. 2. Envelope: permissions, domains, prohibited actions, safety invariants, autonomy level. 3. Scenario suite: source of scenarios, coverage rationale, workflow or ontology basis. 4. Evidence: traces, tool calls, files touched, rubrics, pass/fail outcomes, reviewer notes. 5. Decision: approved, approved with constraints, shadow-only, or blocked. 6. Expiry: what changes invalidate the certificate and when it must be renewed. 7. Rollback: how to disable the agent, revoke credentials, and recover from bad actions.
The certificate does not replace runtime monitoring. It gives monitoring something to compare against. If the envelope says the agent is certified only for read-only reconciliation, a write action is not just “interesting telemetry.” It is a certification violation.
Narrower agents, better evidence
The future of enterprise agents will not be won by the broadest demo. It will be won by systems that can earn narrower permissions with better evidence.
That is uncomfortable because it slows down the story. It forces teams to admit that a promising agent may be ready for shadow mode but not production action, ready for one jurisdiction but not another, ready for read-only analysis but not autonomous updates. It also creates a cleaner path to adoption. Instead of asking stakeholders to trust a black box, the team can show the envelope, the scenarios, the evidence, the verdict, and the expiry condition.
A pilot tells people what happened in a controlled trial. A preflight certificate tells them what is allowed to happen next.
For enterprise agents, that distinction matters. The moment an agent can act, the release artifact has to become as real as the workflow it is entering.
Sources
- https://arxiv.org/html/2606.04037v2
- https://arxiv.org/html/2605.03596v1
- https://arxiv.org/html/2603.20576v1
- https://www.nist.gov/itl/ai-risk-management-framework
- https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-generative-artificial-intelligence
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->