Picture a support agent wired into your stack. A customer asks it to update a shipping address. The agent reads the ticket, pulls the order, edits the record, and replies with a tidy confirmation. The task is done. The customer is happy. The dashboard turns green.
Now look at the path it took. Somewhere in the ticket thread was a line of text the customer pasted from an email — text that quietly instructed the agent to also export the account's order history to an external endpoint. The agent did that too, because to the model it was all just instructions. The task completed. The execution was not faithful to anyone's authorized intent.
This is the gap most agent security programs still miss. We have spent two years hardening the prompt and inspecting the output. But agents no longer just produce answers. They produce side effects: file edits, API calls, code execution, memory writes, outbound requests. Once an action leaves a mark on a real system, "was the answer acceptable?" is the wrong question. The right one is "was the execution faithful to what the user actually authorized?"
Completion is not correctness anymore
Treating agent risk as prompt hygiene made sense when the output was the product. If the model said something wrong, you caught it in the text. Allow lists, deny lists, and post-hoc moderation were reasonable fits for a system whose only output was language.
Agents broke that model. The interesting failures no longer live in the final message. They live in the trajectory — the sequence of tool calls, data reads, and judgments that got the agent from a natural-language request to a concrete change in the world. A run can produce a perfectly reasonable closing reply while having taken an unauthorized path to get there. Completion and correctness have come apart, and our tooling mostly still measures completion.
The useful idea in intent-to-execution integrity
A May 2026 security paper, Securing LLM Agents Need Intent-to-Execution Integrity (Qu, Xu, Wang, Zhai, Zhang, and Song), names the missing piece directly. Its argument is that LLM agents operate over an intent-to-execution pipeline: natural-language instructions become tool calls, API requests, code execution, and other system operations. Security violations, in this framing, are mis-executions — concrete side effects that fail to align with user intent because of an attack. The authors' claim is that current defenses lack an adequate correctness property for what "secure" even means in an agent.
They trace the trouble to two fundamental sources: untrusted data ingestion and untrusted tool execution. Both are unavoidable in any agent that touches the real world, because the world is where the untrusted input lives. To address them, the paper proposes four integrity properties that must hold together:
- Instruction Integrity — actions trace back to authorized instructions, not to text smuggled in through data.
- Data Flow Integrity — untrusted data cannot silently redirect what the agent does.
- Judgment Integrity — the agent's intermediate decisions are sound rather than manipulated.
- Tool Integrity — tools behave as expected and their results are trustworthy.
The conjunction of all four is what the authors call intent-to-execution integrity. The analogy they reach for is the compiler. A compiler is correct when it preserves the semantics of the source program through translation; an agent should preserve the semantics of user intent through execution. That framing is worth borrowing even if you never read the paper, because it reframes security as a single end-to-end property instead of a pile of disconnected controls.
Take it seriously without overselling it. The paper offers a vocabulary and a correctness goal, not a finished implementation you can install. The value is in the shape of the question.
Why evaluation has to become trajectory-based
If integrity is a property of the path, then you cannot verify it by grading a single output. The evaluation literature has been arriving at the same conclusion from a different direction.
A February 2026 paper arguing for a unified framework for LLM-based agent evaluation (Zhu, Sun, Yu, and Su) points out that agent benchmarks are confounded by prompts, tool schemas, memory, inference protocols, and dynamic environments. Agent evaluation, it argues, has to account for closed-loop interaction, trajectories, final answers, and changes to environment state. Their proposal is a unified evaluation substrate — a sandbox offering hermetic determinism, reproducibility, and safety. An integrity contract is only enforceable if you can test it deterministically before real side effects are allowed.
A survey on the evaluation of LLM-based agents (Yehudai, Eden, Li, Uziel, Zhao, Bar-Haim, Cohan, and Shmueli-Scheuer), updated in April 2026, reinforces the practical edge: agents need evaluation beyond textual outputs, covering planning, tool use, memory, sequential decisions, dynamic environments, safety, robustness, and efficiency. Its sharpest recommendation for builders is that future benchmarks should integrate guardrail metrics and penalize agents that reach task success through non-compliant actions. In other words, integrity should be a first-class, scored outcome — not an invisible guardrail you hope held.
There is reason not to trust broad "agentic" claims at face value. AgentEscapeBench (Guo et al., May 2026) deliberately tests unfamiliar, tool-grounded reasoning instead of the familiar domains and repeated workflow templates many benchmarks lean on. It uses directed acyclic graphs over tools and items, real function calls, hidden state, structured feedback, and deterministic final answers, across 270 instances spanning five difficulty levels. The takeaway for an enterprise team is sobering: if agents stumble when tool dependencies are unfamiliar, you should not assume integrity holds on your specific workflow without testing your specific workflow.
What an integrity contract looks like in practice
The point of all this is not to add another anxious paragraph to your policy prompt. Policy prompts, scoped tools, permission dialogs, sandboxes, and telemetry are real and useful — but each is a fragment. They become an engineering discipline only when they answer to a single document that defines what faithful execution means for a given agent.
That document — call it an integrity contract — should state, in concrete terms:
- Authorized intent: what this agent is actually permitted to accomplish, and on whose authority.
- Untrusted data boundaries: which inputs are data to be processed, never instructions to be obeyed.
- Tool trust levels: which tools are trusted, which are not, and what their results are allowed to influence.
- Data-flow constraints: where untrusted content is allowed to flow, and where it must not.
- Judgment checkpoints: the decision points that require verification before an irreversible action.
- Trace evidence: the observable record that lets you tie each side effect back to authorized intent.
- Sandbox tests: deterministic evaluations the agent must pass before touching production.
- Release gates: criteria that fail a run which succeeded at the task but took a non-compliant path.
That last gate is the one most teams skip, and it is the one that matters most. A run that completes the task through an unauthorized action is not a partial success to be smoothed over. Under an integrity contract, it is a failure, and the gate should treat it as one.
How to start small this quarter
You do not need to formalize all four integrity properties at once. Pick one real agent and write down its authorized intent and its untrusted-data boundaries in plain language. Add trace evidence so that, after any run, you can answer "what authorized this action?" for every side effect. Then build one deterministic sandbox test that includes a planted instruction in untrusted data and gate releases on it. That is a contract with three clauses and one test — small enough to ship, honest enough to expose what your agent actually does.
The agent era is going to be won by teams that can delegate real work without holding their breath. That confidence does not come from a more elaborate policy prompt or a stricter tone of voice. It comes from being able to show that execution preserved intent — proofs of faithful execution, not vibes of helpfulness.
Sources
- https://arxiv.org/html/2605.16976v1
- https://arxiv.org/html/2602.03238v1
- https://arxiv.org/html/2503.16416v2
- https://arxiv.org/html/2605.07926v1
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->