The default safety move for an AI agent is still too often a longer prompt.

We add another instruction: do not send sensitive data. Ask before taking destructive action. Stay inside the project. Follow the policy. Never deploy without approval. That is a reasonable starting point for a chat assistant. It is not a sufficient control for a system that can read files, call APIs, update tickets, query databases, send messages, or run shell commands.

Once an agent can act, the question changes. The problem is no longer only whether the model says the right thing. It is whether the system is allowed to do the thing the model proposed.

That distinction is becoming more important because agent autonomy is no longer theoretical. Anthropic's 2026 analysis of real Claude Code and API usage describes autonomy as a property of the whole deployment: the model, the user, and the product interface together shape how much independence the agent actually exercises. In the same report, the longest Claude Code sessions grew substantially over a few months, and experienced users were more likely to use auto-approval while also interrupting runs more often. That is the shape of practical oversight: less clicking through every step, more intervention when the system crosses a meaningful boundary.

The missing layer is an execution firewall.

An execution firewall is a narrow control plane between agent reasoning and real-world action. The agent can think, plan, and propose. The firewall receives proposed actions as structured requests, checks them against policy and context, and only then allows a tool call, command, API request, data write, deployment, payment, or external send to happen.

In other words: the model does not get to be both the lawyer and the locksmith.

A recent preprint, Parallax: Why AI Agents That Think Must Never Act, argues for this separation explicitly. Its core proposal is cognitive-executive separation: the component that reasons about actions should not be the same component that executes them, and an independent validator should sit between the two. You do not have to adopt the paper's full architecture to accept the practical lesson. Prompt guardrails live inside the same reasoning substrate that can be confused by malicious instructions, messy context, or tool output. Execution policy should live outside that substrate.

There is also a reliability argument, not only a security argument. Benchmarking LLM Tool-Use in the Wild introduces WildToolBench, a benchmark built around messy multi-turn user behavior: compositional requests, implicit intent scattered across dialogue, and instruction transitions where the assistant must decide whether to answer, clarify, or use tools. Across 57 models, the paper reports that no model exceeded 15% session accuracy. The exact benchmark will not map perfectly to every enterprise workflow, but the signal is hard to ignore: realistic tool use breaks in ordinary, human ways.

That matters because production failures rarely announce themselves as attacks. They look like ambiguous requests, stale context, mistaken assumptions, half-correct tool arguments, over-broad permissions, and an agent confidently doing the wrong helpful thing.

A good execution firewall treats every proposed action as an untrusted request. It does not need to be complicated at first. It needs to be explicit.

Start with typed action manifests. Instead of giving an agent a general-purpose browser, shell, or database handle, require it to submit a structured action: send_email, create_ticket, read_customer_record, run_test_suite, deploy_preview, update_status_page. Each action should have a schema, required fields, allowed destinations, scope limits, and a declared reversibility class.

Then validate identity and scope. Which user, tenant, patient, customer, repository, project, or environment is this action for? Does the current session have authority for that scope? Is the agent trying to cross from a sandbox to production, from one customer to another, or from internal notes to an external channel? These checks should be deterministic wherever possible. The agent can explain why it wants access; the firewall decides whether access exists.

Next, check data flow. The dangerous pattern is not just tool access. It is private data plus untrusted content plus an exfiltration path. If an agent read a vendor PDF, a web page, an email, or a retrieved document that may contain untrusted instructions, the firewall should become more skeptical about subsequent external sends, code execution, or privileged reads. The right response may be to strip content, require a human review, or force a safer tool.

Reversibility should be first-class. Reading a public document, drafting a message, creating a preview branch, or opening a ticket is different from deleting records, charging money, changing access controls, emailing a client list, or deploying to production. The firewall should know the difference. Low-risk reversible actions can be auto-approved. Irreversible or high-blast-radius actions should require a dry run, a diff, a rollback plan, or explicit human approval.

Budget and rate limits belong in the same layer. A runaway agent should not be able to call an expensive API indefinitely, create hundreds of tickets, page an on-call rotation in a loop, or retry a failing action until it becomes an incident. Limits are not just cost controls; they are behavioral circuit breakers.

Finally, log the decision, not just the action. The audit trail should show what the agent proposed, which policy evaluated it, what context was used, whether the request was allowed, denied, modified, or escalated, and what actually happened. NIST's AI Risk Management Framework emphasizes trustworthiness across design, deployment, use, and evaluation. For agents, execution decisions are where those abstract risk practices become concrete evidence.

This is also a product design issue. If every action prompts the user, teams will eventually turn approvals off. If no action prompts the user, the system will eventually surprise someone. The useful middle ground is a risk-aware approval interface: show humans the small number of decisions that matter, with enough context to intervene quickly.

For a team building agents this week, the implementation path is straightforward.

List the actions your agent can take. Remove the broad tools you cannot defend. Convert the remaining tools into typed requests. Add policy checks for scope, destination, data class, environment, cost, reversibility, and approval threshold. Run risky actions in dry-run mode first. Store the proposed action and the firewall decision in a durable trace. Review denied and escalated actions as seriously as you review model evals, because they are evals of the operating system around the model.

This pattern will not make agents perfectly safe. It will make them more governable. It turns safety from a paragraph in a prompt into a system boundary that can be tested, monitored, audited, and improved.

The practical standard is simple: trust agents with proposals before trusting them with consequences. Let the model reason. Let the execution firewall decide what is allowed to become real.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->