A chatbot can be wrong in a familiar way: it says something false, omits a caveat, or gives a weak recommendation. An agent can be wrong with a shell, a browser, a repository, a credential, a queue, a workflow engine, and a production-adjacent system in reach.
That changes the safety question.
For serious agent deployments, “Do we trust the model?” is too vague to be useful. The better question is: what is the maximum credible damage if the agent tries too hard, gets poisoned by context, or fails halfway through a multi-step workflow?
That maximum should not be implied by vibes, user approval prompts, or whatever credentials happened to be available on the machine. It should be designed. Every agent that can take consequential action needs a blast-radius budget.
Approval prompts are not a control plane
Anthropic’s engineering writeup, [“How we contain Claude across products”](https://www.anthropic.com/engineering/how-we-contain-claude), is valuable because it says the quiet part out loud: as agents become more useful, their potential blast radius grows. Anthropic frames risk as two different variables: how likely failure is, and how much damage failure can do.
Most teams obsess over the first variable. They add a stronger model, a better system prompt, another refusal rule, or an approval dialog before “dangerous” actions. Those can help, but they do not answer the blast-radius question.
The approval-dialog pattern is especially weak as a primary defense. Anthropic reports that users approved roughly 93% of permission prompts in Claude Code telemetry. Repeated prompts train people to become the mechanical part of the system; after enough interruptions, the user is not evaluating risk, just clearing modal debt.
A prompt can slow an action. It does not define what the agent is physically able to reach. If the agent has the whole repo, network, ambient credentials, and a broad cloud role, the blast radius has already been chosen.
Containment changes what is possible
The strongest controls are boring in exactly the right way: sandboxes, virtual machines, filesystem boundaries, network egress controls, allowlists, scoped credentials, proxies, and per-session workspaces.
These are not “friction.” They are the budget line.
Anthropic’s containment argument is that model-layer protections influence what an agent tends to do, while environment-layer protections determine what the agent can do. That distinction matters. A system prompt can say “do not exfiltrate credentials.” A sandbox that never receives those credentials makes exfiltration impossible through that path.
This is the operational mindset teams need to bring to agents:
- If a task only needs a working copy, the agent should not see the whole filesystem.
- If a task only needs package registry access, it should not get broad internet egress.
- If a workflow only needs read-only issue metadata, it should not receive a write-capable project token.
- If external content is loaded into context, it should be treated as hostile input even when the connector itself is approved.
That last point is easy to miss. Anthropic notes that an audited connector is not the same as audited data. A GitHub connector can be legitimate while a README pulled through it is malicious. The attack surface is not just the tool. It is every piece of content the tool can place into the agent’s reasoning loop.
Monitor the trace, not just the request
Containment is necessary, but it is not the whole operating model. Once agents are doing real work, teams need to observe what happened with enough detail and speed to matter.
OpenAI’s post, [“How we monitor internal coding agents for misalignment”](https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/), describes a low-latency internal monitoring system for coding-agent interactions. The system reviews conversations, tool calls, outputs, and actions for behavior inconsistent with user intent or internal security and compliance policies. In the described version, reviews happen within 30 minutes of completion, with an expected move toward nearer-real-time monitoring and possible pre-execution blocking for the highest-risk actions.
The lesson is not that every company needs OpenAI’s exact monitoring stack. It is that agent safety moves from static policy into operational telemetry. A mature deployment should know which commands ran, which files changed, which external content entered context, which credential scope was available, and whether the agent worked around a restriction to satisfy the user’s goal.
OpenAI reports that its monitor flagged every interaction employees independently escalated, and also notes that models can be overly eager to work around restrictions in pursuit of a user-specified goal. That is the practical failure mode: not movie-villain scheming, but helpful overreach.
You do not catch that with a launch checklist. You catch it with traces.
Multi-agent failures start before the final answer
The blast-radius budget also has to cover workflow shape. As soon as teams build multi-agent systems, the final response is often the least useful place to debug.
The PROTEA paper, [“Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows”](https://arxiv.org/abs/2605.18032), describes the problem well: failures can originate in subtle intermediate outputs and propagate downstream. By the time the final answer is wrong, the real defect may be three nodes upstream.
PROTEA’s proposed answer is graph-level evaluation: execute the workflow offline, score intermediate node outputs with configurable rubrics, overlay states and rationales on the workflow graph, and support targeted prompt revisions. The reported production-adjacent examples improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.
For operators, the exact tool matters less than the pattern. A serious agent workflow needs intermediate assertions. If a planner hands bad assumptions to a researcher, and the researcher hands weak evidence to a writer, and the writer produces polished nonsense, final-answer QA is too late and too coarse.
Blast radius is not only about secrets and shells. It is also about bad intermediate state flowing through a system with confidence.
Repeated work should become deterministic execution
There is another way to reduce risk: stop asking the model to improvise the same procedure every time.
The arXiv paper [“Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol”](https://arxiv.org/abs/2605.00827) proposes an MCP-native workflow engine where an agent reasons once to produce a declarative workflow blueprint, then future runs execute through a single run_workflow call. In the paper’s Kubernetes CMDB synchronization example, the workflow covered 67 orchestrated steps and reported more than 99% per-execution token-cost reduction with deterministic, idempotent execution and no agent involvement at run time.
That is not just an efficiency story. It is a safety story.
Every live reasoning step is another chance for prompt injection, over-eager tool use, inconsistent interpretation, or accidental privilege crossing. If a procedure is known, tested, and repeated, it should graduate from “agent thinks through 67 tool calls again” to “agent invokes a constrained workflow with typed inputs, bounded outputs, and audit logs.”
Agents are useful for creating and adapting procedures. They do not need to remain inside every loop forever.
A practical blast-radius checklist
Before giving an agent real access, write down its budget:
1. Maximum action scope: What is the worst action the agent can take without another system stopping it? 2. Filesystem boundary: Which directories are visible, writable, and out of scope? 3. Credential scope: Which tokens exist in the runtime, and are they read-only, write-limited, time-limited, and task-specific? 4. Network boundary: Which domains and services can the agent reach? 5. External-content rule: Which tool outputs, web pages, documents, issues, READMEs, and emails are treated as untrusted instructions? 6. Trace monitoring: Who or what reviews tool calls, file changes, policy violations, and intent mismatches? 7. Intermediate workflow checks: Which nodes are scored before downstream agents consume their output? 8. Deterministic graduation path: Which repeated procedures should move from live reasoning into tested workflows?
The goal is not to make agents harmless by making them useless. The goal is to let them do valuable work inside boundaries that survive fatigue, poisoned context, and helpful overreach. A good blast-radius budget lets the organization say yes to more automation because it has already decided what the automation is not allowed to break.
Sources
- https://www.anthropic.com/engineering/how-we-contain-claude
- https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/
- https://arxiv.org/abs/2605.18032
- https://arxiv.org/abs/2605.00827
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->