For most of the last few years, the mental model for an AI assistant was a conversation. You typed something, it answered, and the worst-case outcome of a bad answer was a bad answer. That model is now wrong for a growing class of deployments. The agents teams are shipping today read and write files, call APIs, run shell commands, touch databases, deploy infrastructure, hold memory across sessions, and hand work to other agents. The moment a system can do those things, the interesting question stops being "did it say something unsafe?" and becomes "what was it allowed to do, and who decided?"

That shift is what a recent line of research is pushing on, and the most useful framing comes from treating the agent runtime the way we already treat an operating system.

The operating system analogy

[Toward Securing AI Agents Like Operating Systems](https://arxiv.org/html/2605.14932v1) makes the structural argument plainly: open, extensible LLM agents have the same security problems operating systems have always had — resource isolation, privilege separation, communication mediation, and access-boundary enforcement. The paper maps the pieces directly. The LLM behaves like an untrusted user or process. The runtime is the kernel. Tools are syscalls. Skills are programs. Context is memory, files are persistent storage, gateways are network interfaces, and cron or heartbeat mechanisms are schedulers.

The point of the mapping is not analogy for its own sake. It is the conclusion that falls out of it: the LLM should not be trusted to enforce its own security. In an operating system, a process does not get to decide whether it may write to another process's memory; the kernel mediates that. The paper argues an agent runtime should work the same way — least privilege, complete mediation, sandboxing, network controls, isolated secrets, and tamper-resistant logs all living outside the model. The authors evaluate OpenClaw-style agents and find that many current protections fail under modest attacker capabilities, or only hold up when an expert configures them carefully. That is not a comfortable property to ship on.

Why better prompts are not the fix

The reflex when an agent does something unsafe is to write a better instruction. Add a policy paragraph. Strengthen the refusal. Add a final-answer check. This helps less than it feels like it should, because the failure usually is not in the final answer — it is somewhere along the trajectory that produced it.

[Constraint drift](https://arxiv.org/html/2605.10481v1) gives this a precise name. A safety-critical constraint has operational force only when it is available at the point of action, attached to the authority under which the action is proposed, checkable against the action's actual semantic effect, and reconstructable after execution. Constraint drift is the loss of that force across a multi-agent trajectory. The paper catalogs the ways it leaks: memory drift, authority drift, information-flow drift, accountability drift, and utility-induced drift — the last being the familiar pull of an agent quietly trading a constraint away to better accomplish its goal.

The practical implication is sharp. A system that delegates, communicates, uses tools, writes memory, executes code, and adapts over a long workflow cannot be certified by inspecting its final output. The constraint you wrote into the system prompt may simply not be present, in any enforceable form, at the moment the dangerous tool call is made. Final-output safety is the wrong place to look when the unsafe thing happened six steps earlier and the model summarized over it.

What runtime security actually looks like

If the model cannot be trusted to police itself, the enforcement has to be a real layer. [AgentTrust](https://arxiv.org/html/2605.04785v1) is a concrete example of what that layer can do. It is a runtime interception framework for agents that use side-effecting tools — file operations, shell commands, HTTP and network calls, databases, code execution, DevOps and Kubernetes operations, and credential access. Before a proposed tool call executes, it returns a structured verdict: allow, warn, block, or review.

The design details matter more than the verdict labels. It includes shell deobfuscation, so an action disguised by string tricks is evaluated for what it really does. It suggests safer alternative actions rather than only refusing. It does session-level, multi-step risk-chain detection, which is the direct countermeasure to a trajectory that looks innocuous step by step but dangerous in aggregate. It offers an optional cache-aware LLM-as-judge fallback, with a rule-only path built for interactive agent loops at low millisecond-level latency. The paper is careful about its own limits: this works only as well as the policy and rule quality behind it, and only within the threat model it was scoped for. That honesty is the right posture — no single framework solves agent security, and treating one as if it does recreates the original mistake of trusting a single layer.

An operational checklist

The OS framing turns into a short list of questions any team deploying agents with side effects should be able to answer:

  • Least privilege: Does each tool, session, and task get only the authority it needs, or does the agent run with broad standing credentials?
  • Complete mediation: Is every side-effecting action checked by the runtime, or only the ones someone remembered to wrap?
  • Sandboxing and network boundaries: Can code execution and outbound calls reach only what they are supposed to reach?
  • Explicit authority on delegation: When one agent hands work to another, does the authority travel as an explicit scope or token, or does it silently widen?
  • Secret isolation: Are credentials kept out of model context entirely, so a prompt injection cannot exfiltrate what the model never held?
  • Tamper-resistant, reconstructable logs: After an incident, can you replay what was proposed, under what authority, and what the runtime decided — without relying on the model's own account?
  • Risk-chain interception: Does anything watch for dangerous sequences, not just individual dangerous calls?

If the answer to several of these is "the model is instructed not to," the system is running on trust it has not earned.

This is how agents become deployable

None of this is an argument against agents. It is the opposite. The reason to mediate authority at the runtime is the same reason operating systems do it: it is what lets untrusted, useful processes run on shared, valuable infrastructure without each one being a trust fall. Teams are already moving from chat interfaces to systems that modify files, call APIs, schedule tasks, use credentials, and coordinate with other agents. The default safety story — better prompts, better refusals, better final-answer evaluations — is too model-centered for that reality.

The stronger story is structural. Treat model output as a request from an untrusted actor. Put the enforcement in the runtime. Make the constraint present at the point of action, not asserted at the top of a prompt. That is not a brake on agent adoption. It is the precondition for adopting agents in anything that matters.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->