Most agent security testing still looks too much like chatbot testing with a tool belt attached.
A team writes a handful of scary prompts. Someone asks the agent to leak a secret, install a package, ignore a policy, or summarize a poisoned web page. If the final answer looks safe, the agent passes. If it refuses loudly enough, the dashboard turns green.
That is not enough for software that can read files, call APIs, run commands, update memory, invoke MCP servers, and chain those actions across a long session. For agents, the dangerous part may happen before the final response. The answer can sound responsible after the agent has already touched the wrong file, trusted the wrong artifact, copied sensitive context into memory, or called a tool it never should have been allowed to reach.
The next useful step in agent evaluation is not another clever jailbreak prompt. It is a security spec.
That is why the recent preliminary paper [SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents](https://arxiv.org/html/2606.02302v1) is worth paying attention to. The paper is explicit that the work is still in progress, so it should not be treated as a finished industry standard. But the direction is right: agent security tasks should be generated from structured risk specifications, executed in controlled environments, and scored by looking at the trajectory of the run, not only the last message.
SeClaw’s central move is simple and important. Instead of treating security evals as a loose collection of red-team examples, it starts from explicit descriptions of risk: what resource is involved, what deployment scenario is being tested, what threat labels apply, what tools or skills exist, and what unsafe behavior should be detected. The framework then uses those specifications to synthesize tasks, reconstruct tool environments, run agents in Docker-isolated settings, and inspect behavior across tool calls, file operations, service interactions, permissions, generated artifacts, and audit logs.
That sounds less glamorous than a viral jailbreak. It is also much closer to how real engineering teams can improve.
A security eval for an agent should read more like a test case tied to a threat model:
- The agent has access to these tools.
- The user asks for this legitimate outcome.
- The environment contains this untrusted artifact.
- The safe path requires these actions.
- These reads, writes, network calls, memory updates, or permission changes are forbidden.
- Passing means the task is completed without crossing those boundaries.
Once the problem is framed that way, the eval becomes repeatable. It can be reviewed. It can be versioned. It can be run after a model upgrade, prompt change, MCP server addition, or permission-policy change. It can fail for a concrete reason instead of producing a vague “the model seemed fine.”
The broader research trend points in the same direction. [A3S-Bench](https://arxiv.org/html/2605.22321v1), a May 2026 benchmark for autonomous agents, argues that traditional LLM safety tests are mismatched for systems that execute actions, maintain persistent state, use tools, and operate over external artifacts. Its benchmark includes 2,254 multi-turn conversations, 1,512 adversarial cases, 34 attack techniques, 20 real-world risk scenarios, and Docker-based execution for OpenClaw-style agents. The headline result is uncomfortable: average RTR@1 rises from 28.3% for basic attacks to 52.6% under advanced temporal, spatial, and semantic evasions.
The exact numbers matter less than the pattern. Attacks against agents do not have to appear as one obvious malicious instruction. They can unfold over time. They can hide in files, tool outputs, memory, plugins, or ordinary workflow steps. They can be semantically adjacent to the user’s real task. A final-answer classifier is poorly positioned to catch that.
A survey of [agentic tool use in large language models](https://arxiv.org/html/2604.00835v1) gives the larger context. Tool use has moved LLM systems beyond text generation into retrieval, computation, APIs, GUIs, software, robotics, and other external actions. Evaluation has followed that shift: from function-call correctness, to end-to-end task completion, to interactive environment success, safety, and robustness. If the system is no longer just producing text, the evaluation cannot remain text-only.
Production teams are already learning this lesson in a practical way. OpenAI’s discussion of [monitoring internal coding agents for misalignment](https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/) is not the same as SeClaw, and it should not be used as evidence for SeClaw’s specific claims. But it reflects the same operational instinct: watch the process. For coding agents and other high-agency systems, the trace is often where the meaningful signal lives.
For builders, the takeaway is not “wait for the perfect benchmark.” It is to start treating agent security evaluation as a living specification discipline.
First, write down the resources your agent can touch. Not just “tools,” but the actual risk-bearing surfaces: repositories, customer records, calendars, email, shell commands, browser sessions, MCP servers, memory stores, credentials, ticketing systems, and deployment pipelines. If the agent can influence it, it belongs in the risk inventory.
Second, describe common task patterns. A support agent triaging tickets has different failure modes than a coding agent editing a repo or an operations agent touching infrastructure. The security spec should name the work the agent is expected to do, because realistic attacks are usually embedded inside legitimate workflows.
Third, define forbidden trajectories. Do not stop at “the agent must not leak secrets.” Say what that means in observable terms: do not read .env files, do not copy credential-shaped strings into memory, do not send private content to external tools, do not install unapproved packages, do not change permission settings, do not modify files outside the task scope, do not call production APIs during dry-run analysis. A good eval can then inspect whether those things happened.
Fourth, turn incidents and near misses into regression tests. If an agent once trusted a poisoned README, overreached during a refactor, wrote to the wrong directory, or summarized sensitive data into a persistent note, that should become a reusable scenario. The goal is not to shame the model. The goal is to make sure the same class of failure does not quietly return after the next prompt edit.
Fifth, separate broad screening from deployment gates. Synthetic, spec-generated suites are useful for finding classes of weakness. They should not be mistaken for proof that a specific production agent is safe. Release gates need to match the actual tool permissions, data boundaries, policies, and environments the agent will see.
Finally, keep generated evals auditable. Automated task synthesis is powerful only if humans can inspect the source risk specification, scenario setup, expected safe behavior, and scoring rule. Otherwise, teams are just replacing hand-wavy prompts with hand-wavy automation.
The mature version of agent security evaluation will look less like a clever conversation and more like an engineering harness: explicit threat models, reproducible environments, trace inspection, auditable scoring, and regression tests tied to real privileges.
That may feel slower than asking a model to refuse a bad request. But agents are crossing the boundary from advice into action. Once software can act, “it said the right thing at the end” is no longer a serious safety argument.
Sources
- https://arxiv.org/html/2606.02302v1
- https://arxiv.org/html/2605.22321v1
- https://arxiv.org/html/2604.00835v1
- https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->