Coding agents are turning into filesystem actors. They open shells, edit repos, install dependencies, and run arbitrary scripts on machines you care about. The standard answer to that risk is a sandbox: cgroups, containers, MAC profiles, scoped tokens. Sandboxes are necessary, but they only enforce a policy. Somebody, or something, still has to decide what that policy should be before the agent starts working. A new preprint from the Evolvent AI research team, "Do Coding Agents Understand Least-Privilege Authorization?" (arXiv:2605.14859v1), takes that upstream question seriously and finds that frontier models are not yet good at answering it.
The authors call the task "permission-boundary inference." Given a natural-language instruction and a terminal environment, the model is asked to produce a file-level allowlist over read, write, and execute axes. Any path not on the list is denied. The model can look at the environment in read-only mode while it decides, but it cannot run the task. Then a fixed execution agent — GPT-5 inside an OpenClaw harness — tries to complete the task under whatever policy the first model produced. That split is the entire point. It lets you measure how good a model is at choosing a boundary, independent of how good it is at writing the code that lives inside it.
AuthBench in one paragraph
AuthBench is 120 terminal tasks across 10 domains: 80 standard tasks and 40 sensitive ones. Standard tasks are the usual command-line work — file manipulation, package builds, log analysis, config repair, repository debugging, data conversion. Sensitive tasks take a standard objective and place attack surfaces nearby: unsafe helper scripts, credential-like files, tempting hints that the benign task does not need. Each task ships with a Docker environment, a safe oracle solution, a utility validator, and an annotated permission spec. The benchmark scores generated policies on per-axis precision, recall, and F1 against a gold proxy Sgold, plus three execution-time metrics: Task Success Rate (TSR), Sensitive-File Exposure Rate (SER), and Attack Success Rate (ASR).
Two reference numbers anchor the table. Full-Access, where the agent gets unrestricted access, hits 94.0% TSR on sensitive tasks — and 65.8% ASR. Two thirds of sensitive tasks end with the attack surface actually triggered. Golden-Permission, where the agent runs under the human-reviewed Sgold, drops to 81.7% TSR and 0.0% ASR. That gap between Full-Access and Golden-Permission is the headline argument of the paper: capability benchmarks that hand the agent the whole filesystem are flattering deployment numbers in a way that does not survive contact with real authorization.
Models fail on both sides at the same time
You might expect models to land somewhere on a clean conservative-versus-permissive spectrum. They don't. The paper's main finding is that frontier models simultaneously under-grant and over-grant. They omit a permission their execution chain actually needs, then list paths the task never touches, sometimes including the sensitive ones.
The headline rows from Table 1 make this concrete. Gemini 3.1 Pro produced the highest generated-policy TSR on the board: 75.4% standard and 85.8% sensitive. That sounds good until you see its sensitive SER of 34.8% and ASR of 28.3%. It gets the work done, but more than a quarter of the time its policy is loose enough for the attack to fire. GPT-5.4 sits at the other end. It posts 52.6% standard TSR and 61.1% sensitive TSR, with SER 21.1% and ASR 19.4%. Tighter, but it fails the task far more often. Claude Opus 4.6 lands in the middle on TSR (61.3% / 61.5%) but with a stubbornly high SER of 47.0% — its policies systematically expose more sensitive paths than they need to.
The interesting part is that none of these failures collapses into a single dial. A model can refuse a path the toolchain needs and grant write access to a credential file in the same policy. That is not calibration. It is a different shape of mistake, and it is the one the paper is trying to name.
"Think harder" picks a personality, not a target
The paper plots policies in a sufficiency-tightness space using under-authorization and over-authorization burdens, then tracks what happens when you increase reasoning effort for the models that expose that knob: Gemini 3.1 Pro, GPT-5.4, and Claude Opus 4.6. The displacement is not toward an ideal point. Each model has its own attractor. Gemini drifts toward broader coverage, taking more sensitive exposure with it. GPT-5.4 and Claude Opus 4.6 drift the other way, tightening policies past the point where the execution chain still works. More thinking made each model more itself, not more correct.
This is a narrower claim than "reasoning doesn't help." The authors are careful to scope it to direct policy generation on this benchmark, and to models with configurable reasoning levels. But the pattern matters for anyone tempted to fix authorization problems with prompt size or extended thinking budgets. On this task, those budgets sharpened a bias rather than removing it.
Sufficiency-Tightness Decomposition: coverage first, pruning second
The fix the paper proposes is structural rather than parametric. Instead of asking a model to emit a final least-privilege policy in one pass, "Sufficiency-Tightness Decomposition" splits the work into two phases. First, generate a coverage-oriented policy: enumerate every path the execution plan and its transitive toolchain are likely to touch, biased toward including too much. Second, audit each entry against the task description and sensitivity labels, pruning grants that are not grounded in the work, narrowing patterns where possible, and removing overlaps with sensitive surfaces.
The results are not uniform across models, which is the honest part. For Claude Opus 4.6, sensitive TSR rises from 61.5% to 75.0%, SER drops from 47.0% to 28.3%, and ASR drops from 25.6% to 15.0%. For GPT-5.4, sensitive TSR climbs from 61.1% to 76.9%, with SER moving from 21.1% to 19.2% and ASR from 19.4% to 15.4%. For Gemini 3.1 Pro, the trade is different: sensitive TSR falls from 85.8% to 75.0%, but SER drops from 34.8% to 15.7% and ASR from 28.3% to 12.5%. ASR drops for all three. Execute-axis F1 improves for all three. The shape of the improvement varies, but the direction of the safety metric is consistent.
Builder takeaways
If you are shipping a coding agent, the operational reading is fairly direct.
Treat policy generation as its own pipeline stage, with its own evaluation. A capability score under Full-Access does not tell you whether the model can choose the right boundary; the Full-Access-versus-Golden-Permission gap is roughly the size of that blind spot. Score policies on at least two axes: task completion under the policy, and exposure of paths the task didn't need. TSR alone hides overbroad grants. SER alone hides whether you can still get the job done.
Build the permission step as coverage plus audit, not as a single emit. The decomposition result is a prompt-engineering pattern any team can adopt today: one pass that enumerates likely-needed paths from the plan and the transitive toolchain, then a separate pass that asks "is each entry grounded in the task, and does it overlap a sensitive surface?" The improvements come from forcing those two questions apart.
Pay extra attention to execute permissions. The paper flags missing execute grants as a dominant cause of under-authorization, and decomposition improves execute-axis F1 across every model tested. Interpreters, build tools, and helper binaries reached through the toolchain are the easy things to forget. Enumerating them on purpose is cheap; debugging a tight execute policy at runtime is not.
And do not lean on reasoning budget to fix authorization. Under the settings tested, more reasoning made each model more consistently wrong in its own way. Structural changes to the task — splitting coverage from pruning, scoring exposure separately from success — moved the numbers in ways that bigger thinking budgets did not.
What this paper is not
A few honest caveats, because the framing matters. This is a preprint, not a settled result. AuthBench is file-level read/write/execute on terminal tasks only; it does not cover network egress, cloud IAM, database credentials, browser state, or API-level scopes, which is most of the authorization surface in a real deployment. Sgold is a human-reviewed, execution-calibrated proxy for a task-sufficient boundary, not a universal minimal policy — the authors are clear that execution under policy is the definitive sufficiency check. And the benchmark evaluates policy inference. It is not a claim that models cannot understand authorization in any broader sense.
What it does show, cleanly, is that the question of what policy the agent should run under is its own capability, separable from coding skill, and one the current frontier is not solving by reflex. Sandboxes will enforce whatever boundary you hand them. Choosing the boundary is the part that still needs work, and a permission compiler — coverage first, audit second, measured on both success and exposure — is a reasonable place to start building it.
Sources
- https://arxiv.org/abs/2605.14859
- https://arxiv.org/pdf/2605.14859v1
- https://github.com/evolvent-ai/Authbench
- https://evolvent.co/en/research/authbench
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->