The most dangerous agent failure is not always the dramatic one. Sometimes it is quieter: the agent answers without making the tool call it needed, calls a tool when the task did not require one, or takes an action whose consequences only become obvious after the run has already moved on. By then the trace may be long, the token bill may be higher, and the mistake may have shaped every later step.
That is why Beyond the Black Box: Interpretability of Agentic AI Tool Use, a new arXiv paper by Hariom Tatsat and Ariye Shater, is worth paying attention to. The paper is not trying to replace agent evaluations, logging, or policy checks. It asks a more surgical question: can we inspect a model's internal state before it acts and see whether it appears to know that a tool should be used?
For anyone building tool-using agents, that is the right level of discomfort. Most observability today is post-action. We look at traces after the model has already called the API, skipped the lookup, written the file, or sent the message. That helps with debugging, but it does not give the runtime a chance to slow down at the exact decision boundary where the mistake is forming.
What the Paper Actually Builds
The authors propose a mechanistic monitoring layer for repeated tool decisions. They convert multi-step agent trajectories into per-step decision rows. Each row contains the context available before the next action, a label for whether a tool is required, and, for tool-call steps, a low, medium, or high risk label for the next action.
Then they inspect model state before execution. The framework extracts pre-action hidden states from GPT-OSS 20B and Gemma 3 27B instruction-tuned models, mean-pools the last 32 pre-action tokens, maps those activations through Sparse Autoencoders, and trains sparse linear probes over the resulting features.
There are two probes. The Tool-Need Probe predicts whether the current step requires a tool call. The Tool-Risk Probe predicts whether a tool action looks low, medium, or high risk. Low risk is mostly read-only retrieval and lookup. Medium risk covers bounded write or creation operations. High risk includes authentication, outbound communication, account changes, shell execution, and similarly consequential actions.
That framing matters. The paper is not merely asking whether a model can name the right function. It separates three questions that production systems often collapse together: should the agent call anything, what kind of action is it about to take, and how much consequence does that action carry?
The Useful Result: Tool Need Is Visible
On held-out Nemotron function-calling data, the Tool-Need Probe reaches 75.3% accuracy on GPT-OSS 20B and 71.4% on Gemma 3 27B. Macro-F1 lands at 0.75 and 0.71 respectively. Those numbers are not magic, and they are not enough to hand control to the probe. But they are strong enough to be operationally interesting because the signal appears before the external action.
The Tool-Risk Probe performs better on headline accuracy, with 90.3% for GPT-OSS and 88.5% for Gemma on held-out tool rows. The paper is careful about that result, though. Risk prediction is more dependent on the risk-tier labeling scheme, and the macro-F1 scores, 0.64 and 0.62, tell the more sober story. The model is much better at recognizing broad low-risk patterns than cleanly separating every medium and high-risk case.
That distinction is the practical takeaway. Tool-Need looks like the more stable first deployment target. Tool-Risk is useful, but it should be treated as an advisory layer around an explicit permissions model, not as a standalone safety system.
Why This Is Different From Logs
A log tells you what happened. A benchmark score tells you how often the system behaved correctly under a test harness. This paper is after something different: a warning light before the action.
The authors show that tool-decision information concentrates in later monitored layers. For GPT-OSS, dominant features appear around layers 23 and 19. For Gemma, they appear around layers 40, 53, and 31. The representative Tool-Need features are associated with numerical data, formal language, measurements, and similar contexts. The Tool-Risk features lean toward authentication, passwords, account management, policy language, and other higher-consequence concepts.
That is not proof that the model has a neat little internal variable named should_call_tool. It does suggest that the decision boundary leaves a readable pattern in the model's state. The ablation results strengthen that point. When the authors zero out high-ranked sparse features, GPT-OSS top-10 ablation flips 4 out of 10 held-out examples and shifts the tool probability by a mean absolute 0.431. Gemma shifts more gradually, with top-200 ablation flipping 1 out of 10 examples and shifting probability by 0.146.
Ablation is important because it pushes the evidence beyond correlation. If removing the identified features changes the probe output, the features are doing real work inside the monitor's readout.
The Runtime Lesson
The most actionable section is the held-out Nemotron replay. For GPT-OSS, the replay shows 78.6% step accuracy. The missed-tool-call rate is 34.2% among tool-required steps, while the unnecessary-call rate is 7.7% among no-tool-required steps. Once the model does decide to call a tool, tool-naming accuracy is 90.8%.
That is a blunt operational lesson: the hard part is often deciding to delegate at all. In other words, the agent may know how to call the tool once it commits, but fail before that by acting as if no tool is needed.
This matches real deployments. A bad tool name is usually visible. A missing tool call can look like confidence. The agent gives an answer, the trace looks clean enough, and the failure only appears when a human notices the answer was never grounded in the system of record.
A pre-execution Tool-Need monitor could become a useful friction point. If the agent is about to answer directly while internal state suggests a tool is needed, the runtime can ask for a second pass, require retrieval, lower confidence, or route the step through stricter validation.
Transfer Helps, But It Is Not Free
The authors also test zero-shot transfer on BFCL. They do not train, retrain, tune thresholds, or calibrate on BFCL activations. Instead, they map BFCL multi-turn episodes into the same step-level pre-action format and evaluate transfer.
Probe-1 agreement with gold expected-versus-internal labels reaches 87.7% for GPT-OSS and 77.7% for Gemma. That is promising because the monitor is not only memorizing one dataset's surface form. Still, the paper is careful here too. BFCL changes the task distribution and the instruction-following demands. It is a transfer stress test, not proof that one probe will generalize cleanly across every tool namespace.
This is the part builders should not skip. Internal monitors need their own evals. If your tool set includes CRM writes, billing actions, shell commands, email sends, and medical-record lookups, a risk scheme trained around another benchmark's labels is only a starting hypothesis. The permissions model still has to be explicit.
How I Would Use This Tomorrow
The pragmatic version is simple: put the monitor at high-value boundaries, not everywhere.
Start with omission auditing. Before the agent answers without a tool, ask whether the internal signal suggests retrieval or computation was needed. If yes, do not automatically trust the answer. Force a tool call, run a grounding check, or mark the response as unverified.
Next, use risk tiering as a second signal around actions with external consequence. Authentication, outbound messages, account changes, file writes, and command execution already deserve policy gates. A Tool-Risk Probe can add context, but it should not replace deterministic authorization.
Finally, keep the normal stack. You still need typed tools, allowlists, scoped credentials, logging, replay tests, eval suites, and rollback. The point of this paper is not that interpretability solves agent safety. The point is that interpretability may finally be useful at the place agent systems most need it: the moment before action.
The Bottom Line
The paper's strongest claim is narrow and useful: tool decisions leave readable traces in model state before execution. The Tool-Need signal is the clearest win, especially for catching missing tool calls in long-horizon workflows. The Tool-Risk signal is promising, but more sensitive to how risk labels are defined.
That is enough to change how we think about agent observability. We should stop treating traces as only a postmortem artifact. For serious workflows, the runtime needs a warning light before the tool fires. This paper sketches one credible way to build it.
Sources
- https://rss.arxiv.org/rss/cs.AI
- https://arxiv.org/abs/2605.06890
- https://arxiv.org/html/2605.06890
- https://doi.org/10.48550/arXiv.2605.06890
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->