AI agents are being deployed into high-stakes environments faster than the infrastructure to evaluate them has matured. Teams ship agents, run evaluations, and collect millions of interactions—then face the same open-ended question: what do we actually do with all of this? The answer, in most organizations, is improvised. One team writes a Python script to grep for error keywords. Another manually reads a hundred conversations and calls it a qualitative review. A third relies on an LLM to summarize a sample and hopes for the best. These approaches are understandable, but they are not rigorous, and they are not comparable across teams or over time.
A new preprint (arXiv:2604.09563) from leading AI safety institutions attempts to change that. The paper proposes a seven-step pipeline methodology for analyzing AI system interaction logs—a structured, reproducible framework that takes researchers from the moment they define a research question all the way through statistical analysis of findings. Rather than treating log analysis as a post-hoc afterthought, the methodology positions it as a first-class scientific discipline with defined steps, validation gates, and reporting standards.
What the Pipeline Does
The framework is built around seven sequential stages. Each step has a defined purpose, explicit deliverables, and a clear connection to the steps that precede and follow it.
The pipeline begins with defining the purpose: articulating the primary or secondary research question and building a precise mental model of the log context—task design, model or agent setup, environment configuration, and the prompts being evaluated. This step is foundational. A vague question produces a vague analysis. A clear question with a defined scope is what makes everything downstream tractable and interpretable.
The second stage is preparing the database of logs. Raw interaction logs are rarely analysis-ready. This step covers organizing logs into a structured database, filtering out incomplete runs, removing personally identifiable information, standardizing formats, and enriching records with metadata. The paper recommends the Inspect AI library for teams working in this space—its structured log schema and built-in evaluation primitives reduce boilerplate significantly.
Exploring the logs comes next. Before any automated analysis, the methodology calls for manual transcript review using strategic sampling: selecting conversations by score category, length, error type, and near-threshold cases. Automated methods—summary statistics, string matching, and LLM-based exploration—complement this review. The goal is building genuine intuition about the data before committing to signal definitions.
From that exploration, the researcher refines the research question, converting broad questions into concrete signals defined at three granularities: the message level (specific patterns within individual turns), the transcript level (agent behaviors across an entire conversation), and the population level (systematic differences across conditions or model variants). Choosing the right granularity for a signal is a design decision with consequences for every subsequent step.
Developing the scanner means building automated pattern detectors that operationalize the defined signals. The methodology guides critical decisions: how to scope chunks, what type of scoring to use, and how to handle ambiguous cases. For scoring, the paper draws on prior work noting that pairwise scoring is subject to positional bias, while pointwise scoring faces calibration challenges. The recommended approach uses confidence scores and multiple independent judges for hard cases, especially when signals are subtle or context-dependent.
The sixth stage is validating the scanner. This step is easy to skip in practice and critical in principle. The validated process runs the scanner, then samples outputs stratified by the scanner's decisions, applies ground truth labels, and computes standard metrics. Crucially, the sampling must include non-detections—cases where the scanner did not flag a signal. If you only evaluate flagged cases, your metrics will be artificially inflated and you will have no visibility into false negatives.
The pipeline concludes with running the analysis and interpreting results. This means systematic statistical analysis across transcripts, messages, and conditions—with confidence intervals, effect sizes, and explicit acknowledgment of what the data does and does not support.
Why It Matters
The gap the paper fills is real. Knowledge about log analysis for AI agents has lived in scattered blog posts, internal evaluation reports, and institutional memory—not in systematic methodologies that teams can adopt and compare. When evaluation approaches are ad-hoc, findings are hard to reproduce, teams cannot build on each other's work, and the field struggles to converge on shared standards.
The paper does not claim to solve this overnight. It is a methodology paper, not a benchmark study. It does not argue that one scanning technique is definitively superior to another. What it offers is a standard process—the kind that software engineering adopted decades ago when testing moved from ad-hoc scripts to formal test suites with defined coverage criteria and reporting conventions.
There is also an underappreciated practical problem the paper highlights: underrepresented categories. In real conversational data, rare but critical cases—like mental health crises—can account for roughly five percent of interactions. Teams that evaluate only on distributions of positive cases, or that sample uniformly at random, will systematically miss these cases. Deliberate oversampling during the exploration and validation phases is not a luxury; it is a requirement for responsible evaluation.
And there is a security dimension that is easy to overlook. The paper references the documented AgentHarm incident, in which error messages exposed the evaluation path—including the string /benchmark/—directly to the model being evaluated. This context leakage was discovered only through post-hoc log analysis. The implication is uncomfortable: evaluation infrastructure itself can be visible to the agent, and unless teams are systematically auditing logs for such leakage, they may be drawing wrong conclusions from evaluations they believe are airtight.
Key Numbers and Findings
The paper is careful to present itself as a methodology contribution, not a definitive empirical result. Even so, it includes findings worth noting.
On scanner performance, LLM-based scanners show what the paper describes as "promising performance for some types of signals"—explicit reward-hacking cues are more tractable, while more implicit or subtle signals remain harder to detect reliably. The phrase "with human oversight" appears more than once in this context, and the methodology embeds that principle into the validation step rather than treating it as optional.
On sampling: the paper's framework emphasizes that validation must include non-detections—a point that sounds obvious but is frequently violated in practice. The costs of missing a failure mode are asymmetric, particularly in production deployments.
Practitioner Takeaways
For teams building or evaluating AI agents today, the pipeline's immediate value is structural. It gives teams a process they can adopt without waiting for the field to converge on a gold-standard benchmark. The Inspect Scout library referenced in the paper is open source and designed to support this workflow natively.
The sampling strategies section deserves particular attention for teams doing qualitative evaluation. The principle of strategic sampling—selecting cases by score category, length, error type, and near-threshold cases—is not expensive to implement and substantially improves coverage over uniform random sampling. For teams working with rare categories, deliberate oversampling is a forcing function for actually finding those cases.
The scanner validation step is the one most frequently skipped in practice and the one the paper most firmly insists upon. Running a scanner and reporting its flags without validating against ground truth is a confidence trick you are playing on yourself. The validation phase is where the methodology earns its rigor.
Our Take
This paper represents a genuine step forward for the field. The pipeline is not revolutionary in the sense of introducing fundamentally new techniques—most of the individual steps are things experienced evaluation teams have done informally. What changes is making it a coherent, documented methodology that can be taught, adopted, and compared across teams.
For practitioners, the immediate recommendation is pragmatic: adopt the framework for your next evaluation cycle, lean especially hard on the validation step, and treat the sampling strategies not as academic suggestions but as operational necessities. For the field at large, the paper's greater contribution may be raising the floor—establishing that log analysis deserves the same methodological discipline as other branches of empirical computer science. That is a standard worth converging on.
Reference: arXiv:2604.09563
Want to learn more about AI evaluation?
Explore our field guide for practical AI system evaluation techniques and tooling.
Get the Field Guide — $10 →