Stop Reviewing Every Agent Trajectory: How Lightweight Signals Can Fix Your Post-Deployment Pipeline

Your AI agent is live. It's handling thousands of conversations a day. Some go brilliantly. Some go sideways. And you have absolutely no idea which is which — unless you read every single one.

That's the state of agentic AI in production right now. We can build agents that reason, plan, and use tools across multi-step interactions. We can deploy them at scale. But improving them after deployment? That's still mostly "a developer reading transcripts at 2 AM and hoping they spot the pattern."

A new paper from DigitalOcean's research team proposes an elegant fix: don't try to judge every trajectory. Just triage them.

The Paper

"Signals: Trajectory Sampling and Triage for Agentic Interactions" by Shuguang Chen, Adil Hafeez, and Salman Paracha (arXiv:2604.00356, April 2026). Published out of DigitalOcean Holdings, this work tackles one of the most practical problems in production agent systems — figuring out which agent runs are actually worth looking at.

Why This Matters

Here's the ugly truth about deployed agents: they generate enormous amounts of behavioral data, and almost none of it gets used for improvement. The feedback loop is broken.

You have two bad options today:

Manual review — doesn't scale. At all. A single agent might produce thousands of trajectories per day, each containing dozens of tool calls, reasoning steps, and user messages.
LLM-as-a-judge on everything — technically possible, but running an auxiliary model over every single trajectory is cost-prohibitive when you're already paying for the agent itself.

The result? Most production agents improve through vibes-based debugging: a developer notices something weird, tweaks a prompt, deploys, and prays. There's no structured pipeline connecting what agents do in production to how they get better.

Signals addresses this by asking a different question entirely: instead of "how good was this trajectory?", ask "is this trajectory worth a human looking at?"

How It Works

The core idea is beautifully simple. The authors define a taxonomy of lightweight signals — behavioral markers that can be computed from trajectory data without any model calls — and use them to triage which trajectories deserve human review.

The Signal Taxonomy

The taxonomy spans three categories across two axes (data layer × downstream utility):

Interaction Signals (discourse layer, learning-oriented):

Misalignment — the user and agent aren't on the same page. Rephrasing, corrections, restated constraints. Not about who's "wrong," just that shared understanding hasn't been established.
Stagnation — the conversation keeps going but nothing's happening. Near-duplicate responses, circular explanations, linguistic degeneration. The agent is talking without progressing.
Disengagement — the user has checked out. "Talk to a human," explicit negative stances, session abandonment. This is the terminal state.
Satisfaction — it worked. Gratitude, success confirmations, clean closings. These get sampled as exemplars, not just dismissed.

Execution Signals (runtime layer, learning-oriented):

Failure — tool calls that don't advance the task. Empty results, no-op actions, inappropriate tool choices.
Loop — the agent is stuck in a retry cycle. Same call with same inputs, oscillating between strategies, progressive parameter drift without progress.

Environment Signals (runtime layer, diagnosis-only):

Exhaustion — context overflows, rate limits, API outages. These are the system's fault, not the agent's, so they're explicitly excluded from training signal.

What's clever here is the deliberate separation of concerns. Interaction and execution signals are marked as learning-oriented — they're candidates for preference data construction. Environment signals are diagnosis-only — useful for ops, but they'd introduce spurious correlations if used to train the agent. An agent shouldn't learn to behave differently just because an API was down.

Detection Without Model Calls

All signal detection is deterministic: phrase-level matching for interaction patterns, sequence analysis for execution patterns, and structured log parsing for environment conditions. No LLM calls. No embeddings. No classifiers. This is important because it means you can run signal computation on every trajectory in real-time without adding meaningful cost or latency to your system.

The interaction detectors use lightweight normalization and typo-tolerant phrase matching over user turns, with local similarity checks to catch rephrasing even when explicit markers are absent. Stagnation detection uses discourse heuristics — near-duplicate phrasing within a speaker role, prolonged interactions relative to a baseline.

Execution failure detection classifies non-advancing tool outcomes from structured observations. Loop detection runs sequence analysis over invocation streams: repeated calls with identical inputs, repeated calls with systematically varying inputs, and repeated multi-tool cycles.

The Sampling Framework

Here's where it comes together. The signals feed into a composite triage score that doesn't try to rate trajectory quality — it rates trajectory informativeness. The framework then runs parallel sampling streams for failures and exemplars, producing a balanced set of trajectories that a human reviewer can actually act on.

The distinction matters. Quality scoring is inherently domain-specific and assumption-laden (a terse response might be perfect for an expert but terrible for a novice). Informativeness is more universal: "would a developer learn something useful from reading this?"

The Results

The authors validated on τ-bench, a benchmark for tool-augmented agent evaluation that simulates multi-turn conversations across airline and retail domains. They compared three sampling strategies, each drawing 100 trajectories:

Random sampling: 54% informativeness rate
Heuristic filtering (conversations with 10+ user messages): 74% informativeness rate
Signal-based sampling: 82% informativeness rate

That's a 1.52× efficiency gain over random sampling — meaning every annotation dollar goes 52% further. The difference between signal and random sampling is highly significant (p < 0.001).

But the more interesting finding is what happens when you stratify by task reward. Heuristic sampling achieves its 74% rate by cherry-picking failures — 70% of its sample consists of failed trajectories. Signal sampling draws a more balanced mix (52% failed) and still comes out ahead.

The Counterfactual Test

To prove signals aren't just oversampling obvious failures, the authors ran a counterfactual standardization: re-weighting each strategy to match random sampling's reward distribution (63% success, 37% fail). Under this adjustment:

Signal sampling: 77.6% standardized informativeness
Heuristic sampling: 62.7% (drops 11.3 points once failure bias is removed)
Random sampling: 54.0% (unchanged — it's the reference)

Signal sampling provides genuine per-trajectory informativeness gains. It's not just fishing for failures; it finds informative patterns in successful trajectories too — policy violations, inefficient tool use, near-misses that didn't prevent task completion but still represent improvement opportunities.

Where It Shines Most

The advantage is particularly pronounced in the retail domain, which features more complex multi-step tasks: signal sampling achieves 78% informativeness compared to 66% for heuristic and 35% for random. In the simpler airline domain, all strategies perform well (86–96%), leaving less room for differentiation.

The category distribution across informative trajectories is stable regardless of sampling method: action/tool-use issues account for 57–60% and conversation issues for 38–43%. Signals don't bias the type of issue surfaced — they just surface more of them.

What This Means for Practitioners

If you're running agents in production, this paper gives you a practical blueprint for the part everyone hand-waves: the feedback loop.

The immediate takeaway: You don't need to evaluate every trajectory. You need to evaluate the right trajectories. A lightweight signal layer — deterministic rules over discourse and execution patterns — can get you 82% hit rate on informativeness with zero model-call overhead.

The architectural implication: Signals should be a first-class component of your agent infrastructure, computed in real-time and attached as structured metadata to every trajectory. They're not a post-hoc analysis tool; they're sampling infrastructure that sits upstream of everything else — human review, preference data construction, automated evaluation.

The training pipeline angle: The authors explicitly frame this as a path toward preference data construction for DPO/RLHF. Select informative trajectories via signals, generate counterfactual continuations (what the agent should have done), and you've got preference pairs for training. They leave the end-to-end pipeline to future work, but the architecture is clear.

What's still missing: The signal taxonomy is intentionally coarse-grained. It won't catch semantically incorrect but behaviorally smooth trajectories — the agent that confidently delivers wrong information in a perfectly natural conversation. The authors acknowledge this and suggest signals work best alongside domain-specific validators or outcome verification. Also, the evaluation uses LLM-simulated users (τ-bench), so disengagement and satisfaction patterns may differ from real-world traffic.

Our Take

This is the kind of paper we love at Alchemic: practically useful, architecturally clean, and honest about its limitations. The agent community has spent years getting better at building agents and almost no time on the post-deployment improvement loop. Signals doesn't solve the whole problem, but it solves the right first problem — figuring out what to look at.

The "no model calls" constraint is the real insight. In a world where everyone reaches for another LLM to evaluate the first LLM, using deterministic rules over structured trajectory data is refreshingly pragmatic. It's cheaper, faster, more predictable, and scales linearly.

For anyone building with OpenClaw or similar agent frameworks, the signal taxonomy maps directly to observable infrastructure: tool call logs give you execution signals, conversation history gives you interaction signals, and system metrics give you environment signals. You could implement the core framework in a weekend.

The 82% informativeness rate with 1.52× efficiency gain isn't going to solve all your agent problems. But it will make sure you're spending your debugging time on trajectories that actually teach you something — and that's a better foundation than most production agent systems have today.

Want Your AI Agents to Actually Improve?

Learn how to build, deploy, and optimize AI agents with OpenClaw — from setup to production monitoring.

Get the Field Guide — $10 →

Paper: "Signals: Trajectory Sampling and Triage for Agentic Interactions" — Shuguang Chen, Adil Hafeez, Salman Paracha (DigitalOcean Holdings). arXiv:2604.00356, April 2026.