The next enterprise agent request will not always sound like “finish this task.” More often, it will sound like: watch this inbox, monitor this case queue, wait for this vendor update, tell me when a customer crosses a threshold, or act when a deployment signal changes.

That is a different reliability problem.

A normal task agent can be judged by whether it reaches an answer. A long-running monitoring agent has to do something more awkward: preserve attention without burning it, notice the right environmental change, avoid acting on stale evidence, and understand that doing nothing may be the correct outcome. If we treat that as an infinite loop with a larger timeout, we will get expensive refresh bots with good prose.

The production primitive should be a watch contract.

Monitoring is not just longer task completion

Microsoft Research’s SentinelBench is useful because it names a class of work that many teams are already trying to automate. The benchmark focuses on long-running monitoring agents: systems that watch changing environments, wait, and act only when external events make progress possible. Its technical report describes 100 tasks across 10 synthetic web environments, with scripted time-evolving events and metrics that include task success, reaction time, resource usage, and cost: https://arxiv.org/html/2606.05342v2

The Microsoft Research article puts the core behavior plainly: for these tasks, the correct pattern is to “watch, wait, and act only when the environment changes on its own”: https://www.microsoft.com/en-us/research/articles/sentinelbench-a-benchmark-for-long-running-monitoring-agents/

That sounds simple until you operationalize it.

A human can watch a mailbox lightly while doing other work. A naïve agent may repeatedly reload the page, summarize the same state, spend tokens proving nothing has happened, and eventually either give up or take a premature action because the loop needs to produce something. Monitoring tasks punish that impulse. No amount of refreshing makes an email arrive faster. No amount of reasoning makes a calendar invite exist before it is sent.

The agent has to know when not to think.

The watch contract

A watch contract is the small operational agreement that should exist before an agent is allowed to run for minutes, hours, or days. It tells the system what to observe, when to wake up, what counts as evidence, what it may do, and when silence is success.

A practical watch contract should include:

  • Objective: the business reason for the watch, written in plain language.
  • Monitored source: the system of record, not a vague “look around.”
  • Wake condition: the exact predicate that makes action possible.
  • Observation cadence: polling interval, event subscription, backoff, and maximum duration.
  • Evidence requirement: what data must be captured before the agent can claim the condition is true.
  • Freshness window: how recent the evidence must be.
  • Allowed actions: the narrow set of actions the agent may take after the wake condition is met.
  • Idempotency rule: how the system prevents duplicate sends, duplicate tickets, duplicate orders, or repeated approvals.
  • No-op criteria: what a correct “nothing happened” result looks like.
  • Escalation rule: when ambiguity, timeout, conflicting evidence, or permission boundaries require a human.
  • Audit events: the state transitions that must be logged for replay and review.

Without those fields, a long-running agent is not autonomous. It is underspecified.

Separate the watcher from the actor

The safest design is usually not an always-on model loop. It is a deterministic watcher wrapped around narrow model calls.

Let a scheduler, queue, or state machine handle sleep, wake, backoff, deduplication, timeouts, and retries. Use the model when interpretation is needed: classifying a new message, comparing a changed record against the user’s intent, drafting a response, or deciding whether evidence is ambiguous enough to escalate.

That division matches Anthropic’s guidance in “Building Effective AI Agents”: start with the simplest viable system, prefer simple composable patterns, and add agentic complexity only when it improves performance: https://www.anthropic.com/research/building-effective-agents

The watcher should be boring infrastructure. The actor can be intelligent. Mixing both into one self-directed loop makes it harder to reason about cost, timing, and responsibility.

For example, “watch for a high-priority support escalation” should not mean a model rereads the entire queue every minute. The watcher can subscribe to new ticket events, filter by account and severity, and pass only candidate changes to the model. The model can then inspect the ticket text, identify missing context, and draft the escalation summary. If the ticket is not actually relevant, the correct result is a logged no-op.

Treat no-op as a first-class outcome

This is the part many demos hide.

In monitoring work, success often means the agent waited and did not act. A stock threshold was never crossed. A customer did not reply. A security alert did not reach the severity boundary. A build never produced the release artifact. If the only visible success path is “the agent did something,” teams will accidentally reward premature action.

A watch contract should make no-op outcomes explicit:

  • “No matching event occurred before 5:00 PM.”
  • “Three candidate events arrived; none met the evidence threshold.”
  • “The condition may be true, but source data conflicts; escalated to human review.”
  • “The action already happened under idempotency key X; no duplicate action taken.”

That language matters because it turns waiting into an auditable state, not an absence of work.

Observability is part of the product

Long-running agents also need traces that explain time.

OpenTelemetry’s agent observability discussion frames telemetry as more than troubleshooting. For non-deterministic agent systems, traces and logs can become feedback for evaluation and continuous improvement: https://opentelemetry.io/blog/2025/ai-agent-observability/

For watch-style agents, the trace should show the lifecycle:

1. Watch created. 2. Source checked or event received. 3. Predicate evaluated. 4. Evidence captured. 5. Model invoked, if needed. 6. Action proposed. 7. Permission checked. 8. Action committed or skipped. 9. Watch completed, renewed, timed out, or escalated.

That structure lets an operator ask useful questions later. Did the agent miss the event? Did it see the event but reject it? Did it act late because polling was too sparse? Did it waste tokens during idle time? Did it duplicate an action? Did a human override the result?

Without that timeline, a long-running agent becomes a black box with a sleep function.

Governance starts at the trigger

The NIST AI Risk Management Framework is not a watch-agent design spec, but its lifecycle framing is relevant: trustworthy AI has to be considered during design, development, use, and evaluation, not bolted on after deployment: https://www.nist.gov/itl/ai-risk-management-framework

For monitoring agents, the trigger is a governance boundary. A trigger decides when the system moves from observation to action. That means it deserves the same scrutiny as a permission check or approval workflow.

If the watched event can affect money, patient operations, customer commitments, access rights, or production systems, the contract should say who owns the predicate, how it is tested, how drift is detected, and what happens when the source system changes shape.

The dangerous failure is not only “the agent did the wrong thing.” It is also “the agent did the right-looking thing for the wrong trigger.”

The operational takeaway

Long-running agents should feel less like tireless interns and more like well-instrumented sentries.

Most of the time, they sleep. When they wake, they should know exactly why. When they act, the evidence should be reconstructable. When they do nothing, that should be a valid, logged outcome. When the contract runs out, they should stop or escalate instead of improvising.

The companies that get this right will not be the ones with the busiest agents. They will be the ones whose agents can wait without wasting, act without duplicating, and explain the difference between patience and failure.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->