OS-Themis: Teaching GUI Agents to Judge Their Own Work

Paper: “OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards” — Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, et al. (arXiv:2603.19191, March 2026)

Your AI agent clicks through a phone app, completes a task, and declares victory. But did it actually succeed? That's the question OS-Themis sets out to answer — and its solution is surprisingly elegant: don't trust a single judge. Build a jury.

The Problem Nobody Talks About

Reinforcement learning (RL) for GUI agents has a dirty secret: the reward function is almost always terrible. You can build the most sophisticated policy network in the world, but if your reward signal tells the agent “good job” when it accidentally opens the wrong app, your training is doomed.

Existing approaches fall into two camps. Rule-based evaluators are precise but brittle — they break the moment your UI layout changes. LLM-as-a-Judge methods are flexible but sloppy — a single vision-language model looking at a screenshot and saying “yeah, looks done” is about as reliable as a distracted intern reviewing a pull request.

The core issue is what the authors call evidence dilution. When a GUI agent takes 50 steps to complete a task, most of those steps are routine — scrolling, navigating menus, waiting for loads. The handful of critical moments where the agent actually accomplishes (or fails) the objective get buried in noise. A single judge model trying to process the entire trajectory gets overwhelmed.

What OS-Themis Actually Does

OS-Themis decomposes the evaluation problem into a multi-agent pipeline with four specialized roles, organized into two modules:

Milestone Verification Module

The Selector watches the full trajectory and picks out the milestone moments — the steps that actually matter for determining success or failure. Instead of drowning the evaluation in 50 screenshots, it extracts maybe 3-5 critical frames. Each milestone gets an explicit “assignment goal” — a clear criterion for what success looks like at that step.

The Verifier takes each milestone and its assignment goal, then independently checks whether that specific step was executed correctly. This is forensic-level analysis on isolated evidence, not a vague gut check on the whole trajectory.

Verdict Calibration Module

The Reviewer operates as a Critic (they tested Advisor mode too — more on that later). It audits the entire evidence chain produced by the verification module, looking for overlooked flaws, gaps in logic, or evidence that the verification might have missed. Think of it as adversarial quality assurance on the evaluation itself.

The Judge makes the final call. Armed with the verified milestones and the reviewer's critique, it produces a binary success/failure verdict. Critically, the Judge can determine that a task succeeded even if some intermediate steps were imperfect — because in real GUI interaction, partial failures don't always prevent overall success.

The Benchmark: OmniGUIRewardBench

The authors didn't just build a framework — they built the evaluation infrastructure to prove it works. OmniGUIRewardBench (OGRBench) is a cross-platform benchmark spanning five environments:

AndroidWorld — mobile device tasks
OSWorld — desktop OS tasks
WindowsAgentArena — Windows-specific workflows
macOSArena — macOS interactions
WebArena-Lite-v2 — web browser tasks

The benchmark includes 1,409 trajectories (700 positive, 709 negative) generated by diverse agents including Qwen3-VL (4B to 235B), UITARS variants, ScaleCUA, and Claude Sonnet 4.5. This isn't a synthetic toy dataset — it's real agent behavior across real platforms.

The Numbers That Matter

OGRBench Results

Against the two established baselines (DigiRL and ZeroGUI), OS-Themis dominated across every metric:

vs. DigiRL: +18.8% accuracy, +29.6% precision, +16.9% recall, +26.2% F1
vs. ZeroGUI: +7.7% accuracy, +5.1% precision, +13.0% recall, +13.4% F1

These aren't marginal gains. An 18.8% accuracy improvement over DigiRL means the difference between a reward function that sort-of-works and one that actually trains useful agents.

Online RL Training

Here's where it gets practical. Using OS-Themis as the reward function for online RL training on AndroidWorld:

Qwen3-VL-4B: 6.0% absolute improvement over baseline (beating ZeroGUI by 5.2%, SEAgent by 3.5%)
Qwen3-VL-8B: 7.1% absolute improvement (beating ZeroGUI by 3.0%, SEAgent by 4.7%)

The bigger model benefited more from OS-Themis — a promising sign that the framework scales with model capability rather than hitting diminishing returns.

Scaling Results

When they scaled to 1,024 training tasks with Qwen3-VL-4B, OS-Themis delivered a 10.3% improvement over baseline on AndroidWorld, reaching 55.6% task success rate. The performance curve showed consistent gains as task count increased, with no signs of plateauing.

Self-Evolution via Data Filtering

Perhaps the most exciting result: OS-Themis as a data filter for self-training. From 15,110 raw trajectories collected through autonomous exploration, OS-Themis filtered the data and achieved:

Qwen3-VL-4B: +6.9% over baseline after SFT on filtered data
Qwen3-VL-8B: +5.0% over baseline

Training on unfiltered data actually degraded performance — confirming that data quality matters far more than data quantity, and that OS-Themis can reliably separate signal from noise.

The Ablation Study (The Interesting Part)

Each component earns its keep:

Remove Selector: -4.7% accuracy, -13.1% precision. Without milestone extraction, the verifier drowns in irrelevant evidence.
Remove Verifier: -6.1% accuracy, -15.6% precision. Skipping verification introduces systematic bias.
Remove Judge: Accuracy collapses to 52.5%. Precision stays at 89.7%, but recall crashes to 5%. The system becomes absurdly conservative because imperfect intermediate steps ≠ task failure.
Remove Reviewer: Accuracy holds, but precision drops. The Critic role catches edge cases the verification pipeline misses.

The assignment goal ablation is particularly telling — without explicit success criteria for each milestone, the Verifier becomes lenient and false positives spike.

Reviewer Role: Advisor vs. Critic

They tested two reviewer personas. The Advisor gives constructive suggestions; the Critic hunts for flaws. The Advisor produced more balanced metrics, but they chose the Critic for production because RL training needs high-precision rewards. A false positive in your reward function is far more damaging than a false negative — it teaches the agent to repeat failures.

Why This Matters for Practitioners

1. The Precision-Recall Tradeoff Is Real (And They Formalize It)

The paper includes a formal analysis (Appendix A) showing that in policy-gradient RL, once recall is adequate, reducing false positives yields larger training improvements than increasing true positive detection. This isn't just intuition — they prove that the “preference margin” (ρ - α) governs the effective gradient signal, and precision directly controls false-positive contamination of updates.

Translation for builders: When choosing a reward function for RL-trained agents, optimize for precision first. A conservative evaluator that occasionally misses successes will train a better agent than a generous one that frequently rewards failures.

2. Multi-Agent Evaluation Scales Better Than Bigger Models

The scaling experiments reveal that upgrading the Judge and Verifier components to larger models yields more improvement than scaling the Selector or Reviewer. This suggests a targeted scaling strategy: invest compute in the decision-making components, not the evidence-gathering ones.

3. Self-Evolving Agents Are Becoming Practical

The autonomous data collection → OS-Themis filtering → SFT loop is a complete self-improvement pipeline. The agent explores, the critic filters, and training improves the agent — which then explores better. We're getting close to agents that genuinely improve themselves without human-labeled data.

4. Platform Generalization Is Solvable

OGRBench spans five different platforms, and OS-Themis performs consistently across all of them. This suggests that GUI evaluation doesn't need platform-specific engineering — a well-designed general framework can handle the diversity of real-world interfaces.

Limitations (The Honest Part)

The authors acknowledge several constraints worth noting:

Infrastructure bottleneck: Scaling online RL is currently limited by their ability to provision and coordinate containerized Android environments. The method works, but the plumbing needs work.
Reward granularity: They're using binary (success/failure) rewards. Milestone-level intermediate rewards could improve sample efficiency but aren't fully explored yet.
Semantic reward hacking: Because the evaluator is a VLM (not a rule-based system), agents could theoretically learn to produce screenshots that look successful to the critic without actually completing the task. The multi-agent design mitigates but doesn't eliminate this risk.
VLM bias inheritance: The critic framework inherits whatever biases exist in the underlying vision-language models, potentially penalizing non-standard UI layouts or accessibility interfaces.

Our Take

OS-Themis represents a meaningful step toward solving one of the hardest problems in agent development: reliable automated evaluation. The insight that evaluation should be decomposed into specialized roles — extraction, verification, critique, and judgment — mirrors how human organizations handle quality assurance. No single person reviews a complex deliverable; you have specialists checking different aspects and a final decision-maker synthesizing the evidence.

The self-evolution results are particularly noteworthy. We're seeing the emergence of a pattern where agents can autonomously collect experience, have that experience filtered by a reliable critic, and train themselves to improve. The bottleneck is shifting from “can we train agents?” to “can we evaluate agents?” — and OS-Themis pushes that frontier forward.

For anyone building agentic systems that interact with GUIs — whether that's mobile testing, desktop automation, or web scraping — the architecture here is directly applicable. You don't need OS-Themis specifically; the design pattern of multi-agent evaluation with milestone decomposition can be implemented with any reasonably capable VLM.

The paper is the strongest evidence yet that the future of GUI agents isn't just better action models — it's better critics.

Paper: arXiv:2603.19191 — “OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards”
Authors: Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, et al.

Build Smarter Agents with OpenClaw

Ready to put your AI agents to work? Our field guide covers everything from setup to advanced automation.

Get the Field Guide — $10 →

OS-Themis: Teaching GUI Agents to Judge Their Own Work

The Problem Nobody Talks About

What OS-Themis Actually Does

Milestone Verification Module

Verdict Calibration Module

The Benchmark: OmniGUIRewardBench

The Numbers That Matter

OGRBench Results

Online RL Training

Scaling Results

Self-Evolution via Data Filtering

The Ablation Study (The Interesting Part)

Reviewer Role: Advisor vs. Critic

Why This Matters for Practitioners

1. The Precision-Recall Tradeoff Is Real (And They Formalize It)

2. Multi-Agent Evaluation Scales Better Than Bigger Models

3. Self-Evolving Agents Are Becoming Practical

4. Platform Generalization Is Solvable

Limitations (The Honest Part)

Our Take

The paper is the strongest evidence yet that the future of GUI agents isn't just better action models — it's better critics.

Build Smarter Agents with OpenClaw

Keep Reading