Hook
Here's an uncomfortable question for anyone running RLHF to align an LLM with safety values:
What if your safety evals pass — but the evaluator that judged them also failed?
In standard RLHF, a Reward Model (RM) sits between your Core LLM and the world. The RM is supposed to catch harmful outputs and steer the model away from them. You test this whole system with red-teaming, and if the LLM passes your safety checks, you're done.
Except there is a class of failure that most red-teaming methods miss entirely: the case where both the Core LLM and the RM fail at the same time. The LLM generates something harmful, the RM gives it a high score, and no one catches it. The system has no internal mechanism to self-correct.
This is the problem ARES tackles. Published at ACL 2026 Main (a future conference — treat as preprint), ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System introduces a framework that doesn't just ask "does the LLM misbehave?" It asks the harder question: "does your safety net — the RM — also fail when the LLM fails?"
What ARES Does
ARES is a two-phase framework for discovering and repairing systemic vulnerabilities in RLHF pipelines.
The core insight is that RLHF creates a coupled system: the Core LLM's behavior is shaped by the RM's scores, and the RM's reliability depends on data that may not cover all the ways the LLM can fail. Most red-teaming operates on the LLM in isolation, treating the RM as a fixed, trustworthy evaluator. ARES treats both as testable components.
The framework proceeds in two phases:
- Phase 1 — Discovery: ARES generates adversarial prompts designed to expose where the Core LLM and RM fail individually and together. It then classifies each failure into one of three types.
- Phase 2 — Repair: Using the discovered vulnerabilities as training signal, ARES fixes the RM first, then uses the improved RM to optimize the Core LLM. The repair is sequential: fix the evaluator before using it to fix the model.
Why It Matters
The failure mode ARES identifies is not theoretical. RMs are trained on preference data, and that data has gaps. A RM may have never seen examples of a particular adversarial strategy, or may have learned biases that allow it to be fooled by cleverly constructed harmful content. When the Core LLM generates content exploiting those gaps, and the RM scores it as safe, you have a silent failure.
Existing red-teaming frameworks — FLIRT, APRT, FERRET, AutoDAN-Turbo — are designed to find LLM policy failures. They generate adversarial prompts to make the LLM misbehave, and if the LLM refuses or produces safe output, the test passes. But they don't systematically probe the RM, and they don't catch the cases where both fail simultaneously.
ARES calls these "systemic weaknesses." They are, by the paper's framing, the most dangerous failure mode: there's no internal flag, no alarm, nothing in the pipeline catches it.
How It Works
The Safety Mentor
The adversarial prompt generation in ARES is handled by what the paper calls a "Safety Mentor" — an LLM that composes attacks by combining four structured component types:
- Topic: The harmful domain (e.g., misinformation, exploitation)
- Persona: A credible identity for social engineering (e.g., "digital forensics expert")
- Goal: The concrete task the model is asked to perform
- Tactic: The method or framing used to disguise the attack
The key is compositional coherence: by combining these four components, the Safety Mentor generates prompts that are semantically logical and deceptively benign-looking, rather than obvious jailbreak attempts that simple filters catch. For each prompt, the Safety Mentor also generates a synthetic harmful response and a preferred safe response — forming a preference triplet that directly supports downstream repair.
Prompts are filtered through ShieldGemma before use to retain only genuinely harmful attempts.
Three Failure Types
Each adversarial prompt is evaluated against both the Core LLM (via a Judge model that provides a harm score from 0 to 5) and the RM (scoring the synthetic harmful response vs. the safe response independently). This parallel evaluation produces three failure types:
- Type A — RM Misalignment: The Core LLM would not independently generate harmful content, but the RM was fooled by the Safety Mentor's synthetic harmful response. The RM needs repair.
- Type B — Policy Vulnerability: The Core LLM generates harmful content, but the RM correctly catches it. The Core LLM needs repair.
- Type C — Systemic Failure: Both fail simultaneously. The Core LLM generates harmful content and the RM gives it a high score. This is the most critical failure mode — it requires fixing both components.
Hierarchical Adaptive Sampling
A key innovation is that the Safety Mentor doesn't generate attacks uniformly. ARES uses a two-level adaptive sampling mechanism: after a warmup phase, it learns which component categories (e.g., Deception & Manipulation) and which specific instances within categories (e.g., deepfake creation) are most effective at finding failures, and reinforces successful strategies multiplicatively. This focuses the search on the most productive parts of the vulnerability space.
Two-Stage Repair
The repair process is sequential and deliberate:
- RM fine-tuning: The RM is fine-tuned on a preference dataset built from Type A and Type C failures, augmented with HelpSteer2 general helpfulness data and FalseReject data to prevent over-correction and capability degradation.
- Core LLM optimization: Using the repaired RM as reward signal, the Core LLM is optimized on Type B and Type C failures (paired with the same auxiliary data), trained via GRPO.
The sequential ordering matters: the paper argues that repairing the RM first prevents a circular dependency, where a broken RM would produce bad training signal for the Core LLM during repair.
Key Results
ARES was evaluated across multiple adversarial safety benchmarks using Qwen3-1.7B as the Core LLM and Skywork-RM-Qwen3-4B as the RM. Results are reported as safety scores (higher = safer) on StrongReject, HarmBench, PKU-SafeRLHF, RedTeam, and XSTest (incorrect refusal rate, lower = fewer false refusals).
On HarmBench specifically, the Original RLHF model scored 0.66. General Safe-RLHF — a baseline that uses broader safety data — reached 0.79. ARES achieved 0.95. Similar patterns hold across the other safety benchmarks: ARES reaches near-ceiling safety scores (0.95–0.97) across StrongReject, HarmBench, PKU-SafeRLHF (0.96), and RedTeam.
Critically, these safety gains come with minimal capability degradation on core benchmarks including MMLU, GSM8K, TruthfulQA, and AlpacaEval. The paper reports no statistically significant capability drop attributable to the ARES repair process.
On sample efficiency, ARES demonstrates strong gains with far less data than baseline approaches. With only 2,000 samples (versus a baseline trained on 10,800 examples), ARES reaches HarmBench 0.91 vs. the baseline's 0.88. With 4,000 samples, ARES reaches 0.97 versus the baseline's 0.94 — roughly 2.7× fewer samples for equivalent or better results.
On runtime, the paper reports approximately 13 hours end-to-end for a single ARES pass (9 hours in the discovery phase generating 4,000 samples, 2.5 hours for RM fine-tuning, and 1.5 hours for Core LLM optimization). This compares favorably to the APRT baseline, estimated at 28 hours under comparable compute.
Practitioner Takeaways
For engineers and teams running RLHF pipelines, ARES surfaces a practical insight: test the evaluator, not just the model. The RM is not a fixed, reliable component — it has failure modes, and those failure modes can coincide with the LLM's failure modes.
Concrete recommendations the paper supports:
- Add a dual-probe to your red-teaming: When adversarially testing an aligned LLM, also probe the RM independently. Generate synthetic harmful responses and check whether the RM correctly scores them lower than safe alternatives. If it doesn't, that's a Type A vulnerability.
- Fix the RM before the Core LLM: If you find RM vulnerabilities, address those first. Using an unreliable RM to optimize the Core LLM risks encoding the RM's biases into the model.
- The adaptive sampling approach is reproducible: The four-component taxonomy (Topic, Persona, Goal, Tactic) provides a structured starting point for teams building custom red-teaming datasets. It's not model-specific.
- Complementary to existing safety methods: ARES targets a specific failure mode — coupled RM-LLM vulnerability — that constitutional AI, Safe RLHF, and representation engineering don't explicitly address. It can layer on top of those approaches.
Our Take
ARES is a well-motivated paper that identifies a real and underappreciated risk in RLHF deployments. The framing of "systemic weaknesses" — where both the LLM and its safety net fail simultaneously — is the right lens for thinking about robust alignment.
That said, the paper has meaningful limitations to keep in view:
- It's a preprint. ACL 2026 Main hasn't occurred yet. The results should be treated as unverified pending formal peer review.
- The experiments are primarily on a 1.7B parameter model. Whether the findings generalize to larger frontier models is an open question.
- Only one RM architecture was tested. The failure modes discovered may be specific to Skywork-RM-Qwen3-4B.
- The taxonomy-based Safety Mentor has coverage limits. If adversarial strategies fall outside the four-component structure, they may not be discovered.
The sample efficiency numbers and the runtime comparison are compelling. The dual-repair sequential approach is intuitive and sensible. But for practitioners, the key contribution may be the conceptual framework: the next time your safety evals pass, ask whether the evaluator was actually tested.
Paper: ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System (arXiv:2604.18789) — Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris. To appear at ACL 2026 Main. Treat as preprint.