When Your Safety Guardrail Becomes the Attack Surface: ARES and the RLHF Vulnerability You Weren't Testing For

Most teams running RLHF pipelines have a testing regimen. They check whether the model refuses harmful requests. They run adversarial prompts. They monitor for jailbreaks. What they're almost certainly not testing: whether the reward signal itself can be made to certify harmful content as preferred. That's the attack surface ARES, a new framework from researchers at USC and Salesforce Research, sets out to close.

Published as a Main Paper at ACL 2026, ARES — Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System — identifies a failure mode in RLHF that prior work overlooked: cases where both the core language model and its Reward Model fail simultaneously. When that happens, you have no internal mechanism stopping harmful output. The model generates dangerous content and the reward signal actually encourages it.

The Problem Prior Work Missed

Reinforcement Learning from Human Feedback (RLHF) relies on a trained Reward Model (RM) to signal which model outputs are desirable. That signal then guides the policy model toward better behavior. It's elegant in theory, but it creates a dependency: if the Reward Model itself can be fooled, the entire pipeline loses its corrective function.

Existing red-teaming approaches address this in pieces. Methods like FLIRT, FERRET, and APRT target policy-level vulnerabilities — they find prompts that make the model misbehave. Other work hardens the Reward Model against specific attacks. But neither line of work addresses what the ARES paper calls systemic weaknesses: cases where a single adversarial scenario exposes failures in both the core LLM and its Reward Model at the same time.

In those scenarios, neither component provides a backstop. The model produces harmful content. The reward signal, corrupted or misled, tells the pipeline that this content is acceptable. No alarm fires. No safeguard activates. The system simply generates and reinforces dangerous output.

How ARES Works

ARES has two phases: vulnerability discovery and closed-loop repair.

Phase 1: Finding What You're Missing

The core innovation is the Safety Mentor — an LLM-based attacker that composes adversarial prompts from four structured components:

Topic: The harmful domain (e.g., fraud, weapons synthesis, self-harm)
Persona: A credible identity designed to lower the model's defenses
Goal: The concrete harmful task being requested
Tactic: The methodology wrapping the attack — urgency, roleplay, false authority

What makes this different from standard red-teaming is the dual-component classification. Every generated prompt is evaluated against both the core LLM and the Reward Model simultaneously. This produces three failure types:

Type A: The core LLM wouldn't generate harmful content on its own, but the Reward Model is fooled by a synthetically crafted harmful response
Type B: The core LLM generates harmful content, but the Reward Model correctly flags it
Type C — Systemic Weakness: Both fail in tandem — the model generates harmful content and the Reward Model rewards it

Type C is the critical insight. It's not just that either component can fail; it's that they can fail together, in ways that only show up when you test them jointly.

The Safety Mentor also runs a hierarchical adaptive sampling strategy. After a warmup phase of uniform exploration, it switches to weighted sampling that reinforces whatever attack strategies are working. Successful topic+tactic combinations get higher weight. Successful individual topic instances get boosted too. This means the attacker gets better over time — it doesn't just run a fixed dictionary of attacks, it evolves toward the vulnerabilities that actually work against your specific pipeline.

Phase 2: Two-Stage Repair

Once vulnerabilities are classified, ARES runs a two-stage repair process:

Stage 1 — RM Repair: Fine-tune the Reward Model to better detect harmful content. This addresses Type A failures (where the RM was fooled) and Type C failures (where the RM's misjudgment was part of the systemic problem).
Stage 2 — Core LLM Optimization: Use the improved RM to optimize the core model. This addresses Type B failures (where the LLM generated harm but the RM caught it) and Type C failures that survived the first stage.

The ordering is deliberate. You fix the reward signal first because you can't use a broken RM as a training oracle — if the RM is itself unreliable, optimizing against it compounds the error rather than correcting it.

What ARES Shows

The paper reports results across multiple adversarial safety benchmarks. According to the authors, ARES substantially improves model safety while preserving model capabilities and keeping over-refusal rates low — meaning it doesn't make the model overly cautious as a side effect.

We want to be honest about what the paper does and doesn't tell us: the specific quantitative results (percentage improvements, benchmark scores, attack success rates) are not available in the abstract. The paper was published at ACL 2026 Main, which means it went through peer review, but as with any pre-proceeding version, details may shift before final camera-ready copy. We'll watch for the final version and update if needed.

What we can say with confidence: the dual-vulnerability framing is qualitatively new. No prior red-teaming approach evaluates the policy and reward model jointly under the same attack. The classification taxonomy and the two-stage repair architecture are both novel. The fact that the Safety Mentor composes attacks from structured components also means the approach is reproducible and inspectable — you can audit what threat categories are covered.

What Practitioners Should Consider

If you're running RLHF or deploying an aligned model, a few things from this paper are worth sitting with:

You're probably not testing your Reward Model directly. Most teams test the final model. ARES shows you should be testing the RM's robustness as a separate step, and specifically testing whether adversarial content can get it to certify harm as preferred.

The ordering of your safety fixes matters. If you find a failure where both model and RM are compromised, fix the RM first. Using a compromised RM to train the core model just bakes in the error.

A fixed attack taxonomy is a limitation to watch. The Safety Mentor works from a defined set of topics, personas, goals, and tactics. Real-world attackers aren't constrained this way. The question is whether the taxonomy is rich enough to cover the actual threat landscape — and that requires ongoing maintenance, not a one-time audit.

Our Take

ARES is a serious piece of work. ACL 2026 Main is a competitive venue, and the dual-vulnerability framing addresses a real gap in how the field thinks about RLHF safety. The two-stage repair architecture is sound — it correctly identifies that you can't optimize against a broken oracle.

What we'd want to see more of: broader benchmark coverage beyond safety-specific tests, evaluation of the fixed taxonomy's generalizability, and clarity on how the approach scales to larger models with different RLHF configurations. Early-stage research papers tend to show promising results on the specific setup they optimize for; replication across different pipelines is the real test.

That said, the core insight — that the Reward Model is an attack surface most teams aren't actively red-teaming — is actionable today. Even if you don't adopt ARES wholesale, the practice of classifying your RLHF failures into types (policy-only, RM-only, systemic) and testing them separately is a useful diagnostic discipline. The paper gives you that framework for free.

The RLHF pipeline is only as strong as its weakest link. For a long time, that link was assumed to be the core model. ARES makes a compelling case it might be the reward signal instead.

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System (arXiv:2604.18789) appears in ACL 2026 Main. Paper link: https://arxiv.org/abs/2604.18789