Customer Support Agents Need an Evaluation Flywheel, Not a Deflection Bot

The weakest version of a customer-support AI agent is easy to describe: it sits in front of the help center, answers what it can, and tries to keep tickets away from humans. That may reduce queue volume for a while. It may also create a new class of invisible operational risk: confidently wrong answers, skipped policy steps, broken handoffs, and customers who leave more frustrated than when they arrived.

The more durable pattern is not a better deflection bot. It is an evaluation flywheel.

That is the practical lesson from Nubank’s recent paper, [“Building Customer Support AI Agents at 100M-User Scale”](https://arxiv.org/html/2606.08867v2). The authors describe an evaluation-driven framework used across five production support agents, including card delivery, debt management, credit-limit support, card management, and product explanation. The reported results are not framed as a lab demo. They connect offline evaluation, judge calibration, human review, modular context engineering, and online A/B testing to business outcomes such as transactional Net Promoter Score and self-service rate.

One card-delivery deployment, for example, is reported as gaining 37 percentage points in AI tNPS and 29 percentage points in self-service rate. Another product-explanation agent saw a self-service-rate dip that the authors attribute to transient LLM request failures. That second detail matters as much as the first. Production support agents do not fail only because they misunderstand language. They fail because distributed systems fail, tool calls fail, policies shift, prompts drift, judge rubrics go stale, and escalation paths lose context.

In other words: support automation is a service operation, not a prompt.

The wrong metric is “tickets avoided”

Deflection is tempting because it is easy to count. If fewer conversations reach human agents, the dashboard looks cleaner. But deflection alone says little about whether customers received correct, compliant, and trust-preserving help.

A banking support agent, a healthcare benefits assistant, or an insurance claims copilot is not equivalent to a generic FAQ chatbot. It touches sensitive data. It invokes domain tools. It navigates policy exceptions. It must know when not to act. It must hand off gracefully when automation becomes unsafe, ambiguous, or emotionally costly for the customer.

Nubank’s paper is useful because it treats the agent as part of a larger operating loop. Their framework includes modular context components such as instructions, routines, macros, tool specifications, and working memory. It also includes offline LLM-as-a-judge evaluation, judge-prompt optimization, inter-rater reliability checks, human-in-the-loop prompt iteration, and online experiments. That combination is the important part. No single evaluator, prompt, or benchmark is trusted as the whole truth.

The goal is not “make the model answer more.” The goal is “make the support system improve safely.”

What an evaluation flywheel looks like

A real support-agent flywheel has at least seven moving parts.

First, the context is modular. Policies, standard operating procedures, reusable response patterns, tool contracts, and working memory should be separately versioned. If a billing policy changes, the team should not be spelunking through a monolithic mega-prompt to find the relevant paragraph. A support workflow should be editable, reviewable, and testable at the component level.

Second, every meaningful action leaves a trace. The agent should not merely say it checked an order status, reviewed an account state, or escalated a case. The system should record whether the tool call happened, what class of result came back, what policy branch was followed, and what information was passed to the human handoff. Final-answer review is not enough when the failure lives in the trajectory.

Third, offline evaluation needs human-curated edge cases. Historical tickets are useful, but they overrepresent what already happened. Support teams also need adversarial and rare cases: angry customers, partial information, policy conflicts, tool outages, fraud-sensitive requests, accessibility needs, and “safe but not helpful” responses.

Fourth, LLM judges should be calibrated, not worshiped. Nubank’s framework uses judge optimization and inter-rater reliability checks, which is exactly the right instinct. A judge model is an execution component. It can be biased toward verbosity, miss domain constraints, or reward a plausible answer that skipped a mandatory tool. The organization still owns the rubric.

Fifth, online metrics must close the loop. Offline scores are release gates and diagnostic tools; online tests show customer impact. The useful dashboard combines task success, escalation quality, self-service rate, customer satisfaction, latency, tool failures, policy violations, and post-handoff outcomes.

Sixth, handoff is a product surface. If the agent escalates without context, the customer experiences the automation as a delay. A production support agent should preserve the customer’s goal, collected facts, attempted actions, policy blockers, and confidence signals so a human can continue without forcing the customer to restart.

Seventh, failures should become fixtures. Every serious miss should create a regression case, a trace rule, a rubric update, or a context-module change. That is the flywheel: production evidence becomes evaluation infrastructure, and evaluation infrastructure shapes the next release.

Reliability is broader than accuracy

The Princeton-led paper [“Towards a Science of AI Agent Reliability”](https://arxiv.org/html/2602.16666v1) makes the same point from a more general reliability perspective: compressing agent behavior into a single success metric hides operational flaws. It separates reliability into consistency, robustness, predictability, and safety.

Those dimensions map cleanly to support operations.

Consistency asks whether the agent handles similar customer cases in similar ways. Robustness asks whether it degrades gracefully when a customer phrases a request oddly, a tool is slow, or the knowledge base has conflicting entries. Predictability asks whether the system can identify likely failure and escalate before it damages trust. Safety asks whether failures are bounded: no unauthorized changes, no policy-violating advice, no confident fabrication about money, care, or legal obligations.

A support agent that answers 85 percent of benchmark cases correctly but unpredictably mishandles refunds, account changes, or escalations is not production-ready. It is merely impressive in a narrow test harness.

Do not make everything an agent

There is also a design restraint here. Anthropic’s engineering guidance on [building effective agents](https://www.anthropic.com/engineering/building-effective-agents) recommends starting with the simplest viable solution and adding agentic complexity only when it improves performance. It distinguishes workflows, where code defines the path, from agents, where the model dynamically directs tool use.

Support systems usually need both. Password resets, address updates, claim-status checks, and card-shipping questions may benefit from workflow-like predictability. Messier cases may need agentic flexibility: collecting context, deciding which policy branch applies, or preparing a high-quality handoff. The mistake is choosing full autonomy as the default architecture.

A good support-agent platform makes autonomy conditional. The model can reason, but it operates inside scoped tools, mandatory checks, escalation thresholds, and traceable policies. That design is less glamorous than an open-ended agent. It is also more likely to survive contact with real customers.

Governance is part of the product

NIST’s [AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework) describes trustworthiness as something organizations incorporate across design, development, use, and evaluation. For support agents, that means governance cannot be a PDF created after launch. It has to show up in release artifacts: risk registers, eval suites, policy mappings, incident reviews, monitoring dashboards, and approval records.

The question for a production team is not “Can the model answer this customer?” It is:

Which policy version governed the answer?
Which tools were allowed, and which were actually used?
What evidence supports the response?
How often does this flow escalate, and with what customer outcome?
What changed between the last passing eval and this deployment?
Which failures are severe enough to block release?

Those questions sound bureaucratic only if the agent is still a demo. At production scale, they are how teams keep automation legible.

Build the flywheel before the launch

The practical takeaway is simple: before launching a customer-support agent, launch the evaluation system around it.

Start with a narrow domain. Write explicit routines. Version the context. Define tool contracts. Capture traces. Build a human-reviewed eval set from real and synthetic edge cases. Calibrate judges against expert raters. Run offline gates before release. Run online experiments after release. Feed every incident back into the suite.

Then measure the agent like a support operation: customer satisfaction, resolution quality, self-service rate, escalation quality, latency, compliance, and severity-weighted failure.

The companies that win with support agents will not be the ones that hide the most tickets from humans. They will be the ones that learn fastest without making customers absorb the cost of that learning.

A deflection bot tries to make work disappear. An evaluation flywheel makes the work visible enough to improve.

Sources

Build Agents That Prove Their Work

If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.

Get the Field Guide - $10 ->