XpertBench: Why a Reported 66% Ceiling on Expert Tasks Matters for LLM Evaluation

Many LLM benchmarks reward recall, retrieval, or tightly scoped problem-solving. Professional work is usually different: it is open-ended, multi-step, and judged on quality rather than a single exact answer.

A new preprint from ByteDance Seed, XpertBench, is aimed at that gap. According to the paper, XpertBench contains 1,346 curated tasks across 80 categories, assessed with detailed weighted rubrics rather than simple pass/fail scoring. The authors report that even leading models reach only about 66% at the top end, with a mean around 55% in their reported evaluation.

That does not mean existing benchmarks are useless. It does suggest they miss something important about expert-style work.

What XpertBench is trying to measure

The paper presents XpertBench as a benchmark for authentic professional and expert-oriented workflows rather than closed-form exam questions. In the abstract, the authors describe coverage across finance, healthcare, legal services, education, and dual-track research spanning STEM and humanities. In the full paper, the dataset is broken out across seven domains: Finance, Law, Education, Engineering and Applied Sciences, Humanities and Social Sciences, Computer Science, and Healthcare.

The tasks are derived from over 1,000 expert submissions, then filtered and curated into the final benchmark. The paper says experts went through qualification steps and that tasks and rubrics were reviewed by additional domain experts.

Each task is scored with a rubric that usually contains 15 to 40 weighted checkpoints. That design matters because it lets the benchmark reward partial quality and capture multiple dimensions of performance, including instruction following, factual correctness, logical coherence, domain expertise, and compliance.

Why this matters: rubric-based scoring is closer to how many expert workflows are actually judged in practice. It rewards quality across multiple dimensions instead of collapsing everything into a single exact-match answer.

The evaluation setup

The paper introduces ShotJudge, an LLM-judge setup intended to make rubric-based evaluation more scalable while staying anchored to expert assessments. In the authors' implementation, GPT-5 is used as the anchor baseline model and Gemini 2.5 Pro as the primary judge.

The paper also reports a curated XpertBench-Gold subset of 245 tasks used for empirical evaluation. On that subset, the top reported overall score is 66.20% for Claude-Opus-4.6-thinking, followed by 64.78% for GPT-5.4-high and 64.51% for Doubao-2.0-pro.

The authors further report domain-specific divergence rather than a single universally dominant model. Their examples include GPT-5.4-high leading Finance at 84.65%, while Claude-Opus-4.6-thinking leads Law at 65.54%, Humanities at 83.02%, and Engineering and Applied Sciences at 49.58%.

Why this benchmark is interesting

Two things stand out.

First, it treats professional evaluation as rubric-based work. That is a better fit for open-ended outputs than exact-match grading.

Second, it emphasizes domain variation. The paper's results support a practical point: model performance depends heavily on the type of work being asked for. A strong result in one domain does not guarantee equivalent performance in another.

For teams building AI copilots, that is a useful reminder that benchmark choice should match workflow reality.

What to be careful about

This is still a preprint on arXiv, not a peer-reviewed journal paper.

It is also important to keep the scope of the claims clear. The reported scores come from the paper's rubric-based evaluation pipeline, including the XpertBench-Gold subset and the ShotJudge method. Those numbers are meaningful, but they are not the same thing as a direct deployment readiness test for every real-world workflow.

The most defensible takeaway is narrower: the paper presents evidence that expert-oriented, open-ended tasks remain difficult for current models, and that performance varies significantly by domain.

Bottom line: if you care about whether an LLM can handle professional-grade work, XpertBench is worth watching. In the paper's reported evaluation, even the top model reaches 66.20%, not anything close to full reliability.

Bottom line

If you care about whether an LLM can handle professional-grade work, XpertBench is worth watching. It pushes evaluation closer to what expert tasks actually look like: open-ended prompts, detailed rubrics, and domain-specific judgment.

And the headline result is hard to ignore: in the paper's reported evaluation, even the top model reaches 66.20%, not anything close to full reliability.

That is not the final word on expert AI performance. But it is a useful corrective to the idea that strong general benchmark scores automatically translate into expert-level capability.

Sources

Build AI Systems Against the Benchmarks That Actually Matter

We help teams evaluate AI against real workflows, not just headline benchmark scores. If you're designing agent systems for expert work, we can help you build a better testing loop.

Get the Field Guide — $10 →

XpertBench: Why a Reported 66% Ceiling on Expert Tasks Matters for LLM Evaluation

What XpertBench is trying to measure

The evaluation setup

Why this benchmark is interesting

What to be careful about

Bottom line

Sources

Build AI Systems Against the Benchmarks That Actually Matter

Keep Reading