The most seductive enterprise AI demo is still the simplest one: open a chat box, ask a business question, and watch the system query company data like a patient analyst with perfect memory.
That demo is useful. It is also dangerously incomplete.
The hard part of a data agent is not the sentence the user types. It is everything behind the sentence: which tables count as authoritative, how two systems should be joined when their identifiers do not match, whether a number hidden in prose should be extracted or ignored, how much compute an exploratory query may burn, and what the system should do when the answer is plausible but not proven.
Recent benchmarks are making that gap much harder to ignore. The Data Agent Benchmark from UC Berkeley EPIC Data Lab and Hasura PromptQL evaluates agents on realistic enterprise-style questions across 54 queries, 12 datasets, 9 domains, and four database systems: PostgreSQL, MongoDB, SQLite, and DuckDB. Its public leaderboard shows top submissions clustered around the low 60s in Pass@1, while the original paper reports a best frontier-model baseline of 38% Pass@1 and less than 69% even with 50 attempts.
That is not a small UI problem. It is a systems problem.
The failure is not “SQL generation”
Most teams hear “ask questions over data” and map it to text-to-SQL. That is too narrow for real business work.
The DAB paper points to four properties that make enterprise data-agent work different from clean benchmark translation. The agent may need to integrate multiple databases. It may need to reconcile ill-formatted join keys, such as IDs with prefixes, spaces, aliases, or domain-specific mappings. It may need to extract values from unstructured text. It may need private domain knowledge that is obvious to a sales ops lead or clinical operations manager but invisible to a general model.
Those are not edge cases. They are the normal texture of enterprise data.
BEAVER, an enterprise text-to-SQL benchmark built from private data warehouses, tells the same story from another angle. Private warehouses have cryptic schemas, implicit joins, sparse documentation, long-tail domain values, and complex analytical queries. In the reported results, state-of-the-art agentic systems using GPT-5.2 reach only 10.8% accuracy on BEAVER, rising to 30.1% when given oracle annotations for critical subtasks.
The lesson is blunt: if an agent cannot reliably identify the right tables, infer the right joins, map business language to private schema elements, and decompose the analytical intent, the chat interface only makes the failure easier to ask for.
Wrong answers can be expensive before they are visible
Data agents also differ from ordinary chatbots because mistakes have execution cost.
The Text-to-“Big SQL” paper argues that evaluation should include more than whether generated SQL matches a reference answer. In large data systems, the agent’s reasoning time, tool orchestration, query runtime, execution cost, partial correctness, and scale all matter. A query that is “almost right” can still scan too much data, omit a required column, return the wrong row set, or force another expensive round trip.
This is where many conversational analytics pilots become fragile. The user sees an answer. The infrastructure sees a chain of schema inspections, failed attempts, retries, exploratory queries, and maybe a final result that looks neat but is not sufficiently verified.
For a dashboard, we usually demand definitions: what metric is this, where does it come from, how fresh is it, who owns it, and what changed since last month? A data agent deserves at least the same discipline. In fact, it deserves more, because it can assemble a new path through the data every time it is asked.
The missing object is a query contract
A useful data agent should not merely accept a natural-language question. It should operate under a query contract.
A query contract is the operational agreement that governs how the agent may turn intent into data work. It does not need to be heavyweight at first, but it should make the hidden assumptions explicit.
At minimum, that contract should include:
- Source scope: which databases, schemas, tables, APIs, and documents the agent is allowed to touch for a class of questions.
- Semantic mappings: how business terms map to private schema elements, metric definitions, entity IDs, and known aliases.
- Join policy: which joins are approved, which require evidence, and which should trigger escalation because the key relationship is ambiguous.
- Transformation evidence: when the agent extracts values from unstructured text, it should preserve the supporting row, excerpt, rule, or classifier rationale.
- Cost and latency budgets: how much exploration is acceptable before the system asks for narrower scope or routes to an analyst workflow.
- Verification path: whether the answer was checked by execution equivalence, independent query generation, deterministic tests, sampling, reconciliation against known dashboards, or human review.
- Refusal and escalation rules: when the agent must say “I cannot answer this reliably from the available data.”
- Audit trail: the final answer should be connected to the plan, queries, intermediate assumptions, tool calls, and validation checks that produced it.
The contract changes the product question. Instead of “Can the model answer questions over our data?” the better question becomes: “For which question families can we define sources, semantics, joins, transformations, budgets, and verification strongly enough to let an agent operate?”
That is a much more useful adoption boundary.
Clean evaluation beats clever orchestration
There is another uncomfortable lesson in the recent literature: more agent machinery is not always the fix.
ReViSQL argues that human-level performance on some text-to-SQL settings depends heavily on clean verified data and verifiable rewards, not just more elaborate pipelines. Its authors report severe annotation noise in BIRD training samples, including any annotation error in 61.1% of sampled instances and corrected gold SQL in 52.1%. If the reward signal is wrong, an agent can learn to reproduce the benchmark’s mistakes rather than the user’s intent.
That matters for enterprise teams building internal data agents. A gold set of business questions is not automatically a gold set. If the reference SQL is stale, ambiguous, or quietly wrong, the evaluation harness becomes another source of false confidence.
A query contract should therefore be paired with a curated evaluation set. Start narrow. Pick a few high-value question families: revenue variance, account risk, open support burden, claims backlog, inventory exceptions, trial conversion. For each one, define the approved sources, joins, metric definitions, edge cases, and expected verification checks. Then test the agent against those cases repeatedly, including adversarial variants and stale-data scenarios.
The goal is not to make the agent sound fluent. The goal is to make its data work inspectable.
The chat box can stay. It just cannot be the system.
Natural language is still a powerful interface. It lets operators ask follow-up questions, explore anomalies, and reduce the translation burden between domain experts and data teams.
But the enterprise value does not come from pretending the chat box is magic. It comes from putting a governed data system behind it.
The next generation of useful data agents will look less like open-ended analysts and more like constrained operators inside well-defined lanes. They will know which sources are authoritative. They will expose uncertain joins. They will preserve extraction evidence. They will budget expensive exploration. They will refuse when the data does not support the question. And when they answer, they will leave enough trace behind for another human or system to check the work.
That may sound less glamorous than “ask your data anything.”
It is also the difference between a demo and a tool an enterprise can trust.
Sources
- https://ucbepic.github.io/DataAgentBench/
- https://arxiv.org/html/2603.20576v1
- https://arxiv.org/html/2409.02038v3
- https://arxiv.org/html/2602.21480v1
- https://arxiv.org/html/2603.20004v2
Build Agents That Prove Their Work
If you are wiring agent workflows into real operations, Alchemic can help design the checkpoints, traces, and validation gates that keep automation honest.
Get the Field Guide - $10 ->