Ask an LLM to answer a question about a table, and it sounds like the problem is solved. The model reads the table, reasons over it, and gives you an answer. Simple.
But multi-turn table reasoning — where a model has to track what it knows, ask follow-up questions, and refine its answer across multiple turns — has a hidden failure mode that nobody talks about much: representation drift.
Here's what happens. A table is a 2D object. When you feed it to an LLM, you have to serialize it into text — JSON, Markdown, or LaTeX. That serialization destroys the spatial topology. Cells that are adjacent in a 2D grid become distant tokens in a 1D sequence. The model loses the ability to perceive row and column relationships correctly.
And here's the part that makes it worse: the errors accumulate. In a single-turn setting, a representation error might cause one wrong answer. In a multi-turn setting, where each turn builds on the previous one, the errors compound. After a few turns, the model's belief state about what's actually in the table diverges from reality — and it doesn't know it's happening.
The existing fixes make it worse. Grounding methods that try to correct belief state drift add significant compute and latency, making them impractical for real-world deployment.
A new paper from researchers at UCLA, McGill, HKUST, and Université de Montréal introduces a framework that solves both problems simultaneously — without any training.
The Two Bottlenecks
TABQWORLD identifies two distinct bottlenecks in multi-turn table reasoning:
Representation bottleneck. Fixed text serialization systematically distorts 2D table topology. The model perceives what cells are adjacent, which row a value belongs to, and what column a header controls — all incorrectly. This isn't a model capability problem. It's a fundamental mismatch between how tables encode information and how text represents it.
Estimation bottleneck. When models try to estimate the full table state — for example, "what does the table look like after sorting by column X?" — they hallucinate. The researchers tested this directly: after correctly sorting a table, GPT-5.4 was asked to estimate the resulting state. It generated 1 for all numerical values — the wrong answer — despite having correctly executed the sort operation. The belief state diverged silently.
TABQWORLD: Two Mechanisms, One Goal
The framework has two complementary components that jointly optimize table reasoning:
1. Action-Conditioned Multimodal Selection Policy
Instead of fixing the table representation, TABQWORLD makes the representation a choice. At each reasoning step, the model predicts a triplet:
- αₜ — the executable operation (sort, filter, aggregate...)
- rₜ — step-level feedback
- mₜ — the modality for the next observation (text or image)
The key insight: the same operation can be visually obvious or textually obvious depending on the table structure. A spatial relationship like "all rows where column A is greater than column B" is trivial to see in a rendered image but requires careful text parsing to execute. A cell value lookup is trivial in text but hard to see in an image of a large table.
TABQWORLD lets the model decide which representation is more reliable for the current task, dynamically switching between them rather than committing to one format for the entire session.
The policy is training-free. It uses the model's own reasoning to choose modality, without any fine-tuning or auxiliary training signal. The framework works with existing models off the shelf.
2. Metadata-Guided Trajectory Optimization
Full table state estimation is infeasible — you can't dump the full table state into the context at every turn without blowing through token limits and accumulating massive latency.
TABQWORLD instead extracts a lightweight metadata summary of the table state at each step: dimensions, data types, key values, row/column counts. This is the low-dimensional projection the model uses for belief tracking.
The key trick: the model predicts what it expects the metadata to look like after an action — "after sorting by price, I expect 5 rows, prices in ascending order, and 3 unique product names." It then compares this prediction against the realized metadata. A mismatch tells the system that something went wrong in execution — without needing to inspect the full table.
Low-risk actions are then compressed: instead of executing sort → observe → verify across three separate turns, TABQWORLD compresses sequences of low-uncertainty actions into single turns, cutting conversation length and latency.
The Numbers
The researchers evaluated TABQWORLD across 7 benchmarks — WikiTableQuestions, TABMWP, TAT-QA, HiTab, FeTaQA, TabFact, and InfoTabs — against 6 baseline groups spanning text-only, image-only, multimodal, adaptive routing, and trained trajectory-optimizing approaches.
| Approach | WTQ Accuracy | Key Tradeoff |
|---|---|---|
| JSON serialization (text) | 72.2% | High token cost, moderate accuracy |
| Image rendering (visual) | 81.8% | Low token cost, high accuracy |
| TABQWORLD (dynamic) | 86.7% | Lowest latency, highest accuracy |
Key results:
- 4.87% accuracy improvement over the best baseline on WikiTableQuestions
- 5.42% accuracy gain and 33.35% inference latency reduction over static (fixed-modality) settings
- Outperformed trained agents including DeepSeek-R1-distilled models with process reward modeling — without any training
The visual grounding advantage is stark in the pilot study. On WikiTableQuestions, rendering the table as an image (rather than serializing as JSON, LaTeX, or Markdown) improved accuracy from 72.2% to 81.8% with Qwen-3-8B-VL — nearly 10 percentage points from a formatting change alone.
Why This Matters
TABQWORLD is a reminder that representation is reasoning. The format you use to encode information determines what the model can perceive, what it gets wrong, and how errors propagate through multi-step reasoning.
Structured data is everywhere: spreadsheets, financial reports, scientific tables, database outputs. And we've been shoehorning all of it into text formats because that's what LLMs expect. But the moment you serialize a table as text, you're throwing away the spatial structure that makes tables useful in the first place.
The training-free design is also notable. Every competitive approach in this space has required fine-tuning, process reward models, or tabular grounding networks trained at significant compute cost. TABQWORLD achieves better results by letting the model choose its own representation strategy — a framework design choice rather than a training intervention.
The compression mechanism is equally important for practical deployment. Cutting multi-turn interactions from 5–9 turns down to 2–3 turns doesn't just reduce latency — it reduces the surface area for error accumulation. Shorter trajectories are more reliable trajectories.
Bottom Line
TABQWORLD demonstrates that representation is reasoning. The format you use to encode information determines what the model can perceive, what it gets wrong, and how errors propagate through multi-step reasoning.
By dynamically switching between visual and textual representations and using lightweight metadata summaries to track belief state, TABQWORLD achieves state-of-the-art table reasoning while simultaneously reducing inference cost by a third. That's a rare combination — better and cheaper.
For practitioners building on structured data: this paper is a strong signal that your table formatting choices are more consequential than your model choices. And for the field broadly: the next leap in reasoning capability might come not from bigger models, but from smarter information presentation.
Paper: arXiv:2604.03393 — UCLA, McGill, HKUST, Université de Montréal
Build with AI That Actually Works
Our OpenClaw Field Guide walks through deploying autonomous AI agents that reason reliably over real data — no shortcuts, no hand-waving.
Get the Field Guide — $10 →