Most reasoning systems built on large language models still treat parallelism as a tournament. Sample N independent chains of thought, score them, vote, and ship the winner. Self-consistency, best-of-N, and majority voting all live inside that frame: threads explore in isolation and only meet at the end.

A new arXiv preprint, LACE: Lattice Attention for Cross-thread Exploration (Li, Zhang, Liu, and Mao; arXiv:2604.15529, April 2026), questions that frame. The authors generalize causal attention from a 1-D sequence into a 2-D lattice, allowing concurrent reasoning threads to attend across each other while they generate. The result, on the benchmarks the authors report, is a measurable gap over voting baselines on hard math, with under 11% additional parameters.

This is a preprint, not a deployed system, and the strongest evidence is concentrated on reasoning-heavy benchmarks. But the architectural idea — coordinated exploration instead of independent sampling — is worth examining on its own merits.

What it does

LACE turns parallel reasoning from N independent trials into a single coordinated process. Instead of running N separate decoder passes and reconciling their outputs, LACE generates T threads simultaneously, with a cross-thread attention path stitched into the transformer so that any token in one thread can, at certain layers, attend to tokens in the other threads.

Concretely, LACE adds three things on top of a standard decoder:

  1. A lattice attention mechanism that augments standard scaled dot-product attention with a cross-thread path and a learned gate that fuses the two.
  2. A synthetic multi-thread training pipeline to address the absence of natural training data where reasoning chains explicitly reference each other.
  3. A multi-thread reinforcement learning stage, which the authors call Lattice GRPO, that rewards both correctness and diversity across threads.

The headline framing from the authors is that LACE reframes parallel reasoning as coordinated exploration rather than independent sampling followed by selection. That is a useful way to read the contribution: not a new decoding trick, but an architectural and training intervention designed to let threads correct and complement each other mid-generation.

Why it matters

Inference-time compute is one of the few axes practitioners can still push without retraining a base model. The standard playbook — sample more, vote, rerank — has known ceilings. Independent samples drawn from the same model tend to fail in correlated ways: they miss the same edge cases, take the same wrong turn at the same step, and pile up redundantly on the same wrong answer. Voting only helps when errors are roughly independent.

LACE is interesting because it attacks that limitation at the architecture level rather than at the orchestration level. The architecture is intended to let partial work in one thread inform the others during generation, reducing the odds that every path collapses onto the same mistake. That is a different bet than throwing more independent samples at a problem and hoping the majority is right.

For practitioners building reasoning agents, code generation pipelines, or planning systems, the LACE result suggests a path that does not require giving up parallelism but does require rethinking how parallel work is structured. Whether that path generalizes outside competition math is the open question — and the authors are reasonably restrained about it.

How it works

The architectural changes are concentrated in a few specific components.

Lattice attention. Standard self-attention over a single thread is preserved. On top of its output, LACE adds a cross-thread attention operation that lets each token attend across the other threads. The two outputs are combined through a learned gate of the form G = σ(Linear([H_down; A_p])), where H_down is a downsampled representation and A_p is the cross-thread attention output. The gate gives the model dynamic, thread-aware control over how much to lean on cross-thread information at each token.

Efficiency strategies. To keep the parameter and compute overhead under 11%, the authors apply three reductions: cross-thread attention operates on a downsampled representation, lattice layers are inserted only in a band of middle-to-late layers rather than the full stack, and the gating mechanism allows the model to ignore the cross-thread path when it is unhelpful. Positional encoding uses 3D RoPE to disambiguate token position within a thread from thread identity across threads.

Training. LACE is trained in three stages. Continuous pre-training adapts the base model to multi-thread inputs. Supervised fine-tuning uses random thread shuffling to prevent the model from memorizing thread order. The reinforcement learning stage, Lattice GRPO, extends GRPO to the multi-thread setting and combines an accuracy reward (based on symbolic verification of final answers, plus self-selection tags such as [[best]], [[success]], and [[fail]] that the model emits) with a diversity reward defined as the average cosine dissimilarity between thread embeddings.

Synthetic data. The model is trained largely on synthetic multi-thread reasoning traces. Problems are filtered to a success rate window of S ∈ (0, 0.5] to keep difficulty in a useful band. Iterative sampling with solution-history rejection produces threads that are logically distinct rather than near-duplicates. Long traces above 8000 tokens are compressed below 3000 to fit context budgets.

The base models in the paper are Qwen3-1.7B and Qwen3-4B.

Key results

The most cited numbers in the paper are on the Qwen3-4B model trained with LACE SFT plus Lattice GRPO:

  • AIME 24: 20.0
  • AIME 25: 16.7 vs. 13.3 (+3.4)
  • LiveBench: 33.0 vs. 28.0 (+5.0)

On Qwen3-1.7B, LACE reaches 13.3 on AIME 25, a +3.3 gap over voting. The "over 7 points" figure that appears in the abstract refers specifically to AIME 24, not a universal uplift; on the other reported benchmarks the margin is smaller but still positive.

Format adherence — the model emitting required tags and structure — is reported at 100% on AIME and 85.5% on LiveBench at the 4B scale. Comparisons include standard SFT+RL, an "Isolated Parallel" ablation that uses the same data and format but no lattice layers, self-consistency voting, Parallel-R1, and Native Parallel Reasoner. The Isolated Parallel ablation is the most informative point of comparison: it isolates the contribution of the lattice attention path from the contribution of the data and format.

The picture is less clean outside math. On the TextWorldCookAgent agent benchmark, LACE achieves the highest Best and Win Rate but a Mean Score below single-thread GRPO. The paper acknowledges that broader agent and open-ended evaluation is left for future work, and several ablations are reported on training subsets rather than full evaluation suites.

Practitioner takeaways

A few things in this paper transfer even if you never adopt the full architecture:

  • Synthetic multi-thread data via solution-history rejection. The recipe for generating logically distinct reasoning chains — sample, reject anything too similar to prior solutions, repeat — is a useful pattern for any pipeline that wants diverse traces rather than near-duplicates.
  • Diversity rewards as a mode-collapse guardrail. The embedding-cosine diversity term in Lattice GRPO is a small change to a standard RL objective, and the underlying motivation (correlated failures across samples kill the value of voting) applies to any best-of-N system.
  • Self-selection tags. Having the model emit [[best]], [[success]], and [[fail]] markers gives downstream pipelines a cheap signal for cherry-picking without requiring an external verifier on every problem.
  • Pick the workload carefully. The largest gains are on tasks with multiple valid solution paths and verifiable answers, especially competition math. Workloads with a single forced answer path are unlikely to benefit as much from cross-thread coordination.

The architectural change itself is harder to drop into an existing stack. Cross-thread attention requires retraining and is not a decoding-time wrapper.

Our take

LACE is a clean architectural argument: independent sampling is leaving signal on the table, and a small, gated cross-thread attention path can recover some of it. The paper supports that argument well on the benchmarks it chooses, and the Isolated Parallel ablation does the right thing by separating the data and format contribution from the lattice mechanism itself.

We would temper the framing in three ways. First, this is a preprint, and the strongest evidence is on competition-math benchmarks where solutions are verifiable and the search space rewards diverse exploration. Second, the agent results show a real trade-off: peak performance improves while mean score on at least one agent benchmark falls below a single-thread baseline, which matters for any production system that cares about reliability rather than best case. Third, the synthetic data pipeline and per-model difficulty calibration are nontrivial engineering, and the paper does not show how robust the approach is when those choices are perturbed.

That said, the high-level direction is the right one to track. Inference-time scaling that relies purely on more independent samples is a saturating strategy. Architectural coordination among parallel threads is a more interesting bet, and LACE is a credible early data point that the bet can pay off where it matters most: hard problems with room for genuinely different solution paths.

Build Better Reasoning Systems

If you are evaluating inference-time scaling, agent orchestration, or reasoning-model architecture choices, our field guide helps teams separate headline claims from production realities.

Get the Field Guide — $10 →