SuperNova: Why Better RL Data, Not More RL Compute, May Unlock General Reasoning in Smaller Models

Most reinforcement learning gains in language models have shown up more clearly in math and code than in broad general reasoning. A new UCLA preprint asks whether the bottleneck is less about inventing a new RL method or spending more compute, and more about choosing the right training data.

The paper, "SuperNova: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions," proposes a data-curation framework for reinforcement learning with verifiable rewards (RLVR) that targets general reasoning skills like logic, causal inference, temporal ordering, and commonsense, the kinds of capabilities that math-focused RL recipes have largely failed to move.

What SuperNova actually does

SuperNova is not a new RL algorithm. It is a recipe for selecting, mixing, and formatting training data so that standard RLVR pipelines can improve broad reasoning instead of just narrow quantitative skills. The key insight is that existing human-annotated instruction-tuning datasets, specifically Super-NaturalInstructions, a collection the paper reports contains around 1,600 tasks, already encode latent reasoning structure. The authors reformat a candidate pool of 83 tasks from that collection into answer-verifiable questions suitable for RLVR, then systematically study which tasks help, which hurt, and how to combine them.

The pipeline has three stages. First, task selection: reformat instruction-tuning examples into verifiable QA pairs and filter out examples that are too easy or too hard for the target model. Second, task ranking: run controlled RL experiments on individual candidate tasks, measuring each task's impact on downstream general reasoning performance. Third, task mixing: compare strategies for combining top tasks into a final training set.

Why this matters

The practical message from the paper is pointed: if you are trying to make a smaller open model reason better, your choice of RL training data may be as important as, or more important than, changes in algorithm or additional compute.

This runs counter to the default playbook in post-training, where teams often focus on scaling synthetic math and code data for RL. SuperNova argues that general reasoning requires general training signal, and that human-curated instruction datasets provide a better starting point than synthetic web-derived QA for that purpose.

For teams building on open models without frontier-scale resources, that reframing is significant. It suggests a concrete path to improving reasoning capabilities that does not require generating massive new synthetic datasets or training at scales only a handful of labs can afford.

How the data curation works

The authors report running over 100 controlled RL experiments, primarily on Qwen3-0.6B with 250 RL steps per experiment, to isolate the effects of different data choices. They validate against BBEH-mini, a 460-example benchmark spanning 23 reasoning subtasks.

The first finding is that task selection matters enormously. According to the paper, individual source tasks range from degrading baseline performance by 9 percentage points to improving it by 39 percentage points on BBEH-mini pass@8. That spread is striking, it means a naive choice of training data can actively make the model worse at general reasoning, while a well-chosen task can produce large gains from the same RL recipe.

The second finding concerns mixing strategy. The paper compares macro mixing, selecting the globally top-ranked tasks, against micro mixing, selecting the best tasks per reasoning subtask. Micro mixing with the top 2 tasks per subtask reached 22.8% pass@8 on BBEH-mini, outperforming the macro approach. The implication is that different reasoning skills benefit from different source data, and a one-size-fits-all task ranking leaves performance on the table.

The third finding is a useful negative result: synthetic interventions designed to increase reasoning difficulty, such as adding long-context dependencies or go-against-prior transformations, did not beat the base curated mixture. The best intervention reached 22.6% pass@8, still below the micro-mixed baseline. This may reflect the specific interventions tested or the fixed compute budget, but it reinforces the paper's central claim that careful selection of existing human-annotated data is hard to beat with synthetic augmentation.

Key framing: SuperNova is a preprint-reported data-curation framework for RLVR, not a new reinforcement learning algorithm.

Key results

The final SuperNova models are Qwen3 at 0.6B, 1.7B, and 4B parameters, trained with GRPO for 5,000 RL steps on the curated mixture. On the held-out BBEH-test set, the paper reports that SuperNova-4B achieves a 29.4% relative gain on pass@1 and a 42.9% relative gain on pass@8 over the baseline.

The paper also claims SuperNova-4B beats the larger Qwen3-8B by 8.2 percentage points on pass@8 on general reasoning tasks, though that comparison should be treated cautiously until evaluation settings are independently verified. On out-of-distribution benchmarks, the paper reports gains on BBH, Zebralogic, MMLU-Pro, and MATH500. A notable claim: SuperNova-4B outperforms Qwen3-4B by 21 percentage points on Zebralogic.

As a cross-model generalization check, the authors report that training LLaMA 3.2-3B-Instruct on the SuperNova data yields a 15.8 percentage-point improvement over baseline on BBEH-mini, suggesting the curated dataset transfers across model families.

These are preprint-reported numbers. Until evaluation settings, prompts, decoding budgets, and pass@k methodology are independently verified, they should be treated as claims rather than established results.

Practitioner takeaways

Three lessons stand out for teams doing post-training work on open models.

First, treat task selection as a first-class optimization problem. The enormous spread between best and worst source tasks suggests that casual data curation is leaving large gains unrealized, or worse, actively degrading model capabilities.

Second, prefer skill-aware data mixtures over global rankings. The micro-mixing result implies that reasoning is not one skill but many, and that different reasoning subtasks respond to different training signals.

Third, look at existing human-annotated instruction datasets before investing in new synthetic data pipelines. The SuperNova recipe is built almost entirely on data that already existed in Super-NaturalInstructions. The contribution is in how it is selected and formatted, not in generating new content.

Our take

SuperNova makes a clean, useful argument: data curation is an underexplored lever for RLVR, especially for general reasoning. The experimental methodology, controlled single-task ablations, systematic mixing comparisons, and cross-model transfer checks, makes the paper more useful than many high-level data-curation arguments, and the negative result on synthetic interventions adds credibility.

That said, this is a preprint, and the results are heavily benchmark-driven. BBEH is a reasonable proxy for general reasoning, but it is still a proxy. Whether these gains translate to real-world reasoning in agentic or production settings is an open question the paper does not address. The curation experiments are also conducted at small scale, mostly 0.6B parameters and 250 RL steps, so how the recipe's advantages hold as models and compute budgets grow remains to be seen.

The reproducibility picture is also worth watching. Because the gains depend so heavily on specific task selection and formatting choices, the practical value for other teams will depend on how robust the recipe is to different model families, different instruction datasets, and different evaluation benchmarks. The cross-model result with LLaMA is encouraging but limited.

Still, the paper's core claim, that better RL data can matter more than simply adding more RL data, is practical, testable, and worth attention. For practitioners working with smaller open models, SuperNova offers a concrete framework worth experimenting with, and a reminder that the most impactful post-training improvements may come from the data pipeline, not the training loop.

Sources

Suvarna, A., Phan, K., Beikzadeh, M., Bansal, H., & Gabriel, S. (2026). "SuperNova: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions." arXiv preprint, arXiv:2604.08477. University of California, Los Angeles.
Code and data: github.com/asuvarna31/supernova

Build Better Reasoning Workflows

If you are evaluating retrieval, post-training, or agent design choices for smaller models, our field guide can help you think more clearly about what actually matters in production.

Get the Field Guide — $10 →

SuperNova: Why Better RL Data, Not More RL Compute, May Unlock General Reasoning in Smaller Models

What SuperNova actually does

Why this matters

How the data curation works

Key results

Practitioner takeaways

Our take

Sources

Build Better Reasoning Workflows

Keep Reading