Most teams building reasoning models follow the same recipe: supervised fine-tuning first, reinforcement learning second. The assumption is that SFT gives the model a solid foundation, and RL sharpens it. A new preprint from researchers at Zhejiang University challenges that assumption directly — arguing that standard SFT can reduce downstream RL compatibility and generalization in this setting.
The paper, "GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification," proposes a method that rethinks the supervised stage of post-training. If its claims hold up under broader evaluation, GFT could change how teams structure the gap between imitation learning and reinforcement learning.
Paper status: this is a preprint, not peer-reviewed research. The results are promising but should be treated as preliminary evidence, especially outside the math-reasoning setup tested in the paper.
What SFT actually does to your policy
The core insight of the paper is a reframing. The authors show that standard SFT can be interpreted as a degenerate special case of policy gradient optimization — one where the reward function is a sparse binary indicator (1 if the output exactly matches the reference, 0 otherwise) and the importance weight is the raw inverse probability 1/πθ(y|x).
This is not just a theoretical curiosity. In the authors' analysis, it exposes two failure modes.
Single-path dependency. SFT trains the model to reproduce one specific trajectory per input. The policy entropy collapses around that single path, which means the model loses the ability to explore alternative reasoning chains. When RL kicks in later and needs a diverse policy to learn from, the model has already narrowed its search space.
Gradient explosion. The inverse-probability weighting in SFT is unbounded. When the model assigns low probability to a token in the reference sequence — which happens frequently early in training or on difficult examples — the gradient can spike dramatically. This creates unstable optimization dynamics that compound as training progresses.
Taken together, the authors argue that these problems can produce a narrower policy that transfers less effectively into downstream RL, encouraging overfitting to demonstrated paths rather than broader generalization.
How GFT works
GFT replaces standard SFT with two mechanisms designed to fix these failure modes while remaining compatible with downstream RL.
Group Advantage Learning (GAL)
Instead of training on one reference answer per query, GFT constructs a group of K responses for each input. In the paper's primary setup, K = 8, with the group composed of expert demonstrations, teacher-distilled responses, and the model's own sampled outputs.
The method then computes a normalized advantage for each response within the group — essentially scoring how good each trajectory is relative to the others in that batch. This is analogous to what algorithms like GRPO do during RL, but applied at the supervised stage.
The practical effect is a shift in training signal: instead of saying "copy this exact answer," it says "this answer is better than those other answers, and here's by how much." The model learns relative quality rather than rote sequences, which preserves policy diversity and makes the transition to RL smoother.
Dynamic Coefficient Rectification (DCR)
DCR addresses the gradient explosion problem. It introduces a threshold τ that clips the inverse-probability weight for low-probability tokens. When the model assigns a token probability below τ, the gradient contribution gets bounded rather than allowed to blow up.
The authors report that τ ≈ 0.7 provides the best tradeoff between training stability and learning efficiency. Below that, gradients become noisy; above it, the clipping is too aggressive and slows convergence.
The combination means GFT's gradient update looks like a policy gradient step with stabilized coefficients and group-relative advantages — structurally much closer to what RL algorithms expect than standard cross-entropy SFT.
Key results
The experiments focus on math reasoning benchmarks using models ranging from 1.5B to 8B parameters, trained on the NuminaMath CoT dataset. The results are worth examining in detail, with the caveat that this is preprint work on a specific task domain.
Coverage metrics tell the clearest story. On Pass@128 averaged across SAT Math, Minerva, and TabMWP, GFT reaches 62.16 compared to 56.32 for distillation and 49.87 for GRPO, starting from a base of 24.52. At Pass@256, the pattern holds: GFT at 61.91 versus distillation at 56.11 and GRPO at 49.16. These coverage metrics matter because they measure the model's ability to find correct solutions across many attempts — a proxy for policy diversity and reasoning breadth.
Data efficiency is notable. GFT and GRPO both use a 10k-example subset with 8 trajectories per query (80k total trajectories), while single-trajectory baselines like standard SFT use 100k individual samples. GFT outperforms these baselines despite seeing fewer unique problems, suggesting the grouped training signal carries more information per example.
Catastrophic forgetting is where the argument gets sharpest. On LLaMA-3.2-3B-Instruct, standard SFT degrades MAWPS by 4.09 points, SVAMP by 7.63 points, and MMLU-STEM by 5.98 points. GRPO is gentler but still drops performance on two of three benchmarks. GFT shows minimal degradation: -0.27 on MAWPS, -1.71 on SVAMP, and actually gains 2.86 points on MMLU-STEM. If these numbers replicate, the forgetting reduction alone could justify the approach for teams that need to preserve general capabilities while adding domain-specific reasoning.
Ablation highlights
The paper includes ablation studies on group composition. In one reported setup, the best-performing ratio is 2 expert demonstrations to 6 self-generated samples, suggesting that a small amount of gold-standard data mixed with the model's own outputs can outperform heavier reliance on either source alone in this math-reasoning setup.
What the paper doesn't show
Several important caveats deserve attention.
Scale is limited. All experiments use models up to 8B parameters. The authors explicitly acknowledge that validation on 70B+ models is future work. Whether GFT's advantages persist, grow, or vanish at larger scales is an open question — and for many production teams, the answer at 70B matters more than the answer at 7B.
The evaluation domain is narrow. Math reasoning with objective correctness criteria is a best-case scenario for this kind of approach. The benchmarks have clear right-and-wrong answers, making advantage computation straightforward. How GFT behaves on open-ended instruction following, creative tasks, or multi-turn dialogue — where "better" is fuzzier — remains untested.
This is a preprint. The work has not been peer-reviewed. The results are promising but should be treated as preliminary evidence, not established fact.
Group construction adds overhead. Building response groups requires generating multiple completions per query and potentially running teacher models for distillation. The paper describes this as "marginal," but the actual cost depends heavily on your infrastructure and the models involved.
Practitioner takeaways
For teams studying reasoning-model post-training, the paper suggests several hypotheses worth testing.
- Audit your SFT stage for entropy collapse. Even if you do not adopt GFT, the paper's analysis of SFT failure modes is worth internalizing. If your SFT stage is producing policies that RL struggles to improve, entropy collapse and gradient instability are plausible culprits.
- Consider grouped training data even without the full method. The principle of training on response groups with relative advantages rather than single reference trajectories is applicable beyond GFT's specific implementation.
- In one Qwen2.5-Math-1.5B ablation, a 2:6 demo-to-self-sample ratio performed best. If you experiment with grouped training, that result is a reasonable point of comparison, at least in the math domain studied here.
- Watch the handoff to RL. GFT's most interesting claim is not raw benchmark performance, but that the resulting policy is more RL-compatible. The real test is whether your downstream RL stage converges faster, explores better, and produces stronger final models.
Our take
The post-training pipeline for reasoning models has largely calcified around SFT-then-RL, and most optimization effort goes into the RL stage. This paper makes a compelling case that we should look harder at what SFT is actually doing to the policy before RL ever touches it.
The theoretical reframing is the strongest contribution. Viewing SFT as broken policy gradient — with sparse reward and unbounded weights — gives practitioners a concrete vocabulary for diagnosing problems they may already be seeing in practice. The fix, group advantages plus gradient clipping, is mechanically simple, which is a point in its favor.
We are cautious about the benchmark-specific results until they are validated at larger scale and on broader task distributions. But the direction is right: the boundary between supervised and reinforcement learning in post-training is more permeable than current recipes assume, and methods that treat it as a spectrum rather than a hard cutoff are likely to keep gaining traction. GFT is one credible proposal for what that spectrum looks like.
Sources
Build Better Post-Training Pipelines
If you're evaluating reasoning stacks, RL workflows, or agent design choices, our field guide can help you focus on what actually matters in production.
Get the Field Guide — $10 →