Preprint notice: This post covers arXiv:2604.21950v1, submitted on April 23, 2026. It has not been peer-reviewed. The results are worth attention, but they should still be read as provisional evidence.

What the paper studies

Small language models in the 1 to 3 billion parameter range are now good enough to be interesting and cheap enough to be practical. You can run them locally, deploy them on lighter hardware, and use them in places where a bigger proprietary model would be too expensive or too awkward. The catch is that they still struggle on harder code-generation tasks. That creates a natural design question for builders: if one small model is not enough, can a better pipeline architecture recover the gap?

This paper takes that question seriously and tests it in a way that is narrower, but more useful, than most agent-architecture papers. The authors compare several ways of composing small models for Python code generation, then ask whether a more sophisticated pipeline topology actually matters once you already have a strong feedback signal from code execution.

The setup is tightly scoped. The paper evaluates on HumanEval, which has 164 Python problems, and sanitized MBPP, which has 427. The search pool consists of three general-purpose models: Gemma3:1B, qwen2.5:1.5B, and llama3.2:3B. As an external baseline, the authors also test qwen2.5-coder:3B, but importantly, that code-specialized model is not part of the evolutionary search space.

The pipeline family itself is constrained. These are linear pipelines with a generator, executor, optional analyzer, and refiner, plus one to three refinement stages and early stopping when tests pass. The authors run everything locally on a single 128GB M4 Max MacBook Pro. That detail matters because it keeps the result grounded in a realistic small-model workflow rather than a giant cluster search that no ordinary team could reproduce.

What the results actually say

The big finding is not subtle. llama3.2:3B running solo scores 76.6±3.7 on HumanEval and 255.8±5.5 on MBPP. Put that same model into a simple self-refinement loop with execution feedback, and the numbers rise to 94.0±2.7 on HumanEval and 286.2±2.9 on MBPP. The paper describes that jump as more than 4σ in its own effect-size shorthand. That should not be read like a formal hypothesis-test claim, but it does capture the scale of the difference.

Just as interesting, cross-model refinement nearly matches same-model self-refinement. A qwen2.5:1.5B generator paired with a llama3.2:3B refiner reaches 93.6±3.0 on HumanEval and 287.0±2.0 on MBPP. In practice, that means the exact topology matters less than many teams might expect once the refinement stage is strong and the execution signal is useful.

The paper's NEAT-inspired search over pipeline variants finds a champion that reaches 98.2±3.4 on HumanEval. That is directionally better than the manual best, but the paper does not present it as a clearly significant leap. The important part is what the search kept rediscovering: a generate, execute, refine loop with early stopping. Not a baroque multi-agent graph. Not some exotic topology twist. A simple loop with a grounded feedback signal.

The code-specialized baseline sharpens the point. qwen2.5-coder:3B with self-refinement reaches 139.6±2.5 on HumanEval and 307.2±4.3 on MBPP, beating every general-purpose pipeline configuration the paper tested. So the result is not that architecture no longer matters. It is that, in this specific 1 to 3B regime, execution feedback matters more than additional topology complexity, and model specialization still matters a lot.

The error analysis makes the mechanism clearer. Refinement helps most with runtime-visible failures like NameError and SyntaxError. It does much less for logic errors that only show up as failed assertions. That is an important reality check. The system is not suddenly reasoning its way through deep correctness issues. It is mostly getting better at using an interpreter and traceback as a repair signal.

There is also one methodological warning that deserves to travel with the result. The paper reports that single-run fitness inflated results by 5 to 7 percent during evolutionary search. In plain English, if you score a pipeline once and trust the outcome, you are likely to promote lucky genomes rather than genuinely better ones. That is a useful lesson even outside this paper.

What builders should do with this

The practical takeaway is straightforward. If you are building a small-model code pipeline and you do not yet have a clean execution-feedback loop, stop worrying about more elaborate orchestration and add that first. Running the generated code, capturing the failure trace, and giving the model one or more bounded repair passes is where the paper found the real gain.

This is good news because it is cheaper than the alternative. A refinement loop with execution feedback is easier to build, easier to debug, and easier to explain than an architecture full of specialized agent roles. It also maps cleanly onto how code actually fails. Small models do not always need a second planner or a meta-controller. Often they just need to see that they used an undefined variable or returned the wrong shape.

The paper also suggests where to spend effort inside the loop. Within the tested model pool, refiner quality mattered more than generator identity. That means if you are doing cost-aware routing, you can plausibly use a smaller or cheaper model to draft an answer and reserve the stronger model for repair. The cross-model result does not prove that this will always be cost-optimal, but it makes the strategy credible enough to test.

Early stopping is another non-optional detail. The authors report that without it, every additional iteration was net-negative. That lines up with what many practitioners have already seen in the wild. If a model keeps revising after the problem is solved, it often talks itself into a worse answer. The lesson here is not just to add loops. It is to add loops with a clear stop condition.

There is also a more strategic point hiding underneath the benchmarks. A lot of agent design energy goes into role assignment, planner hierarchies, and topology diagrams. This paper is a reminder that those abstractions are often downstream of a simpler question: what reliable signal does the model get after it acts? If the answer is "none," or "just more language," then topology may not save you. If the answer is "a grounded pass-fail trace with concrete error context," then even a boring loop can carry surprising weight.

Limits and caveats

The scope here is constrained in ways that matter. These are Python benchmarks with execution oracles. That makes the setting unusually friendly to refinement because the model receives a direct, machine-readable signal when it fails. You should not assume the same gain transfers cleanly to open-ended code tasks, multi-file repository work, or domains where there is no executable correctness check.

The model scale matters too. Everything here lives in the 1 to 3B regime. The paper does not establish that the same tradeoff holds at 7B, 30B, or frontier scale. It also does not test graph-shaped agent topologies, long-horizon task loops, or richer tool environments. So if someone tries to generalize this into a universal claim that topology never matters, they are saying more than the paper does.

The benchmarks are also familiar and limited. HumanEval and MBPP are useful, but they are not the same thing as production code generation inside real systems with dependencies, hidden tests, style constraints, and messy runtime environments. The result is still valuable. It is just narrower than the headline version many people will want to repeat.

And, again, this is a preprint. The paper is interesting because it reports a negative result honestly: more topology did not buy much once execution feedback was in place. That honesty is part of why it is worth reading. But it is still one paper, on one setup, and it should be treated that way.

Verdict

Bottom line: in this paper's constrained 1 to 3B Python code-generation setting, execution feedback is the first lever and topology is the second. The result does not say pipeline design is irrelevant. It says that if your small-model coding stack lacks a strong execution-and-repair loop, you are probably optimizing the wrong layer first.

That is a healthy result for the field. It pushes against the instinct to solve every model weakness with a more elaborate agent diagram. Sometimes the better answer is simpler: let the model try, let the environment answer back, and make the repair loop reliable before you make the architecture fancy.

For teams working with local code models, that is not just an academic insight. It is a practical sequencing rule. Build the feedback channel first. Measure what it fixes. Then decide whether the extra topology is still earning its keep.

Sources

Building LLM pipelines for code generation?

Alchemic Technology tracks applied research on small models, agent pipelines, and practical AI systems. Subscribe for structured breakdowns of what is actually worth building with.

Get the Field Guide — $10 →