Simple self-distillation improves code generation

https://arxiv.org/abs/2604.01193

Embarrassingly Simple Self-Distillation Improves Code Generation

Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful diversity where exploration matters. Taken together, SSD offers a complementary post-training direction for improving LLM code generation.

arXiv.org

Really fascinating how this works; it's basically context-aware decoding. From the paper:

> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.

In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).

What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.

I love that we're still learning the emergent properties of LLMs!

Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.
That would seem intuitively true; it certainly applies to written language, where a clause could go off in another direction, but at other positions the correct grammar/syntax is unambiguous.

thinking -
well if we think of lock as happening in a narrative, then I think we can see there can be points where "everything you know is wrong" which essentially allows you to go back into a sort of fork mode and work towards another lock.

Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.

I think this seems similar to what Anthropic had been doing since the latest few Opus releases, which is interleaved thinking; CoT reasoning in the middle of a message. But they operate at different layers.

> I love that we're still learning the emergent properties of LLMs!

TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:

> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)

So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.

(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)

Another example of the mindf@#$ these systems are: I was doing some fine tuning to a small model, take data fields and make a sentence out of it. I was running into mode collapse (basically when the AI simplifies too much and always output the same thing).

I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...

wow that's fascinating
the irony of modern software engineering: we spent decades perfecting deterministic algorithms, and now we're basically just shaking a black box and hoping the magic rocks align.
It's a little disturbing, but also very fun to just discover by probing, building and breaking.
apparently you can straight up duplicate/add/rearrange layers without changing any of the weights and get better results as well - https://dnhkng.github.io/posts/rys/
LLM Neuroanatomy: How I Topped the LLM Leaderboard Without Changing a Single Weight

ML, Biotech, Hardware, and Coordination Problems. Sometimes I write about hard problems and how to solve them.

David Noel Ng
This is crazy, thank you for the link!

I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.

I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.

I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.

[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

[1] https://developers.redhat.com/articles/2025/06/03/structured...

Give coding agents access to intellisense and syntax highlighting.

Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.

In general these agents support LSPs, which is often as much information as your IDE will give you. They are also not required to output syntactically correct code token by token when running agentically, because the loop is:

1. code

2. syntax check / build / format / lint (details language dependent)

3. test

and they can hop between 1 and 2 however many times they want.

> Give coding agents access to intellisense and syntax highlighting.

i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.

Yeah, I was also thinking about it A LOT.

We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.

I think that there is a lot of low hanging fruit in this area.

And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.

For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.

Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.

Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.

I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.

Doing a tool call for autocomplete is not going to make coding agents faster.

I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.

Could we not get the same with EAFT? Maybe that’s what it’s doing but definitely not the first to think “let’s lock in high probability solutions”

In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc