I'm fundamentally a tool builder, and LLM coding agents work one million times better if you give them good tools, and I wrote a thing about this

https://john.regehr.org/writing/zero_dof_programming.html

zero_dof_programming

@regehr If we build tools that actually give us zero degrees of freedom, surely there are more efficient and reliable ways to use them than LLMs?

Given that, as you note, zero DOF is only aspirational, I would love to see more work along the lines of the Termite project for synthesizing device drivers. Version 1 took the provided constraints on the behavior of both the device and the OS, did a bunch of computation, and tried to spit out C source without human intervention. Termite 2 took the same inputs, but gave developers an IDE that would auto-complete large chunks when there were no valid alternatives, then prompt the programmer for the few decisions that were left. I think there are lessons I'd like to see more people learn there.

@jamey "If we build tools that actually give us zero degrees of freedom, surely there are more efficient and reliable ways to use them than LLMs?"

the distinction is between recognizing a good solution and creating a good solution. the former is much easier!

@regehr sure, I understand that distinction. like verifying an NP-Hard solution versus generating one, though I know you can tell me all about how these program verification tasks are themselves often NP-Hard. I just think it would be a shame to drop the existing research on program synthesis in favor of something that generates vaguely-guided random text in a long feedback loop. I mean if we really reach zero DOF, I think an existing coverage-guided fuzzer ought to give better results faster than an LLM
@jamey I think we all hope to avoid relying on these big corporate plagiarism machines that are out of our control, so I want you to be right!
@regehr I was trying to avoid phrasing it that way, but yes, very much that 😂
@jamey @regehr I share your concern in this respect but I don't think fuzzers and LLM-assisted search can be made equivalent. At the margin, many existing codebases are already heavily fuzzed and the surfaced issues fixed. What remains are "weird" issues that violate intended invariants in ways that don't generate a fuzzer-visible crash. Use LLM to add instrumentation, so the fuzzer can fuzz, and the combination finds new bugs.
@jamey @regehr I suspect it's possible to construct ethically trained models capable of doing this work, though no such thing exists today.
@mirth @regehr I meant a different thing. If you have a hypothetical tool that tells you whether your source code meets your requirements, run a fuzzer on that tool: it may eventually find an input program that satisfies your requirements. This is not that different from using an LLM in a feedback loop with such a tool, except a coverage-guided fuzzer can see inside the tool to make potentially better guesses about what result you want. I claim random fuzzing should be "smarter" than an LLM for that purpose.

@jamey @regehr Oh I see. Do you follow "classical" AI? The combinatorics of program synthesis are (I think) even harder than say chess or go, but maybe we'll see some of the same patterns (heuristic search) at least until some possible future where quantum computers reduce the need to prune the tree.

On a different tack, random program search guided by spec coverage has a similar "mouth feel" of layer-hopping cleverness to the RPython JIT-from-interpreter extraction. Hmm.

@mirth @regehr I do have some experience with combinatorial search, mostly when I was a student a couple decades ago, and I love it! In a way I think it's kind of disappointing that SMT solvers are so good, because they make it not worth learning search techniques for most problems unless you get extra deep into a problem, and that takes away some of the joy of studying search.

@mirth @jamey so here's our paper (the one referenced in the post) using randomized synthesis. this is as good as we can do, so far. we might be able to do better with more work, but I don't know. but regardless, the difference between the LLM and randomized synthesis is night and day, it's not even close, and I strongly doubt we can close this gap.

https://users.cs.utah.edu/~regehr/papers/popl26.pdf

@mirth @jamey @regehr Also, crashes that are hard to find by coverage-guided fuzzing (e.g. CVE-2023-4863), but easy to detect when they happen -- as crashes are.
@robryk @regehr @jamey Yes, some of those I think LLMs can find by inspection without much tooling. There's also the automation of hint extraction from existing CVEs to guide where to look (like "read all the CVEs for image loading libraries and summarize the common exploitable bugs, then build harnesses for each"). I'm no professional but in just a little experimentation it proved way too easy to find bug candidates than is healthy for the world.