I'm fundamentally a tool builder, and LLM coding agents work one million times better if you give them good tools, and I wrote a thing about this

https://john.regehr.org/writing/zero_dof_programming.html

zero_dof_programming

I do enjoy using these things in ways that don't plagiarize and don't inflict slop on people. for example, when debugging an LLVM pass, this kind of thing can actually work. like, this actually found a miscompilation bug just now, and may find more if I ask it to keep going. it's an absolutely fresh LLVM pass, there is no chance that the LLM was trained on anything it's seeing here (much less on bug fixes to it).
(the just-found miscompile wasn't caught by Csmith in ~8 hours of testing, either)
@regehr Hehe, your detailed sanity-guards described in plain English made me chuckle :)
@regehr what about the plagiarization, at unimaginable scale, that went into the training? "I'm *fundamentally* a tool builder, so I disregard ethics"?
@bikubi I also drive a car and fly on airplanes, which I know to be bad. do you make any compromises, or is everything you do purely good?
@bikubi oh I also eat meat, though quite seldomly
@regehr great defense, i do bad sometimes, so i do every new bad that comes up. let's give that new right wing party a fair shake, i heard it's convenient and exciting
@bikubi thanks for your input, internet rando!
@bikubi I appreciate your part in making mastodon feel like twitter used to feel ❤️

@regehr Part of it is looking into patterns of previous passes and looking into patterns of creating LLVM IR and then it is just matching it up that way.

So it is plagiarizing but in a more interesting way.

Plus running a secondary program to do the acceptance criteria. Alive2 is the anti-slop part here really.

I have done something similar with some passes before where I was able to match a known pattern inside the pass and then create a testcase out of that.

Basically what LLMs are doing in this case is limiting what kind of testcases to produce. It using pattern matching what the code could to limit its search. This is why it found something that Csmith could not in a reasonible time frame.

Now this might be one of the only uses of LLMs which can prove very useful BUT only because it is not about creating something which will ever be used outside of a test env. It might actually reduce the amount of energy used overall.

BUT it is NOT even close to any usecase that is being pushed by anyone.

Note in this use case the plagiarization is in my mind would be fair use. Basically testing all programs is fair use on the compiler.

@pinskia I think I agree with all this. still trying to figure it out.

@regehr If we build tools that actually give us zero degrees of freedom, surely there are more efficient and reliable ways to use them than LLMs?

Given that, as you note, zero DOF is only aspirational, I would love to see more work along the lines of the Termite project for synthesizing device drivers. Version 1 took the provided constraints on the behavior of both the device and the OS, did a bunch of computation, and tried to spit out C source without human intervention. Termite 2 took the same inputs, but gave developers an IDE that would auto-complete large chunks when there were no valid alternatives, then prompt the programmer for the few decisions that were left. I think there are lessons I'd like to see more people learn there.

@jamey "If we build tools that actually give us zero degrees of freedom, surely there are more efficient and reliable ways to use them than LLMs?"

the distinction is between recognizing a good solution and creating a good solution. the former is much easier!

@regehr sure, I understand that distinction. like verifying an NP-Hard solution versus generating one, though I know you can tell me all about how these program verification tasks are themselves often NP-Hard. I just think it would be a shame to drop the existing research on program synthesis in favor of something that generates vaguely-guided random text in a long feedback loop. I mean if we really reach zero DOF, I think an existing coverage-guided fuzzer ought to give better results faster than an LLM
@jamey I think we all hope to avoid relying on these big corporate plagiarism machines that are out of our control, so I want you to be right!
@regehr I was trying to avoid phrasing it that way, but yes, very much that 😂
@jamey @regehr I share your concern in this respect but I don't think fuzzers and LLM-assisted search can be made equivalent. At the margin, many existing codebases are already heavily fuzzed and the surfaced issues fixed. What remains are "weird" issues that violate intended invariants in ways that don't generate a fuzzer-visible crash. Use LLM to add instrumentation, so the fuzzer can fuzz, and the combination finds new bugs.
@jamey @regehr I suspect it's possible to construct ethically trained models capable of doing this work, though no such thing exists today.
@mirth @regehr I meant a different thing. If you have a hypothetical tool that tells you whether your source code meets your requirements, run a fuzzer on that tool: it may eventually find an input program that satisfies your requirements. This is not that different from using an LLM in a feedback loop with such a tool, except a coverage-guided fuzzer can see inside the tool to make potentially better guesses about what result you want. I claim random fuzzing should be "smarter" than an LLM for that purpose.

@jamey @regehr Oh I see. Do you follow "classical" AI? The combinatorics of program synthesis are (I think) even harder than say chess or go, but maybe we'll see some of the same patterns (heuristic search) at least until some possible future where quantum computers reduce the need to prune the tree.

On a different tack, random program search guided by spec coverage has a similar "mouth feel" of layer-hopping cleverness to the RPython JIT-from-interpreter extraction. Hmm.

@mirth @regehr I do have some experience with combinatorial search, mostly when I was a student a couple decades ago, and I love it! In a way I think it's kind of disappointing that SMT solvers are so good, because they make it not worth learning search techniques for most problems unless you get extra deep into a problem, and that takes away some of the joy of studying search.

@mirth @jamey so here's our paper (the one referenced in the post) using randomized synthesis. this is as good as we can do, so far. we might be able to do better with more work, but I don't know. but regardless, the difference between the LLM and randomized synthesis is night and day, it's not even close, and I strongly doubt we can close this gap.

https://users.cs.utah.edu/~regehr/papers/popl26.pdf

@mirth @jamey @regehr Also, crashes that are hard to find by coverage-guided fuzzing (e.g. CVE-2023-4863), but easy to detect when they happen -- as crashes are.
@robryk @regehr @jamey Yes, some of those I think LLMs can find by inspection without much tooling. There's also the automation of hint extraction from existing CVEs to guide where to look (like "read all the CVEs for image loading libraries and summarize the common exploitable bugs, then build harnesses for each"). I'm no professional but in just a little experimentation it proved way too easy to find bug candidates than is healthy for the world.
@regehr I had a similar thought and I was thinking writing constraints in something like JQF might work for making writing oracles that encode knowledge about your problem low friction
@regehr of course if you game that out in the long run nobody's writing prompts anymore and they're writing specs in some programming language instead
@bob ... and I think that's ok!
@regehr oh it would absolutely be an improvement over writing prompts in natural language
@regehr meanwhile i'm reconsidering the kinds of tools i'm building. if i don't add types to my project it will result in fewer slop submissions => maybe i should do that?
@regehr to be clear, the answer i have so far is "definitely no" but it's not a question i ever anticipated seriously contemplating
@whitequark it's nice to think that! maybe so. but I'd be wary about this if I actually wanted types, I mean let's make things good for the humans first
@regehr yea like i said it's not a serious thought yet
@whitequark there have to be solutions to this, maintainers should not have to deal with this bullshit
@regehr it seems like people who work in biology departments could have told us that this was coming: LLMs are to random mutation as oracles are to natural selection (like how a fast predator effectively selects relatively-faster prey for breeding).

@regehr also it seems like we're having kind of a monkey's-paw kind of moment: we've always wanted better languages/tools for expressing & automatically checking whether a program satisfies some formal specification, and now more motive for that improvement is coming from... LLMs.

it's like hoping for more countries to go all-in on solar panel deployment, and then they finally do, but the motive comes from a war in iran started by an insane american ruling class.

@JamesWidman yes I've been thinking about the monkey's paw often recently
@regehr Thanks for writing this! It feels close to how I started thinking of LLMs as "Kahneman system 1" thinkers (stochastic, intuitive, ...) . So you need "Kahneman system 2" tools (deterministic, analytical, controlled, ...) to control, verify or shape their output. It sounds like "executable oracles" are exactly one such kind of "Kahneman system 2" tool...
@kbeyls I like this way of thinking about it. isn't it hilarious how we have 75 years of SF showing us coldly logical AI and then when we get them they're touchy-feely and lazy?
@regehr I guess if the AI weren't touchy-feely and lazy, we wouldn't call it AI as it wouldn't be human-like enough ;)

@regehr I'm really curious what alternatives fit:

"By this point, most of us who have experimented with ________ have noticed that the current generation of these can sometimes do good work, at superhuman speed, when given some kinds of highly constrained tasks."

Fill in the blank:

1. Using child labor
2. Supervising graduate students
3. ???

@jawnsy I haven't yet tried out child labor!
@regehr wow, a refreshing look at LLMs that isn't simply marketing hype! Really appreciate it. Interesting to see the kind of work and strict scoping this takes. Also seems like IMO LLMs still really don't outweigh the negatives & cost
@kworker regarding "don't outweigh the negatives and cost" -- entirely possible. I avoided these things as long as I could, and I would still be avoiding them if I had a different job
@regehr yeah I hear ya. I won't repeat a laundry list that I think many people are familiar with, I just think LLMs have turned out to be very negative for society
@kworker I don’t even think we’ve seen the impact of the negatives yet
@regehr The concluding sentence is exactly what I'm excited about. Software will almost certainly get better with better tooling, AI agents or not, but now there's going to be a ton of money chasing better SE tools (I hope).
@sree @regehr definitely agree with this, accessibility for computing for the sake of LLMs is still accessibility of computing
@regehr In a way this idea of zero DoF is like the proverbial sculpture in a block of marble. You must carve away the space of tool-accepted code until the remainder has the shape you want. Very different from manual coding which often feels like assembling the thing by gluing grains of sand together one by one, or normal templates and scaffolding which have a sort of rigid and fixed shape.
@mirth that is a good analogy
@regehr article ok but. i don't understand why these metric-providers need to be called "oracles". it's the same pretension as the use of the word "story" in ticket systems. oracles deliver divine prophecies. what these things do is underwhelming in comparison. it irritates me greatly.
@lritter this is the standard term from software testing
@regehr @lritter to be fair, I think the original usage goes back to things like non-deterministic Turing machines, where an Oracle whispers to you the exact right sequence of non-deterministic choices to make to get the result you want, and in that case (certainly in the presence of things like uncomputability etc.) some form of divine intervention (or at least divine wisdom) might legitimately be required.
@regehr @lritter it's just like with anything, you start up building a religion to give you divine input to find a counterexample to the Goldbach Conjecture or whatever, but once the infrastructure exists and works you stop asking the Oracle questions like "what is the meaning of life" and start asking it questions like "is it sunny outside? I can't be bothered to pull back the curtains"
@regehr @lritter as an aside, the idea that the arcane (the concept) in Arcane (the show), rather than purely a mechanical facet of how that world works, is strongly implied to have agency and starts "acting out"/getting pissy when treated as a mere utility to be tapped on a whim, is IMO a genuinely inspired world building idea and a cool fix for this kind of thing
@regehr @lritter basically, Rule of Cool enforced on penalty of magic apocalypse by Vengeful God
@rygorous @lritter the specific software testing usage seems to be from 1978