Mastodawn

Cassandrich Mar 2

@regehr Do you have a simple explanation of what it did wrong with the test case in the article?

John Regehr Mar 2

@dalias ack, no, but most of them were misinterpretations of stuff like the width or sign of a value

Shafik Yaghmour Mar 2

@regehr

Interesting, nothing surprising, to me at least.

The problem we are seeing in LLVM and from what I understand other open source project as well, it is not that they can't provide patches that "work".

They can't seem to effectively address code review. They will fix some things correctly, some incorrectly and on some comments, often go off and do completely wrong things.

Another issue is that they have a hard time following existing idioms in the code base. They often produce solutions they don't conform to the current idiom and they seem unable to make the correction based on review feedback.

Other issues are shallow fixes that make a crash go away but actually don't fix the real root problem, "it works" but it is wrong as well.

Combined w/ the fact that the average lines of code changes in a patch is small. The kind of hand holding required results in a large net negative return on value.

I think these flaws in inherent in the model and not really fixable long-term in the current LLM based tooling. We need models that can actually "reason" and "understand" and LLMs can't do that.

This is what we get w/ statistical inference and it is ineed impressive but not sufficient.

Shafik Yaghmour Mar 2

@regehr

This is for all intents and purposes a huge experiment and Open Source gets a front row seat but we can't get off the ride nor are we supported or compensated for what is essentially a large burden on our resources.

James Widman Mar 2

@shafik @regehr
> we can't get off the ride

not unless we unionize across the industry, anyway.

mx alex tax1a - 2020 (6)Mar 2

@shafik @regehr nor did we ask to be enrolled in this experiment, which raises, shall we say, ethical concerns

Andrea (Drea) Tamar Pinski Mar 2

@regehr "Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid."

I can't totally prove it either but if these tools are run regularly on the trunk of the compilers; they have found bugs earlier than doing a full distro build. At least for GCC.
For GCC, the runtime fuzzier testing has usually found bugs that were introduced in the last week or so which make it easier for a developer/reviewer just to fix it as that part of the code is fresh (usually).

(note LLVM insight might be harder due to bug/regression tracking is something which is lacking behind compared to GCC; that and the folks running the runtime fuzziers might not report reports upstream but only downstream from what i can tell).

John Regehr Mar 2

@pinskia catching recently introduced bugs is my favorite thing, and it's definitely where people like me should be focusing our efforts!

ash Mar 2

@regehr bit: thoroughly committed to ✅

Owen Anderson Mar 2

@regehr Now test it for volatile correctness

Andrea (Drea) Tamar Pinski Mar 2

@resistor @regehr hehehhehehehehehe.

I don't remember how many bugs I fixed for GCC dealing with volatiles in the past.

Owen Anderson Mar 2

@pinskia @regehr I did the same for volatile in LLVM around 2008, most of them reported by @regehr !

John Regehr Mar 2

@resistor @pinskia I'm being punished for this still... I want to add support for volatile to Alive2

zwarich Mar 2

@regehr My first reaction when seeing the Claude C Compiler was “John Regehr is going to be able to milk this so hard”.

@zwarich @regehr i just keep trying to find a backup of the classic "state of an 80s C compiler" tweet

zwarich Mar 3

@joe @regehr Someone should prank Claude Code by giving it an ancient C compiler where struct field labels needed to be globally unique.

@zwarich @regehr or the one where `x <<= n` takes O(`n`) time

@joe @zwarich ha--I know that one. let me dig it up.

zwarich Mar 3

@regehr @joe You’ll also need an emulation environment for running the result. Might be easier to just hack up TCC to have the archaic behavior.

https://www.cs.swarthmore.edu/~bylvisa1/cs97/f13/Papers/DifferentialTestingForSoftware.pdf

@zwarich @joe it's table 1 from this paper, right?

@zwarich @joe this time one of you post it and I'll reboost

@regehr @zwarich if you insist https://f.duriansoftware.com/@joe/116162639761541510

Joe Groff󠄱󠄾󠅄󠄸󠅂󠄿󠅀󠄹󠄳󠅏 (@[email protected])

Attached: 1 image state of a typical C compiler, 1998

Durian Software

Shafik Yaghmour Mar 3

@regehr @zwarich @joe

Oh that is a great paper!

@regehr @zwarich yep that's the one!

Per Vognsen Mar 3

@regehr @zwarich @joe Heh, speaking of 0.0f/0.0f crashing the tested compiler (one of the table 1 examples), you can still crash tcc today (last I checked) with INT_MIN/-1 on x86. I like to test any program that has a built-in expression evaluator with that and it's scary (but many not surprising) how many you can crash with it.

@pervognsen @zwarich @joe it's like shooting fish in a barrel with a nuclear bomb

Thalia Archibald Mar 3

@joe @zwarich @regehr PDP-11 asr/asrb/asl/aslb shifts were by 1 bit. With the Extended Instruction Set, it added ash which shifts by some constant and ashc which does the same, but on a pair of registers treated as a single 32-bit integer. Since there's no logical shift, an idiom was to use ashc with a zeroed high register to get a logical shift.

https://archive.org/details/snyder_c_differences_1978-04-04/page/n1/mode/1up

Thalia Archibald Mar 3

@zwarich @joe @regehr Alan Snyder's portable C compiler in 1976 (not the better known pcc by Steve Johnson descended from it) was, I believe, the first to have local struct names.

C Compiler Differences : Alan Snyder : Free Download, Borrow, and Streaming : Internet Archive

C Compiler Differences by Alan Snyder, revised 4 April 1978, compares Snyder's portable C compiler and the UNIX C compiler (February 1977 version). Printed...

Internet Archive

synlogic4242 Mar 3

@zwarich @regehr now he just needs to write a paper on it. "Reflections on Vibe Trusting Trust"

Sam Mar 3

@regehr one worry I would have about this is whether using a reducer is "reasonable". I know it's "needed" for humans to analyse the problems, but I don't know if the claude compiler has a compositional enough structure that big programs hit the same problems small programs do (I know clang etc *do* have this structure but that's because they were written by humans). This feels like some opposite to the "small model theory" we get in program synthesis for working out which candidates are likely to be most general.

I guess that's easy to verify by checking all the pre-shrinking cases are fixed by the fixes to the reduced bugs, and maybe against all the intermediate shrunk programs.

@lenary I feel like many (but surely not all) of our assumptions about software are true for the vibe coded stuff. perhaps a little bit because they're sort of universal observations, but more because the LLMs live and die by their training -- which is us

Philip Guo Mar 3

@regehr you need an "I WUZ AI B4 AI" custom license plate

Philip Guo Mar 4

@regehr i feel like that blog post needs some stone cold steve austin gifs in there somewhere