won't say I'm totally proud of myself here, but once I saw that the Claude C compiler was super buggy according to YARPGen and Csmith, I had a hard time preventing myself from doing something about it
won't say I'm totally proud of myself here, but once I saw that the Claude C compiler was super buggy according to YARPGen and Csmith, I had a hard time preventing myself from doing something about it
Interesting, nothing surprising, to me at least.
The problem we are seeing in LLVM and from what I understand other open source project as well, it is not that they can't provide patches that "work".
They can't seem to effectively address code review. They will fix some things correctly, some incorrectly and on some comments, often go off and do completely wrong things.
Another issue is that they have a hard time following existing idioms in the code base. They often produce solutions they don't conform to the current idiom and they seem unable to make the correction based on review feedback.
Other issues are shallow fixes that make a crash go away but actually don't fix the real root problem, "it works" but it is wrong as well.
Combined w/ the fact that the average lines of code changes in a patch is small. The kind of hand holding required results in a large net negative return on value.
I think these flaws in inherent in the model and not really fixable long-term in the current LLM based tooling. We need models that can actually "reason" and "understand" and LLMs can't do that.
This is what we get w/ statistical inference and it is ineed impressive but not sufficient.
This is for all intents and purposes a huge experiment and Open Source gets a front row seat but we can't get off the ride nor are we supported or compensated for what is essentially a large burden on our resources.
@regehr "Although I can’t prove it, I like to think that these tools (and others like them) have helped the production compilers that developers use every day become more robust and solid."
I can't totally prove it either but if these tools are run regularly on the trunk of the compilers; they have found bugs earlier than doing a full distro build. At least for GCC.
For GCC, the runtime fuzzier testing has usually found bugs that were introduced in the last week or so which make it easier for a developer/reviewer just to fix it as that part of the code is fresh (usually).
(note LLVM insight might be harder due to bug/regression tracking is something which is lacking behind compared to GCC; that and the folks running the runtime fuzziers might not report reports upstream but only downstream from what i can tell).
@zwarich @joe it's table 1 from this paper, right?
https://www.cs.swarthmore.edu/~bylvisa1/cs97/f13/Papers/DifferentialTestingForSoftware.pdf

Attached: 1 image state of a typical C compiler, 1998
@zwarich @joe @regehr Alan Snyder's portable C compiler in 1976 (not the better known pcc by Steve Johnson descended from it) was, I believe, the first to have local struct names.
https://archive.org/details/snyder_c_differences_1978-04-04/page/n1/mode/1up
@regehr one worry I would have about this is whether using a reducer is "reasonable". I know it's "needed" for humans to analyse the problems, but I don't know if the claude compiler has a compositional enough structure that big programs hit the same problems small programs do (I know clang etc *do* have this structure but that's because they were written by humans). This feels like some opposite to the "small model theory" we get in program synthesis for working out which candidates are likely to be most general.
I guess that's easy to verify by checking all the pre-shrinking cases are fixed by the fixes to the reduced bugs, and maybe against all the intermediate shrunk programs.
@regehr this just... yeah
"CCC isn’t even a useful prototype."
yeahh,,,