Mastodawn

I can't stop thinking about the LLM-generated compiler that passes all the unit tests but emits inner loops that benchmark over 150,000x slower than a gcc debug build. I couldn't possibly have intentionally come up with such a funny demonstration of the point of genuine expertise https://harshanu.space/en/tech/ccc-vs-gcc/

CCC vs GCC

A Guide to comparing Claude Code Compiler with GCC

Harshanu

Show thread

Erin 💽✨

@0xabad1dea It’s so diabolically bad I don’t know how you do it. We’re not talking about gcc -O3 here, which does some truly herculian things, we’re talking about GCC with basically every optimization disabled. I don’t understand how the generated code wouldn’t run within a finite constant factor of gcc here, you just have to spit out the dumbest possible assembly for a given input source.

You just know there’s some absolutely horrific workarounds going in here because it’s apocalyptically bad in utterly incomprehensible ways.

Show thread

Erin 💽✨Feb 12

@0xabad1dea …the more i ruminate on it the more i think digging into the output (which is rather difficult given the poor quality and lack of debugging symbols) would find that it’s done something like sometimes implementing multiplication iteratively or something. it’s really astoundingly bad.

Show thread

big awoo notation Feb 12

@[email protected] @0xabad1dea looking at the disassembly makes me think it has invented 6502-64 /j

Show thread

John Regehr Feb 12

@erincandescent @0xabad1dea making it all even funnier, there’s a full set of optimization passes in the implementation

Show thread

Erin 💽✨Feb 12

@regehr @0xabad1dea i know! there’s presumably a whole Source -> AST -> SSA -> Multiple optimization passes -> Assembly pipeline going on here! what on earth is it even doing in there that the output is this embarassingly bad?!

The output would be quite frankly embarassing for a single pass source -> assembly/machine code translator (which you can do for a half reasonable subset of C in 2kB of C code, see e.g. OTCC) but there’s an entire optimization pipeline in there?!

Show thread

John Regehr Feb 12

@erincandescent @0xabad1dea I took a very quick look at the code for some of the passes and they're at least superficially plausible. I think one would have to actually run the compiler to see what they're doing. perhaps working together to produce that amazingly slow code, like maybe each pass adds a bunch of copies and the stupid AI forgot copy propagation. something like that feels likely.

Show thread

abadidea Feb 12

@regehr @erincandescent the blogger's assessment is that the main issue in the SQL loop is it was shuttling every single variable read/write through one single register, because once there are more variables than registers it doesn't know what else to do.

Show thread

John Regehr Feb 12

@0xabad1dea @erincandescent well that's technically a register allocator

Show thread

Erin 💽✨Feb 12

@regehr @0xabad1dea and it’s a bad one but it’s like a 10x factor of bad one at worst. and is say that only really because all of the mov big_offset(%rbp), %reg and back are probably huge and giving the instruction decoder indigestion.

Show thread

Jason Orendorff Feb 12

@erincandescent @0xabad1dea @regehr I'd love to know if this is really the problem. The blog post itself shows signs of having been AI-generated, and it contains a whole section "Why Subqueries Are 158,000x Slower" that makes no sense to me

Show thread

David Chisnall (*Now with 50% more sarcasm!*)Feb 12

@erincandescent @0xabad1dea

And we're talking about the kind of things that tcc can compile. TCC was originally an entry into the International Obfuscated C Competition, as a C compiler that fitted on one screen and could compile itself (the back end bit is in QEMU as the Tiny Code Generator, which QEMU uses for JITing small fragments of emulated code).

The full version is bigger, but still very small. And it can compile SQLite.

It's pretty naïve. It doesn't do anything more than peephole optimisation. In the worst case performance is usually around 25% of GCC (occasionally worse for vectorised hot loops), for some things it's closer to 90%.

TCC is not designed for generating fast code, it was designed to be simple and to generate code quickly (they did a demo about 20 years ago with tcc embedded in GRUB, compiling the Linux kernel and then booting it. It took 30s to compile the kernel in an x86 emulator on a 1.25GHz PowerPC host). So if you're generating slower code than TCC, that's really embarrassing.