Weird FUD by Phoronix about LTO builds since Mesa is apparently turning off their LTO builds due to "impossible to reproduce bugs that only happen in LTO builds". I'd like to get other people's take, but I'm assuming what's actually happening is that like with -O1 vs -O3, optimizing across translation units is exposing latent UB bugs (or much less likely, latent compiler bugs not directly related to LTO), and it's not that the LTO pass itself is buggy.
Incidentally, this has always been one of the socio-technical problems I've had with LTO in large C and C++ code bases. All the scariest bugs in code bases for those languages are UB-related and might only show up at higher optimization levels. And because of slow build times, most devs understandably don't want to do all their local rebuilds at -O3, never mind -O3 plus LTO (even thin LTO), so you're not actually iterating and dogfooding the thing you're shipping if you ship LTO.
@pervognsen yeah mesa has a bunch of UB bugs, phoronix is just repeating stuff others have said which isn't very accurate. A few people have been slowly improving the situation in mesa, though.
@dotstdy @pervognsen Does any non-trivial C/C++ source code even exist, that does not contain (latent) UB?
@soc @pervognsen no :) though there's a relatively limited number of issues which can actually manifest problems, and once you get rid of the majority of those staying on top of things isn't as difficult.
@pervognsen which of course more or less follows the standard trajectory for bringing up old codebases on modern compilers in general, long painful tail of hard to isolate issues caused by latent UB. I'm currently performing a "where in the world did this ud2 come from" ritual myself... :')
@dotstdy UD2, the nemesis of us all.
@pervognsen I wish reverse debugging worked for video games. It reminds me of the old James Mickens rant "THERE'S NO HARDWARE ARCHITECTURE WHICH IS ALIGNED ON 7" https://www.usenix.org/system/files/1311_05-08_mickens.pdf
@dotstdy It's not a true UD2 miscompile unless half your code base was also optimized out in the process.
@pervognsen I can't tell if half the codebase is gone because there's only ud2 left! A lone instruction, jumped to from an unknown location. Debug info claims this was once the location of a thriving destructor, but I cannot see any evidence of such a thing. Have is gone, was it ever there?
@pervognsen actually, typing this out makes me wonder. If a function gets collapsed to a ud2, and you have comdat folding enabled, does the linker then fold all those functions into the same location?
@dotstdy @pervognsen “my name is Undefinedias, King of Kings…”
With the right adjustment in your mindset, any time you find that a bug with your code only materializes with different compilers or compiler versions or optimization flags, you're given a gift that lets you root out UB bugs in your code. Or, much more rarely, you get to root out compiler bugs, which is less fun. But having been in the situation of trying to ship things with latent UB like this, I understand the frustration. So don't ship LTO for now, but you (probably) still have serious bugs.
If you're a real psychopath you can try to fuzz your code for UB by compiling and auto-testing it with many different permutations of compilers and compiler configurations like that.
@pervognsen also if you have a reproducible issue it's probably worth reaching for `-fwrapv` and `-fno-strict-aliasing` first, since that might help you narrow it down more than reaching for the hammer that is disabling LTO.
@dotstdy Yeah, I think -fno-strict-aliasing in particular is close enough to mandatory for a lot of code bases since that's the single biggest deviation between C as actually used in practice vs the specification. So it deserves a separate category in that it's acceptable to just require it and not consider it UB in practice, so it's not really in the same category as anything else. Although -fwrapv would indeed be a close second.
@pervognsen @dotstdy
If I interpret https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/meson.build?ref_type=heads correctly, they indeed do not set -fno-strict-aliasing - bold.
meson.build · main · Mesa / mesa · GitLab

Mesa 3D graphics library

GitLab
@dotstdy @pervognsen fwiw it's no help when the compiler is rustc. For that one you do need the deranged fuzzing mentioned above 🫠
@crystalmoon @pervognsen at least there's still no tbaa!
@dotstdy @pervognsen but you have non-deduplicable Comdat sections 😭
@pervognsen
AFAIK Mesa tried for a long time to figure out the underlying issue and this "disable LTO" thing is the last resort because they weren't able to
@pervognsen I was thinking the same, I’ve hit hard to debug crashes in release builds that turned out to be UB problems triggered by LTO, I’ve never hit a problem that boiled down to just be LTO.
@rburchell Yeah, and I've worked on projects we couldn't ship initially with the -O3 equivalent without selectively disabling optimizations. It's a completely rational expedient but usually (unless it's something definitive and categorical like -fno-strict-aliasing where it's intentional) it should make you feel extremely uncomfortable since there might be way the same root cause could trigger without LTO after making innocuous code changes or changing compiler versions.
@pervognsen Yes, my experience is also that one of the main reasons people cannot use LTO is because it surfaces latent UB bugs.
@pervognsen Except for Emscripten (which has its own kind of LTO bugs), almost all bugs that I have seen exposed by LTO were actually UB in the first place (for GCC/Clang/MSVC).
-O3 tends to expose similar bugs, however -O3 also tends to expose actual compiler bugs, and I have seen a fair share of those. I suspect they would also affect LTO, but for me, -O3 has mostly been enough to trigger the actual compiler bugs.
I think I have seen 1 actual MSVC bug that only triggered with LTCG/LTO.

@pervognsen Yeah, the only time I have actually seen compiler bugs with LTO was when we were LTOing a mixture of v8-M Main and Base in the RP2350 ROM. Even that was ICEs instead of miscompiles.

That said I am currently working with a large embedded firmware codebase that has issues under LTO, and it's pretty annoying to track down given there is no way of running it off-target, so things like ubsan are not an option.

LTO also makes the disassembly way more annoying to read, which unfortunately I care about quite a bit if I'm down in the weeds doing hardware debug.

@pervognsen at one point Mesa had strict aliasing turned off but they turned it back in recent years so that might be still an issue.

A good example of alaising issue and lto. I fixed a strict aliasing issue in gcc sources in 2011 (PR48981) which was exposed by lto, just happens spec 2017 chose a gcc that was older than that too and every once in a while (like last year), we get a report of gcc in spec failing with lto if using strict aliasing but only with the new enough gcc.
It is just funny how I fixed it almost 15 years ago now too.

@pervognsen The chance of this being latent UB increases quite a bit the closer you get to the hardware. People who work on drivers (either kernel or userspace) really like to pretend that C is just a macro assembler.