Recent discussion about the perils of doors in gamedev reminded me of a bug caused by a door in a game you may have heard of called "Half Life 2". Are you sitting comfortably? Then I shall begin.
Once upon a time, I worked at Valve on virtual reality. This was in 2013 around the time the Oculus DK1 emerged, and Joe Ludwig and I decided that the best way to figure out how VR would work in a real game context was to port a real game to it and see what happened.

We picked Team Fortress 2 as the game - the reason why is a whole different story I won't go into here. TF2 used the Source 1 engine, and as it happened two Valve games also using that same version of the engine were Half Life 2 and Portal 1. So as a side-effect they also got to work in VR.

Well, Portal 1 "worked" - but all the tricks with perspective when you went through a portal were of course a nauseating disaster - it was pretty unplayable.

But HL2 did actually work pretty well. Joe spent a lot of time making the boat sequences work in a reasonable way.

There's a sequence of stacking boxes near the start that is somewhat infuriating in the original - the stack keeps falling over - but in VR it's really easy to place them well.

Also, whacking the manhacks with your crowbar goes from being a panicked flailing in flatscreen, to being an elegant one-swing home-run hit in VR.

Luckily, there was some other excuse to reissue HL2 anyway (see https://developer.valvesoftware.com/wiki/Source_2013) and the VR version worked pretty well, so we put the VR support on a command-line, labelled it "beta", rebuilt the whole of HL2 and prepared to ship it.
Source 2013 - Valve Developer Community

Of course we've played a bunch of HL2 by this point, testing all the VR stuff works. But we just skipped to the relevant chapters - we never actually played through from the start. And I hadn't played it through in a while, so I thought I'd do that in VR, start to finish. If I discovered anything that still didn't work, I could at least document it in the release notes.
So I started it up, selected new game, played the intro section. It's a fairly well-known section - you arrive at the train station with a message from Breen, a guard makes you pick up a can, and then you have to go into a room and... uh... I got stuck. I wasn't dead, I just couldn't go anywhere. I was stuck in a corridor with a guard, and nowhere to go. Bizarre.

What is meant to happen is a guard (spoiler alert - it's actually Barney in disguise) bangs on a door, the door opens, he says "get in", and then the game waits for you to enter the room before the script proceeds.

But in this case the door sort of rattled, but didn't open, and then locked shut again. So you can't get in the room, and the gate closed behind you, so you can't go do anything else. The guard waits forever, pointing at the locked door, and you're stuck.

I checked a video online, wondering if my memory was faulty - nope, the door's meant to open automatically, and you walk in. https://www.youtube.com/watch?v=y_3vMUOayyc&t=215s (at 3:40). But... now it doesn't!
Half Life 2 opening scene, first 5 minutes - in Full HD

YouTube
Oh dear. We can't ship this. I get some other folks, including some folks who worked on HL2 originally, and yep - it's broken. And it's broken when you're not in VR either - so it's not something Joe and I broke. But nobody knows why - none of the relevant code has changed.
Someone even goes back in the source history and compiles the original game as it shipped - nope, that original version is also broken. How can this possibly be? At this point people are freaking out - this isn't a normal bug - it appears to have traveled backwards in time and infected the original!
After about a day remembering how to use the debugging and replay tools, someone smart (sorry, I don't remember who) figured out what was going wrong.
If you watch the video, when the door unlocks and then opens, there's a second guard standing inside the room to the left of the opening door. That guard is actually standing very slightly too close - the very corner of his bounding box intersects the door's path as it opens. So what's happening is the door starts to open, slightly nudges into the guard's toe, bounces back, closes, and then automatically locks. And because there's no script to deal with this and re-open the door, you're stuck.
Once we'd figured this out, the fix was simple - move the guard back about a millimeter. Easy. But it took a lot of work to find because people had to dust off old memories of how the debugging tools worked, etc.
OK cool now we can ship the game phew. But why did this EVER work? The guard's toe was in the way in the original version as well. As I say, we went back in time and compiled the original as-shipped source code - and the bug happened there as well. It's always been there. Why didn't the door slam closed again? How did this ever ship in the first place?
So this kicked off an even longer bug-hunt. The answer was (as with so many of my stories) good old floating point. Half Life 2 was originally shipped in 2004, and although the SSE instruction set existed, it wasn't yet ubiquitous, so most of HL2 was compiled to use the older 8087 or x87 maths instruction set. That has a wacky grab-bag of precisions - some things are 32-bit, some are 64-bit, some are 80-bit, and exactly which precision you get in which bits of code is somewhat arcane.
But ten years later in 2013, SSE had been standard in all x86 CPUs for a while - the OS depended on it being there, so you could rely on it. So of course by default the compilers use it - in fact you have to go out of your way to make them emit the old (slightly slower) x87 code. SSE uses a much more well-defined precision of either 32 or 64 bit according to what the code asks for - it's much more predictable.

So problem solved, right? 80 bits of precision means the collision didn't happen, but in 32 bits of precision it does, and that's your problem, more bits better, QED, right? Well not quite.

The guard's toe overlaps in both cases - a few millimeters is still significantly larger than ANY of the possible precisions. In both the SSE and x87 versions, the door hits the guard's toe. So far, both agree.

This collision is actually properly modelled - a big innovation of HL2 was the extensive use of a real physics engine. The door and the guard are both physical objects, both have momentum, they impart an impulse on each other, and although the door hinge is frictionless, the guard's boots have some amount of friction with the floor.
On both versions, the door has just enough momentum to rotate the guard very slightly. The guard's friction on the floor is not quite enough to oppose this, and he rotates a tiny fraction of a degree. On the x87 version, this tiny rotation is enough to move his toe out of the way, the collision is resolved, and the door continues to swing open. All is well.
But on the SSE version, a whole bunch of tiny precisions are very slightly different, and a combination of the friction on the floor and the mass of the objects means the guard still rotates from the collision, but now he rotates very slightly less far.
So on the next frame of simulation, his toe is still in the way of the door. The door isn't allowed to just pass through his toe, so it does the only other option - it bounces back. I think by default it's set to do so completely elastically, so the door bounces back with exactly the speed it came in at, slams shut, and locks again. And you're stuck.
And that's why the bug went "back in time" - because yes it's the old code, but we were using a newer compiler with new default settings. In the original build, the compiler defaulted to x87, but in the newer compilers the default is SSE. It's not that one is "better" - the fundamental bug is that the guard was too close to the door, and that had always been there. But in the original the problem "self corrected" and so was never spotted, whereas in the newer compile it became a showstopper.
And there you have it. The two biggest bug-farms in gamedev - doors and floating point - contrived to make a simple NPC placement bug into quite the time-travelling palaver. /end
@TomF I was sitting comfortably and I had coffee. This was a great story indeed. When you mentioned the floating point, I was like, wow, that guard must have been positioned a TINY bit wrong. Imprecisions in physics however... fascinating. Thanks for sharing!
@wildrikku @TomF Should have emulated it using quantum physics, so the game forked into two, one where the door quantum-tunnelled through the guard's toe and life went on
@wildrikku @TomF yeah I've written 2D physics engines in C in past and experienced firsthand the non-intuitive weirdness that can manifest when a tiny flaw amplifies then propagates downstream
@wildrikku @TomF The moment he went "recompiled the old code" I was like "and did you use the old compiler too?" ... 😅
Well done though, spot on analysis, enjoyed the read.
@TomF interesting read, thanks! Gotta love those bugs that make you question your fundamental assumptions and/or sanity. And I did play TF2 on my DK1 but got a lot of motion sickness I think due to the hardware and using an underpowered laptop GPU. But I didn't know HL2 VR was developed.
@TomF you found one of those elusive situations where the compiler broke it. Great story

@knutaf @TomF As I read about the bug "Going back in time", it did make me think about Reflections on Trusting Trust.

Only this time, it definitely wasn't malicious, but was unintentionally being too beneficial, that does it in.

@knutaf @TomF Although thinking on it more, this feels like it justifies my preference for integration testing over unit testing; because from a unit of physics, all of these rules did look like they worked as intended in both versions - it took both the door collision physics *and* the guard rotating physics to combine into a series of floating point precision issues that caused the door to lock when it shouldn't have.
@AT1ST @knutaf Yup - this is why I think 99% of unit testing is a total waste of time and effort. Integration/holistic testing catches far more bugs, and is also easier to support (coz it's you know - game save/load!)
@TomF @AT1ST @knutaf agreed. I've been building a new Golang backend system and I care much more about what happens when running for real and under prod-like conditions, end to end, than I do about any ivory tower isolated assertion about what one particular LOC did

@TomF @AT1ST @knutaf Do you mean that unit testing is ineffective in game development, or ineffective generally?

I find unit testing very useful when writing library code, but I have no experience with game development.

@bobulous @TomF @AT1ST I've gotten a lot of value from unit tests in libraries where I can inject conditions that are difficult to induce at the integration layer (drop/reorder specific messages, stuff like that), but it's not a substitute for good integration tests
@bobulous @AT1ST @knutaf Unit tests find very few bugs that are not ALSO found by integration testing, and adding unit tests to things that are not well-defined "units" (which is 80% of game code) is a lot of infrastructure to write, and that distorts clarity and slows down development. It's just not worth that overhead for the few (and simple) bugs that it finds that an integration test did not.

@TomF @AT1ST @knutaf
Unit tests are valuable in languages without static type checking.

Essentially, your unit test suite becomes your second compiler/linter, because JavaScript/Python/PHP/Lua doesn't give it to you for free like C++/Haskell/Java/Rust.

They're also good as documentation: "here's how to call it and what it's expected to do!"

@StompyRobot @AT1ST @knutaf Your language is only free of types if your time is worth nothing :-)
@StompyRobot @TomF @knutaf I guess I prefer integration tests for documentation as a "Here's how things work together and where they go, and if you extend/change this thing, here's where it gets picked up.".

@AT1ST @TomF @knutaf integration tests say "when the user presses A on the controller, the character jumps" but unit tests say "attempting to jump when not in contact with the floor does not change velocity."

Unit tests can also test a number of use cases of some unit that may not be currently in use by the full system, but are meaningful for a complete library/component.

It's also much easier to unit test error handling, than integration test the same...

@AT1ST @TomF @knutaf that being said. There's not a global definition of the separation of the kinds of test.

I tend to think of "unit tests" as "tests that run entirely in process/RAM and have controlled for any possible external system like file system, network, clock, etc."

Integration tests then use the actual uncontrollables, and thus cannot be guaranteed to be 100% free of flakes.

And flaky tests are the absolute worst!

@StompyRobot @TomF @knutaf Unit testing error handling I can understand, but I would hedge that integration testing can test "When attempting to jump after jumping while in contact with the ground, you cannot jump again once leaving the ground", which involves verifying that the "Jumpability" state changing while jumping.

While integration testing can be flaky because of that, it is more likely to catch unexpected double jumping capabilities (Or expected double jumping with limitations that are not Kirby-like.).

@StompyRobot @TomF @knutaf (For example of where you might see Unexpected double jumping capabilities, this comes to mind [ https://youtu.be/gUTPINL7UQs?si=BBTdI6Iv2FciYzA6 ].

It's used in the world record speedruns like this [ https://youtu.be/lBffcbr7dmI?si=oyyZ8AVgdMVBZQdN ] at 29:18 (Also using the Tingle Turner.).

Glitch Tutorial: Zombie Hover in The Wind Waker

YouTube

@TomF @AT1ST @knutaf I don't know about easier to support.

My strategy lately has been unit test to test weird edge cases (that we know of, at that time) and try to reach a very good coverage, so going in every branches .

And then, an end to end test covering the happy path. I guess I tend to skip the integration testing. I feel integration testing is harder to test (like end to end), but doesn't provide the real full value because it is so mocked already.

But all of this is based on me not working in video games, so not same complexity and push for performances (which might break some clean separation of concerns, maybe?)

Also, maybe we don't have the same definition of unit test, integration test, end to end test (which is really a pain to be able to talk about with fellow developpers)

@dolanor @TomF @knutaf Perhaps it would be better for me to define what I am thinking of as "Integration Test", because I suspect it's pretty close to the "End to End testing" - most of my experience with it has been Selenium pointed at a website under test in a staging/dev/production environment, and then running the commands to do either a sanity test or a test for things (Recently, I did try and do an accessibility testing version in Selenium, although that...can be difficult to codify as I understand. Well, that and SauceLabs RemoteWebDriver connection stuff.). Other than that, it is usually me doing manual tests of an application in a test environment, ensuring that inputs get the expected outputs.

That is, automating the testing of a thing with as much of the app wired up so that the tooling only tests the same endpoints as the user might. As much as possible, avoiding mocking dependencies, and just having everything connected and using the things like they would in production.

Maybe that's more E2E?

@AT1ST @TomF @knutaf yes, that for me is complete e2e

@dolanor @TomF @knutaf I think for me, the difference with E2E testing is that while it's essentially integration testing, it would also involve testing the backend storage mechanism instead of just "Can the user get the system to a state that makes it clear that they broke it I'm their UI?".

At least, for an automation level, it would need to have an aspect of white-box testing, rather than just black-box testing.

@dolanor @TomF @knutaf (To bring it closer to the Half-Life 2 example - a Unit Test catches the floating point precision on the door opening and closing actions, an Integration Test catches that the door opens and ends up being pushed back into the doorframe and locking, and an End To End Test catches both, but also that the guard rotated and the floating point precision issues there.)

@knutaf

Sounds more like the old compiler accidentally unbroke it and the new compiler didn't.

It's a map bug. The guard is in the way of the door. Or it's a game engine bug, in that doors opening by script don't move unseen NPCs out of the way. But it's not a compiler bug, at any rate.

@TomF

@argv_minus_one @knutaf @TomF

You're the first one I see here talking about a compiler bug. One has said that "the compiler broke it", but that doesn't mean the compiler has a bug.

@TomF you explained this VERY well for a non-computer-toucher like me, thank you
@TomF But why did the bug not appear when the game was recompiled for Linux around 2015 ?

@vincent

@TomF

He starts the timeline of this story in 2013. My guess is that, by the time the 2015 linux release was done, the issue had already been found and sorted

@Dio9sys @vincent @TomF yeah, likely as #Valve updates their stuff all the time...
@vincent @TomF a different compiler (gcc rather than ms visual c++) is used which means the specific floating point operations used are a bit different, and even nowadays gcc on Linux still defaults to using x87 floating point when building 32bit x86 code unless you explicitly request to target a newer cpu.
@vincent @TomF because Linus wrote a strongly worded email on the kernel dev list and it decided to not make itself known.

@TomF This is amazing. Thanks Tom!

More! More!

@TomF thank you for this story.

I work in embedded with bleeding-edge toolchains targeting unreleased hardware. Despite C being older than I am, it remains staggering to me how much flux still happens in the generated code from compiler changes. Our team has learned long ago to freeze the toolchain alongside the code…

@TomF feels like a "never compare floats with equal signs"-error.
I would have asked myself "If the error didn't happen in the old executable, but happens when I use the old source, what did change?" well compiler and computer did probably change, if it's that far apart in time. As I'm not a game developer (yet) I probably wouldn't be able to analyse the problem as you did and just implemented the 'obvious' solution without knowing "why?"

@zerodime Doesn't sound like an eq comparison here, but rather a gt or lt inequality (used in checking for collisions/potential overlaps).

@TomF

@TomF I had a similar bug: checking equality between two dot products of the same vector pair (when computing closest rotation) failed because the first was moved from the FPU to the CPU (truncated to 32 bits), while the second stayed in the FPU, and the comparison ran in the FPU again (80-bit vs 32-bit). Switching to SSE months later made the bug disappear.

@davidcanadasmazo @TomF a floating point equals comparison should never be done without specifying an epsilon that specifies how close the values must be to be considered equal. Also both absolute and relative limits might be needed. Did you add an epsilon in this case? Or did you just let it go once the bug disappeared?

I have seen these bugs in design rule checkers for PCB layout software. In that case it happens even for a greater than comparison.

@poleguy @TomF We had an epsilon acceptable for the scale. We solved it debugging at the assembly level, reasoning the problem and forcing the comparison to happen in CPU. Later, our engine changed math code to SSE and we couldn't repro the bug anymore when we tried.
@davidcanadasmazo @poleguy @TomF
BTW: Similar fun bugs like the ones when moving from x87 to SSE can be had when moving from SSE (or x86_64 which implies SSE2) to an extension level that supports FMA

@davidcanadasmazo @poleguy @TomF
With FMA I ran into the problem that a simple cross product didn't work as intended, like when both vectors had the exact same numbers (or even were identical) the cross product is supposed to be (0, 0, 0), but with FMA it isn't because suddenly `x1*y2 - y1*x2` is calculated with one regular multiplication and FMA for the rest of the term and FMA uses higher internal precision so even with identical values the result is != 0

Solution in GCC: `-ffp-contract=off`

@Doomed_Daniel @davidcanadasmazo @TomF Calling "-ffp-contract=off" a "solution" for GCC is an interesting choice. It is a solution if the problem is defined as the answer doesn't come out quite at zero as expected.

But I would argue that the floating point equals comparison is the problem. So the pepper solution is to define a proper epsilon and do a less than comparison.

Am I wrong to think equals comparisons are never a great idea?

@poleguy @davidcanadasmazo @TomF
That's true generally, though especially around 0 using an epsilon is a bit tricky if you have checks that use < 0 or similar (like "on which side of a line is a point" or "is point in triangle") - the rounding error might change the sign.

And of course usually you run into such problems with code you didn't even write yourself that worked for years and suddenly stops working which makes decisions other than "try to restore previous behavior" harder

@poleguy @davidcanadasmazo @TomF
And generally even floating point is supposed to behave deterministically to some degree, IIRC the compiler is generally not even allowed to use associativity rules on floating point math (when conforming to IEEE-754, i.e. not using -fassociative-math which is part of -ffast-math or -funsafe-math-optimizations).
Unfortunately IEEE-754 has an exception for those contractions that (in GCC/Clang) can be disabled with the aforementioned compiler flag
@poleguy @Doomed_Daniel @davidcanadasmazo Yes - any time you use equality on floats you need to be incredibly careful about what you mean by that.
@TomF Such an awesome story.
@TomF that was a great story, thanks for sharing it.