OK, OK, ok, story time.

Way back when (early 90s), when Omni was consulting for McCaw Cellular (or AT&T Wireless, not sure which it was at the time), we were working on apps for NeXTSTEP for sales, customer care, and such for cell phones, nation wide. We'd occasionally get a crash reports and I don't even remember how those got back to us back in the day before automated collection and reporting, but eventually we were able to reproduce it.

Back then NeXT was using gcc as the system compiler and it turns out that the `new[]` C++ operator would allocate room for the stuff you asked for, plus an extra word at the front of the block, where it would store the count (and then give you the shifted address). Except at some point that changed because it was silly and that redundant count was removed. Except that *also* `delete[]` still took the pointer given and loaded the word *before* it to load the count (and then did nothing with it). Given enough hours, you'd eventually have `delete[]` looking off into a previous unallocated page get a stern talking to from the MMU.

Having discovered this, and not having a way to patch the compiler or system libraries, I instead wrote a perl script to process the assembly output of the compiler, find instances of this and fix them, hand verifying each fix was correct while the hack was needed, and every compiled file went through this until we got new tools that fixed the problem for real.

Duct tape and bailing wire, y'all.

@tjw stashing stuff at negative offsets from the pointer is such a good trick
@Catfish_Man back then it wasn't an issue, but these days my skin crawls thinking about memory alignment issues and maybe that's a sign I'd be OK as a frameworks person, but instead I'll just thank y'all. 😂
@tjw @Catfish_Man Alignment can be so fun. For example, did you know that ARM64’s ldp/stp instructions, to load or store 128 bits at a time, are atomic? Well, unless it’s misaligned and crosses a cache line boundary, then it still works except it’s no longer atomic.
@mikeash @Catfish_Man me to ISA folks, “thanks?” 😬
@tjw @mikeash @Catfish_Man I did know that. Don’t ask me how I found out but it made me sad. (Misaligned ring buffer placement after a field was added to a header struct and it misaligned the ring segments)
@mikeash @tjw @Catfish_Man Think of the transistors they’d have saved by simply making that crash.
@mikeash @tjw @Catfish_Man IIRC atomicity of LDP/STP is a relatively recent development in AArch64, but it has guaranteed by Apple's CPUs for longer, perhaps all the way back to the first 64-bit iPhone.
@gparker @mikeash @tjw @Catfish_Man “recent”. It’s 8.3 or 8.4 I don’t remember

@tjw

Alignment was a thing then, too. NeXTSTEP originally ran on m68k, whose alignment-related behavior is the same as x86: unaligned access is allowed but slow.

@Catfish_Man

@tjw It’s never the compiler’s fault, until it is. Technically the standard library in this case, but that is only semantics.

@tjw shades of an issue in Rogue Wave (as shipped with Borland C++) where they’d use placement new/delete in some operations, but neglected to mark the memory pool free after placement delete. Our video server was using an effected function hundreds of time per second, spiraling usage.

We had to patch every new BC++ library for like a year before they shipped the fix (which we repro’ed, reported with suggested fix, etc)

@tjw I was on one of the product development teams there, and I don't think I ever knew this!

But we still had gcc issues. Binaries were 32-bit and the set of libraries was so big that we bumped up against 4GB binary sizes and the linker crashed. So you could only get away with one or two libraries with debugging symbols at any one time. Try to track down a bug and guess wrong about where it was, and you'd get to choose a different library with symbols, rebuild for an hour, and try again.

@gregtitus oh yeah! Also trying to remember if it was there or just at Omni that we got the maximum number of dylibs raised. 🤣

@tjw On the day Opteron was supposed to tape out, a colleague discovered a logic bug. After some analysis, we figured out we could fix it by disconnecting a wire from one gate and attaching it to another. But running through or design flow would take days, and the ripple effect of changing connectivity could cause more problems. So I loaded the chip mask into VIM and modified the polygons directly, then we taped it out.

Don’t remember for sure, but I don’t think we told management 🙂

@cmaier @tjw Now that's what I call proper low-level debugging 👍

@cmaier @tjw here is a hw story for you. A chip comes back. It won't come out of rest. They find it was because someone messed up and had a connection done that was only supposed to be there while doing full chip testing and it was to be connected to ground rather to always on.
The few chips that came back they were able to salvage by zapping the connection to the right location.

I dont have the full details as I heard this 2nd hand and I am just a software guy. This was also 8 years ago so this was still on newish process, I think 45nm.

@cmaier @tjw “…and this, children, is why we use vim instead of vi.”

Do you remember what the bug would’ve affected?

@AnachronistJohn @tjw nope. I think it was the load/store unit though.
@cmaier Amazing. It would have taken me years to sleep again.
@tjw Nah. We were cowboys. There were only a couple dozen people total working on that chip.
@cmaier cool! What format was the mask data file?
@tomf It was a long time ago, but I believe it would have been .def, since I was just changing the wires and not anything below M1. Probably once I edited it we’d have launched LVS just to be safe, and that would have re-generated the gds. Our whole flow was text files back then, with little binary databases that tools would create on their own as-needed.

@tjw

What a twist ending!

I was sure it would end like most campfire horror stories "and that perls script is still running production code to this day!" Bwahahaha!

@tjw

I might've worked for the same outfit in 1997. I was hired from a job at Goddard Spaceflight Center to help form a new unit doing system security consulting (I wasn't particularly well versed in the subject, but was keen).

By the time my partner and I had driven across the country, the company had been sold, and the security gig was dropped. So, I got farmed out to AT&T wireless to do odd coding tasks.

The outfit was called Platinum Tech, and was a NextStep shop.

@lolcat I was supposed to be on a six week project and ended up there seven years, so yeah we were probably there around the same time!
@tjw reminds me of the evil mangler that the glasgow haskell compiler used to have https://archives.haskell.org/code.haskell.org/ghc-scp/ghc/docs/comm/the-beast/mangler.html
The GHC Commentary - The Evil Mangler

@tjw iPhone almost performed a similar hack workaround for some bad compiler code generation. iPhone OS 2.0 introduced the App Store and the new ObjC ABI. An important new feature was non-fragile ivars: you could add instance variables to a class without needing to recompile all of its subclasses. But there was a bug in the compiler's codegen in one kind of generated accessor method, making the ivar access fragile again.

Of course buggy non-fragile ivars work fine until they break later. We didn't discover the problem until after iPhone OS 2.0 shipped. To fix the problem we almost shipped iPhone OS 2.x with an exciting runtime introspection that could identify the broken accessor methods and dynamically replace them. Sadly this heroic engineering solution was dropped in favor of teamwork: we instead scanned all App Store binaries for the bad codegen and simply asked the affected developers to recompile with an updated compiler.

@tjw "Given enough hours, you'd eventually have `delete[]` looking off into a previous unallocated page get a stern talking to from the MMU."

Even worse, the memory allocator itself likely used the word before the allocation for its own metadata, so it would have been readable in even more cases. From what I know of early Mac OS X malloc, I bet the only way that word could possibly be unreadable on NeXT would be if the allocation were also very large, at which point malloc switched to allocating pages directly from the kernel and storing its own metadata out of line.

@tjw I think this is what people in the biz call "real engineering"