As it happens, we still use CVS in our operating system project (there are reasons for doing this, but migration to git would indeed make sense).

While working on our project, we occasionally have to do a full checkout of the whole codebase, which is several gigabytes. Over time, this operation has gotten very, very, very slow - I mean "2+ hours to perform a checkout" slow.

This was getting quite ridiculous. Even though it's CVS, it shouldn't crawl like this. A quick build of CVS with debug symbols and sampling the "cvs server" process with Linux perf showed something peculiar: The code was spending the majority of the time inside one function.

So what is this get_memnode() function? Turns out this is a support function from Gnulib that enables page-aligned memory allocations. (NOTE: I have no clue why CVS thinks doing page-aligned allocations is beneficial here - but here we are.)

The code in question has support for three different backend allocators:
1. mmap
2. posix_memalign
3. malloc

Sounds nice, except that both 1 and 3 use a linked list to track the allocations. The get_memnode() function is called when deallocating memory to find out the original pointer to pass to the backend deallocation function: The node search code appears as:

for (c = *p_next; c != NULL; p_next = &c->next, c = c->next)
if (c->aligned_ptr == aligned_ptr)
break;

The get_memnode() function is called from pagealign_free():

#if HAVE_MMAP
if (munmap (aligned_ptr, get_memnode (aligned_ptr)) < 0)
error (EXIT_FAILURE, errno, "Failed to unmap memory");
#elif HAVE_POSIX_MEMALIGN
free (aligned_ptr);
#else
free (get_memnode (aligned_ptr));
#endif

This is an O(n) operation. CVS must be allocating a huge number of small allocations, which will result in it spending most of the CPU time in get_memnode() trying to find the node to remove from the list.

Why should we care? This is "just CVS" after all. Well, Gnulib is used in a lot of projects, not just CVS. While pagealign_alloc() is likely not the most used functionality, it can still end up hurting performance in many places.

The obvious easy fix is to prefer the posix_memalign method over the other options (I quickly made this happen for my personal CVS build by adding tactical #undef HAVE_MMAP). Even better, the list code should be replaced with something more sensible. In fact, there is no need to store the original pointer in a list; a better solution is to allocate enough memory and store the pointer before the calculated aligned pointer. This way, the original pointer can be fetched from the negative offset of the pointer passed to pagealign_free(). This way, it will be O(1).

I tried to report this to the Gnulib project, but I have trouble reaching gnu.org services currently. I'll be sure to do that once things recover.

#opensource #development #bugstories

So what was the bug I was looking for?

We recently switched off_t from 32-bit to 64-bit. Once this change was made, suddenly the "ar" tool would start to generate corrupt archives when objects were deleted from the archive.

It took quite a long time to find out that this snippet in the open function was the root cause:

if (!error && (mode & O_TRUNC))
{
syscall(SYS_ftruncate, fd, 0);

Needless to say, ftruncate was getting a very large length argument passed to it, resulting in the underlying filesystem outright refusing to perform any action -> O_TRUNC would not truncate existing files -> ar generated a corrupt archive when the file shrank.

It's obvious now, but believe me it wasn't easy to find it.

#bugstories

500 Mile Email

500 Mile Email - Absurd Software Bug Stories

I found and fixed yet another use after free bug in pdksh. This bug is likely about 30 years old or so.

- c_eval (shell "eval" command) creates a new "strict source" "s" from current env ATEMP: https://github.com/Orc/pdksh/blob/131021a130337e3313195c9d321c6aa40b4e291d/c_sh.c#L429

- c_eval passes the created source to "shell" function:
https://github.com/Orc/pdksh/blob/131021a130337e3313195c9d321c6aa40b4e291d/c_sh.c#L459

- "shell" calls "compile" function: https://github.com/Orc/pdksh/blob/131021a130337e3313195c9d321c6aa40b4e291d/main.c#L593

- "compile" function assigns the global "source" pointer to the passed source: https://github.com/Orc/pdksh/blob/131021a130337e3313195c9d321c6aa40b4e291d/syn.c#L790

At some point (I didn't bother tracing exactly where and in what circumstances - certain parts of gmp-6.3.0 configure script trigger it, at least on our platform) the ATEMP associated with the "env" gets released via call to "reclaim", but the "source" pointer will continue to point to the already released memory. This leads to classic Use After Free condition and all kinds of havoc and eventually a crash.

The fix is easy enough: Save and restore the original source pointer in c_eval:

--- pdksh-5.2.14/c_sh.c 1999-07-13 19:54:44.000000000 +0300
+++ pdksh-5.2.14-fixed/c_sh.c 2025-05-20 23:08:41.162128448 +0300
@@ -423,6 +423,8 @@
char **wp;
{
register struct source *s;
+ struct source *sold;
+ int ret;

if (ksh_getopt(wp, &builtin_opt, null) == '?')
return 1;
@@ -456,7 +458,10 @@
exstat = subst_exstat;
}

- return shell(s, FALSE);
+ sold = source;
+ ret = shell(s, FALSE);
+ source = sold;
+ return ret;
}

int

#bugstories #development #morphos

pdksh/c_sh.c at 131021a130337e3313195c9d321c6aa40b4e291d · Orc/pdksh

pdksh 5.2.14 with the lovecraftian tentacled horror of GNU autoconfigure stripped out - Orc/pdksh

GitHub

Just spotted a sneaky bug in file 5.46: Due to forgetting to change the byteswap logic in recent commit that changed the magic file header fields, the code is unable to load magic files compiled on different endianness platforms. https://bugs.astron.com/view.php?id=656

Due to the nature of this bug there would be no warnings, the magic file would just appear to have no content in it. This was due to the 16-bit "flag" field not being swapped, and the code then being unable to find any rules with "if ((m->flag & mode) != mode)" as the mode bit checked was off by 8 bits.

1. incorrect byteswaps: https://github.com/file/file/blob/f77a1092e1862c2295a21077c9e28c2614a0eede/src/apprentice.c#L3673

2. comparison failing due to the missing byteswap: https://github.com/file/file/blob/f77a1092e1862c2295a21077c9e28c2614a0eede/src/apprentice.c#L1177

#bugstories

0000656: byteswap mistake prevents loading cross-endian generated magic files - MantisBT

The code originally made a copy of a struct before modifying the copy. The original was then used afterwards. I entirely missed the later use and that it was *critical* that the original struct was used as is. So I passed a subtly modified struct to the later processing, which, in combination with a second bug I had introduced some time earlier, caused all kinds of havoc.

There was *another* bug I also introduced, which funnily had similar effects. This bug was added months ago, and it affected only older OS versions. I typically only run the bleeding version during development (but I had tested the change with older versions, too). Unfortunately, this issue was random as it depended on stack contents to get triggered, and thus went unnoticed until the additional scrutiny introduced this intense debugging session.

The combination of these factors made this highly frustrating thing to debug, as any kind of A-B testing fails when you have multiple or random issues.

#bugstories

You might be interested to learn how I tracked this bug down. In short, I was quite lucky.

I was working on adding HTTP/3 support to a WebKit-based browser. Suddenly, the browser build began to crash at exit. Closer inspection revealed that apparently the destructor called by __do_global_dtors_aux was for some reason calling free() with a random pointer. Being the kind of guy who really likes inspecting the assembly code just for kicks, rather than trying debug actively, I turned into just objdumping the code. What I saw was this:

...
390f630: 48 00 00 04 b 390f634 <xmlDestructor+0x2c>
390f634: 3d 80 00 00 lis r12,0
390f636: R_PPC_ADDR16_HA .text+0x4d8fc8
390f638: 39 8c 00 00 addi r12,r12,0
390f63a: R_PPC_ADDR16_LO .text+0x4d8fc8
390f63c: 7d 89 03 a6 mtctr r12
390f640: 4e 80 04 20 bctr

...
004d8fc8 <xmlFreeRMutex>:
...

This is the libxml2 destructor, and it should be calling xmlCleanupParser:

static void
ATTRIBUTE_DESTRUCTOR
xmlDestructor(void) {
/*
* Calling custom deallocation functions in a destructor can cause
* problems, for example with Nokogiri.
*/
if (xmlFree == free)
xmlCleanupParser();
}

So what was going on? Why would the code just branch to the next instruction and then do a bctr-based jump to another location? Initially, I was a bit baffled as this looked rather weird code to be generated by a compiler. When I inspected the libxml2.a itself, sure enough, it did not have such code in it, just the simple branch to xmlCleanupParser. At this stage, I was able to determine that this code had to be created by the linker (ld). Also, the branch was going backwards about 52 MB offset. This helped me realise that this was the so-called "branch relaxation trampoline", and it allowed me to home in on the potential location of the bug.

I could also see that the offset of the target xmlCleanupParser function within the .text segment of the compiled object was:

0000007c <xmlCleanupParser.part.0>:

In the actual binary, that was at offset:

004d8f4c <xmlCleanupParser.part.0>:

Now, if you remember, we were jumping to offset 0x4d8fc8 (xmlFreeRMutex) instead. These addresses differed by exactly 0x7c bytes.

At this point, it was quite trivial to locate the part of our code that didn't correctly disregard the "addend" offset, resulting in a double offset being used.

I was more than lucky in that the bug was affecting a functionality (a destructor) that was hit 100% of the time, rather than the gazillion other code paths in the browser. I also later noticed that a total of nine such branches had been affected. The eight others were not hit normally (or maybe never!).

#debugging #bugstories #bugs #development #coding

I noticed a bug in our binutils port that had been generating semi-randomly broken branch relaxation trampolines for decades.

Why did it take so long to notice this code generation bug? The branch has to reach farther than +-32MB for the branch relaxation trampoline generation to kick in. And even then not all branches were affected (the type of relocation affected it, it had to be in a link library in a specific kind of segment and not in the beginning of it). Finally, for it to actually come into play, the branch had to actually be taken, too.

What did it do wrong then? It added the offset to the target function in the link library .text segment twice. So instead of jumping to the intended function it jumped somewhere random after that. Funnily, the jump might end up hitting some code that did not crash, but did something unintended. For you all non-developer peeps: That is very, very bad.

Fun features of this bug: Since whether the trampoline was generated or not depended on the order of object code and from where the affected call was being made, the bug would pop in and out of existence even on the smallest changes to the code or link libraries. If you know a thing about debugging, this is not very conducive to locating bugs.

Needless to say this one took a lot of head scratching to finally figure out. The fix was a change on a single line.

This bug was old enough to get into bars and drink alcohol legally.

EDIT: Oops, this was supposed to be a reply to this post: https://infosec.exchange/@harrysintonen/114387494121632632 - so added some context.

#bugstories #bugs #development #coding

Harry Sintonen (@[email protected])

Attached: 1 image The feeling when you notice a bug in your binutils port that has been generating semi-randomly broken branch relaxation trampolines for decades. #programming #coding #oops

Infosec Exchange

Endlich erster Post fertig.  

https://jkad.de/post/0

#embedded #bugstories

jkad

Bug story time.

We run many CI servers. All report metrics on queue depth and blocked cause. We alert if the queue is too big for too long. Reliable for years.

On a Sunday a user paged. Nothing had built in 20 hours. The queue had > 1500 builds and no alerts had gone off!

What?!

A recent release introduced a StackOverflowError when checking certain builds for blocked cause. Builds weren’t getting scheduled. Worse, metrics failed to report.

#BugStories #CICD #SoftwareEngineering