I should post about the latest #retrocomputing project I started.

Problem: I'd like an open-source, self-hosting C compiler on 8086, that supports the large memory model, overlays, and enough C89 to build Lua.

This seems to not exist! K&R is much more common in this size category. Around the time of C89, many compilers bloated to the point of requiring a 386 or better host, though they could still target 8086. The 8086 holdouts were, in general, commercial products that never got a source release.

One notable exception was DeSmet C http://www.desmet-c.com. It seems to have started life as a commercial PC fork of Bell Labs PCC, a small and sturdy K&R compiler. (Edit: I'm no longer sure of this: DeSmet had a shareware release as "PCC", but this stood for Personal C Compiler, while Bell's stood for Portable C Compiler. So the similarity in naming was probably just a coincidence.) DeSmet 3.1 added "draft ANSI C" support, but this is incomplete, and riddled with code-gen bugs. This version later found itself on Github as OpenDC https://github.com/the-grue/OpenDC.

Aside from all the bugs, this is a pretty cool package: its dis/assembler, debugger, text editor, and some other utilities were also open sourced, and it runs on an 8088 with 256K RAM and two 360K floppies.

The OpenDC person did a good job packaging things up into an easily buildable form, and fixing syntax errors that probably came from running the sources through a different compiler version than expected, so... yes, it does indeed build and self-host... and I've done this on my Book 8088.

So now I will try to fix the bugs and add the missing C89 features. There are many, many of both... gulp.

The #DeSmetC compiler codebase is the hairiest code I've had the experience of hacking. K&R style, many global variables, short cryptic names, spooky action at a distance, the shotgun-surgery pattern for type handling splatted around everywhere, oh baby.

For all that, I managed to fix the codegen bug from the Github issues on the ~second day of working on the compiler... that's the beauty of a small codebase.

My fork is here: https://gitlab.cs.washington.edu/fidelp/open_desmet_c
1 bug down, 999 to go...

#retrocomputing

Ooh, the-grue, the current maintainer of OpenDC #DesmetC, took my code-gen patch, how awesome! Usually when I pick up an old codebase like this, the maintainer is long gone.

So @linear if you end up wanting to submit patches, that's the place: https://github.com/the-grue/OpenDC
My fork will remain just an unofficial fork.

GitHub - the-grue/OpenDC: DeSmet C - Open source and completely built with latest toolchain

DeSmet C - Open source and completely built with latest toolchain - the-grue/OpenDC

GitHub

Been writing regression tests for arithmetic, which caught another #DesmetC code-gen bug which I was able to fix. https://github.com/the-grue/OpenDC/issues/5

Previously, illegal asm instructions were being generated, as shown.

#retrocomputing

Hmm, I found yet more weirdness with signed chars in #DesmetC. If you do:

signed char i, j;
int k;
// ...
k = i + j;

k's upper bits become a sign-extended version of i, not a sign-extended version of the result.

Much of this pain seems to trace back to a quirk (deviation from the C standard) documented in the manual: math on char types produces a char result not an int result. Perhaps to save a few instructions? Anyway, however this was implemented seems to work fine for char but not signed char.

Okay, #DesmetC sign-extension in (signed char -> int) promotion now works in my branch. (signed char -> long) does not work yet; it neglects to sign-extend, acting more-or-less like (unsigned char -> long). That's next to fix.

Thankfully, the codebase is small enough that it's not too hard to find the logic responsible for any given codegen decision.

Also, I came up with the trick of having the assembler backend emit comments into the output asm file. This lets me do something like printf debugging to check which codegen cases are being hit and annotate the assembly they're generating.

Naturally this is all test-driven: I'm accumulating regression tests for the broken codegen I've been fixing, and usually the way I find codegen bugs is by writing new tests expecting the mathematically correct answer, and watching them immediately fail.

#retrocomputing

#DesmetC (signed char -> long) promotion is now working on my branch: got it on the first try, which hopefully means I'm internalizing the codebase. Now I will start on tests for mixed-sign arithmetic.

#retrocomputing

It's been a pretty productive night in the ol' #DesmetC codebase. Regression tests finally checked in, all the mixed-size integer addition/subtraction involving signed chars I could think of is exercised and passing, nice.

Then I try i8 * i8 -> int and it instantly breaks, not so nice 🫠.
Oh well, that gives me something to fix tomorrow.

Also, I haven't ventured into floating point conversion land yet, either. I'm sure that'll have plenty of dragons when used with signed char.

All this is making me appreciate the wisdom of BCPL and B who have just a word-sized type -- or #Forth which takes that and adds char, as a treat.

#retrocomputing

Oh gosh

MOV CL,BYTE [BP-2]
XCHG CX,AX
CBW
XCHG CX,AX
XCHG CX,AX
CBW
XCHG CX,AX
IMUL CX

When CL absolutely, positively, needs to be sign extended before multiplication 

The mul-div codegen path in #DesmetC is sorta a nightmare because so much is reused between multiplication, division, and mod, and because there are some dodgy special-cases in here from 1990 that demonstrably do the wrong thing. Proceeding slowly with machete and torch, laying down test cases as I go.
Heh, well, can't imagine why the assembler doesn't like that instruction.
#DesmetC mul/div/mod i8 bugs seem more or less vanquished, now moving on to i8 comparison, which I probably should have done first, as it's also turning out to be broken.... -1 > 1, don't you know?

It is satisfying to watch the test count creep steadily upward. I like to leave off just after writing a failing test, to give me a clear "next task" when I resume work.

#DesmetC #retrocomputing

Huh weird, just noticed that in #DesmetC I can just keep redeclaring a local variable by the same name and it works. My test suite was doing this by accident, with seemingly no problems.

int n = f1();
printf("%d\n", n);
int n = f2();
printf("%d\n", n);

This isn't even a c99 compiler, so declaring a variable other than at the start of a function should be illegal besides.

#retrocomputing

$ cat cmp.c
#include <stdio.h>
int main()
{
int i = -1;
unsigned j = 1;
printf("%d\n", i < j);

return 0;
}

$ gcc cmp.c

$ ./a.out
0

Hmm, at some point I probably knew this is how the comparison would turn out in C. I was expecting it to coerce both operands to the smallest common type that could represent both of them, (which would have been long), then compare. Instead, we get twos-complement funny business.

when u smack the compiler so hard it turns into dwarf fortress
i deserved it though, that 13 byte program utilized 65489% of my available system resources

OK, finally I'm reasonably confident that the integer comparisons in #DesmetC are working correctly, after beating them into submission against a test-case-generator, with tcc on my Linux machine serving as the test oracle.

#retrocomputing

Cool, those same test cases pass on OpenWatcom as well, so I'm fairly confident the comparison behavior is correct.

After all my testing, how many bugs can #DesmetC math expressions still have? Well... the next problem is that the result type doesn't always match the behavior required in C89 §3.2.1.5 Usual arithmetic conversions. Like if you do int + unsigned, there are circumstances where the result might be int, not unsigned as would be standard.

My tests weren't catching this because they were all structured like,

int i;
unsigned j, k;
k = i + j;
// now test the value of k against expectations

where the assignment coerced the result to a particular type. This meant tests weren't checking the result's "natural" type, and sometimes that was wrong.

Latest #DesmetC "explorations with machete and torch": the compiler source has numbered constants for each supported C datatype. Normally you'd use enum for this sort of thing, but this codebase used #defines. The constants were numbered in a strange order, and I wanted to re-sort them in the order of the "usual arithmetic conversions", to simplify some logic. This broke code-gen, emitting illegal instructions. Several hours later, I found that CCHAR=1 and CINT=2 were directly used in hex math determining which x86 opcode to emit. When I renumbered those constants, it caused absurd instructions to be generated. After correcting that problem, we are now back to self-hosting o.k.
I'm hoping this will make it possible to retire a bunch of one-off type promotion logic scattered around the compiler, in favor of a few central functions closely mapping to the C89 standard.

#retrocomputing #SoftwareArchaeology

realized my now-thousands of #DesmetC binop expression evaluation test cases are still all just variable [op] variable cases, and don't exercise variable [op] literal or literal [op] variable at all yet. I know there will be more bugs there, because the compiler does constant folding as an entirely separate code path.

On the plus side I seem to have corrected lots of #DesmetC mul/div/mod bugs in one fell swoop by rewriting the relevant part of its codegen, lmao. So many type hacks, gone.

On the minus side, I think now I might be finding some bugs in #OpenWatcom C, which was supposed to be my infallible test oracle, dammit ;D

POP QUIZ: On a 16-bit compiler, what answer would you expect for:

int i;
unsigned int j;
i = -32768; j = 36;
printf("%d\n", (i / j));
-911
30%
-910
20%
910
40%
divide overflow exception
10%
Poll ended at .

@psf I think all of them are valid results. If I'm not mistaken, the assignment to i is undefined behaviour.

I think the best way to handle this would be to set the user's computer on fire, but I'm not sure if a C compiler can do that.

@soulsource If undefined behavior can set my PC on fire, at this rate I'm sure I'll experience that soon.

@psf quoting C99 6.3.1.8 Usual arithmetic conversions:

"if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type."

X and unsigned X have the same conversion rank (cf. 6.3.1.1). so the answer is that -32768 is converted to unsigned int. assuming a two's complement machine, that's just 32768.

FWIW this aligns with the residue class ring ℤ/2ⁿℤ which *both* int and unsigned int represent on a two's complement system.

always remember, int is not ℤ and unsigned int is not ℕ.

@pesco I interpret the standard the same way you do, yielding a result of 910.

If you are #OpenWatcom, though, the answer is divide overflow!

x86 div instructions take a dividend that is twice as wide as the quotient or remainder. In 16 bit mode the relevant ones are,

  • DIV: u32 / u16 -> u16
  • IDIV: i32 / i16 -> i16

What OpenWatcom does is:

  • sign extend -32768 (0x8000) to 0xffff8000
  • DIV 0xffff8000 / 36
  • die horribly on divide overflow

If it wanted a signed result it should have used IDIV. If it wanted an unsigned result it should have zero padded the operand instead of sign extending. Seems like a codegen bug.

I should probably check if the latest version of OpenWatcom still does this, and if so, see about reporting it. It isn't the only weird result I noticed in OpenWatcom's math, either.

#DesmetC now has 10912 tests of its arithmetic and logical operations. This approach continues to turn up bugs, which I then fix.

#retrocomputing

I have discovered that the >> operator in #DesmetC always performs logical right shift, never arithmetic right shift, even when the original type is signed. This seems to be undocumented.

The first round of test generation for #DesmetC is complete, yielding over 30,000 test cases for integer comparison, integer arithmetic, and constant folding. Numerous bugs dating to the late 1980s were found and corrected. This is now out for review as pull request 9.

#retrocomputing #softwarearchaeology

Someone asked me in the pull request comments how I'd generated thousands of test cases for #DesmetC, and I figured the #retrocomputing heads here in the fediverse might be interested in the answer, so I'm linking it here too: https://github.com/the-grue/OpenDC/pull/9#issuecomment-3866218702

I didn't put a lot of thought into the approach, and I'm sure it could be made much more elegant, but damn if it didn't flush out a lot of bugs.

Fix and regression-test integer arithmetic and comparison by PeterFidelman · Pull Request #9 · the-grue/OpenDC

Math operations now promote to int, unsigned, or long, in the C89 standards-compliant way ("usual arithmetic conversions"). Most notably, this removes the DeSmet-ism by which char additi...

GitHub
when u smack the compiler so hard it turns into dwarf fortress, part 2
re: corrupted compiler crash: Turns out that's what happens when you mess up an incremental build and use a drastically mismatched compiler frontend vs. backend.
Modifying #DesmetC's compiler front-end to support unsigned long constants isn't going to be easy, given the sheer volume of places that assume all values fit in a long. I think I might squeak by, because constants carry some amount of pseudo-type information and an unsigned long doesn't take more space than a long, but it's still a bigger change than anything I've done so far.

I hadn't looked much at the compiler front-end before now. Much of it is a handwritten recursive descent parser, which makes for a maze of twisty little functions with names like heir19 and heir19c. Perhaps due to its impenetrability, it harbors many peculiarities. Here's one I just found by happenstance.

unsigned x = 'AB';
printf("%x\n", x);

This prints 4142. You might be surprised to learn that tcc and gcc do exactly the same thing. Per C89,

The value of an integer character constant containing more than one character, or containing a character or escape sequence not represented in the basic execution character set, is implementation-defined.

const int n = 5;
n = 6;
printf("%d\n", n);

6

Oh... oh dear...

#DesmetC #retrocomputing

It appears I've managed to get #DesmetC to support the 32-bit unsigned long type, as well as the U and LU suffixes on integer constants.

This changeset introduces a potential bootstrapping problem, because the brand-new unsigned long type is used in the same changeset that introduces it. I've got working binaries so it doesn't matter, but I probably should split this into two commits so as not to break the bootstrappability chain from the first open source release.

There are still some type issues, e.g., the preprocessor still treats all numbers as longs and doesn't understand the new type, and getting it to do so would not be entirely straightforward. But I bet this is good enough for most programs anyway...

#retrocomputing

@psf every time i read an update on your progress with this i think "digs up ancient c compiler, proceeds to fix tons of bugs. what a legend".
@pesco Thanks, it's very fun and I'm learning a lot!
Fuck I broke the bootstrap fuckity fuck. I should have saved off a copy of the working .EXEs... "trusting trust", my ass. I'm sure I'll be able to figure it out, less sure that I have enough working brain-cells to do that right now.
Phew, managed to work out a series of steps for building this successfully. This is currently more manual than I'd like, so what I'll wind up doing is making a parallel branch that breaks apart the commits so that each one can be built with the compiler built from the previous commit.

This was one of the more hair-raising changesets to prepare, but now unsigned long is available in my fork of #DesmetC and I've opened a pull request to the main project: https://github.com/the-grue/OpenDC/pull/12

We're up to 39,277 test cases now, which have been very handy in catching stupid bugs and bootstrapping-failures I introduced.

Since I started work on the project,

  • The compiler front-end has grown from 54784 to 57856 bytes.
  • The compiler back-end has grown from 51712 to 52736 bytes.
  • The C library has grown from 56365 to 57027 bytes.

Adding 2 data types in 4758 bytes isn't bad, I think? #retrocomputing

Add unsigned-long data type by PeterFidelman · Pull Request #12 · the-grue/OpenDC

This adds support for unsigned 32-bit integers (unsigned long). I think this adds the last primitive data type we needed for C89 support, although of course there still are other areas in which we...

GitHub
btw the #ELKS maintainer has also been hacking on #DesmetC and has his own fork (though still closely related to the main DOS repo) that is able to compile an increasingly large amount of stuff natively from within ELKS   It's great seeing ongoing interest in on-target 8086 development, as opposed to just cross builds.
Commits · ghaerr/dcc

DeSmet C Compiler toolchain v3.10h ported to ELKS. Contribute to ghaerr/dcc development by creating an account on GitHub.

GitHub

As for why I have been hacking on #DesmetC... look how well it fits on my #HP200LX! 300KB for a compiler is pretty sick (and will be even cooler if I can keep it near this small while supporting full C89); compile times are quite fast (for hello world, at least): overall it is a very pleasant tool to use on these small machines.

#retrocomputing

@psf With space at such a premium that the size of the executable matters that much, I wonder if UPX could further help?

@nazokiyoubinbou That's an interesting thought! Earlier when I crammed #LuaLang onto this machine, I used UPX to compress LUA.EXE, and while it did save space, the decompression time at startup got very tiresome... taking 10-15 seconds to launch the REPL instead of ~3 seconds. So I abandoned UPX for that project and just ate the disk space cost. With a CPU this slow, trading off CPU for disk space is not necessarily a win.

I wonder if the startup time penalty would be less painful for these smaller binaries, or if it would feel less onerous for a compiler than an interpreter.

psf (@[email protected])

Attached: 1 image After adding the missing size check, #LuaLang behavior is much more benign. A 64K segment size limit on table sizes isn't ideal, but it beats a hard crash, and it's a stable jumping-off point for further modifications. #retrocomputing #msdos #i8086 #v20

OldBytes Space - Mastodon