I should post about the latest #retrocomputing project I started.

Problem: I'd like an open-source, self-hosting C compiler on 8086, that supports the large memory model, overlays, and enough C89 to build Lua.

This seems to not exist! K&R is much more common in this size category. Around the time of C89, many compilers bloated to the point of requiring a 386 or better host, though they could still target 8086. The 8086 holdouts were, in general, commercial products that never got a source release.

One notable exception was DeSmet C http://www.desmet-c.com. It seems to have started life as a commercial PC fork of Bell Labs PCC, a small and sturdy K&R compiler. (Edit: I'm no longer sure of this: DeSmet had a shareware release as "PCC", but this stood for Personal C Compiler, while Bell's stood for Portable C Compiler. So the similarity in naming was probably just a coincidence.) DeSmet 3.1 added "draft ANSI C" support, but this is incomplete, and riddled with code-gen bugs. This version later found itself on Github as OpenDC https://github.com/the-grue/OpenDC.

Aside from all the bugs, this is a pretty cool package: its dis/assembler, debugger, text editor, and some other utilities were also open sourced, and it runs on an 8088 with 256K RAM and two 360K floppies.

The OpenDC person did a good job packaging things up into an easily buildable form, and fixing syntax errors that probably came from running the sources through a different compiler version than expected, so... yes, it does indeed build and self-host... and I've done this on my Book 8088.

So now I will try to fix the bugs and add the missing C89 features. There are many, many of both... gulp.

The #DeSmetC compiler codebase is the hairiest code I've had the experience of hacking. K&R style, many global variables, short cryptic names, spooky action at a distance, the shotgun-surgery pattern for type handling splatted around everywhere, oh baby.

For all that, I managed to fix the codegen bug from the Github issues on the ~second day of working on the compiler... that's the beauty of a small codebase.

My fork is here: https://gitlab.cs.washington.edu/fidelp/open_desmet_c
1 bug down, 999 to go...

#retrocomputing

Ooh, the-grue, the current maintainer of OpenDC #DesmetC, took my code-gen patch, how awesome! Usually when I pick up an old codebase like this, the maintainer is long gone.

So @linear if you end up wanting to submit patches, that's the place: https://github.com/the-grue/OpenDC
My fork will remain just an unofficial fork.

GitHub - the-grue/OpenDC: DeSmet C - Open source and completely built with latest toolchain

DeSmet C - Open source and completely built with latest toolchain - the-grue/OpenDC

GitHub

Been writing regression tests for arithmetic, which caught another #DesmetC code-gen bug which I was able to fix. https://github.com/the-grue/OpenDC/issues/5

Previously, illegal asm instructions were being generated, as shown.

#retrocomputing

Hmm, I found yet more weirdness with signed chars in #DesmetC. If you do:

signed char i, j;
int k;
// ...
k = i + j;

k's upper bits become a sign-extended version of i, not a sign-extended version of the result.

Much of this pain seems to trace back to a quirk (deviation from the C standard) documented in the manual: math on char types produces a char result not an int result. Perhaps to save a few instructions? Anyway, however this was implemented seems to work fine for char but not signed char.

Okay, #DesmetC sign-extension in (signed char -> int) promotion now works in my branch. (signed char -> long) does not work yet; it neglects to sign-extend, acting more-or-less like (unsigned char -> long). That's next to fix.

Thankfully, the codebase is small enough that it's not too hard to find the logic responsible for any given codegen decision.

Also, I came up with the trick of having the assembler backend emit comments into the output asm file. This lets me do something like printf debugging to check which codegen cases are being hit and annotate the assembly they're generating.

Naturally this is all test-driven: I'm accumulating regression tests for the broken codegen I've been fixing, and usually the way I find codegen bugs is by writing new tests expecting the mathematically correct answer, and watching them immediately fail.

#retrocomputing

#DesmetC (signed char -> long) promotion is now working on my branch: got it on the first try, which hopefully means I'm internalizing the codebase. Now I will start on tests for mixed-sign arithmetic.

#retrocomputing

It's been a pretty productive night in the ol' #DesmetC codebase. Regression tests finally checked in, all the mixed-size integer addition/subtraction involving signed chars I could think of is exercised and passing, nice.

Then I try i8 * i8 -> int and it instantly breaks, not so nice 🫠.
Oh well, that gives me something to fix tomorrow.

Also, I haven't ventured into floating point conversion land yet, either. I'm sure that'll have plenty of dragons when used with signed char.

All this is making me appreciate the wisdom of BCPL and B who have just a word-sized type -- or #Forth which takes that and adds char, as a treat.

#retrocomputing

Oh gosh

MOV CL,BYTE [BP-2]
XCHG CX,AX
CBW
XCHG CX,AX
XCHG CX,AX
CBW
XCHG CX,AX
IMUL CX

When CL absolutely, positively, needs to be sign extended before multiplication 

The mul-div codegen path in #DesmetC is sorta a nightmare because so much is reused between multiplication, division, and mod, and because there are some dodgy special-cases in here from 1990 that demonstrably do the wrong thing. Proceeding slowly with machete and torch, laying down test cases as I go.
Heh, well, can't imagine why the assembler doesn't like that instruction.
#DesmetC mul/div/mod i8 bugs seem more or less vanquished, now moving on to i8 comparison, which I probably should have done first, as it's also turning out to be broken.... -1 > 1, don't you know?

It is satisfying to watch the test count creep steadily upward. I like to leave off just after writing a failing test, to give me a clear "next task" when I resume work.

#DesmetC #retrocomputing

Huh weird, just noticed that in #DesmetC I can just keep redeclaring a local variable by the same name and it works. My test suite was doing this by accident, with seemingly no problems.

int n = f1();
printf("%d\n", n);
int n = f2();
printf("%d\n", n);

This isn't even a c99 compiler, so declaring a variable other than at the start of a function should be illegal besides.

#retrocomputing

$ cat cmp.c
#include <stdio.h>
int main()
{
int i = -1;
unsigned j = 1;
printf("%d\n", i < j);

return 0;
}

$ gcc cmp.c

$ ./a.out
0

Hmm, at some point I probably knew this is how the comparison would turn out in C. I was expecting it to coerce both operands to the smallest common type that could represent both of them, (which would have been long), then compare. Instead, we get twos-complement funny business.

when u smack the compiler so hard it turns into dwarf fortress
i deserved it though, that 13 byte program utilized 65489% of my available system resources

OK, finally I'm reasonably confident that the integer comparisons in #DesmetC are working correctly, after beating them into submission against a test-case-generator, with tcc on my Linux machine serving as the test oracle.

#retrocomputing

Cool, those same test cases pass on OpenWatcom as well, so I'm fairly confident the comparison behavior is correct.

After all my testing, how many bugs can #DesmetC math expressions still have? Well... the next problem is that the result type doesn't always match the behavior required in C89 §3.2.1.5 Usual arithmetic conversions. Like if you do int + unsigned, there are circumstances where the result might be int, not unsigned as would be standard.

My tests weren't catching this because they were all structured like,

int i;
unsigned j, k;
k = i + j;
// now test the value of k against expectations

where the assignment coerced the result to a particular type. This meant tests weren't checking the result's "natural" type, and sometimes that was wrong.

Latest #DesmetC "explorations with machete and torch": the compiler source has numbered constants for each supported C datatype. Normally you'd use enum for this sort of thing, but this codebase used #defines. The constants were numbered in a strange order, and I wanted to re-sort them in the order of the "usual arithmetic conversions", to simplify some logic. This broke code-gen, emitting illegal instructions. Several hours later, I found that CCHAR=1 and CINT=2 were directly used in hex math determining which x86 opcode to emit. When I renumbered those constants, it caused absurd instructions to be generated. After correcting that problem, we are now back to self-hosting o.k.
I'm hoping this will make it possible to retire a bunch of one-off type promotion logic scattered around the compiler, in favor of a few central functions closely mapping to the C89 standard.

#retrocomputing #SoftwareArchaeology

realized my now-thousands of #DesmetC binop expression evaluation test cases are still all just variable [op] variable cases, and don't exercise variable [op] literal or literal [op] variable at all yet. I know there will be more bugs there, because the compiler does constant folding as an entirely separate code path.

On the plus side I seem to have corrected lots of #DesmetC mul/div/mod bugs in one fell swoop by rewriting the relevant part of its codegen, lmao. So many type hacks, gone.

On the minus side, I think now I might be finding some bugs in #OpenWatcom C, which was supposed to be my infallible test oracle, dammit ;D

POP QUIZ: On a 16-bit compiler, what answer would you expect for:

int i;
unsigned int j;
i = -32768; j = 36;
printf("%d\n", (i / j));
-911
30%
-910
20%
910
40%
divide overflow exception
10%
Poll ended at .

@psf I think all of them are valid results. If I'm not mistaken, the assignment to i is undefined behaviour.

I think the best way to handle this would be to set the user's computer on fire, but I'm not sure if a C compiler can do that.

@soulsource If undefined behavior can set my PC on fire, at this rate I'm sure I'll experience that soon.