Okay, I've got a page (128 words) of #PDP8 code that will generate a truncated Soundex hash of any ASCII or 6-bit TEXT string.

Since I have only 12 bits to store it, and the consonant classifiers are from 1-6, that takes 3 bits per digit. I'm using 5 bits to store the initial, so I get two digits plus a final bit just to distinguish if there was a third digit or not.

I coded this up in awk first to test it, with a giant `switch` statement (only in gawk, alas) mapping letters to digits. But in PAL assembler for the PDP-8, I decided instead to make a table like so:

```
ALPHBT, 2170; /CBA@
2173; /GFED
2277; /KJIH
7554; /ONML
2621; /SRQP
7173; /WVUT
0272; /[ZYX
0000; /_^]\
```

Basically the characters in this section of the character set are there in reverse order. The top three bits of the 5-bit char values indicate which word of this table to load in, and the last two indicate how many times to shift right by 3 before masking off the 3 least-significant bits. So the letter D is `04`, which means we get word `1` and shift it `0` times before pulling out the octal digit `3` (which is correct!)

Most of the code is tests for various bit patterns to see if we need to keep shifting or if we care (is it even a letter? Is it a vowel? Is it the same digit we saw last time? `H` and `W` are special cases...)

The 128 words includes all temporary variables and the pretty-printer. I'm still deciding if I like that `1` in the last place, or if I should show it as a `+` or something. I think it's misleading for folks who actually know Soundex.

I'm also relatively confident that there are optimisations I could make on the test logic. So much of it is "store, reload, load a comparator, add, skip on condition, etc etc" that there's bound to be room for a few dirty tricks. If I could, I'd fit a routine to tokenise an entire string of words into an array of hashes, and print the set.

(And yes, I could make that awk version more portable by leaning on regexes, I suppose)

Well now, I've finally had time to come back to this project and fix it up now that I'm done with my postgraduate degree.

I changed the encoding, because soundex can't distinguish between `SOUTHEAST` and `SOUTHWEST`. So I now use the most significant bit of the 12-bit hash to mean "this is soundex, not raw text". When it's raw text, you get one or two characters directly in the 12-bit word.

So this means that `S`, `SE`, `SW`, etc are all distinguished inexpensively. I've done analysis on the dictionaries for Hibernated, Trinity, Moonmist, and the Cloak of Darkness demo, and the collision rate is (I think) acceptable given the other factors that can disambiguate verbs and objects in the game grammar.

So, why am I doing this? What good could possibly come from a "soundex without enough digits" scheme?

Well, coding on the PDP-8 (or 12 or LINC) has taught me that the algorithms we think of as for "small machines" are all relative. People nowadays will think absolutely nothing of a process casually allocating more memory than any computer I owned in the 80s or 90s. And so you will find people touting efficient low-resource algorithms that will brag that they make do with a scant few kilobytes of memory for the index tables or trees they use. Such frugality!

Well, the PDP-8 most commonly came with 8k of core memory. Using 6 of those for a dictionary of possible words the player could type in just won't fit. The LINC only had 2k of core, and spent most of its time swapping to the random-access tapes it had!

So a couple years ago I found a paper describing a text compression algorithm for which I'm confident I can write a decompression algorithm in a page (128 machine words) of code, and had the idea to use a modified Soundex encoding to replace the input dictionary.

So I'll have a pass that tokenises all input into a string of these hashes, and the code for an object will include those single-machine-word hashes in a list of "these are the words you can use to refer to me". And grammar objects will help disambiguate verbs that collide (such as `L200`, which maps to both `LOCK` and `LOOK`, but nobody types `LOCK AT DOOR` or `LOOK DOOR WITH KEY`, so that's fine).

And one advantage is that this will make the confusions a little more understandable. Yeah, "lock" and "look" can be hard for humans to hear the difference in a noisy room, so it can be part of the charm that the game is a little fuzzy, less picky about how you spell words, but also has trouble telling `BLOOD` from `BLADE` without more information.

Incidentally, I would absolutely love to have a PiDP8 to demonstrate all this on, once I get it working. If anyone here has one that they either never got working or no longer use, hit me up. I'm happy to supply my own raspi for it.
I'm finding the challenge of "package this one function/functionality in one 128-word page" really fun to meet. It makes the code relocatable (so I could load it in as an overlay into any page in memory), and ensures that I'm not stepping on any "globals" in the zero page. I still need to take care with the core frame registers from time to time, but I think I already have a regimen for that anyway.

I had a busy week, so haven't been able to sit down and bang out any code for the ASCII text compression system, but that is definitely next. I'm basing it on https://doi.org/10.1093/comjnl/24.4.324 (EDIT: the author has put this up at http://www.jackpike.co.uk/36.Text%20compression%20using%20a%204%20bit%20coding%20scheme.pdf), but tuned for my 12-bit words.

One feature I worked out last night while drifting off to sleep is that if I make 0000 the "grab two nybbles as ASCII" symbol, it will natively handle a string of unpacked ASCII with only a little computation overhead (peanuts compared to waiting for the teletype ready signal!)

I'm debating keeping "Conbak" as the name for this. It's more pronounceable than most of the Spells of Quendor, and never appeared in any games (but it does have a bonus reference to 12 in the lore). I was thinking maybe "Constructing Bitty Adventures Kit" as a backronym if I do.

I'm using Vince Slyngstad's pal assembler, as at some point I hope to optimise code for @tastytronic's PDP-12, but my target minimum platform is a PDP-8/i with 8k of core and at least one DECTape drive. No EAE necessary (although it was an *extremely* common option for 12s).

And believe me, I would LOVE to have an 8/e, even without the EAE! The `BSW` instruction is SO USEFUL at keeping code density down. But the 12 doesn't have it, so I'm being cautious.

I'll have to look up which models had the `MQ` instructions available even without EAE, as that makes a handy two-word stack and gives you a machine-native `OR` instruction even without all the `MUL` circuitry hooked up.

My one complaint is that I can't get SIMH to build the pdp8 simulator on my Alpine system, even with gcompat installed. I'll need to bash about and figure out what includes aren't working.

(or, you know...someone could sell me their used PiDP8)

Working on the 4-bit variable-length text encoding scheme now. I'm debating whether to have abbreviation lookups separate from the raw ASCII format, now.

One nice thing about having it separate is that as it stands you can put raw ASCII values into 12-bit words with leading zeroes in the high-order bits, and it Just Works. Like, this will handle ASCII data as well as its own encoding.

But the original paper had the dictionary lookups tied to the 205 values outside of the "printable" ASCII set (96 chars, minus the ones already covered by the scheme, ignoring case etc etc). This is kind of tidy and gives us an extra symbol in the shortest 4-bit form (a `u` in my layout). It would also mean that I'd have all of `ETAOIN SHRDLU` in the shortest symbol space, which...appeals to my sense of history.

But this would definitely break the "ASCII Just Works™" behaviour. Not sure whether I need that just yet.

Also I'd probably only have an abbreviation dictionary that's about 28 cells long, if I did this. Better than 16, worse than 205...the paper shows an inflection point somewhere between 0 and 50, so maybe that's an excellent place to land my trade-offs?

I wouldn't really use the raw ASCII feature, anyway. I think I'm talking myself into "abbreviations live in the raw ASCII part" so I can add `u` to the one-nybble set, a comma to the two-nybble, and set all of these symbols become abbreviation lookups in the three-nybble set.

oh yeah, also as you can see in the screenshot above, I've been calling nybbles "groats". Since this is a 12-bit system, I have been using the word "shilling" for a 12-bit value (shillings were worth 12 pennies in the https://en.wikipedia.org/wiki/%C2%A3sd system).

So I've also matched up sub-word pieces with names for old pre-decimalisation coins. A "tanner" is old slang for a sixpence (and `TANNER` is six letters, which fits in the PAL symbol table), so that's six bits. An octal digit takes 3 bits, so that's thruppence, and a four-penny coin was called a "groat".

Yes, this is absurd and confusing, but I'm coding PDP-8 assembly so this is a drop in the bucket.

£sd - Wikipedia

One detail of PDP-8 teletype stuff is that it does weird things with the high bit. On an ASR33, the paper tape uses the high bit for even parity (it comes on only to make sure there are an even number of holes on a row), but the PDP-8 side of things requires the high bit high for you to send a byte of ASCII to the teletype.

It's a bit weird, but it's something you kind of get used to. Like, an `E` is `0305` in octal, when `ascii(7)` clearly lists it as `105` octal.

It's *so* much easier for me to test one bit than to find sub-ranges of a value space on this thing, so I'm thinking that this high bit will determine whether an 8-bit value is raw ASCII or a dictionary index. That gives me 128 possible abbreviation entries, which feels excessive given the other constraints, so I'm no longer worried about this side of the encoding scheme.

That said, I'm more convinced than ever that I'll need to compress the expansion strings as well, and that means I will need to implement a stack. I've done that before, and have ideas on how to implement a tiny one with over/underflow detection.

I am reminded that the typical number of abbreviations in an Inform6/Z-machine game is 96, so 128 possible codes is plenty.

OS/8 has 256-word blocks on whatever backing store (TU tape, RK floppy, RK hard drive, etc), and the PDP-8 has an opcode format that contains 7-bit addresses in 128-word pages (but with a bit to switch to the shared "zero page" which is handy for global variables). So you basically load in two pages of core minimum, and that's my unit size for overlays or other swap strategies.

In my encoding scheme, it now takes a full 12-bit word to launch into an abbreviation, so it would be absurd to use this for any value equal to or smaller than that. So I can take this 7-bit value I'm given, shift it left once to address pairs of 12-bit words, and add it to the base address for the expansion dictionary.

This means that my three groats can expand to two raw ASCII values (such as `!"`), three uncommon letters (such as `.py`), six common letters (such as `delete`), or a mix of the above.

But if I `NUL`-terminate these abbreviations, I can let them spill over into another pair, letting me compress down longer strings. It's common in Z-machine games to have a few abbreviations for extremely common long strings like `You are standing in \0` (23 groats in my scheme, or 25 if you don't handle capitalisation automatically, taking up 8 or 9 machine words), and I'm sure something like the zabbrev tool could help find a few of these in any corpus. But most abbreviations I expect to be in the 4 or fewer category, so just leaving gaps (or re-using substrings) could help a lot here.

For an example of a reasonable abbreviation set, PunyInform provides a default set that at least compresses the standard library's output for you:

https://github.com/johanberntsson/PunyInform/blob/master/lib/globals.h#L14-L79

You can see that there is a long tail of sorts, and only the first dozen or so are longer than 10 characters. The short ones mainly benefit from being difficult characters in the Z-machine's own awkward text compression system.

This is not the best example of a suitable abbreviation set for a finished game, but it gives a sense of what sort of substrings end up in an abbreviation list like this.

PunyInform/lib/globals.h at master · johanberntsson/PunyInform

A fast and compact library for writing text adventure games for the Z-machine running on 8-bit computers as well as other platforms. - johanberntsson/PunyInform

GitHub

As far as capital letters go, it's reasonable to assume that the output device is an ASR-33 or equivalent at this point, and that letter case isn't necessarily supported. That makes it more efficient to try an automated set of rules for when to capitalise and when to use the lower case:

1. The first letter of an in-game response should be capitalised.
2. The first letter following the string `. ` should be capitalised.
3. The first letter following a newline should be capitalised.
4. The string ` i ` should capitalise the "I", as this is a special case in English.
5. If the previous two characters were upper case letters (either via the above rules or as a raw ASCII character), we capitalise the next one.

So if I started a sentence, paragraph, or response with `bEhold my aubergine!` (using raw ASCII to get the capital `E`), that would be printed as `BEHOLD my aubergine!` on terminals that support lower case.

Yes, this would hurt games written in German. Soz.

OK, since I will probably need a stack at some point anyway, here's one with overflow/underflow detection. Right now it just halts on boundary faults, but I had a version that used multiple return vectors when testing.

This uses 20 words of core for the code and variables, and then uses the rest of core to the top (`7777` octal, or address 4095 decimal). So you get like 109 words if you put it on the top page, and you can add units of 128 words by moving it down lower and lower in the address space.

Heh, I just learned that I can do `ZBLOCK -.` and it will reserve all of the core to the top of memory, which helped me sort out some of my fencepost errors on the size of things.

It was 20 words for the implementation, not 19, meaning I have 108 words left in the page for the stack itself. If I start at `*7400` instead of `*7600` I'd get a stack of 236 words, and `*7200` would get me 364, etc...

Naturally, I couldn't write code that *uses* this stack without adding helpful convenience functions like `DUP`, `SWAP`, `OVER`, etc... and I even wrote an `UNDER` to make implementing some of the others less inefficient.

I have `INC` which increments the top of the stack (skipping an instruction on return just like `ISZ`) and a `DEC` which decrements it (with no other side-effects). I'll probably end up adding other standard operators ere long, but YAGNI...

tfw you have to print out your `.lst` files and your SIMH CPU log and ponder them in a long soaking bath...

Well now, that's not bad at all! The text at the top hits a comma and freaks out, but it gets the first 128 bits (would have been 120, but I capitalised the initial `S` as a test) decoded into 24 characters. That's about 5 bits per character, which is what I was aiming for with this scheme.

Now I have to figure out why the literal comma didn't come out right. And then I have to implement abbreviations and debug their strange recursiveness.

Oh! It was an encoding error! The decoder works just fine!!

There's a full stop on the end that got eaten by the early `HLT`. I keep having to remind myself that the CPU and core on this thing run instructions in a few µs, while the teletype prints one character every 10ms, tops. So a lot of my debugging frustration is logs getting flooded with loops waiting for I/O to be ready, or a breakpoint halting the system before the output can finish printing.

Okay, so this is 240 bits to encode 39 characters, which is about 6 bits each. I'm wasting 8 bits on the `S` and this example string is kind of pathological in using all the uncommon letters, so I'm fine with that as a rough upper-bound: it's about what the internal `TEXT` or `SIXBIT` character sets do, but I have access to all of 7-bit ASCII when I really need it.

I'll have to fix my encoding tools, which are just python scripts on my laptop, to set the DEC TTY high bit on that comma... not sure why it didn't do that.

The output is the same, but I got the code to work with an abbreviation!

Next up: nested abbreviations!

I'm reasonably confident those will work fine, as this is recursive and I've already exercised all my test cases there.

So, looking at my code, I have:
* 1 page for entry code/main loop stuff
* 1 page for Soundex
* 1 page for some pasted TTY routines I got from DECUS (`DEC-8I-RZPA-D` (1971) for the curious)
* 1 page for text decompression
* + 1-2 pages for the abbreviation dictionary (could be more or less as needed)
* 1 page for stack operations plus a small stack area
* + arbitrary stack pages. I can fill top of core with this whenever I want.

So rounding up a little let's say that's 8 pages out of 32 used, not counting the Zero Page, which I haven't touched yet. The Zero Page will be more about the API of far calls and data structure pointers, when it's relevant. I may have to weave it into whatever overlay system I find myself relying on.

So what's actually on the docket for the next thing to write is the grammar system. I'll write grammar routines that match phrase patterns of varying levels of specificity, and print out stored messages in response.

In my head I'm kind of chewing my nails and wondering if I'll need to write a six-bit "bytecode" system for game object code. It feels like it would help, and it's something I've written before on this platform and I already have a page of nice stack operations to lean on as it is. Heck, the notes for it are in the same notebook as this 12-bit IF project so it almost feels inevitable. I just don't want to put the cart before the horse, here.

Oh yeah, I should say 32 pages are in one field. I'm aiming at OS/8 which requires two core fields minimum, and that complicates indirect memory accesses in ways I'll need to tidy up at some point. Basically one field is for code, and another is for data and occasionally swapping in OS/8 code itself.
Reader, recursive abbreviations worked perfectly.

Been looking over my `TANNER` six-bit "bytecode" language implementation, which was inspired by STABLE (and from there, Mouse). It's got some places to improve on it based on what I've learned writing what I have so far in the Soundex/Compression code.

But what surprised me the most is how few `JSR`s are in there. It makes a kind of sense: each token is just vectored into some code in a single main loop, so it basically just jumps back to the top of that loop after a token has been interpreted and run.

The one thing I could really do to optimise it is to re-balance the page layout of the code so that I'm not relying so heavily on the zero page for far pointers. I suspect I could break it up into a couple of pages.

I also suspect it would end up a higher-level system, dealing in the game data structures instead of raw memory. This would probably inflate the code slightly, but mostly to call out to functions for things like getting the location of an object, or iterating over the objects in scope, or checking/setting object flags.

I'm not yet entirely sold on this approach, but it's fun to think over.

The next thing I've been tackling has been the verb/grammar code. The goal here is to represent the logic/data structures from this section of the #PunyInform library: https://github.com/johanberntsson/PunyInform/blob/master/lib/grammar.h#L7-L225

It looks like a combinatorial explosion of possible prepositions in a large space of possible sequences, along with placeholders for object search conditionals. It feels like the kind of thing that needs a lot of core just to map it all out. Sure, I have my input words compressed to single 12-bit integers, but there's a lot of repetition in the source there and it seems like a mess to represent.

I made a list of all the tokens in this thing, and came up with 32 values (including a null). Well, that's neat! I could decide that if the most significant bit of a tanner is `0`, it's a lookup into a table of 12-bit values. I have some space left in my compression scheme for ideas like "object in scope" or "object in inventory" (which can gobble multiple words), so I can handle common cases like `INVENTORY`, or `GET LAMP`, or `TURN ON LAMP`. I map the `TURN` verb into a pair of indices for `ON` and *noun*, pointing out to our equivalent of the Puny `SwitchOn` function in the #PDP8 game code.

Okay, but what about all that `'in'/'into'/'inside'/'on'/'onto'` stuff? Well, If the tanner's most significant bit is `1`, then I treat it as a sort of pointer into the second half of my grammar index. So that might look like this:

37. `IN`, `INTO`
38. `INSIDE`, *37*
39. `ON`, `ONTO`
40. *38*, *39*

So I might have `GET` mapping a grammar word of [*40*,*noun*] to the `ENTER` procedure, so that `GET INSIDE BICYCLE` and `GET ONTO BICYCLE` will both call the same code. The grammars use 3 words per mapping, and make use of a table of 64 words. This feels more comfortable for the type of system I'm writing here.

Yes, this is just #LISP cons cells, and I love that.

Now I just need to write all this...

I love the approaches for each of the problems I've come up with so far:

1. Soundex
2. Linotype
3. LISP

I hope this level of techno-whimsy continues!