Okay, I've got a page (128 words) of #PDP8 code that will generate a truncated Soundex hash of any ASCII or 6-bit TEXT string.

Since I have only 12 bits to store it, and the consonant classifiers are from 1-6, that takes 3 bits per digit. I'm using 5 bits to store the initial, so I get two digits plus a final bit just to distinguish if there was a third digit or not.

I coded this up in awk first to test it, with a giant `switch` statement (only in gawk, alas) mapping letters to digits. But in PAL assembler for the PDP-8, I decided instead to make a table like so:

```
ALPHBT, 2170; /CBA@
2173; /GFED
2277; /KJIH
7554; /ONML
2621; /SRQP
7173; /WVUT
0272; /[ZYX
0000; /_^]\
```

Basically the characters in this section of the character set are there in reverse order. The top three bits of the 5-bit char values indicate which word of this table to load in, and the last two indicate how many times to shift right by 3 before masking off the 3 least-significant bits. So the letter D is `04`, which means we get word `1` and shift it `0` times before pulling out the octal digit `3` (which is correct!)

Most of the code is tests for various bit patterns to see if we need to keep shifting or if we care (is it even a letter? Is it a vowel? Is it the same digit we saw last time? `H` and `W` are special cases...)

The 128 words includes all temporary variables and the pretty-printer. I'm still deciding if I like that `1` in the last place, or if I should show it as a `+` or something. I think it's misleading for folks who actually know Soundex.

I'm also relatively confident that there are optimisations I could make on the test logic. So much of it is "store, reload, load a comparator, add, skip on condition, etc etc" that there's bound to be room for a few dirty tricks. If I could, I'd fit a routine to tokenise an entire string of words into an array of hashes, and print the set.

(And yes, I could make that awk version more portable by leaning on regexes, I suppose)

Well now, I've finally had time to come back to this project and fix it up now that I'm done with my postgraduate degree.

I changed the encoding, because soundex can't distinguish between `SOUTHEAST` and `SOUTHWEST`. So I now use the most significant bit of the 12-bit hash to mean "this is soundex, not raw text". When it's raw text, you get one or two characters directly in the 12-bit word.

So this means that `S`, `SE`, `SW`, etc are all distinguished inexpensively. I've done analysis on the dictionaries for Hibernated, Trinity, Moonmist, and the Cloak of Darkness demo, and the collision rate is (I think) acceptable given the other factors that can disambiguate verbs and objects in the game grammar.

So, why am I doing this? What good could possibly come from a "soundex without enough digits" scheme?

Well, coding on the PDP-8 (or 12 or LINC) has taught me that the algorithms we think of as for "small machines" are all relative. People nowadays will think absolutely nothing of a process casually allocating more memory than any computer I owned in the 80s or 90s. And so you will find people touting efficient low-resource algorithms that will brag that they make do with a scant few kilobytes of memory for the index tables or trees they use. Such frugality!

Well, the PDP-8 most commonly came with 8k of core memory. Using 6 of those for a dictionary of possible words the player could type in just won't fit. The LINC only had 2k of core, and spent most of its time swapping to the random-access tapes it had!

So a couple years ago I found a paper describing a text compression algorithm for which I'm confident I can write a decompression algorithm in a page (128 machine words) of code, and had the idea to use a modified Soundex encoding to replace the input dictionary.

So I'll have a pass that tokenises all input into a string of these hashes, and the code for an object will include those single-machine-word hashes in a list of "these are the words you can use to refer to me". And grammar objects will help disambiguate verbs that collide (such as `L200`, which maps to both `LOCK` and `LOOK`, but nobody types `LOCK AT DOOR` or `LOOK DOOR WITH KEY`, so that's fine).

And one advantage is that this will make the confusions a little more understandable. Yeah, "lock" and "look" can be hard for humans to hear the difference in a noisy room, so it can be part of the charm that the game is a little fuzzy, less picky about how you spell words, but also has trouble telling `BLOOD` from `BLADE` without more information.

Incidentally, I would absolutely love to have a PiDP8 to demonstrate all this on, once I get it working. If anyone here has one that they either never got working or no longer use, hit me up. I'm happy to supply my own raspi for it.
I'm finding the challenge of "package this one function/functionality in one 128-word page" really fun to meet. It makes the code relocatable (so I could load it in as an overlay into any page in memory), and ensures that I'm not stepping on any "globals" in the zero page. I still need to take care with the core frame registers from time to time, but I think I already have a regimen for that anyway.

I had a busy week, so haven't been able to sit down and bang out any code for the ASCII text compression system, but that is definitely next. I'm basing it on https://doi.org/10.1093/comjnl/24.4.324 (EDIT: the author has put this up at http://www.jackpike.co.uk/36.Text%20compression%20using%20a%204%20bit%20coding%20scheme.pdf), but tuned for my 12-bit words.

One feature I worked out last night while drifting off to sleep is that if I make 0000 the "grab two nybbles as ASCII" symbol, it will natively handle a string of unpacked ASCII with only a little computation overhead (peanuts compared to waiting for the teletype ready signal!)

@smolwaffle Oh that's fantastic! I'll have to mail him and ask some questions, then.
@smolwaffle http://www.jackpike.co.uk/cave.html ← Ah, I suspect we are simpatico regarding my goals in this!
Cave

@spacehobo
Is the code for Adventure II available? Might be an interesting read if so!
@smolwaffle The Atari game? The name was inspired from Colossal Cave Adventure, but it's purely graphical isn't it?

@spacehobo Hmm, it seems to be a text adventure AFAIK. The one which Dr. Pike refers to on his website (on the page you linked re: Colossal Cave) is the one I meant. Looks like it's been recreated, with some archeology. https://github.com/Quuxplusone/Advent/tree/master/LUPI0440

Based one the main README of that repo, there's not much in the way of engine changes from the original Adventure.

Advent/LUPI0440 at master · Quuxplusone/Advent

A modern ANSI C port of Crowther & Woods' "Adventure". - Quuxplusone/Advent

GitHub
@smolwaffle Ah, sorry, I had missed the sentence where he called his game "Adventure II". It's kind of a crowded field, but everyone in the disco era was making sequels to ADVENT (to use its TOPS-10 six-letterism). Even the early editons of Zork from 1977 say "Welcome to Adventure"!