So, for a little project I needed to compress a bit of data, and @cancel and I made up a spec that should be possible to implemented on small systems without too much headaches.

I wrote a bit of documentation for it, and I was wondering if anyone wanted to put the docs to the test and see if they could write a toy implementation in a language of their choice. It'd help me see if there's things missing in the docs.

http://wiki.xxiivv.com/site/ulz_format

So, think of it as a little puzzle.

XXIIVV — ulz format

Thanks to everyone who answered my puzzle and wrote experimental implementations of our little LZ scheme! You've helped us improved the documentation and see how portable of an algorithm it is across multiple languages!
@neauoire @cancel trying my hand at a C# version of this!
@renaudbedard @cancel Keep me posted :>
@neauoire @cancel A couple of things that are tripping me :
- CPY uses a "negative offset plus 1", but it's not clear if it's "-offset + 1" or "-(offset + 1)"
- It's implied that the dictionary pointer is always pointing to the end of the dictionary buffer after the last copy/append operation, right?

@renaudbedard @cancel

Example: an offset of 0 means go back by 1 bytes into the history.

And the pointer is the at the end of the dictionary buffer yep.

I'll add a bit about the negative increment to the offset to the docs! thanks for pointing it out

@neauoire @cancel also I know HTML tables are a nightmare, but if you could make the LIT length box appear to take all 7 bits instead of 6 like CPY1... 
@neauoire @cancel wait a minute... if the length of a CPY is greater than its negative offset, what's the expected behavior?
@cancel @neauoire that part was definitely unclear to me, but if I loop over the offset until length is hit, it does decode the test case! 🎉
@renaudbedard @neauoire nice :D here's my original version of the document from like 9 months ago https://gist.github.com/randrew/b43c011ccb44cb8cc6fa9d009ac58a7f the one you read was written by devine after reading mine, so the game of telephone appears to work
uxn lz explanation.md

GitHub Gist: instantly share code, notes, and snippets.

Gist

@renaudbedard @cancel I didn't even implement the loop around myself 🤦‍♀️

Well done! the documentation and telephone implementation works! :D

@neauoire @renaudbedard the fun part is you don't actually need to explicitly implement looping over the dictionary when the length is longer than the dictionary. you just plow forward and "overflow" into the data you already wrote and keep re-appending it.
@neauoire @renaudbedard (ok, you do need to explicitly loop over the dictionary if you're doing a streaming/resumable implementation, since you have to store the dictionary separately from the output instead of just treating it as the last 256 bytes)
@renaudbedard @cancel mhm I thought a fixed that, could you reload?

@neauoire @cancel here is a naive Rust implementation:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&code=fn+decode%28data%3A+%26%5Bu8%5D%29+-%3E+String+%7B%0A++++let+mut+output+%3D+vec%21%5B%5D%3B%0A++++let+mut+iter+%3D+data.iter%28%29%3B%0A++++while+let+Some%28byte%29+%3D+iter.next%28%29+%7B%0A++++++++match+byte+%26+0x80+%7B%0A++++++++++++%2F%2F+LIT%0A++++++++++++0x00+%3D%3E+%280..%28byte+%2B+1%29%29.for_each%28%7C_%7C+output.push%28*iter.next%28%29.unwrap%28%29%29%29%2C%0A++++++++++++%2F%2F+CPY%0A++++++++++++_+%3D%3E+%7B%0A++++++++++++++++let+%28length%2C+offset%29+%3D+match+byte+%26+0x40+%7B%0A++++++++++++++++++++0x00+%3D%3E+%28%0A++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%2B+4%2C%0A++++++++++++++++++++++++usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%2C%0A++++++++++++++++++++%29%2C%0A++++++++++++++++++++_+%3D%3E+%7B%0A++++++++++++++++++++++++let+length%3A+u16+%3D%0A++++++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%3C%3C+8+%7C+u16%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29%3B%0A++++++++++++++++++++++++%28length+%2B+4%2C+usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%29%0A++++++++++++++++++++%7D%0A++++++++++++++++%7D%3B%0A++++++++++++++++%280..length%29.for_each%28%7C_%7C+output.push%28output%5Boutput.len%28%29+-+offset%5D%29%29%0A++++++++++++%7D%0A++++++++%7D%0A++++%7D%0A++++std%3A%3Astr%3A%3Afrom_utf8%28%26output%29.unwrap%28%29.to_string%28%29%0A%7D%0A%0Aconst+ENCODED_DATA%3A+%26%5Bu8%5D+%3D+%26%5B%0A++++40%2C+66%2C+108%2C+117%2C+101%2C+32%2C+108%2C+105%2C+107%2C+101%2C+32%2C+109%2C+121%2C+32%2C+99%2C+111%2C+114%2C+118%2C+101%2C+116%2C%0A++++116%2C+101%2C+32%2C+105%2C+116%2C+115%2C+32%2C+105%2C+110%2C+32%2C+97%2C+110%2C+100%2C+32%2C+111%2C+117%2C+116%2C+115%2C+105%2C+100%2C%0A++++101%2C+10%2C+129%2C+40%2C+35%2C+97%2C+114%2C+101%2C+32%2C+116%2C+104%2C+101%2C+32%2C+119%2C+111%2C+114%2C+100%2C+115%2C+32%2C+73%2C+32%2C%0A++++115%2C+97%2C+121%2C+10%2C+65%2C+110%2C+100%2C+32%2C+119%2C+104%2C+97%2C+116%2C+32%2C+73%2C+32%2C+116%2C+104%2C+105%2C+110%2C+107%2C%0A++++138%2C+41%2C+9%2C+102%2C+101%2C+101%2C+108%2C+105%2C+110%2C+103%2C+115%2C+10%2C+84%2C+128%2C+34%2C+6%2C+108%2C+105%2C+118%2C+101%2C+32%2C%0A++++105%2C+110%2C+128%2C+80%2C+23%2C+32%2C+109%2C+101%2C+10%2C+73%2C+39%2C+109%2C+32%2C+98%2C+108%2C+117%2C+101%2C+10%2C+68%2C+97%2C+32%2C%0A++++98%2C+97%2C+32%2C+100%2C+101%2C+101%2C+32%2C+100%2C+130%2C+9%2C+0%2C+105%2C+181%2C+18%2C%0A%5D%3B%0A%0Aconst+DECODED_DATA%3A+%26str+%3D+%22Blue+like+my+corvette+its+in+and+outside%0ABlue+are+the+words+I+say%0AAnd+what+I+think%0ABlue+are+the+feelings%0AThat+live+inside+me%0AI%27m+blue%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%22%3B%0A%0Afn+main%28%29+%7B%0A++++assert_eq%21%28decode%28ENCODED_DATA%29%2C+DECODED_DATA%29%3B%0A%7D%0A

Rust Playground

A browser interface to the Rust compiler to experiment with the language

@neauoire @cancel I ran into only one issue, and it was a rust-specific thing about u8 to u16 conversion. I would apply some operation to u8 values instead of to their u16 representation, losing bits in the process.

@neauoire @cancel a bit more cleaned up:

https://play.rust-lang.org/?version=stable&mode=release&edition=2021&code=fn+decode%28data%3A+%26%5Bu8%5D%29+-%3E+String+%7B%0A++++let+mut+output+%3D+vec%21%5B%5D%3B%0A++++let+mut+iter+%3D+data.iter%28%29%3B%0A++++while+let+Some%28byte%29+%3D+iter.next%28%29+%7B%0A++++++++match+byte+%26+0x80+%7B%0A++++++++++++%2F%2F+LIT%0A++++++++++++0x00+%3D%3E+%280..%28byte+%2B+1%29%29.for_each%28%7C_%7C+output.push%28*iter.next%28%29.unwrap%28%29%29%29%2C%0A++++++++++++%2F%2F+CPY%0A++++++++++++_+%3D%3E+%7B%0A++++++++++++++++let+%28length%2C+offset%29+%3D+match+byte+%26+0x40+%7B%0A++++++++++++++++++++0x00+%3D%3E+%28%0A++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%2B+4%2C%0A++++++++++++++++++++++++usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%2C%0A++++++++++++++++++++%29%2C%0A++++++++++++++++++++_+%3D%3E+%7B%0A++++++++++++++++++++++++let+length%3A+u16+%3D%0A++++++++++++++++++++++++++++u16%3A%3Afrom%28byte+%26+0x3f%29+%3C%3C+8+%7C+u16%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29%3B%0A++++++++++++++++++++++++%28length+%2B+4%2C+usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%29%0A++++++++++++++++++++%7D%0A++++++++++++++++%7D%3B%0A++++++++++++++++%280..length%29.for_each%28%7C_%7C+output.push%28output%5Boutput.len%28%29+-+offset%5D%29%29%0A++++++++++++%7D%0A++++++++%7D%0A++++%7D%0A++++std%3A%3Astr%3A%3Afrom_utf8%28%26output%29.unwrap%28%29.to_string%28%29%0A%7D%0A%0Aconst+ENCODED_DATA%3A+%26%5Bu8%5D+%3D+%26%5B%0A++++40%2C+66%2C+108%2C+117%2C+101%2C+32%2C+108%2C+105%2C+107%2C+101%2C+32%2C+109%2C+121%2C+32%2C+99%2C+111%2C+114%2C+118%2C+101%2C+116%2C%0A++++116%2C+101%2C+32%2C+105%2C+116%2C+115%2C+32%2C+105%2C+110%2C+32%2C+97%2C+110%2C+100%2C+32%2C+111%2C+117%2C+116%2C+115%2C+105%2C+100%2C%0A++++101%2C+10%2C+129%2C+40%2C+35%2C+97%2C+114%2C+101%2C+32%2C+116%2C+104%2C+101%2C+32%2C+119%2C+111%2C+114%2C+100%2C+115%2C+32%2C+73%2C+32%2C%0A++++115%2C+97%2C+121%2C+10%2C+65%2C+110%2C+100%2C+32%2C+119%2C+104%2C+97%2C+116%2C+32%2C+73%2C+32%2C+116%2C+104%2C+105%2C+110%2C+107%2C%0A++++138%2C+41%2C+9%2C+102%2C+101%2C+101%2C+108%2C+105%2C+110%2C+103%2C+115%2C+10%2C+84%2C+128%2C+34%2C+6%2C+108%2C+105%2C+118%2C+101%2C+32%2C%0A++++105%2C+110%2C+128%2C+80%2C+23%2C+32%2C+109%2C+101%2C+10%2C+73%2C+39%2C+109%2C+32%2C+98%2C+108%2C+117%2C+101%2C+10%2C+68%2C+97%2C+32%2C%0A++++98%2C+97%2C+32%2C+100%2C+101%2C+101%2C+32%2C+100%2C+130%2C+9%2C+0%2C+105%2C+181%2C+18%2C%0A%5D%3B%0A%0Aconst+DECODED_DATA%3A+%26str+%3D+%22Blue+like+my+corvette+its+in+and+outside%0ABlue+are+the+words+I+say%0AAnd+what+I+think%0ABlue+are+the+feelings%0AThat+live+inside+me%0AI%27m+blue%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%22%3B%0A%0Afn+main%28%29+%7B%0A++++assert_eq%21%28decode%28ENCODED_DATA%29%2C+DECODED_DATA%29%3B%0A%7D%0A

Rust Playground

A browser interface to the Rust compiler to experiment with the language

@merlin @cancel do you remember which variable gave you a unit size problem? Was it the offset?

@neauoire @cancel I just looked a bit more into it, and it was way dumber than this: I was doing (byte & 0x3f + 4) for CPY1 length. Order of operation was executing the addition first, hence the issue I was getting. Didn't have to do with u8-to-u16 conversion.

BTW I realize now that I didn't properly copy the link to the "cleaned-up" version, I'll send it again later when I get home.

@merlin @cancel please do :) So, how long would you say it took you to implement, roughly?
@neauoire @cancel around an hour
@neauoire @cancel here is the cleaner version:
https://play.rust-lang.org/?version=stable&mode=release&edition=2021&code=fn+decode%28data%3A+%26%5Bu8%5D%29+-%3E+String+%7B%0A++++let+mut+output+%3D+vec%21%5B%5D%3B%0A++++let+mut+iter+%3D+data.iter%28%29%3B%0A++++while+let+Some%28byte%29+%3D+iter.next%28%29+%7B%0A++++++++match+byte+%26+0x80+%7B%0A++++++++++++%2F%2F+LIT%0A++++++++++++0x00+%3D%3E+%280..%28byte+%2B+1%29%29.for_each%28%7C_%7C+output.push%28*iter.next%28%29.unwrap%28%29%29%29%2C%0A++++++++++++%2F%2F+CPY%0A++++++++++++_+%3D%3E+%7B%0A++++++++++++++++let+length%3A+u16+%3D+match+byte+%26+0x40+%7B%0A++++++++++++++++++++0x00+%3D%3E+u16%3A%3Afrom%28byte+%26+0x3f%29%2C%0A++++++++++++++++++++_+%3D%3E+u16%3A%3Afrom%28byte+%26+0x3f%29+%3C%3C+8+%7C+u16%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29%2C%0A++++++++++++++++%7D+%2B+4%3B%0A++++++++++++++++let+offset+%3D+usize%3A%3Afrom%28*iter.next%28%29.unwrap%28%29%29+%2B+1%3B%0A++++++++++++++++%280..length%29.for_each%28%7C_%7C+output.push%28output%5Boutput.len%28%29+-+offset%5D%29%29%0A++++++++++++%7D%0A++++++++%7D%0A++++%7D%0A++++std%3A%3Astr%3A%3Afrom_utf8%28%26output%29.unwrap%28%29.to_string%28%29%0A%7D%0A%0Aconst+ENCODED_DATA%3A+%26%5Bu8%5D+%3D+%26%5B%0A++++40%2C+66%2C+108%2C+117%2C+101%2C+32%2C+108%2C+105%2C+107%2C+101%2C+32%2C+109%2C+121%2C+32%2C+99%2C+111%2C+114%2C+118%2C+101%2C+116%2C%0A++++116%2C+101%2C+32%2C+105%2C+116%2C+115%2C+32%2C+105%2C+110%2C+32%2C+97%2C+110%2C+100%2C+32%2C+111%2C+117%2C+116%2C+115%2C+105%2C+100%2C%0A++++101%2C+10%2C+129%2C+40%2C+35%2C+97%2C+114%2C+101%2C+32%2C+116%2C+104%2C+101%2C+32%2C+119%2C+111%2C+114%2C+100%2C+115%2C+32%2C+73%2C+32%2C%0A++++115%2C+97%2C+121%2C+10%2C+65%2C+110%2C+100%2C+32%2C+119%2C+104%2C+97%2C+116%2C+32%2C+73%2C+32%2C+116%2C+104%2C+105%2C+110%2C+107%2C%0A++++138%2C+41%2C+9%2C+102%2C+101%2C+101%2C+108%2C+105%2C+110%2C+103%2C+115%2C+10%2C+84%2C+128%2C+34%2C+6%2C+108%2C+105%2C+118%2C+101%2C+32%2C%0A++++105%2C+110%2C+128%2C+80%2C+23%2C+32%2C+109%2C+101%2C+10%2C+73%2C+39%2C+109%2C+32%2C+98%2C+108%2C+117%2C+101%2C+10%2C+68%2C+97%2C+32%2C%0A++++98%2C+97%2C+32%2C+100%2C+101%2C+101%2C+32%2C+100%2C+130%2C+9%2C+0%2C+105%2C+181%2C+18%2C%0A%5D%3B%0A%0Aconst+DECODED_DATA%3A+%26str+%3D+%22Blue+like+my+corvette+its+in+and+outside%0ABlue+are+the+words+I+say%0AAnd+what+I+think%0ABlue+are+the+feelings%0AThat+live+inside+me%0AI%27m+blue%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%0ADa+ba+dee+da+ba+di%22%3B%0A%0Afn+main%28%29+%7B%0A++++assert_eq%21%28decode%28ENCODED_DATA%29%2C+DECODED_DATA%29%3B%0A%7D%0A
Rust Playground

A browser interface to the Rust compiler to experiment with the language

@neauoire This is my interpretation, is it correct?

0 LIT:7 <up to 2^7-1 bytes which are not commands>
10 CPY1:6 < copy up to 2^6-1 bytes from offset; offset is a byte >
11 CPY2:14 < copy up to 2^14-1 bytes from offset; offset is a byte >

@cancel

@neauoire @cancel I've made a little implementation in Go : https://github.com/max22-/ulz-go
it was a little bit difficult to understand that you can copy data even if the length goes past the end of the output buffer (i did an equivalent of memcpy, and it didn't work ^^)
GitHub - max22-/ulz-go: An implementation of ulz (de)compression in go

An implementation of ulz (de)compression in go. Contribute to max22-/ulz-go development by creating an account on GitHub.

GitHub
@maxime_andre @cancel ah! that trips many people, how would you explain this so others aren't tripped by it?
@neauoire @cancel 🤔 maybe a little drawing ?
@neauoire @cancel Nice! But the encoded data doesn't use CPY2 does it? I expected the example to guide me towards a full implementation, so I was surprised to see the full text after only LIT and CPY1! :)
@nicolagi @neauoire Use my reference implementation to create whatever tests you want. It has the encoder. https://github.com/randrew/uxn32/blob/master/uxn_lz.c
uxn32/uxn_lz.c at master · randrew/uxn32

Uxn emulator for Windows and Wine. Contribute to randrew/uxn32 development by creating an account on GitHub.

GitHub
@nicolagi @cancel ah you're right, I chose a segment that's too small. I'll fix the example :) thanks for pointing that out.
@neauoire No need to update the example! My message was a way of sharing my enthusiasm for working through your puzzle. Just showing appreciation and checking I didn't miss something; not intending to criticize or create work... sorry if it came across differently!
@nicolagi nono it's all good :) I've since added a longer example to the repo(as to not clog the documentatio) It's good, I wanted to have a way to benchmark CPY2 as well!

@neauoire @cancel

Have you considered including 'magic-bytes' with your ULZ file format?

...

The usage of it would be —

If someone doesn't know what the file-name is, they could still determine the type of the file.

...

It could be something as simple as the first by 7 bytes of the file format being:

55 4C 5A 2F 31 0D 0A

Which, if interpreted as ASCII or Unicode UTF-8 would be:

"ULZ/1\r\n"

...

Here are magic bytes for other file formats:

https://en.wikipedia.org/wiki/List_of_file_signatures

.

List of file signatures - Wikipedia

@reiver @cancel it will be used as parts of other formats, which might have identifiers yes :)
@neauoire @cancel really enjoyed this! when a video of it popped up on YouTube I read the wiki
@gingerbeardman @neauoire there's a youtube video?!
@cancel @gingerbeardman you've seen it, the little video where you can watch the compression data :)