Mastodawn

Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Show thread

mcc Feb 20

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

Show thread

mcc Feb 20

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

Show thread

mal3aby Feb 20

@mcc Hindi's got to be pretty good. "नमस्ते" (namaste) clocks in at 18 bytes for 3 gcs; of that the "-ste" alone (स्ते) is 12 bytes for 1gc. "पास्त्रामी" (pastrami - I'm just thinking of random words here) is 30 bytes for still 3 gcs. I'm sure someone who actually knows Hindi could do much better here!

Show thread

mal3aby Feb 20

@mcc (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)

Show thread

Frédéric Grosshans Feb 20

@mal3aby @mcc But then, Hindi uses Devanagari which is (rightfully) in the BMP. I guess the same strategy would work better with the ancient and/or minority Indic scripts from the SMP, which have essentially the same encoding model, since the scripts are related, with grapheme clusters, but codepoints in the range 11xyz which need 4 bytes/codepoint in UTF8

Show thread

mal3aby

@fgrosshans @mcc Ooh, good point! I'd twigged that the other Indic scripts would also do well, having the same structure, but it hadn't occurred to me that they have a bytes advantage, too. (Even as I was sitting there thinking Arabic had an unfair disadvantage with its characters taking only 2 UTF8 bytes instead of 3!)