Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

@mcc Hindi's got to be pretty good. "नमस्ते" (namaste) clocks in at 18 bytes for 3 gcs; of that the "-ste" alone (स्ते) is 12 bytes for 1gc. "पास्त्रामी" (pastrami - I'm just thinking of random words here) is 30 bytes for still 3 gcs. I'm sure someone who actually knows Hindi could do much better here!
@mcc (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)
@mal3aby @mcc But then, Hindi uses Devanagari which is (rightfully) in the BMP. I guess the same strategy would work better with the ancient and/or minority Indic scripts from the SMP, which have essentially the same encoding model, since the scripts are related, with grapheme clusters, but codepoints in the range 11xyz which need 4 bytes/codepoint in UTF8
@fgrosshans @mcc Ooh, good point! I'd twigged that the other Indic scripts would also do well, having the same structure, but it hadn't occurred to me that they have a bytes advantage, too. (Even as I was sitting there thinking Arabic had an unfair disadvantage with its characters taking only 2 UTF8 bytes instead of 3!)