Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

@mcc wait how is this the second time today archaic Korean scripts are relevant
@operand Was the other one a post I wrote
@mcc yeah the one about pre-1700 Korean characters not being allowed in AP handles!
@operand Both these questions came out of one rust project!