Mastodawn

Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

So, in Korean Hangul, syllabic characters are formed by arranging 67 "jamo" in a grid. Unicode decided to represent this by working out every legal combination of jamo and assigning each one a codepoint. But this creates a problem:

https://en.wikipedia.org/wiki/Obsolete_Hangul_jamo

Adoption of Hangul came in fits and starts. There are thirty-ish Jamo that are attested in pre-1945 waves of Hangul but didn't survive into modern use. Unicode didn't want to waste codepoints on these. So they did something hacky. (1/3)

Obsolete Hangul jamo - Wikipedia

To support encoding of older Korean texts, the Unicode body has a concept of "conjoining" jamo:

https://www.rfc-editor.org/rfc/rfc5892#section-2.9

This leads to something I was posting about last night— these "conjoining" jamo are discouraged in some circumstances because they mean there can be multiple different ways of encoding a single visible character:

https://mastodon.social/@mcc/116100748088621894

I suspect conjoined jamo give you a very high byte-grapheme ratio. But this raises a question: Are these texts "natural"? (2/3)

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

The point of this challenge is "could someone not intentionally trying to beat this challenge, beat this challenge?".

To my knowledge, nobody writing modern Korean would use the conjoining hangul. And the archaic jamo are *pretty* archaic; I doubt they'd get used in real speech. So our candidates for the challenge become:

- Actual pre-1700 texts;
- The Jeju language, spoken by 5,000 people as of 2014, which uses Hangul with the otherwise-lostㆍ jamo. I don't know if this jamo conjoins.

(3/3)

Update: Once again my limited knowledge of Indian-subcontinent languages has biten me!

@mal3aby points out Hindi, because of how ligatures work, produces an extreme grapheme-to-utf-8-byte in totally natural texts:

https://mastodon.smears.org/@mal3aby/116105194257298934

Unfortunately the highest-ratio word they thought of, पास्त्रामी ("Pastrami") *hits* a ratio of 10:1. To break Bluesky, we need to *exceed* 10:1. Hindi so far AFAIK does *best* at this challenge, but can it break Bluesky? Any Hindi speakers care to help?

mal3aby (@[email protected])

@[email protected] (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)

Mastodon

Further updates—

- My speculations about Vietnamese do not seem to have borne out. Unicode seems to have granted the standard Vietnamese diacritic combinations their own codepoints. https://en.wikipedia.org/wiki/Vietnamese_language_and_computers

- Thai's diacritics do produce a higher codepoint-to-grapheme ratio than Vietnamese, but it can't keep up with Hindi. https://bsky.app/profile/did:plc:7qtwfjtfw4xkr6ny7ckqxa7j/post/3mfd6xuzat22x

- This doesn't help you but a 2023 investigation concluded the longest-by-byte Unicode grapheme is 👨🏻‍❤️‍💋‍👨🏻, at 35 bytes. https://machs.space/posts/whats-the-max-valid-length-of-an-emoji/

Vietnamese language and computers - Wikipedia

Further further updates:

There IS a way you can wind up with decomposed jamo in modern Korean text. And it's Apple's fault?! :O

https://mastodon.social/@mwh/116106825790571446

none gender with left politics Feb 20

@mcc are you familiar with Unicode normalization? There are specific ways to do this for characters that can be represented as either a single codepoint or multiple conjoining codepoints. Most infuriating of all, some of these are baked into filesystems.

https://www.unicode.org/reports/tr15/

UAX #15: Unicode Normalization Forms

Specifies the Unicode Normalization Formats

@vikxin I'm familiar, yes. I think to follow the spirit of the challenge the text doesn't necessarily need to be normalized, but it really should be something conceivably produced by a real-world IME and not a contrived combination of codepoints specifically designed to beat the challenge (EG: me concluding that writing modern hangul with the conjoining jamo is illegitimate)

Scott Cheloha Feb 21

@mcc my god, 35 bytes

Ramsey Nasser Feb 20

@mcc I cannot help (arabic is generally has a ratio of 2:1) but I am obsessed with this thread

@mcc To be clear that's not so much "the best word I can think of", so much as "literally the first word that came into my head with a consonant cluster and a non-a vowel". As in, I wasn't trying very hard, so my hopes that someone with actual knowledge can do better are high 🙂

@mal3aby based on this reply I removed the word "could"

@mcc Thinking about it, I'm pretty sure other Indian-subcontinent writing systems are structurally pretty similar, so may also be good candidates!

Manic Emo Dream Ghoul Feb 20

@mcc disclaimer is I'm extremely not educated in this: I wonder if Classic Manchurian is a candidate. I know less than 0 Manchurian. I went to a Manchurian primary school. Andi. I am ashamed of my existence. (Unlike my lacking of Mongolian/Māori knowledge or piss poor English, this is solidly NOT my fault).

Alternatively, this is me trolling. The "Biang" in biang biang noodles. I hope the Wikipedia page will explain why it's not in unicode, AND why we are collectively trolling here. https://en.wikipedia.org/wiki/Biangbiang_noodles

Biangbiang noodles - Wikipedia

@BigShellEvent 🤯 I studied Mandarin for two years so hanzi was the bane of my existence but/so I don't know if I'm relieved or disappointed they never tried to teach us that one. It certainly would have put all the others into perspective....

Manic Emo Dream Ghoul Feb 20

@zeborah we troll, we troll.

Emelia/Emi Feb 20

@BigShellEvent @mcc But it is in Unicode: 𰻞𰻞麵 (That may appear as a placeholder on some devices given the obscurity, but it shows up correctly on my phone). But it still only takes up 4 bytes per character, so no dice there.

@becomethewaifu @BigShellEvent yeah one thing is that in general the more "famous" something is, the more likely it is to have its own codepoint.

however, characters like this might help if one were going to try to design a block of hanzi or hanzi-derivative text while hitting as many four-byte characters as possible (the more common chinese characters, like 字, are down in the three byte range)