Mastodawn

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

mcc

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

https://en.wikipedia.org/wiki/Obsolete_Hangul_jamo

So, in Korean Hangul, syllabic characters are formed by arranging 67 "jamo" in a grid. Unicode decided to represent this by working out every legal combination of jamo and assigning each one a codepoint. But this creates a problem:

Adoption of Hangul came in fits and starts. There are thirty-ish Jamo that are attested in pre-1945 waves of Hangul but didn't survive into modern use. Unicode didn't want to waste codepoints on these. So they did something hacky. (1/3)

Obsolete Hangul jamo - Wikipedia

https://www.rfc-editor.org/rfc/rfc5892#section-2.9

To support encoding of older Korean texts, the Unicode body has a concept of "conjoining" jamo:

This leads to something I was posting about last night— these "conjoining" jamo are discouraged in some circumstances because they mean there can be multiple different ways of encoding a single visible character:

https://mastodon.social/@mcc/116100748088621894

I suspect conjoined jamo give you a very high byte-grapheme ratio. But this raises a question: Are these texts "natural"? (2/3)

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

The point of this challenge is "could someone not intentionally trying to beat this challenge, beat this challenge?".

To my knowledge, nobody writing modern Korean would use the conjoining hangul. And the archaic jamo are *pretty* archaic; I doubt they'd get used in real speech. So our candidates for the challenge become:

- Actual pre-1700 texts;
- The Jeju language, spoken by 5,000 people as of 2014, which uses Hangul with the otherwise-lostㆍ jamo. I don't know if this jamo conjoins.

(3/3)

https://mastodon.smears.org/@mal3aby/116105194257298934

Update: Once again my limited knowledge of Indian-subcontinent languages has biten me!

@mal3aby points out Hindi, because of how ligatures work, produces an extreme grapheme-to-utf-8-byte in totally natural texts:

Unfortunately the highest-ratio word they thought of, पास्त्रामी ("Pastrami") *hits* a ratio of 10:1. To break Bluesky, we need to *exceed* 10:1. Hindi so far AFAIK does *best* at this challenge, but can it break Bluesky? Any Hindi speakers care to help?

mal3aby (@[email protected])

@[email protected] (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)

Mastodon

Further updates—

- My speculations about Vietnamese do not seem to have borne out. Unicode seems to have granted the standard Vietnamese diacritic combinations their own codepoints. https://en.wikipedia.org/wiki/Vietnamese_language_and_computers

- Thai's diacritics do produce a higher codepoint-to-grapheme ratio than Vietnamese, but it can't keep up with Hindi. https://bsky.app/profile/did:plc:7qtwfjtfw4xkr6ny7ckqxa7j/post/3mfd6xuzat22x

- This doesn't help you but a 2023 investigation concluded the longest-by-byte Unicode grapheme is 👨🏻‍❤️‍💋‍👨🏻, at 35 bytes. https://machs.space/posts/whats-the-max-valid-length-of-an-emoji/

Vietnamese language and computers - Wikipedia

https://mastodon.social/@mwh/116106825790571446

Further further updates:

There IS a way you can wind up with decomposed jamo in modern Korean text. And it's Apple's fault?! :O

none gender with left politics Feb 20

@mcc are you familiar with Unicode normalization? There are specific ways to do this for characters that can be represented as either a single codepoint or multiple conjoining codepoints. Most infuriating of all, some of these are baked into filesystems.

https://www.unicode.org/reports/tr15/

UAX #15: Unicode Normalization Forms

Specifies the Unicode Normalization Formats

@vikxin I'm familiar, yes. I think to follow the spirit of the challenge the text doesn't necessarily need to be normalized, but it really should be something conceivably produced by a real-world IME and not a contrived combination of codepoints specifically designed to beat the challenge (EG: me concluding that writing modern hangul with the conjoining jamo is illegitimate)

Scott Cheloha Feb 21

@mcc my god, 35 bytes

Ramsey Nasser Feb 20

@mcc I cannot help (arabic is generally has a ratio of 2:1) but I am obsessed with this thread

@mcc To be clear that's not so much "the best word I can think of", so much as "literally the first word that came into my head with a consonant cluster and a non-a vowel". As in, I wasn't trying very hard, so my hopes that someone with actual knowledge can do better are high 🙂

@mal3aby based on this reply I removed the word "could"

@mcc Thinking about it, I'm pretty sure other Indian-subcontinent writing systems are structurally pretty similar, so may also be good candidates!

Manic Emo Dream Ghoul Feb 20

@mcc disclaimer is I'm extremely not educated in this: I wonder if Classic Manchurian is a candidate. I know less than 0 Manchurian. I went to a Manchurian primary school. Andi. I am ashamed of my existence. (Unlike my lacking of Mongolian/Māori knowledge or piss poor English, this is solidly NOT my fault).

Alternatively, this is me trolling. The "Biang" in biang biang noodles. I hope the Wikipedia page will explain why it's not in unicode, AND why we are collectively trolling here. https://en.wikipedia.org/wiki/Biangbiang_noodles

Biangbiang noodles - Wikipedia

Zeborah Feb 20

@BigShellEvent 🤯 I studied Mandarin for two years so hanzi was the bane of my existence but/so I don't know if I'm relieved or disappointed they never tried to teach us that one. It certainly would have put all the others into perspective....

@mcc

Manic Emo Dream Ghoul Feb 20

@zeborah we troll, we troll.

Emelia/Emi Feb 20

@BigShellEvent @mcc But it is in Unicode: 𰻞𰻞麵 (That may appear as a placeholder on some devices given the obscurity, but it shows up correctly on my phone). But it still only takes up 4 bytes per character, so no dice there.

@becomethewaifu @BigShellEvent yeah one thing is that in general the more "famous" something is, the more likely it is to have its own codepoint.

however, characters like this might help if one were going to try to design a block of hanzi or hanzi-derivative text while hitting as many four-byte characters as possible (the more common chinese characters, like 字, are down in the three byte range)

Michael Homer Feb 21

@mcc A plausible route is “create a file with this name on macOS, then copy the name out”, which will normalise to NFD on APFS. When you copy the name out afterwards you get the fully decomposed text regardless of what you put in originally. I just checked creating a directory in Finder by pasting in 한국 (2 codepoints), and it copies out as 한국 (6 codepoints, though only 18 bytes UTF-8). That should work for any precomposed characters and copying a path seems “reasonable” to do naturally.

@mwh Okay that is… really surprising!!!

Nicolás Alvarez Feb 22

@mwh @mcc it's not just APFS, Apple's filename normalization used the decomposed form in HFS+ too, though I don't know about NFD vs NFKD.

Michael Homer Feb 22

@nicolas17 Yes, HFS+ enforced something close to NFD. APFS is normalisation-insensitive but -preserving and doesn’t change the given name, but won’t allow two canonically-equivalent names to exist at once. Finder is actually doing the normalisation work here — if you use the POSIX APIs to make a file (e.g. at the terminal) you can create NFC or unnormalised names, and then copying those out of Finder will give you the name unchanged. It is an odd set of choices.

Nicolás Alvarez Feb 22

@mwh I assume POSIX APIs still won't let you do like, invalid UTF-8?

Michael Homer Feb 22

@nicolas17 It seems like no. The error is “Illegal byte sequence”.

Rachel Stantz Feb 20

@mcc very tempted to refer to this hypothetical high byte-grapheme ratio Hangul as “jumbo jamo”

crab Feb 20

@mcc wait how is this the second time today archaic Korean scripts are relevant

@operand Was the other one a post I wrote

crab Feb 20

@mcc yeah the one about pre-1700 Korean characters not being allowed in AP handles!

@operand Both these questions came out of one rust project!

@mcc Hindi's got to be pretty good. "नमस्ते" (namaste) clocks in at 18 bytes for 3 gcs; of that the "-ste" alone (स्ते) is 12 bytes for 1gc. "पास्त्रामी" (pastrami - I'm just thinking of random words here) is 30 bytes for still 3 gcs. I'm sure someone who actually knows Hindi could do much better here!

@mcc (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)

Frédéric Grosshans Feb 20

@mal3aby @mcc But then, Hindi uses Devanagari which is (rightfully) in the BMP. I guess the same strategy would work better with the ancient and/or minority Indic scripts from the SMP, which have essentially the same encoding model, since the scripts are related, with grapheme clusters, but codepoints in the range 11xyz which need 4 bytes/codepoint in UTF8

@fgrosshans @mcc Ooh, good point! I'd twigged that the other Indic scripts would also do well, having the same structure, but it hadn't occurred to me that they have a bytes advantage, too. (Even as I was sitting there thinking Arabic had an unfair disadvantage with its characters taking only 2 UTF8 bytes instead of 3!)

Frédéric Grosshans Feb 21

@mal3aby @mcc For Hindi (and presumably other indic languages/script combinations) one should not forget cluster commonly used in informal context, even if they're not officially correct.

The Unicode proposal https://www.unicode.org/L2/L2026/26062-indian-language-feedback.pdf Text Rendering, Input, Search and Processing in Indian Languages states :

«Forms like क्यााा in which a vowel sound is
exaggerated by repeating the vowel sign
multiple times, which is popular in Hindi
novels, magazines, as well as on social
media»

I now wonder if further exaggeration like क्यााााााााा would seem natural? Aaaaaaaaaaaaaargh! I will probably never know

Advanced Persistent Teapot Feb 21

@fgrosshans @mal3aby @mcc I absolutely love that the answer to this is turning out to be @scream

Endless Screaming Feb 21

@http_error_418 AAAAAAAAAAAAAAAAAAAAH

Søren Feb 20

@mal3aby @mcc Looking through Hindi wordlists, I see स्क्रू ("screw"), with an impressive 18 bytes of UTF-8 to encode a single grapheme.

Søren Feb 21

@mal3aby @mcc The same list (compiled from opensubtitles.org) also has तु्म्ह ("your") at 24 (edit: no, 18) bytes, but Wiktionary lists that as "Old Hindi", so not sure that counts, even if it may have appeared in a Hindi subtitle at some point. https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/hi/hi_full.txt

@kwi @mal3aby i think it counts because a thing a real user might plausibly do is transcribe an antiquated Hindi text onto a modern computer, and plausibly that Hindi text might contain many instances of the word तु्म्ह.

Crossing the 20:1 boundary would be actually very significant because it would mean we could spam the word space-separated many times and pass the 3000 boundary! Not a *good* text but not gibberish & closer than we've got yet. However, my own tool puts तु्म्ह at only 19 bytes…?

Søren Feb 21

@mcc @mal3aby Oops, turns out my terminal is struggling with rendering these and cut the word off when I copy-pasted it. The word from the word list is actually "तु्म्हीं", which isn't listed in any online Hindi dictionary I can find.

@kwi @mal3aby Okay NOW we are getting somewhere!!!

@kwi @mal3aby Oh and terrifyingly तु्म्हीं is, while Rust unicode-segmentation counts it as one grapheme, the Bluesky web client as installed on blacksky.community counts it as two. No idea whether I have just found a bug in Rust unicode-segmentation, the Bluesky client app, or the Bluesky server software/specification! But if it's anything other than the client app I'm actually in trouble! Crud!

EDIT: IT TURNED OUT BLACKSKY IS SEGMENTING UNICODE DIFFERENTLY FROM BLUESKY?!

Robin Leroy Feb 21

@mcc I think your post is missing some words so I am not sure what the other grapheme count is; but the relevant rule changed a couple of years ago, so this may be a mismatch in version of grapheme cluster segmentation. See PU UAX #29 for 15.1, https://www.unicode.org/reports/tr29/tr29-42.html#GB9c.

(Assuming I ran the various segmentation algorithms in my head correctly—a daring assumption, I have a cold—if the count is 3, this is a version mismatch; if the count is 4, it is EGC vs. LGC.)

UAX #29: Unicode Text Segmentation

@eggrobin It looks like what happened is the bluesky js frontend was using a busted segmenter until recently https://github.com/bluesky-social/social-app/pull/9526

Replace `graphemer` with `unicode-segmenter` by mozzius · Pull Request #9526 · bluesky-social/social-app

We're using unicode-segmenter in @atproto/api now, so no need to have two libraries doing the same thing. We really should consider using this library to its full potential - it returns iterato...

GitHub

@eggrobin Incidentally your mental unicode arithmetic is good; indeed EGC is 1 and LGC is 4. I do not know why Bluesky Social App 116 was giving 2, I assume it was just wrong.

Robin Leroy Feb 21

@mcc Yeah two is just wrong for all versions of Unicode for that string.

But then to your earlier question, the actual string seems weird (a virama on a vowel?). The Old Hindi Wiktionary entry mentioned above doesn’t have the first virama, and thus is two (modern) EGCs.

@eggrobin do you think this is a plausible actual archaic form, or could it be a typo in the list we got it from?

Robin Leroy Feb 21

@mcc @manishearth says it’s a typo.

@eggrobin @manishearth Thanks Manish

Manish Feb 21

@mcc @eggrobin It's definitely a typo. A virama on a vowel is meaningless, the fact that you can do it at all is a feature of Unicode, not of the script.

In Hindi visible viramas tend to be a pedagogical tool or something you use when you don't want to figure out how to write out a cluster. Unicode uses the virama character as an architectural tool for representing conjuncts most of the time.

There's no actual way to write this string. Unicode just lets you construct it.

@manishearth @eggrobin Alright. Well that makes me a lot less worried about the fact I was getting different segmentation results from different segmenters (if natural-but-rare text chokes my code I have a problem, but if *malformed* text chokes the code that's… less bad…)

chinmay 🐋Feb 21

@kwi @mal3aby @mcc that doesn't work, it has a halant (्- indicates no vowel) after the vowel diacritic (ु). searching "तु्म्ह Wiktionary" brings up तुम्ह https://en.wiktionary.org/wiki/%E0%A4%A4%E0%A5%81%E0%A4%AE%E0%A5%8D%E0%A4%B9

तुम्ह - Wiktionary, the free dictionary

Wiktionary

chinmay 🐋Feb 21

@kwi @mal3aby @mcc try deleting characters one by one in that word

LR Feb 21

@kwi @mal3aby @mcc well स्क्रू me

Lynne Feb 20

@mcc How is that remotely a challenge? Just paste any CJVK character exactly 100 times and you'll hit the limit.
You don't get more than 3 bytes per character in any normal language.

@lynne To hit the limit you need *ten* byes per character.

Lynne Feb 20

Oh, 3000, not 300 bytes. Not unless you get into zalgo text or you fill in spacer/padding chars.

Infrapink (he/his/him)Feb 20

@mcc TIL Cherokee characters are all 3 bytes. Can't remotely hit the limit but new knowledge.