Mastodawn

Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Show thread

mcc Feb 20

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

Show thread

mcc Feb 20

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

Show thread

mcc Feb 20

So, in Korean Hangul, syllabic characters are formed by arranging 67 "jamo" in a grid. Unicode decided to represent this by working out every legal combination of jamo and assigning each one a codepoint. But this creates a problem:

https://en.wikipedia.org/wiki/Obsolete_Hangul_jamo

Adoption of Hangul came in fits and starts. There are thirty-ish Jamo that are attested in pre-1945 waves of Hangul but didn't survive into modern use. Unicode didn't want to waste codepoints on these. So they did something hacky. (1/3)

Obsolete Hangul jamo - Wikipedia

Show thread

mcc Feb 20

To support encoding of older Korean texts, the Unicode body has a concept of "conjoining" jamo:

https://www.rfc-editor.org/rfc/rfc5892#section-2.9

This leads to something I was posting about last night— these "conjoining" jamo are discouraged in some circumstances because they mean there can be multiple different ways of encoding a single visible character:

https://mastodon.social/@mcc/116100748088621894

I suspect conjoined jamo give you a very high byte-grapheme ratio. But this raises a question: Are these texts "natural"? (2/3)

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Show thread

mcc Feb 20

The point of this challenge is "could someone not intentionally trying to beat this challenge, beat this challenge?".

To my knowledge, nobody writing modern Korean would use the conjoining hangul. And the archaic jamo are *pretty* archaic; I doubt they'd get used in real speech. So our candidates for the challenge become:

- Actual pre-1700 texts;
- The Jeju language, spoken by 5,000 people as of 2014, which uses Hangul with the otherwise-lostㆍ jamo. I don't know if this jamo conjoins.

(3/3)

Show thread

mcc Feb 20

Update: Once again my limited knowledge of Indian-subcontinent languages has biten me!

@mal3aby points out Hindi, because of how ligatures work, produces an extreme grapheme-to-utf-8-byte in totally natural texts:

https://mastodon.smears.org/@mal3aby/116105194257298934

Unfortunately the highest-ratio word they thought of, पास्त्रामी ("Pastrami") *hits* a ratio of 10:1. To break Bluesky, we need to *exceed* 10:1. Hindi so far AFAIK does *best* at this challenge, but can it break Bluesky? Any Hindi speakers care to help?

mal3aby (@[email protected])

@[email protected] (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)

Mastodon

Show thread

mcc

Further updates—

- My speculations about Vietnamese do not seem to have borne out. Unicode seems to have granted the standard Vietnamese diacritic combinations their own codepoints. https://en.wikipedia.org/wiki/Vietnamese_language_and_computers

- Thai's diacritics do produce a higher codepoint-to-grapheme ratio than Vietnamese, but it can't keep up with Hindi. https://bsky.app/profile/did:plc:7qtwfjtfw4xkr6ny7ckqxa7j/post/3mfd6xuzat22x

- This doesn't help you but a 2023 investigation concluded the longest-by-byte Unicode grapheme is 👨🏻‍❤️‍💋‍👨🏻, at 35 bytes. https://machs.space/posts/whats-the-max-valid-length-of-an-emoji/

Vietnamese language and computers - Wikipedia

Show thread

mcc Feb 21

Further further updates:

There IS a way you can wind up with decomposed jamo in modern Korean text. And it's Apple's fault?! :O

https://mastodon.social/@mwh/116106825790571446

Show thread

none gender with left politics Feb 20

@mcc are you familiar with Unicode normalization? There are specific ways to do this for characters that can be represented as either a single codepoint or multiple conjoining codepoints. Most infuriating of all, some of these are baked into filesystems.

https://www.unicode.org/reports/tr15/

UAX #15: Unicode Normalization Forms

Specifies the Unicode Normalization Formats

Show thread

mcc Feb 20

@vikxin I'm familiar, yes. I think to follow the spirit of the challenge the text doesn't necessarily need to be normalized, but it really should be something conceivably produced by a real-world IME and not a contrived combination of codepoints specifically designed to beat the challenge (EG: me concluding that writing modern hangul with the conjoining jamo is illegitimate)

Show thread

Scott Cheloha Feb 21

@mcc my god, 35 bytes