Mastodawn

Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Show thread

mcc Feb 20

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

Show thread

mcc Feb 20

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

Show thread

mcc Feb 20

So, in Korean Hangul, syllabic characters are formed by arranging 67 "jamo" in a grid. Unicode decided to represent this by working out every legal combination of jamo and assigning each one a codepoint. But this creates a problem:

https://en.wikipedia.org/wiki/Obsolete_Hangul_jamo

Adoption of Hangul came in fits and starts. There are thirty-ish Jamo that are attested in pre-1945 waves of Hangul but didn't survive into modern use. Unicode didn't want to waste codepoints on these. So they did something hacky. (1/3)

Obsolete Hangul jamo - Wikipedia

Show thread

mcc Feb 20

To support encoding of older Korean texts, the Unicode body has a concept of "conjoining" jamo:

https://www.rfc-editor.org/rfc/rfc5892#section-2.9

This leads to something I was posting about last night— these "conjoining" jamo are discouraged in some circumstances because they mean there can be multiple different ways of encoding a single visible character:

https://mastodon.social/@mcc/116100748088621894

I suspect conjoined jamo give you a very high byte-grapheme ratio. But this raises a question: Are these texts "natural"? (2/3)

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Show thread

Rachel Stantz

@mcc very tempted to refer to this hypothetical high byte-grapheme ratio Hangul as “jumbo jamo”