Mastodawn

Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Show thread

mcc Feb 20

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

Show thread

mcc Feb 20

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

Show thread

mcc Feb 20

So, in Korean Hangul, syllabic characters are formed by arranging 67 "jamo" in a grid. Unicode decided to represent this by working out every legal combination of jamo and assigning each one a codepoint. But this creates a problem:

https://en.wikipedia.org/wiki/Obsolete_Hangul_jamo

Adoption of Hangul came in fits and starts. There are thirty-ish Jamo that are attested in pre-1945 waves of Hangul but didn't survive into modern use. Unicode didn't want to waste codepoints on these. So they did something hacky. (1/3)

Obsolete Hangul jamo - Wikipedia

Show thread

mcc Feb 20

To support encoding of older Korean texts, the Unicode body has a concept of "conjoining" jamo:

https://www.rfc-editor.org/rfc/rfc5892#section-2.9

This leads to something I was posting about last night— these "conjoining" jamo are discouraged in some circumstances because they mean there can be multiple different ways of encoding a single visible character:

https://mastodon.social/@mcc/116100748088621894

I suspect conjoined jamo give you a very high byte-grapheme ratio. But this raises a question: Are these texts "natural"? (2/3)

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Show thread

mcc Feb 20

The point of this challenge is "could someone not intentionally trying to beat this challenge, beat this challenge?".

To my knowledge, nobody writing modern Korean would use the conjoining hangul. And the archaic jamo are *pretty* archaic; I doubt they'd get used in real speech. So our candidates for the challenge become:

- Actual pre-1700 texts;
- The Jeju language, spoken by 5,000 people as of 2014, which uses Hangul with the otherwise-lostㆍ jamo. I don't know if this jamo conjoins.

(3/3)

Show thread

Michael Homer Feb 21

@mcc A plausible route is “create a file with this name on macOS, then copy the name out”, which will normalise to NFD on APFS. When you copy the name out afterwards you get the fully decomposed text regardless of what you put in originally. I just checked creating a directory in Finder by pasting in 한국 (2 codepoints), and it copies out as 한국 (6 codepoints, though only 18 bytes UTF-8). That should work for any precomposed characters and copying a path seems “reasonable” to do naturally.

Show thread

mcc

@mwh Okay that is… really surprising!!!