Got nerdsniped by a request from @thisismissem.social and made a little visualizer tool demonstrating the various ways you can represent "how long is this string?" in Unicode:

https://data.runhello.com/bs-limits/

- Bytes (in the standard UTF-8 recording)
- UTF-16 (irrelevant except in JS, where it's relevant)
- Codepoints (unicode characters)
- Grapheme clusters (the visual "characters" you see on screen)

And how the divergence of the two relates to Bluesky's "unusual" post limit rules.

Unicode demo

Based on this visualizer I issue you a challenge.

Bluesky's post limits work like this: You can fit 3000 bytes OR 300 graphemes, whichever is less. I thought at first that hitting the byte limit would be nigh impossible, but it turns out to be pretty easy with emoji: 🤷🏾‍♀️ is 1 grapheme but 17! bytes, thanks to stringing an emoji, a skintone, a gender, and two ZWJs.

My challenge: Can you hit the 3000 byte limit, without hitting the 300 grapheme limit, using only *natural human text*, any language?

In other words, can you top out Bluesky's byte limit by writing in a human language, not relying on emoji or proto-emoji like ☪? And if not, what human text comes closest— has *largest* ratio of byte-length to grapheme-length?

I'm guessing the leading candidates would be:

- Vietnamese, as far as I know the diacritic-est language on earth;
- Chinese, assuming you can stick only to four-byte characters;
- Archaic korean— oh, but this one's *complicated*, so I'll have to explain in the next post—

@mcc Hindi's got to be pretty good. "नमस्ते" (namaste) clocks in at 18 bytes for 3 gcs; of that the "-ste" alone (स्ते) is 12 bytes for 1gc. "पास्त्रामी" (pastrami - I'm just thinking of random words here) is 30 bytes for still 3 gcs. I'm sure someone who actually knows Hindi could do much better here!
@mcc (It will do well because, very roughly, consonants that don't have a vowel in between end up being a ligature, and by default consonants have a built-in "-a" vowel that takes an extra codepoint to remove - and then non-a vowels are also handled as ligatures. So you end up with complex ligatures that take a lot of bytes to construct.)
@mal3aby @mcc But then, Hindi uses Devanagari which is (rightfully) in the BMP. I guess the same strategy would work better with the ancient and/or minority Indic scripts from the SMP, which have essentially the same encoding model, since the scripts are related, with grapheme clusters, but codepoints in the range 11xyz which need 4 bytes/codepoint in UTF8
@fgrosshans @mcc Ooh, good point! I'd twigged that the other Indic scripts would also do well, having the same structure, but it hadn't occurred to me that they have a bytes advantage, too. (Even as I was sitting there thinking Arabic had an unfair disadvantage with its characters taking only 2 UTF8 bytes instead of 3!)

@mal3aby @mcc For Hindi (and presumably other indic languages/script combinations) one should not forget cluster commonly used in informal context, even if they're not officially correct.

The Unicode proposal https://www.unicode.org/L2/L2026/26062-indian-language-feedback.pdf Text Rendering, Input, Search and Processing in Indian Languages states :

«Forms like क्यााा in which a vowel sound is
exaggerated by repeating the vowel sign
multiple times, which is popular in Hindi
novels, magazines, as well as on social
media»

I now wonder if further exaggeration like क्यााााााााा would seem natural? Aaaaaaaaaaaaaargh! I will probably never know

@fgrosshans @mal3aby @mcc I absolutely love that the answer to this is turning out to be @scream
@mal3aby @mcc Looking through Hindi wordlists, I see स्क्रू ("screw"), with an impressive 18 bytes of UTF-8 to encode a single grapheme.
@mal3aby @mcc The same list (compiled from opensubtitles.org) also has तु्म्ह ("your") at 24 (edit: no, 18) bytes, but Wiktionary lists that as "Old Hindi", so not sure that counts, even if it may have appeared in a Hindi subtitle at some point. https://github.com/hermitdave/FrequencyWords/blob/master/content/2018/hi/hi_full.txt

@kwi @mal3aby i think it counts because a thing a real user might plausibly do is transcribe an antiquated Hindi text onto a modern computer, and plausibly that Hindi text might contain many instances of the word तु्म्ह.

Crossing the 20:1 boundary would be actually very significant because it would mean we could spam the word space-separated many times and pass the 3000 boundary! Not a *good* text but not gibberish & closer than we've got yet. However, my own tool puts तु्म्ह at only 19 bytes…?

@mcc @mal3aby Oops, turns out my terminal is struggling with rendering these and cut the word off when I copy-pasted it. The word from the word list is actually "तु्म्हीं", which isn't listed in any online Hindi dictionary I can find.
@kwi @mal3aby Okay NOW we are getting somewhere!!!

@kwi @mal3aby Oh and terrifyingly तु्म्हीं is, while Rust unicode-segmentation counts it as one grapheme, the Bluesky web client as installed on blacksky.community counts it as two. No idea whether I have just found a bug in Rust unicode-segmentation, the Bluesky client app, or the Bluesky server software/specification! But if it's anything other than the client app I'm actually in trouble! Crud!

EDIT: IT TURNED OUT BLACKSKY IS SEGMENTING UNICODE DIFFERENTLY FROM BLUESKY?!

@mcc I think your post is missing some words so I am not sure what the other grapheme count is; but the relevant rule changed a couple of years ago, so this may be a mismatch in version of grapheme cluster segmentation. See PU UAX #29 for 15.1, https://www.unicode.org/reports/tr29/tr29-42.html#GB9c.

(Assuming I ran the various segmentation algorithms in my head correctly—a daring assumption, I have a cold—if the count is 3, this is a version mismatch; if the count is 4, it is EGC vs. LGC.)

UAX #29: Unicode Text Segmentation

@eggrobin It looks like what happened is the bluesky js frontend was using a busted segmenter until recently https://github.com/bluesky-social/social-app/pull/9526
Replace `graphemer` with `unicode-segmenter` by mozzius · Pull Request #9526 · bluesky-social/social-app

We're using unicode-segmenter in @atproto/api now, so no need to have two libraries doing the same thing. We really should consider using this library to its full potential - it returns iterato...

GitHub
@eggrobin Incidentally your mental unicode arithmetic is good; indeed EGC is 1 and LGC is 4. I do not know why Bluesky Social App 116 was giving 2, I assume it was just wrong.

@mcc Yeah two is just wrong for all versions of Unicode for that string.

But then to your earlier question, the actual string seems weird (a virama on a vowel?). The Old Hindi Wiktionary entry mentioned above doesn’t have the first virama, and thus is two (modern) EGCs.

@eggrobin do you think this is a plausible actual archaic form, or could it be a typo in the list we got it from?

@mcc @eggrobin It's definitely a typo. A virama on a vowel is meaningless, the fact that you can do it at all is a feature of Unicode, not of the script.

In Hindi visible viramas tend to be a pedagogical tool or something you use when you don't want to figure out how to write out a cluster. Unicode uses the virama character as an architectural tool for representing conjuncts most of the time.

There's no actual way to write this string. Unicode just lets you construct it.

@manishearth @eggrobin Alright. Well that makes me a lot less worried about the fact I was getting different segmentation results from different segmenters (if natural-but-rare text chokes my code I have a problem, but if *malformed* text chokes the code that's… less bad…)
@kwi @mal3aby @mcc that doesn't work, it has a halant (्- indicates no vowel) after the vowel diacritic (ु). searching "तु्म्ह Wiktionary" brings up तुम्ह https://en.wiktionary.org/wiki/%E0%A4%A4%E0%A5%81%E0%A4%AE%E0%A5%8D%E0%A4%B9
तुम्ह - Wiktionary, the free dictionary

Wiktionary
@kwi @mal3aby @mcc try deleting characters one by one in that word
@kwi @mal3aby @mcc well स्क्रू me