Mastodawn

Moriel Schottlender Jul 30, 2023

This is something that I've been fascinated about for a while now. Let's see if I can lay it all out --

💡Unexpected Challenge Breakdown💡

🌟char- and word-counters!🌟

It sounds straightforward but has a ton of surprising challenges

- it’s inconsistent between #languages and scripts
- it’s inconsistent between products online (!)
- it's hard for users to understand
- it’s completely thrown off by emoji 🙊🙈🙉

… We thought a lot about this at #Wikipedia …

#accessibility #a11y

(thread 🧵1/)

Show thread

Moriel Schottlender Jul 30, 2023

🧵2/

“Byte” limits can cause an #equity problem:

🔤For Latin script, each char is more or less a byte (ASCII)
🔣(many) Non-latin scripts, each char can be 2 or 3 bytes.

Arabic and Hebrew, for example, use 2-bytes per character. Diacritics may use more.

Limiting chars based on storage (like varchar limit) may mean non-Latin-script users have half as much (or less!) effective length than English/Latin users.

(see https://en.wikipedia.org/wiki/UTF-8#Encoding)

UTF-8 - Wikipedia

Show thread

Moriel Schottlender Jul 30, 2023

🧵3/

Diacritics tend to be even more unpredictable when counting (and limiting) characters.

On Mastodon, Twitter, HTML’s `maxlength` spec and #javascript `().length`:
- This (latin) diacritic ž is 1 character
- This (hebrew) diacritic שָׁ is 3 characters
- This (arabic) diacritic ـتـ is 3 characters

Users who write Arabic will effectively get a lower limit than users who write in Latin… which is a problem.

(see
https://en.wikipedia.org/wiki/Diacritic, https://en.wikipedia.org/wiki/Arabic_diacritics and https://en.wikipedia.org/wiki/Hebrew_diacritics)

Diacritic - Wikipedia

Show thread

Moriel Schottlender Jul 30, 2023

🧵4/

Chinese is another example for char- and word-counting challenges.

For one, it’s harder to distinguish what a character is because characters are made from overlapping strokes that do not conform to traditional character and word-boundaries.

Professional translators often need to *estimate* the word-count of translation documents (to get paid fairly).

So how do you even get your software to count characters?

(see https://davidsmithtranslation.com/articles/how-to-count-chinese-characters/ for ex)

How to count Chinese characters - David Smith Translation

In this article, David Smith, a professional Chinese to English translator gives his thoughts on how to count Chinese characters

David Smith Translation

Show thread

Moriel Schottlender Jul 30, 2023

🧵5/

There’s a lot more to say about this, but I want to quickly touch on emoji.

Emoji completely break character counters in really fun ways.

A basic emoji 🙂 is counted almost everywhere as two characters. (but not Mastodon!)

But there are also "combined" emojis; those that look like 1 emoji but are actually several together connected with zero-width-join char, like this one: 👪

Mastodon considers it 1 char.
MDN and twitter consider it 2 chars.

But wait, there’s more…

Show thread

Moriel Schottlender Jul 30, 2023

🧵6/

👩‍is 1 emoji
👩‍👩‍👧 is 3 emojis into 1
👩‍👩‍👧‍👦is 4 emojis into 1
Connected w/ a zero-width-join char

Mastodon considers all the above 1 characters.
Twitter considers all the above as 2 characters.

MDN and #javascript:
👩: 2 chars
👩‍👩‍👧: 8 chars
👩‍👩‍👧‍👦: 11 chars
…It seems to be counting the bytes, not the characters.

(Correction: JS counts UTF-16 code points; they artificially "act" like they're byte-sized, but they are different; see: https://en.wikipedia.org/wiki/UTF-16#Description)

… Joy!

(see https://r12a.github.io/uniview/)

UTF-16 - Wikipedia

Show thread

Simon

@moriel JavaScript counts UTF-16 characters, not bytes. And 👩 is probably represented by two surrogate pair UTF-16 characters, like most emoji. And each of these characters use two bytes. So 👩 uses 4 bytes in JavaScript.

Show thread

Simon Jul 30, 2023

@moriel And sorry for the mansplaining. 😅 It is a topic that's dear to my heart as well, though I did not know the depths of it and many of these aspects as you described them!

Show thread

Moriel Schottlender Jul 31, 2023

@Shepard it might use UTF-16, but checking the ().length on each gives me
- '👩'.length = 2
- '👩‍👩‍👧'.length = 8
- '👩‍👩‍👧‍👦'.length = 11

... Which looks like it's byte counts.

Breaking 👩‍👩‍👧‍👦 down, it makes sense, too; the emoji is made of

- ‎1F469 WOMAN = 2b
- ‎200D ZERO WIDTH JOINER = 1b
- ‎1F469 WOMAN = 2b
- ‎200D ZERO WIDTH JOINER = 1b
- 1F467 GIRL = 2b
- 200D ZERO WIDTH JOINER = 1b
- ‎1F466 BOY = 2b

Which is overall = 11bytes.

What am I missing?

(edit - see https://r12a.github.io/uniview/)

UniViewSVG 17

Show thread

Simon Jul 31, 2023

@moriel 👩 is in the Supplementary Multilingual Plane. UTF-16 can only encode code points in the Basic Multilingual Plane as a single unit. Everything beyond that it splits up into two "surrogate pair" units. To my knowledge, what ().length counts are these units.

The actual number of bytes this code point uses in RAM would be more than 2. Probably 4 but I'm not sure.

See https://en.wikipedia.org/wiki/UTF-16#Description

UTF-16 - Wikipedia

Show thread

Moriel Schottlender Jul 31, 2023

@Shepard Okay, I see what you mean (I think)

This would mean that the other combination emoji are counting pairs + separator(byte?) + pairs +... etc which is still 8 or 11 "places"/counts.

... but it's not actual bytes, even though it effectively would count the same if it was bytes.

That's a fair point! Thanks!

Show thread

Moriel Schottlender Jul 31, 2023

@Shepard I added a correction to the post, though don't have enough char count space (IRONICALLY) to fully explain it :)

thanks!