Mastodawn

Moriel Schottlender Jul 30, 2023

This is something that I've been fascinated about for a while now. Let's see if I can lay it all out --

💡Unexpected Challenge Breakdown💡

🌟char- and word-counters!🌟

It sounds straightforward but has a ton of surprising challenges

- it’s inconsistent between #languages and scripts
- it’s inconsistent between products online (!)
- it's hard for users to understand
- it’s completely thrown off by emoji 🙊🙈🙉

… We thought a lot about this at #Wikipedia …

#accessibility #a11y

(thread 🧵1/)

Show thread

Moriel Schottlender Jul 30, 2023

🧵2/

“Byte” limits can cause an #equity problem:

🔤For Latin script, each char is more or less a byte (ASCII)
🔣(many) Non-latin scripts, each char can be 2 or 3 bytes.

Arabic and Hebrew, for example, use 2-bytes per character. Diacritics may use more.

Limiting chars based on storage (like varchar limit) may mean non-Latin-script users have half as much (or less!) effective length than English/Latin users.

(see https://en.wikipedia.org/wiki/UTF-8#Encoding)

UTF-8 - Wikipedia

Show thread

Moriel Schottlender Jul 30, 2023

🧵3/

Diacritics tend to be even more unpredictable when counting (and limiting) characters.

On Mastodon, Twitter, HTML’s `maxlength` spec and #javascript `().length`:
- This (latin) diacritic ž is 1 character
- This (hebrew) diacritic שָׁ is 3 characters
- This (arabic) diacritic ـتـ is 3 characters

Users who write Arabic will effectively get a lower limit than users who write in Latin… which is a problem.

(see
https://en.wikipedia.org/wiki/Diacritic, https://en.wikipedia.org/wiki/Arabic_diacritics and https://en.wikipedia.org/wiki/Hebrew_diacritics)

Diacritic - Wikipedia

Show thread

Moriel Schottlender Jul 30, 2023

🧵4/

Chinese is another example for char- and word-counting challenges.

For one, it’s harder to distinguish what a character is because characters are made from overlapping strokes that do not conform to traditional character and word-boundaries.

Professional translators often need to *estimate* the word-count of translation documents (to get paid fairly).

So how do you even get your software to count characters?

(see https://davidsmithtranslation.com/articles/how-to-count-chinese-characters/ for ex)

How to count Chinese characters - David Smith Translation

In this article, David Smith, a professional Chinese to English translator gives his thoughts on how to count Chinese characters

David Smith Translation

Show thread

Moriel Schottlender Jul 30, 2023

🧵5/

There’s a lot more to say about this, but I want to quickly touch on emoji.

Emoji completely break character counters in really fun ways.

A basic emoji 🙂 is counted almost everywhere as two characters. (but not Mastodon!)

But there are also "combined" emojis; those that look like 1 emoji but are actually several together connected with zero-width-join char, like this one: 👪

Mastodon considers it 1 char.
MDN and twitter consider it 2 chars.

But wait, there’s more…

Show thread

Moriel Schottlender Jul 30, 2023

🧵6/

👩‍is 1 emoji
👩‍👩‍👧 is 3 emojis into 1
👩‍👩‍👧‍👦is 4 emojis into 1
Connected w/ a zero-width-join char

Mastodon considers all the above 1 characters.
Twitter considers all the above as 2 characters.

MDN and #javascript:
👩: 2 chars
👩‍👩‍👧: 8 chars
👩‍👩‍👧‍👦: 11 chars
…It seems to be counting the bytes, not the characters.

(Correction: JS counts UTF-16 code points; they artificially "act" like they're byte-sized, but they are different; see: https://en.wikipedia.org/wiki/UTF-16#Description)

… Joy!

(see https://r12a.github.io/uniview/)

UTF-16 - Wikipedia

Show thread

Ténno Seremél’

@moriel
> The length data property of a String value contains the length of the string in UTF-16 code units.