This is something that I've been fascinated about for a while now. Let's see if I can lay it all out --

💡Unexpected Challenge Breakdown💡

🌟char- and word-counters!🌟

It sounds straightforward but has a ton of surprising challenges

- it’s inconsistent between #languages and scripts
- it’s inconsistent between products online (!)
- it's hard for users to understand
- it’s completely thrown off by emoji 🙊🙈🙉

… We thought a lot about this at #Wikipedia

#accessibility #a11y

(thread 🧵1/)

🧵2/

“Byte” limits can cause an #equity problem:

🔤For Latin script, each char is more or less a byte (ASCII)
🔣(many) Non-latin scripts, each char can be 2 or 3 bytes.

Arabic and Hebrew, for example, use 2-bytes per character. Diacritics may use more.

Limiting chars based on storage (like varchar limit) may mean non-Latin-script users have half as much (or less!) effective length than English/Latin users.

(see https://en.wikipedia.org/wiki/UTF-8#Encoding)

UTF-8 - Wikipedia

🧵3/

Diacritics tend to be even more unpredictable when counting (and limiting) characters.

On Mastodon, Twitter, HTML’s `maxlength` spec and #javascript `().length`:
- This (latin) diacritic ž is 1 character
- This (hebrew) diacritic שָׁ is 3 characters
- This (arabic) diacritic ـتـ is 3 characters

Users who write Arabic will effectively get a lower limit than users who write in Latin… which is a problem.

(see
https://en.wikipedia.org/wiki/Diacritic, https://en.wikipedia.org/wiki/Arabic_diacritics and https://en.wikipedia.org/wiki/Hebrew_diacritics)

Diacritic - Wikipedia

🧵4/

Chinese is another example for char- and word-counting challenges.

For one, it’s harder to distinguish what a character is because characters are made from overlapping strokes that do not conform to traditional character and word-boundaries.

Professional translators often need to *estimate* the word-count of translation documents (to get paid fairly).

So how do you even get your software to count characters?

(see https://davidsmithtranslation.com/articles/how-to-count-chinese-characters/ for ex)

How to count Chinese characters - David Smith Translation

In this article, David Smith, a professional Chinese to English translator gives his thoughts on how to count Chinese characters

David Smith Translation
@moriel I guess there's also an extra layer when considering how information density varies among different scripts, with some of them requiring long strings of characters to convey (almost) the same information as a couple of characters in other scripts.