This is something that I've been fascinated about for a while now. Let's see if I can lay it all out --

💡Unexpected Challenge Breakdown💡

🌟char- and word-counters!🌟

It sounds straightforward but has a ton of surprising challenges

- it’s inconsistent between #languages and scripts
- it’s inconsistent between products online (!)
- it's hard for users to understand
- it’s completely thrown off by emoji 🙊🙈🙉

… We thought a lot about this at #Wikipedia

#accessibility #a11y

(thread 🧵1/)

🧵2/

“Byte” limits can cause an #equity problem:

🔤For Latin script, each char is more or less a byte (ASCII)
🔣(many) Non-latin scripts, each char can be 2 or 3 bytes.

Arabic and Hebrew, for example, use 2-bytes per character. Diacritics may use more.

Limiting chars based on storage (like varchar limit) may mean non-Latin-script users have half as much (or less!) effective length than English/Latin users.

(see https://en.wikipedia.org/wiki/UTF-8#Encoding)

UTF-8 - Wikipedia

🧵3/

Diacritics tend to be even more unpredictable when counting (and limiting) characters.

On Mastodon, Twitter, HTML’s `maxlength` spec and #javascript `().length`:
- This (latin) diacritic ž is 1 character
- This (hebrew) diacritic שָׁ is 3 characters
- This (arabic) diacritic ـتـ is 3 characters

Users who write Arabic will effectively get a lower limit than users who write in Latin… which is a problem.

(see
https://en.wikipedia.org/wiki/Diacritic, https://en.wikipedia.org/wiki/Arabic_diacritics and https://en.wikipedia.org/wiki/Hebrew_diacritics)

Diacritic - Wikipedia

🧵4/

Chinese is another example for char- and word-counting challenges.

For one, it’s harder to distinguish what a character is because characters are made from overlapping strokes that do not conform to traditional character and word-boundaries.

Professional translators often need to *estimate* the word-count of translation documents (to get paid fairly).

So how do you even get your software to count characters?

(see https://davidsmithtranslation.com/articles/how-to-count-chinese-characters/ for ex)

How to count Chinese characters - David Smith Translation

In this article, David Smith, a professional Chinese to English translator gives his thoughts on how to count Chinese characters

David Smith Translation

🧵5/

There’s a lot more to say about this, but I want to quickly touch on emoji.

Emoji completely break character counters in really fun ways.

A basic emoji 🙂 is counted almost everywhere as two characters. (but not Mastodon!)

But there are also "combined" emojis; those that look like 1 emoji but are actually several together connected with zero-width-join char, like this one: 👪

Mastodon considers it 1 char.
MDN and twitter consider it 2 chars.

But wait, there’s more…

🧵6/

👩‍is 1 emoji
👩‍👩‍👧 is 3 emojis into 1
👩‍👩‍👧‍👦is 4 emojis into 1
Connected w/ a zero-width-join char

Mastodon considers all the above 1 characters.
Twitter considers all the above as 2 characters.

MDN and #javascript:
👩: 2 chars
👩‍👩‍👧: 8 chars
👩‍👩‍👧‍👦: 11 chars
…It seems to be counting the bytes, not the characters.

(Correction: JS counts UTF-16 code points; they artificially "act" like they're byte-sized, but they are different; see: https://en.wikipedia.org/wiki/UTF-16#Description)

… Joy!

(see https://r12a.github.io/uniview/)

UTF-16 - Wikipedia

🧵7/

So how do we deal with that at #Wikipedia?

The biggest thing is that WE CARE about this problem.
This might sound self-serving, but hear me out:

#Wikipedia’s mission is to enable anyone in the world to participate in the sum of all knowledge.

We support 400+ languages with billions of visits a month and an average of 345 edits per minute…

❗Other products’ “edge cases” are, very often, our use cases.❗

🧵8/

One relatively recent thing we did (in 2018) is splitting our `truncate` method with two others:

- truncateForDatabase - which counts bytes
- truncateForVisual - which counts characters

Technical contributors are encouraged to use `truncateForVisual` whenever possible to encourage equity, and use truncateForDatabase when we absolutely have to cut information by byte size.

(see https://phabricator.wikimedia.org/T197492)

⚓ T197492 Deprecate and remove Language::truncate()

🧵9/

Another issue is user-facing. Char counters are inconsistent – so they can be very confusing to users, especially non-Latin languages.

Most users don’t understand what a “byte” is when they type into an input.

Showing users a counter that may “jump” as they type is unhelpful and distracting.
So, we try not to.

VisualEditor doesn’t show the byte counter until it has to – when the text is approaching the limit. This means that most users at least don’t get distracted by this problem.

🧵10/

On top of all of that, we’re also expanding our understanding of the impact of character and word counters when it comes to multilingual products.

The Wikimedia research team is doing some super cool research into multilingual readability scores that touches on how to figure out word counts in different languages.

(see https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research/Background_Research )

Research:Multilingual Readability Research/Background Research - Meta

🧵11/

So there you have it – Character and Word counters are a totally underestimated challenge in computers and on the web.

If you care about #inclusive products, you should really understand the impact of the choices you make counting those pesky characters…

/fin

@moriel Links are handled interestingly in Mastodon (and elsewhere). I've been meaning to look at it a bit more and possibly report a bug. If you type a URL character by character, you can see a jump like you described when you get to the TLD - IIRC, it stops counting the characters in the URL at a certain point, but when you publish, it counts it in a different way, so it can fail, while still showing that you're under the limit.
@moriel Does this extend to arbitrary chains of emoji that are zero-width-joined (this could mean shenanigans are upon us) or only the ones in the standard?
My client (tusky) seems to also be counting bytes, 👩‍👩‍👧‍👧 subtracts 11 from remaining character count.

@lambdatotoro it does, but you have also stumbled onto another point here - the inconsistency between platforms!

I posted my thread from the web. On mastodon web, this emoji 👩‍👩‍👧‍👧 only reduced my char count by 1.

I'm answering you from Tusky, on Android, where 👩‍👩‍👧‍👧 is reducing the limit counter by 11.

Mastodon web counts this as a character.
Tusky on mobile counts this as 11 bytes.

I mean... They're both kinda right...

@moriel JavaScript counts UTF-16 characters, not bytes. And 👩 is probably represented by two surrogate pair UTF-16 characters, like most emoji. And each of these characters use two bytes. So 👩 uses 4 bytes in JavaScript.
@moriel And sorry for the mansplaining. 😅 It is a topic that's dear to my heart as well, though I did not know the depths of it and many of these aspects as you described them!

@Shepard it might use UTF-16, but checking the ().length on each gives me
- '👩'.length = 2
- '👩‍👩‍👧'.length = 8
- '👩‍👩‍👧‍👦'.length = 11

... Which looks like it's byte counts.

Breaking 👩‍👩‍👧‍👦 down, it makes sense, too; the emoji is made of

- ‎1F469 WOMAN = 2b
- ‎200D ZERO WIDTH JOINER = 1b
- ‎1F469 WOMAN = 2b
- ‎200D ZERO WIDTH JOINER = 1b
- 1F467 GIRL = 2b
- 200D ZERO WIDTH JOINER = 1b
- ‎1F466 BOY = 2b

Which is overall = 11bytes.

What am I missing?

(edit - see https://r12a.github.io/uniview/)

UniViewSVG 17

@moriel 👩 is in the Supplementary Multilingual Plane. UTF-16 can only encode code points in the Basic Multilingual Plane as a single unit. Everything beyond that it splits up into two "surrogate pair" units. To my knowledge, what ().length counts are these units.

The actual number of bytes this code point uses in RAM would be more than 2. Probably 4 but I'm not sure.

See https://en.wikipedia.org/wiki/UTF-16#Description

UTF-16 - Wikipedia

@Shepard Okay, I see what you mean (I think)

This would mean that the other combination emoji are counting pairs + separator(byte?) + pairs +... etc which is still 8 or 11 "places"/counts.

... but it's not actual bytes, even though it effectively would count the same if it was bytes.

That's a fair point! Thanks!

@Shepard I added a correction to the post, though don't have enough char count space (IRONICALLY) to fully explain it :)

thanks!

@moriel
> The length data property of a String value contains the length of the string in UTF-16 code units.
@moriel I guess there's also an extra layer when considering how information density varies among different scripts, with some of them requiring long strings of characters to convey (almost) the same information as a couple of characters in other scripts.