This is something that I've been fascinated about for a while now. Let's see if I can lay it all out --

πŸ’‘Unexpected Challenge BreakdownπŸ’‘

🌟char- and word-counters!🌟

It sounds straightforward but has a ton of surprising challenges

- it’s inconsistent between #languages and scripts
- it’s inconsistent between products online (!)
- it's hard for users to understand
- it’s completely thrown off by emoji πŸ™ŠπŸ™ˆπŸ™‰

… We thought a lot about this at #Wikipedia …

#accessibility #a11y

(thread 🧡1/)

🧡2/

β€œByte” limits can cause an #equity problem:

πŸ”€For Latin script, each char is more or less a byte (ASCII)
πŸ”£(many) Non-latin scripts, each char can be 2 or 3 bytes.

Arabic and Hebrew, for example, use 2-bytes per character. Diacritics may use more.

Limiting chars based on storage (like varchar limit) may mean non-Latin-script users have half as much (or less!) effective length than English/Latin users.

(see https://en.wikipedia.org/wiki/UTF-8#Encoding)

UTF-8 - Wikipedia

🧡3/

Diacritics tend to be even more unpredictable when counting (and limiting) characters.

On Mastodon, Twitter, HTML’s `maxlength` spec and #javascript `().length`:
- This (latin) diacritic ΕΎ is 1 character
- This (hebrew) diacritic שָׁ is 3 characters
- This (arabic) diacritic Ω€ΨͺΩ€ is 3 characters

Users who write Arabic will effectively get a lower limit than users who write in Latin… which is a problem.

(see
https://en.wikipedia.org/wiki/Diacritic, https://en.wikipedia.org/wiki/Arabic_diacritics and https://en.wikipedia.org/wiki/Hebrew_diacritics)

Diacritic - Wikipedia

🧡4/

Chinese is another example for char- and word-counting challenges.

For one, it’s harder to distinguish what a character is because characters are made from overlapping strokes that do not conform to traditional character and word-boundaries.

Professional translators often need to *estimate* the word-count of translation documents (to get paid fairly).

So how do you even get your software to count characters?

(see https://davidsmithtranslation.com/articles/how-to-count-chinese-characters/ for ex)

How to count Chinese characters - David Smith Translation

In this article, David Smith, a professional Chinese to English translator gives his thoughts on how to count Chinese characters

David Smith Translation

🧡5/

There’s a lot more to say about this, but I want to quickly touch on emoji.

Emoji completely break character counters in really fun ways.

A basic emoji πŸ™‚ is counted almost everywhere as two characters. (but not Mastodon!)

But there are also "combined" emojis; those that look like 1 emoji but are actually several together connected with zero-width-join char, like this one: πŸ‘ͺ

Mastodon considers it 1 char.
MDN and twitter consider it 2 chars.

But wait, there’s more…

🧡6/

πŸ‘©β€is 1 emoji
πŸ‘©β€πŸ‘©β€πŸ‘§ is 3 emojis into 1
πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦is 4 emojis into 1
Connected w/ a zero-width-join char

Mastodon considers all the above 1 characters.
Twitter considers all the above as 2 characters.

MDN and #javascript:
πŸ‘©: 2 chars
πŸ‘©β€πŸ‘©β€πŸ‘§: 8 chars
πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦: 11 chars
…It seems to be counting the bytes, not the characters.

(Correction: JS counts UTF-16 code points; they artificially "act" like they're byte-sized, but they are different; see: https://en.wikipedia.org/wiki/UTF-16#Description)

… Joy!

(see https://r12a.github.io/uniview/)

UTF-16 - Wikipedia

@moriel JavaScript counts UTF-16 characters, not bytes. And πŸ‘© is probably represented by two surrogate pair UTF-16 characters, like most emoji. And each of these characters use two bytes. So πŸ‘© uses 4 bytes in JavaScript.
@moriel And sorry for the mansplaining. πŸ˜… It is a topic that's dear to my heart as well, though I did not know the depths of it and many of these aspects as you described them!

@Shepard it might use UTF-16, but checking the ().length on each gives me
- 'πŸ‘©'.length = 2
- 'πŸ‘©β€πŸ‘©β€πŸ‘§'.length = 8
- 'πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦'.length = 11

... Which looks like it's byte counts.

Breaking πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦ down, it makes sense, too; the emoji is made of

- β€Ž1F469 WOMAN = 2b
- β€Ž200D ZERO WIDTH JOINER = 1b
- β€Ž1F469 WOMAN = 2b
- β€Ž200D ZERO WIDTH JOINER = 1b
- 1F467 GIRL = 2b
- 200D ZERO WIDTH JOINER = 1b
- β€Ž1F466 BOY = 2b

Which is overall = 11bytes.

What am I missing?

(edit - see https://r12a.github.io/uniview/)

UniViewSVG 17

@moriel πŸ‘© is in the Supplementary Multilingual Plane. UTF-16 can only encode code points in the Basic Multilingual Plane as a single unit. Everything beyond that it splits up into two "surrogate pair" units. To my knowledge, what ().length counts are these units.

The actual number of bytes this code point uses in RAM would be more than 2. Probably 4 but I'm not sure.

See https://en.wikipedia.org/wiki/UTF-16#Description

UTF-16 - Wikipedia

@Shepard Okay, I see what you mean (I think)

This would mean that the other combination emoji are counting pairs + separator(byte?) + pairs +... etc which is still 8 or 11 "places"/counts.

... but it's not actual bytes, even though it effectively would count the same if it was bytes.

That's a fair point! Thanks!

@Shepard I added a correction to the post, though don't have enough char count space (IRONICALLY) to fully explain it :)

thanks!