If you drag an emoji family with a string size of 11 into an input with maxlength=10, one of the children will disappear.

Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.

I filed a WebKit bug: https://bugs.webkit.org/show_bug.cgi?id=252900

252900 – HTML maxlength attribute treats emoji of string length 11 as length 1

@simevidas dang not unicode aware :')
@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂

@jkt @simevidas

Following up with that, as I was thinking of some examples of what I mean...

Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6?

Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10?

@ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess.
@johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding.

@ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding)

With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result.

@johannes @ramsey @jkt @simevidas I think most OSes and language stdlibs/runtimes use something other than utf8 internally. NSString on Apple platforms is UTF-16, and str in Python 3 is actually a custom variant of UTF-32!
@daisy @johannes @ramsey @simevidas the HTML spec is UTF 16, surely safari is correct here?
@daisy @johannes @ramsey @simevidas ah the spec says the length is based on utf16 length so Safari is wrong as @simevidas stated initially
@jkt @daisy @ramsey @simevidas interesting choice to require a specific encoding. Probably that was the time where one assume utf-16 would be the encoding all operating systems etc. would use and then tying to storage buffers etc. makes somewhat sense. Especially also pre-Emoji ...
@johannes @daisy @ramsey @simevidas yeah utf16 was picked as a base in HTML and browsers before all of this.
@ramsey @jkt @simevidas Neither is number of bytes. It's number of characters vs number of grapheme clusters.
@dalias @ramsey @jkt @simevidas safari uses the number of UCS2 words which is arguably the least useful metric to use, although i guess in the context of javascript the easiest. the family emoji is 1 grapheme cluster, 7 codepoints, and 25 bytes long
@charlotte @jkt @simevidas @ramsey I guess you mean UTF-16 since # of UCS2 words is just number of characters with constraint that you can only use BMP.
@dalias @charlotte @simevidas @ramsey yeah the spec says use utf16 code points as length.
@jkt @dalias @simevidas @ramsey Yeah the issue is that most software that uses UTF16 (including safari) treats it as a fixed-width encoding, even though 1 codepoint can be 1 or 2 16 bit words wide. Which is why i mentioned UCS2 because it actually is a fixed-width encoding

@ramsey @jkt @simevidas Safari may be Unicode-aware, but is it HTML-aware? The specification is clear on this point: "A string’s length is the number of code units it contains."

See https://infra.spec.whatwg.org/#string-length

Infra Standard

@simevidas

Kinda wondering what the rules are: CodePoints, bytes? What if the page is UTF32 or ASCII? (Hopefully that insanity is gone)

@DevWouter @simevidas As I understand the spec, it’s “code units”, ie, 2-byte UTF-16 units, for historical or compatibility reasons probably. Wouldn’t make sense IMO if you started in a modern “codepoint” world. https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-maxlength
HTML Standard

@ujay68 @simevidas

Thanks to your link I did some digging and I came to the same conclusion. It even says that JavaScript strings are UTF-16. However a quick check in javascript on both Firefox and safari and the JS implementation is the same.

Kinda wierd that HTML5 spec suggest UTF-8. (also mastodon counts 👩‍👩‍👧‍👧 as a single character)

@DevWouter @simevidas Yes, JavaScript strings have been UTF-16 since the beginning of time. I think that’s where many of the compatibility issues come from. The Go language, eg, has a more modern approach combining UTF-8 byte sequences and codepoints for characters (“runes”).
Introduction to character encoding in .NET

Learn about character encoding and decoding in .NET.

@DevWouter @simevidas From an end-user point of view, the only concept that would make sense as a measure of length IMO is what Unicode calls a “glyph”, ie, a sequence of code points that display or print as ONE visible symbol, ONE (possibly complex composite) emoji or ONE (possibly multiply accented) character.
@ujay68 @DevWouter I guess, this could be based on text caret steps (when the user presses the Arrow Left/Right keys to move the caret).

@DevWouter @simevidas unfortunately, W3C defines “length” as UTF-16 code units. https://infra.spec.whatwg.org/#string-length

So Safari’s behavior is technically wrong.

Infra Standard

@chucker @DevWouter However, the spec defines maxlength both as a “length” and a “number of characters”, and “characters” is defined as code points, not code units. In this case the “length” is 11 and the “number of characters” is 7; the spec is malformed.
@jens @DevWouter so there is hope yet!
@chucker I feel quite confident that any correction will be towards the UTF-16 interpretation, for “compatibility”
@jens @chucker Yeah, the maxlength attribute was defined a long time ago. Browsers will not risk changing it now and breaking a bunch of websites in the process. However, a new attribute (maxchars or similar) could be proposed.

@jens @chucker @DevWouter Speaking of spec: I wanted to look up how maxlength is defined and got rewarded with this example:

The following extract shows how a messaging client's text entry could be arbitrarily restricted to a fixed number of characters, thus forcing any conversation through this medium to be terse and discouraging intelligent discourse.

<label>What are you doing? <input name=status maxlength=140></label>

@simevidas While I find Safari’s behaviour more relatable for end users (how is one supposed to know that an emoji is not single character?) the spec says that maxlength is to measured in 16-bit “code units” (sigh): https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#attr-fe-maxlength Even if you tried to measure in Unicode “Codepoints” that wouldn’t be intuitive for anyone who’s not a Unicode expert. AFAIK, the birdsite counts every emoji as a fixed number of characters (2?), independent of its technical representation.
HTML Standard

@ujay68 Yep, Twitter renders emoji as custom SVG images, and they take up two spaces, regardless of emoji string length.
@simevidas Hah! I guess https://bugs.webkit.org/show_bug.cgi?id=93196 (from 2012) can be closed now then. A relevant spec issue seems to be https://github.com/whatwg/html/issues/7861
93196 – Password fields display two replacement characters for a single supplementary Unicode symbol

@mathias I wonder if some websites consider this a strong password 😂

On strong password @simevidas

Sometimes I wonder if I should use accented characters when usin pass phrase in my native French language. But I often see systems breaking them (e.g. summer, été, becomes été)

As for your example the unicode family members in <p>...</p> you probably parsed e.textContent, and treated it as a product (e.g. first unicode is number 1, etc.) <p>1*2*3*4</p>

Because I notice that it's the boy the last. I imagine if you put another one at the end, it'll the last one

@simevidas from a user's perspective Safari is the only one doing it right.
@samueljohn @simevidas Yeah, this really should be a bug report against Chromium and friends.

@simevidas you could add a reference to https://infra.spec.whatwg.org/#string-length that specifies that the length of a string is the number of UTF-16 code units.

(Alas, I personally would would prefer that graphemes would be the length – disappearing children or others tend to surprise users)

Infra Standard

@simevidas the term of the day is "Extended Grapheme Cluster"!
@simevidas swear I didn't read this first! 😬

@simevidas Ugh, don't you just LOVE when browser makers go off on their own and refuse to adhere to convention…

…thus making it the web devs' problem. 🙄

@simevidas @rami Kinda hoping here that they will not »fix« the »bug«, looking from a user’s perspective.
@simevidas This was a very sad story.
@simevidas 🤯 I thought this was a joke when I first saw it. Oh my.
@simevidas emoji family? Wtf? I thought skin colors/genders is maximum what Unicode can do.

@wilmhit, see also https://www.unicode.org/emoji/charts/emoji-zwj-sequences.html

Unicode going from static code points to a DSL is one of my least favorite modern development, not just for emoji but also Zalgo text. I do have extra distaste for emoji though because they are so tiny and I have to look them up every time for the meaning. IMHO accessibility standards should require title text in user agents.

Cc: @simevidas

@cnx @simevidas @wilmhit I've been pushing for the ability to expand an emoji when long-pressing on it. IMO this should be a common accessibility feature of all apps that display emoji. If not long press, then something else de facto standard.

#UX #a11y #accessibility

@jamesmarshall @cnx @simevidas @wilmhit do you mean expand = gets bigger or shows alt text or shows individual characters?
@suethepooh @cnx @simevidas @wilmhit I mean the emoji get bigger, but showing alt text should be possible too.
@jamesmarshall @cnx @simevidas @wilmhit without my readers I can’t make out any but the most basic shapes and I hate wearing my readers