If you drag an emoji family with a string size of 11 into an input with maxlength=10, one of the children will disappear.

Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.

I filed a WebKit bug: https://bugs.webkit.org/show_bug.cgi?id=252900

252900 – HTML maxlength attribute treats emoji of string length 11 as length 1

@simevidas dang not unicode aware :')
@jkt @simevidas In this case, Safari is the one that’s Unicode aware. The other browsers are treating maxlength as the number of bytes rather than the number of characters. 🙂

@jkt @simevidas

Following up with that, as I was thinking of some examples of what I mean...

Take kanji, for example. 漢字 is 2 characters, but it's 6 bytes, so is the length 2 or 6?

Or the phrase "Góða nótt" in Icelandic. It's 9 characters (counting the space in the middle), but it's 12 bytes. So, should this fail the maxlength check, if the maxlength is 10?

@ramsey @jkt @simevidas bytes assume an encoding. Codepoints vs. grapheme clusters is the distinction in experience, I guess.
@johannes @jkt @simevidas I thought it would be the other way around. The same grouping of bytes could represent different codepoints, based on the encoding.

@ramsey @jkt @simevidas yes, but working on bytes means that the encoding has to be carried thorough the different layers and might cut utf-8 sequences apart (assuming utf-8 being the default encoding)

With either codepoints or grapheme clusters you at least get some valid (while not always sensible) result.

@johannes @ramsey @jkt @simevidas I think most OSes and language stdlibs/runtimes use something other than utf8 internally. NSString on Apple platforms is UTF-16, and str in Python 3 is actually a custom variant of UTF-32!
@daisy @johannes @ramsey @simevidas the HTML spec is UTF 16, surely safari is correct here?
@daisy @johannes @ramsey @simevidas ah the spec says the length is based on utf16 length so Safari is wrong as @simevidas stated initially
@jkt @daisy @ramsey @simevidas interesting choice to require a specific encoding. Probably that was the time where one assume utf-16 would be the encoding all operating systems etc. would use and then tying to storage buffers etc. makes somewhat sense. Especially also pre-Emoji ...
@johannes @daisy @ramsey @simevidas yeah utf16 was picked as a base in HTML and browsers before all of this.