Consider a Mastodon/Fediverse handle, like @[email protected] . What kinds of restrictions are there on "username"? Can I assume any valid unicode could go in there?

Somebody linked me RFC 7565, which linked to RFC7564, and if that's the place to look this appears to be the list of disallowed characters in a Fediverse username, and I'm cracking up because it's *mostly* stuff you'd expect, except the very first category of banned characters, specially, is "pre-1700 Korean characters".

The fediverse is welcome to all. EXCEPT KOREAN TIME TRAVELERS. Did you just wake up from being frozen in ice during the Joseon dynasty? The IETF is targeting you PERSONALLY

@mcc I get it, but the exclusion of "Q" property characters is an interesting and odd one.
@xgranade @mcc Reminds me of how CIRA decided that anyone who buys a .ca domain automatically gets reserved all accented character variations: https://www.cira.ca/en/ca-domains/register-your-ca/domains-french-accented-characters/
Domains with French accented characters – CIRA

Looking to own a domain name with an accented character in it? Learn more about domains with accented characters, including how CIRA manages them.

CIRA
@mcc well, darn, I guess I don't comply to the PRECIS IdentifierClass profile
@mcc 왜 그렇게 되셨나요?
@thatdawnperson I thiiiiink that the way they fit the antique Korean jamo into requires a really awkward hack that they just don't want these systems to have to deal with
@thatdawnperson But seeing them lead with that just makes it seem oddly vindictive
@mcc ...is there any reasoning given for this?? and for the latter two, those seem weird too
-F
@mcc @Hearth @xgranade I'm guessing Q and R are disallowed to mitigate homoglyph attacks. Maybe Old Hangul too, which presumably contains some homoglyphs with modern Hangul.
@alilly @Hearth @xgranade ohhhh wait that would make so much sense :O with the old jamo

@mcc @alilly @xgranade that makes sense! homoglyph attacks are still possible with e.g. replacing latin o with greek ο or cyrillic о, though?

...unless that's what section Q is talking about, i don't know exactly what it means
-F

@mcc @Hearth @xgranade Yeah but that's much harder to do anything about, unless you want to ban modern speakers of languages written using Cyrillic from using names in their native language, which… don't do that.
@alilly @mcc @xgranade yeah, i guess the difference with the hangul thing is that it's a safe assumption no one is using thsoe characters to write their names in modern times, which is not the case for greek or cyrillic
-F

@Hearth The "Q" section is mostly about accented latin alphabet characters.

For example, "á" can be represented as either the single code point U+00E1, or as a pair of code points U+0061 U+0301. The second version is the code point for the letter "a" followed by "COMBINING ACUTE ACCENT" to add the accent to the previous code point.

Since they render identically (not just similarly), you probably don't want both sequences to be valid in names humans are meant to distinguish.

@jamesh @Hearth it makes sense to say unicode forms should be normalised. One form for identical characters. Something like rfc7613
@Hearth @xgranade @mcc … Damn, that might be a valid argument in favor of Han unification. How dare things I already made up my mind on have nuance I didn't consider?
@alilly @mcc @Hearth @xgranade ... just saw this boosted out of context and was very confused why @q and I would be disallowed from something
@mcc this was a subplot in Analog surely
@chrisamaphone so remember, part of the revanchivist ideology in Analogue involved enforcement of writing in Hanja

@mcc Oh! Yeah. It's because they don't have a well-defined canonical composition order, unlike modern Jamo, which do.

A weird bit of trivia: there is no composition for hanzi/kanji/hanja/chữ Hán characters (what many call "Chinese characters"). You can't just build one in Unicode. If you could, they'd also be in this list, for the same reason that Old Hangul Jamo are disallowed (which were only added because scholars needed them).

@Elizafox @mcc I regret to inform you, https://en.wikipedia.org/wiki/Chinese_character_description_languages#Ideographic_Description_Sequences
though afaik no implementation actually renders these sequences composed
Chinese character description languages - Wikipedia

@rcombs @Elizafox I AM NOW VERY EXCITED ABOUT USING THESE COMBINERS ON EMOJI, EVEN IF NOBODY CAN RENDER IT
@mcc @rcombs Jamo are canonicalised to a glyph according to a formula. There’s no such thing for the Chinese character composition characters. Unfortunately.

@mcc It doesn't come through in the RFC, but afaict it's more like "Hangul is too harmonic for our feeble algorithms to handle" Without reasonably interoperable "does <this> equal <that>?" algorithms, IDNA would be unreliable...

To quote selectively from https://www.alvestrand.no/pipermail/idna-update/2008-February/001117.html

"<...>the fact that Hangul is designed so well structured on so many levels (feature, phoneme, syllable) is actually the very reason for why there are so many (fundamentally, not only superficially) different proposals for encodings, [...]. Encoding designers all saw the beauty, but the differed on which level to consider most important. All the other, not-so-well-thought-through scripts give the encoders much less options to work (and mess) with."

Normalization of Hangul

@mcc I choose to interpret this as a personal slight to the self-proclaimed crown prince of the Joseon dynasty (who totally deserves it after what he did to Freenode)

https://en.wikipedia.org/wiki/Andrew_Lee_(entrepreneur)

Andrew Lee (entrepreneur) - Wikipedia

@mcc what will they have done to deserve this??
@mcc I feel like someone waking up today from the Joseon dynasty has much more immediate problems to worry about than the Fediverse.