Three small announcements:
1. RFC 9839, a guide to which Unicode characters you should never use: https://www.rfc-editor.org/rfc/rfc9839.html
2. Blog piece with background and context, “RFC 9839 and Bad Unicode”: https://www.tbray.org/ongoing/When/202x/2025/08/14/RFC9839
3. A little Go library that implements 9839’s exclusion subsets: https://github.com/timbray/RFC9839

#Unicode

RFC 9839: Unicode Character Repertoire Subsets

This document discusses subsets of the Unicode character repertoire for use in protocols and data formats and specifies three subsets recommended for use in IETF specifications.

@timbray oh, that lib sounds juicy ❤️
@andrewg eh, I bashed it out one afternoon, nothing special.
@timbray i haven’t looked at it yet, but the idea is spot on. 👍🏻
@timbray thanks for getting this one to RFC status 🙏🏻
@timbray Thanks to you and @paulehoffman for the perseverance in getting that through the process.

@timbray would be curious as to the rationale for the choice of the "problematic" terminology, as that adjective is famously considered to be so vague as to constitute a sort of "red flag" when deployed in discussions of online propriety. the precise distinction of "never useful text" and "can lead to misbehavior" seems like a useful one, although i'd argue that private-use characters should be included precisely because they can sometimes be valid, so are more likely to show up.

were any alternatives considered for terminology to designate such invalid text characters? "non-assignable" would seem to be much more specific with respect to the "unicode assignables" subset defined in the rfc document.

@hipsterelectron @timbray Nah, "problematic" is used correctly here: "constituting or presenting a problem or difficulty."

It does not mean improper or rude. The private-use ones do not cause problems. You might get a square, but then you are not party to the agreement!

@sayrer @timbray agreed on private-use characters since i would not want generic software to exclude them
@sayrer @timbray the precedent of "noncharacter" from unicode would seem to motivate something closer to the "non-assignable" terminology. if the "problematic" classification is not intended to be referenced from other RFCs, i don't see a problem with it
@hipsterelectron @timbray Yeah, you're just supposed to say "we're using Unicode Assignables" ... " Specifications can refer to these subsets..." https://www.rfc-editor.org/rfc/rfc9839.html#name-subsets
RFC 9839: Unicode Character Repertoire Subsets

This document discusses subsets of the Unicode character repertoire for use in protocols and data formats and specifies three subsets recommended for use in IETF specifications.

@hipsterelectron @timbray Yeah, I'm really unclear on what makes the C0 codes (other than U+0000) "problematic".

I mean, application/json-seq uses ASCII record separators, and I think in general it would be good if *more* data formats used proper separator characters rather than comma, space, tab, tilde, and so on.

@mathew @hipsterelectron I look at https://www.unicode.org/charts/PDF/U0000.pdf and aside from \n, \r, and \t, there is a distinct smell of last century. If we were designing json-seq now, people would be asking why not just \n for a separator?
@timbray @mathew @hipsterelectron doesn't matter because it's a tree structure. the real cost is hunting for the end of strings.

@timbray @hipsterelectron For json-seq specifically, it means you can safely continue parsing a stream even if you encounter invalid UTF-8 in one of the records.

In general escaping rules are a source of error and lack of interoperability (see: CSV), and the less you need to escape data the better. ASCII record separators occur in user data approximately never, unlike comma, newline and tab.

@mathew @timbray i introduced the alarm bell as a separator for array entries in a portable posix shell script to @spack precisely because it was unlikely to occur in valid compiler command line arguments. command line arguments are of course already not unicode, and the contents of our array polyfill are never printed anywhere, so i didn't think it would serve to undermine this RFC's designation. however, if we updated our logic to use an ASCII record separator instead, we could consider printing out the array contents directly for debugging. given the myriad quoting and escaping we're already performing in our cc wrapper script, needing to escape the record separator is out of the question.
@[email protected] @[email protected] @[email protected] I now have silly thoughts of an ASCII Master teaching their student to XON, XOFF.

@timbray haha, in the spirit of all comments on that document, some suggestions for the blog post:

HTJ is at least in ECMA-48 2nd Edition (1979) [page 6], and I strongly suspect it is in the first edition (1976), but that edition is not online.

I would not pluralize "emoji", as it is already plural. But the dictionary says "emojis" is OK. However, "kanjis" is not.

https://ecma-international.org/wp-content/uploads/ECMA-48_2nd_edition_august_1979.pdf

@timbray Thanks for the reminder and pointer to the exact RFC (which I figured probably had already been written, but hadn't dug up yet). I just filled an issue for #MoQ to make sure we don't cause anyone undue pain with character encoding issues.
@timbray Wait, I just realized that you co-authored and just now published this RFC! Thank you!! This is very important work!
@timbray Thank you to you and @paulehoffman for doing this work, as painful and annoying as I am sure it was. (Having also gone the IS route for a RFC myself!) I like your approach with 9839 of not locking people into a version of Unicode. Thanks for writing very readable text

@timbray @GeekAndDad Congrats on getting the RFC published and avoiding the use of the dreaded “SHOULD” word.

I was involved with a few of the HTTP related RFCs back when I worked on Apple’s WebDAV file system and I know how much work goes into those documents.

@timbray Great work! You say the library is well tested, are there test vectors available somewhere? I would like to see support for identifying Bad Unicode as you specify added to https://www.gnu.org/software/libunistring/ and if I were to start working on that, I would like to see good test cases.
libunistring - GNU Project - Free Software Foundation (FSF)

@jas check out the unit tests in the repo, unichars_test.go - probably not in the right format for you, but the coverage is pretty exhaustive.

@timbray This is great stuff!

I'm happy to report that XScreenSaver's UTF-8 parser already seems to replace the problematic codes you've identified with 0xFFFD. At least according to my self-tests: https://github.com/Mathiasb17/xscreensaver/blob/82348d51320b89281407e1cb0e18ce0022189144/utils/utf8wc.c#L564

The one exception being \000 simply truncating the text, because, well, it's C.

xscreensaver/utils/utf8wc.c at 82348d51320b89281407e1cb0e18ce0022189144 · Mathiasb17/xscreensaver

The closest thing to a git repo of xscreensaver known to exist, because jwz Does Not Do Github. This is a READ ONLY mirror. Please send pull requests and bug reports directly to the upstream proj...

GitHub
@jwz U+FFFD REPLACEMENT CHARACTER �
@timbray @jwz I � unicode
@ghewgill @timbray @jwz ‽ U+203D or go home. #interrobang
@dole @ghewgill @timbray
\u200D ZWJ is, of course, just a JWZ from the Terran Empire.
@timbray kinda surprised not to see a mention of the byte order mark, is that simply no longer a problem these days?
@LapTop006 Yeah, doesn't seem to cause problems any more, not sure why. BTW a reason why U+FFFE is a noncharacter.