Arthur @krtab Carcano's "Rust, Unicode gotchas, and the eight different string types" is on stage. He learned to stop worriying and trust UTF-8. #RustLang#RustInParis
Unicode is a way to encode letters. It's a descendant of many encodings, the most well known being ASCII. Is it a 15th standard, like in the infamous xkcd? No, in this case it actually won, Arthur says. #RustInParis
Unicode has interestings consequences: for example, in some languages there are rules to do case change, depending on word boundary for example. The locale can influence the casing. The Segmentation of graphemes, word, sentences can be complex, In Rust, the standard library only does non-locale aware segmentation. #RustInParis
Normalization is a way to determine equivalences between codepoint combinations. When comparing, Rust only compares the raw codepoints. Ditto for ordering, Rust only order by numerical values. #RustInParis
The UCS is split in 17 planes, each capable of having 2^16 characters. The Basic Multilingual Plane (BMP) is the most used. Knowing this is useful for determine how the encoding is done. UTF-32 is wasteful but simple encoding, where each codepoint takes 32 bits. #RustInParis
UTF-16 is the "original sin", Arthur says. It has a hack in the BMP where specific codepoints are used to encode codepoints from other planes using multiple UTF-16 characters. #RustInParis
Rust has 8 string types, Arthur says (or even more), in four categories, each category having an owned and a non-owned version. For example String and str for utf-8 strings. Others include CString, OSString, Path. #RustInParis
WTF-8 was specfied by Simon @simon a previous speaker at #RustInParis. It's used to represent unmatched surrogate pairs, and is a superset of UTF-8 used for UTF-16 conversions.
Unicode can be fun is some cases, but also can create issue or break software, for example in RTL codepoint is inserted in a user controlled field. #RustInParis