Mastodawn

Arthur @krtab Carcano's "Rust, Unicode gotchas, and the eight different string types" is on stage. He learned to stop worriying and trust UTF-8.
#RustLang #RustInParis

Show thread

Anisse 6d ago

Unicode is a way to encode letters. It's a descendant of many encodings, the most well known being ASCII. Is it a 15th standard, like in the infamous xkcd? No, in this case it actually won, Arthur says.
#RustInParis

Show thread

Anisse 6d ago

Unicode is a catalog of symbol/characters, or the UCS (Unicode Character Set); and a set of encodings to encode them (UTF-8, 16 or 32).
#RustInParis

Show thread

Anisse 6d ago

Unicode has interestings consequences: for example, in some languages there are rules to do case change, depending on word boundary for example. The locale can influence the casing. The Segmentation of graphemes, word, sentences can be complex, In Rust, the standard library only does non-locale aware segmentation.
#RustInParis

Show thread

Anisse 6d ago

Normalization is a way to determine equivalences between codepoint combinations. When comparing, Rust only compares the raw codepoints. Ditto for ordering, Rust only order by numerical values.
#RustInParis

Show thread

Anisse 6d ago

The UCS is split in 17 planes, each capable of having 2^16 characters. The Basic Multilingual Plane (BMP) is the most used. Knowing this is useful for determine how the encoding is done. UTF-32 is wasteful but simple encoding, where each codepoint takes 32 bits.
#RustInParis

Show thread

Anisse 6d ago

UTF-16 is the "original sin", Arthur says. It has a hack in the BMP where specific codepoints are used to encode codepoints from other planes using multiple UTF-16 characters.
#RustInParis

Show thread

Anisse 6d ago

UTF-8 is much more well designed, and just uses bit prefixes.
#RustInParis

Show thread

Anisse 6d ago

Rust has 8 string types, Arthur says (or even more), in four categories, each category having an owned and a non-owned version. For example String and str for utf-8 strings. Others include CString, OSString, Path.
#RustInParis

Show thread

Anisse

WTF-8 was specfied by Simon @simon a previous speaker at #RustInParis. It's used to represent unmatched surrogate pairs, and is a superset of UTF-8 used for UTF-16 conversions.

Show thread

Anisse 6d ago

Unicode can be fun is some cases, but also can create issue or break software, for example in RTL codepoint is inserted in a user controlled field. #RustInParis