Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.
@tek and it still sucks

@vathpela @tek

Nah. It stopped sucking when Unicode became variable-width even in a 32-bit encoding. Or at least it no longer became valid to correctly point out that it sucks, since there now isn't anything that doesn't.

@djl
wait what? I thought Unicode itself allocated 31 bits for codepounts??

@vathpela @tek

@AVincentInSpace @vathpela Unicode currently only reserves code points 0..0x10FFFF (https://en.wikipedia.org/wiki/Unicode_block), so all existing CPs fit in 21 bits.

But @djl was saying we don't have to waste 32 bits to encode the most common code points since UTF-8 came along. This chat here only uses 8 bits per CP/character.

Unicode block - Wikipedia

@tek @AVincentInSpace @vathpela

Another issue here is that one of the worst sins in programming is _premature optimization_.

As someone whose serious programming experience was all before 1990, my intuitions are way off for modern processors.

I'm dealing with 500 MB of Japanese text on disk, and reading them into Python and searching them is zippy quick on a PC.

So, IMHO, the world would work fine if the folks at Unicode defined a 32-bit fixed-with encoding, and we just used that.

@djl @AVincentInSpace @vathpela That would make all text storage 4x larger than ASCII, even for Latin text, making cache-bound operations on strings much, much slower. It also means there’s zero compatibility between “old” and “new” text (instead of being able to manipulate ASCII with UTF-8 functions). Also, big endian or little endian?