@vathpela @tek Given how much worse the alternatives are, and how impossible it would have been to get people to move off of encodings, I'm glad UTF-8 exists.
Don't take me wrong, I'm quite aware of the issues with UTF-8, but I (choose to) believe that if it wasn't for UTF-8 we'd still be drowning in ASCII, and it would be impossible to tell the English-only speaking minority that supporting letters other than what was used to write inscriptions in ancient Rome might actually be useful.
@vathpela IMHO, redundancy and/or checksums should be implemented on different layer, not in text encoding
Like, there's many, many ways to keep bits from corrupting, which are applicable in different cases
And forcing one particular inside of text encoding itself is...meh
Same for compression btw. For some texts (CJK in particular) UTF-8 is sub-optimal, but even basic deflate makes it compact enough
TL;DR: UTF-8 is not perfect, but having one encoding for every text outweighs
@AVincentInSpace @vathpela Unicode currently only reserves code points 0..0x10FFFF (https://en.wikipedia.org/wiki/Unicode_block), so all existing CPs fit in 21 bits.
But @djl was saying we don't have to waste 32 bits to encode the most common code points since UTF-8 came along. This chat here only uses 8 bits per CP/character.
@tek @AVincentInSpace @vathpela
Well, I wasn't actually saying that, but at least in the Anglophile universe, since UTF-8 is just ASCII with an escape bit, it does that. Nicely.
My irritation, and rant, is that if I want to actually do some programming with strings (that use both Japanese and English), I really really really don't want to muck with a variable-width encoding.
But, as I understand it, even UTF-32 is a variable-width encoding _if I program for it correctly_.
Sigh.
@djl @AVincentInSpace @vathpela Ah, got it.
I don't think that's the case for UTF-32, though. By definition, I think every code point is encoded as exactly 4 bytes.
@tek @AVincentInSpace @vathpela
Another issue here is that one of the worst sins in programming is _premature optimization_.
As someone whose serious programming experience was all before 1990, my intuitions are way off for modern processors.
I'm dealing with 500 MB of Japanese text on disk, and reading them into Python and searching them is zippy quick on a PC.
So, IMHO, the world would work fine if the folks at Unicode defined a 32-bit fixed-with encoding, and we just used that.
@AVincentInSpace @vathpela @tek
My understanding is that UTF-32 uses escape characters to get to rarely used characters, and thus is a variable-width encoding.
@tek @djl @vathpela Okay, but that's still multiple codepoints. The ASCII sequence "ffi" can be displayed as a single glyph depending on your font. So what?
(Okay, technically that's a ligature, which is distinct from a glyph in that a glyph is to be treated as a single character for the purpose of selecting text and a ligature is not, but unless you are implementing a textbox where the user can select text from scratch, you do not need to care about the difference, and unless you are developing a UI toolkit you almost certainly do not need to care about anything besides codepoints.)
@tek Every now and then the Cambridge CST exam papers include a question like "explain why even experienced programmers sometimes have problems with character codes".
You could write pretty well anything you liked.
Originally what was expected was an essay about things like escape sequences on Flexowriter tapes; in my day it was about conversion between EBCDIC and ASCII; these days it might be about obscure characters in URLs.
@tek if they keep adding emoji we'll have to invent UTF-9!!!
(That's a joke)