Mastodawn

Tekniquelly correct Mar 7

Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.

Show thread

Farce Majeure Mar 7

@tek and it still sucks

Show thread

David, a Bostonian in Tokyo.Mar 7

@vathpela @tek

Nah. It stopped sucking when Unicode became variable-width even in a 32-bit encoding. Or at least it no longer became valid to correctly point out that it sucks, since there now isn't anything that doesn't.

Show thread

Vincent Sparks 🔜 Furlingame

@djl
wait what? I thought Unicode itself allocated 31 bits for codepounts??

@vathpela @tek

Show thread

Tekniquelly correct 21h ago

@AVincentInSpace @vathpela Unicode currently only reserves code points 0..0x10FFFF (https://en.wikipedia.org/wiki/Unicode_block), so all existing CPs fit in 21 bits.

But @djl was saying we don't have to waste 32 bits to encode the most common code points since UTF-8 came along. This chat here only uses 8 bits per CP/character.

Unicode block - Wikipedia

Show thread

David, a Bostonian in Tokyo.15h ago

@tek @AVincentInSpace @vathpela

Well, I wasn't actually saying that, but at least in the Anglophile universe, since UTF-8 is just ASCII with an escape bit, it does that. Nicely.

My irritation, and rant, is that if I want to actually do some programming with strings (that use both Japanese and English), I really really really don't want to muck with a variable-width encoding.

But, as I understand it, even UTF-32 is a variable-width encoding _if I program for it correctly_.

Sigh.

Show thread

Tekniquelly correct 15h ago

@djl @AVincentInSpace @vathpela Ah, got it.

I don't think that's the case for UTF-32, though. By definition, I think every code point is encoded as exactly 4 bytes.

Show thread

David, a Bostonian in Tokyo.15h ago

@tek @AVincentInSpace @vathpela

Another issue here is that one of the worst sins in programming is _premature optimization_.

As someone whose serious programming experience was all before 1990, my intuitions are way off for modern processors.

I'm dealing with 500 MB of Japanese text on disk, and reading them into Python and searching them is zippy quick on a PC.

So, IMHO, the world would work fine if the folks at Unicode defined a 32-bit fixed-with encoding, and we just used that.

Show thread

Vincent Sparks 🔜 Furlingame 15h ago

@djl @tek @vathpela There is one. It's called UCS-4 and Python uses it internally.

EDIT: Actually, that's not true. Python 3.3 and later, following the adoption of PEP 393, decides on a per-string-object basis whether to use latin1, UCS-2, or UCS-4, for maximally compact fixed-width representation.

Show thread

Tekniquelly correct 14h ago

@djl @AVincentInSpace @vathpela That would make all text storage 4x larger than ASCII, even for Latin text, making cache-bound operations on strings much, much slower. It also means there’s zero compatibility between “old” and “new” text (instead of being able to manipulate ASCII with UTF-8 functions). Also, big endian or little endian?

Show thread

David, a Bostonian in Tokyo.15h ago

@AVincentInSpace @vathpela @tek

My understanding is that UTF-32 uses escape characters to get to rarely used characters, and thus is a variable-width encoding.

Show thread

Tekniquelly correct 15h ago

@djl @AVincentInSpace @vathpela Not quite. One UTF-32 code point == 1 32-bit int. However, you can combine multiple code points to make a single glyph. Lots of emoji work that way, for example. But the code points are still exact 32 bits.

Show thread

Vincent Sparks 🔜 Furlingame 15h ago

@tek @djl @vathpela Okay, but that's still multiple codepoints. The ASCII sequence "ffi" can be displayed as a single glyph depending on your font. So what?

(Okay, technically that's a ligature, which is distinct from a glyph in that a glyph is to be treated as a single character for the purpose of selecting text and a ligature is not, but unless you are implementing a textbox where the user can select text from scratch, you do not need to care about the difference, and unless you are developing a UI toolkit you almost certainly do not need to care about anything besides codepoints.)