Mastodawn

Henri Sivonen 5d ago

Tekniquelly correct

Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.

Show thread

avatastic

Mar 7

@tek and yet slashdot still can't display it.

Show thread

Farce Majeure Mar 7

@tek and it still sucks

Show thread

Tekniquelly correct Mar 7

@vathpela Awww, I like UTF-8! I can pretend it's ASCII most of the time.

Show thread

Farce Majeure Mar 7

@tek I have complaints about recoverability on a mildly corrupted bitstream, but it's much too late in the evening to articulate this well.

Show thread

Farce Majeure Mar 7

@tek (don't get me wrong, I have to use UCS-2 often enough to know real pain...)

Show thread

Elias Mårtenson Mar 7

@vathpela @tek Given how much worse the alternatives are, and how impossible it would have been to get people to move off of encodings, I'm glad UTF-8 exists.

Don't take me wrong, I'm quite aware of the issues with UTF-8, but I (choose to) believe that if it wasn't for UTF-8 we'd still be drowning in ASCII, and it would be impossible to tell the English-only speaking minority that supporting letters other than what was used to write inscriptions in ancient Rome might actually be useful.

Show thread

Tekniquelly correct Mar 7

@loke @vathpela I agree. The whole "all ASCII strings are the same series of bits as in UTF-8" was a stroke of brilliance. None of that BOM idiocy, i.e. "we'll define everything but leave the endianness up to the implementer", either.

Show thread

Enno Rehling Mar 7

@tek @loke @vathpela there is a BOM defined for UTF-8, as pointless as that may seem, and it's screwing up that whole beautiful ASCII compatibility whenever someone uses it.

Show thread

Elias Mårtenson Mar 8

@enno @tek @vathpela I'd go as far as saying it's actively harmful. There are exactly zero cases when it's useful, and it will actively mess things up in most cases.

But, of course windows applications tend to add them at times.

Show thread

mxk Mar 7

@vathpela @tek I would argue that in modern times this really shouldn't be an issue to be concerned about. It's not like telnet and plain serial connections are still most central communication protocols. And if your storage is causing bit flips you have other issues than readable plain text.

Show thread

Magnus Ahltorp Mar 7

@mxk @vathpela @tek I don’t know any way to run telnet over a non-checksummed connection.

Show thread

Glen T, heated, not stirred Mar 10

@ahltorp @mxk @vathpela @tek You could in theory run UTF-8 in syslog over non-checksummed UDP packets.

But in practice DNS operations folk apply the clue-by-four to kernel programmers who turn UDP checksums off, as that allows corrupted DNS answers, which are then cached.

Show thread

Farce Majeure Mar 10

@glent @ahltorp @mxk @tek do y'all just not believe people still have to deal with actual UARTs, or what?

Show thread

mxk Mar 10

@vathpela @glent @ahltorp @tek I do work with actual uarts but only for debugging purposes as a fallback when ssh fails.
That doesn't stop me from considering using utf-8 a net benefit.

Show thread

Farce Majeure Mar 12

@mxk @glent @ahltorp @tek I agree, but I also think it could and should have improved.

Show thread

Magnus Ahltorp Mar 10

@vathpela @glent @mxk But even if it’s raw UART with no layer in between, it’s no more of a problem than with Ascii or ISO 8859, if you don’t count the larger surface area of a wide character, which is sort of unavoidable.

Show thread

Farce Majeure Mar 12

@ahltorp @glent @mxk we could have made the whole situation better, but we didn't.

Show thread

Мя ��Mar 7

@vathpela IMHO, redundancy and/or checksums should be implemented on different layer, not in text encoding

Like, there's many, many ways to keep bits from corrupting, which are applicable in different cases
And forcing one particular inside of text encoding itself is...meh

Same for compression btw. For some texts (CJK in particular) UTF-8 is sub-optimal, but even basic deflate makes it compact enough

TL;DR: UTF-8 is not perfect, but having one encoding for every text outweighs

@tek

Show thread

Tekniquelly correct Mar 7

@mo @vathpela Also, UTF-8 is trivially easy to synchronize. If you delete a byte out of the middle of a file, at most you’ll lost the one affected character (well, code point). The ones before and after it will be fine. That’s not true of some other Unicode encodings, like double width ones where everything after would be out of sync.

Show thread

root42 Mar 7

@tek This! UTF-8 is a great encoding. Unicode can be a mess at times though. :)

Show thread

Mans R Mar 7

@mo @vathpela @tek Variable length encoding adds a little complexity at the input and output stages, but I think the benefits outweigh that, especially the 8-bit compatibility that allows a lot of software to work (at least to some extent) unmodified.

Show thread

David, a Bostonian in Tokyo.Mar 7

@vathpela @tek

Nah. It stopped sucking when Unicode became variable-width even in a 32-bit encoding. Or at least it no longer became valid to correctly point out that it sucks, since there now isn't anything that doesn't.

Show thread

Vincent Sparks 🔜 Furlingame 13h ago

@djl
wait what? I thought Unicode itself allocated 31 bits for codepounts??

@vathpela @tek

Show thread

Tekniquelly correct 13h ago

@AVincentInSpace @vathpela Unicode currently only reserves code points 0..0x10FFFF (https://en.wikipedia.org/wiki/Unicode_block), so all existing CPs fit in 21 bits.

But @djl was saying we don't have to waste 32 bits to encode the most common code points since UTF-8 came along. This chat here only uses 8 bits per CP/character.

Unicode block - Wikipedia

Show thread

David, a Bostonian in Tokyo.7h ago

@tek @AVincentInSpace @vathpela

Well, I wasn't actually saying that, but at least in the Anglophile universe, since UTF-8 is just ASCII with an escape bit, it does that. Nicely.

My irritation, and rant, is that if I want to actually do some programming with strings (that use both Japanese and English), I really really really don't want to muck with a variable-width encoding.

But, as I understand it, even UTF-32 is a variable-width encoding _if I program for it correctly_.

Sigh.

Show thread

Tekniquelly correct 7h ago

@djl @AVincentInSpace @vathpela Ah, got it.

I don't think that's the case for UTF-32, though. By definition, I think every code point is encoded as exactly 4 bytes.

Show thread

David, a Bostonian in Tokyo.7h ago

@tek @AVincentInSpace @vathpela

Another issue here is that one of the worst sins in programming is _premature optimization_.

As someone whose serious programming experience was all before 1990, my intuitions are way off for modern processors.

I'm dealing with 500 MB of Japanese text on disk, and reading them into Python and searching them is zippy quick on a PC.

So, IMHO, the world would work fine if the folks at Unicode defined a 32-bit fixed-with encoding, and we just used that.

Show thread

Vincent Sparks 🔜 Furlingame 7h ago

@djl @tek @vathpela There is one. It's called UCS-4 and Python uses it internally.

EDIT: Actually, that's not true. Python 3.3 and later, following the adoption of PEP 393, decides on a per-string-object basis whether to use latin1, UCS-2, or UCS-4, for maximally compact fixed-width representation.

Show thread

Tekniquelly correct 6h ago

@djl @AVincentInSpace @vathpela That would make all text storage 4x larger than ASCII, even for Latin text, making cache-bound operations on strings much, much slower. It also means there’s zero compatibility between “old” and “new” text (instead of being able to manipulate ASCII with UTF-8 functions). Also, big endian or little endian?

Show thread

David, a Bostonian in Tokyo.7h ago

@AVincentInSpace @vathpela @tek

My understanding is that UTF-32 uses escape characters to get to rarely used characters, and thus is a variable-width encoding.

Show thread

Tekniquelly correct 7h ago

@djl @AVincentInSpace @vathpela Not quite. One UTF-32 code point == 1 32-bit int. However, you can combine multiple code points to make a single glyph. Lots of emoji work that way, for example. But the code points are still exact 32 bits.

Show thread

Vincent Sparks 🔜 Furlingame 7h ago

@tek @djl @vathpela Okay, but that's still multiple codepoints. The ASCII sequence "ffi" can be displayed as a single glyph depending on your font. So what?

(Okay, technically that's a ligature, which is distinct from a glyph in that a glyph is to be treated as a single character for the purpose of selecting text and a ligature is not, but unless you are implementing a textbox where the user can select text from scratch, you do not need to care about the difference, and unless you are developing a UI toolkit you almost certainly do not need to care about anything besides codepoints.)

Show thread

🇺🇦 haxadecimal 🚫👑5d ago

@vathpela @tek
Yes, but it sucks less than the alternatives.

Show thread

David Chisnall (*Now with 50% more sarcasm!*)5d ago

@brouhaha @vathpela @tek

The tag line of far too much in the computer industry.

Show thread

Sikorski Arkadiusz vel ArakuS Mar 7

@tek

Show thread

Fabian Köster Mar 7

@tek Still I am regularly confronted with IT systems that do not (properly) support it and display my name with an umlaut wrong.

Show thread

Bradley M. Kühn 5d ago

@fabian wow, really?

I recently put my umlaut back into my name (my grandmother did research and found it was removed at immigration to USA) precisely because UTF-8 seems ubiquitous.

Admittedly I did not change my legal name.

@tek

Show thread

Fabian Köster 5d ago

@bkuhn @tek Yes, most surprisingly this happens a lot with large organizations that are operating internationally for ages like UPS for instance :D

Show thread

Michele Adduci Mar 7

@tek and it is still being handled wrongly in many places

Show thread

deBaer Mar 7

@tek But UTF-EBCDIC is still younger than EBCDIC was when UTF-EBCDIC was invented.

Show thread

Bradley M. Kühn 5d ago

@deBaer just reminded me that I really should get around reencoding all my old EPCIDIC files into UTF-8 …
or do I have to stop at ASCII as an intermediary?

👴

Cc: @mjd @tek

Show thread

deBaer 5d ago

@bkuhn @mjd @tek iconv -f IBM1141 -t utf-8 input_ebcdic.txt -o output.txt (Or a different codepage, depending on which one your EBCDIC files use...)

Show thread

Bradley M. Kühn 5d ago

@deBaer

I was doing improv. Not sure if your comment was a “Yes, and…” or an attempt to say “stop #EBCDIC-based improv. This 💩 is serious,man!”.

Sorry if my confusion just broke the 4th wall. 😁

Cc: @mjd @tek

Show thread

deBaer 5d ago

@bkuhn @mjd @tek I couldn't tell, since I (ex IBM employee) know people who actually still have files encoded in EBCDIC. ;-)

Show thread

Tim Ward ⭐🇪🇺🔶 #FBPE Mar 7

@tek Every now and then the Cambridge CST exam papers include a question like "explain why even experienced programmers sometimes have problems with character codes".

You could write pretty well anything you liked.

Originally what was expected was an essay about things like escape sequences on Flexowriter tapes; in my day it was about conversion between EBCDIC and ASCII; these days it might be about obscure characters in URLs.

Show thread

Jonathan Addleman Mar 7

@tek
And yet, my bank still won't let me add a contact (for etransfers) with an accent in their name.

Show thread

Elena ``of Valhalla''5d ago

@jaddle @tek if it's sepa (or something else with the same restrictions) it's possible that they are still using an extremely restricted set of characters with just uppercase letters, digits and a small handful of punctuation, selected back when everything had to move between obsolete mainframes, that we are still stuck with today for compatibility reasons (yay :( )

Show thread

adamrice 5d ago

@valhalla @jaddle @tek The computer I learned programming on was restricted to Rad-50 for file names. https://en.wikipedia.org/wiki/DEC_RADIX_50

DEC RADIX 50 - Wikipedia

Show thread

Jonathan Addleman 5d ago

@valhalla

@tek

I can understand a bank having lots of legacy code (hello COBOL!), but the ridiculous thing here is that the backend DOES support accents! I have pre-existing contacts with them from several years ago, but when they redid the front-end just recently, they broke it! No excuse!

Show thread