Mastodawn

Tekniquelly correct Mar 7

Whoa. UTF-8 is older now than ASCII was when UTF-8 was invented.

Show thread

Farce Majeure Mar 7

@tek and it still sucks

Show thread

Tekniquelly correct Mar 7

@vathpela Awww, I like UTF-8! I can pretend it's ASCII most of the time.

Show thread

Farce Majeure

@tek I have complaints about recoverability on a mildly corrupted bitstream, but it's much too late in the evening to articulate this well.

Show thread

Farce Majeure Mar 7

@tek (don't get me wrong, I have to use UCS-2 often enough to know real pain...)

Show thread

Elias Mårtenson Mar 7

@vathpela @tek Given how much worse the alternatives are, and how impossible it would have been to get people to move off of encodings, I'm glad UTF-8 exists.

Don't take me wrong, I'm quite aware of the issues with UTF-8, but I (choose to) believe that if it wasn't for UTF-8 we'd still be drowning in ASCII, and it would be impossible to tell the English-only speaking minority that supporting letters other than what was used to write inscriptions in ancient Rome might actually be useful.

Show thread

Tekniquelly correct Mar 7

@loke @vathpela I agree. The whole "all ASCII strings are the same series of bits as in UTF-8" was a stroke of brilliance. None of that BOM idiocy, i.e. "we'll define everything but leave the endianness up to the implementer", either.

Show thread

Enno Rehling Mar 7

@tek @loke @vathpela there is a BOM defined for UTF-8, as pointless as that may seem, and it's screwing up that whole beautiful ASCII compatibility whenever someone uses it.

Show thread

Elias Mårtenson Mar 8

@enno @tek @vathpela I'd go as far as saying it's actively harmful. There are exactly zero cases when it's useful, and it will actively mess things up in most cases.

But, of course windows applications tend to add them at times.

Show thread

mxk Mar 7

@vathpela @tek I would argue that in modern times this really shouldn't be an issue to be concerned about. It's not like telnet and plain serial connections are still most central communication protocols. And if your storage is causing bit flips you have other issues than readable plain text.

Show thread

Magnus Ahltorp Mar 7

@mxk @vathpela @tek I don’t know any way to run telnet over a non-checksummed connection.

Show thread

Glen T, heated, not stirred Mar 10

@ahltorp @mxk @vathpela @tek You could in theory run UTF-8 in syslog over non-checksummed UDP packets.

But in practice DNS operations folk apply the clue-by-four to kernel programmers who turn UDP checksums off, as that allows corrupted DNS answers, which are then cached.

Show thread

Farce Majeure Mar 10

@glent @ahltorp @mxk @tek do y'all just not believe people still have to deal with actual UARTs, or what?

Show thread

mxk Mar 10

@vathpela @glent @ahltorp @tek I do work with actual uarts but only for debugging purposes as a fallback when ssh fails.
That doesn't stop me from considering using utf-8 a net benefit.

Show thread

Farce Majeure Mar 12

@mxk @glent @ahltorp @tek I agree, but I also think it could and should have improved.

Show thread

Magnus Ahltorp Mar 10

@vathpela @glent @mxk But even if it’s raw UART with no layer in between, it’s no more of a problem than with Ascii or ISO 8859, if you don’t count the larger surface area of a wide character, which is sort of unavoidable.

Show thread

Farce Majeure Mar 12

@ahltorp @glent @mxk we could have made the whole situation better, but we didn't.

Show thread

Мя ��Mar 7

@vathpela IMHO, redundancy and/or checksums should be implemented on different layer, not in text encoding

Like, there's many, many ways to keep bits from corrupting, which are applicable in different cases
And forcing one particular inside of text encoding itself is...meh

Same for compression btw. For some texts (CJK in particular) UTF-8 is sub-optimal, but even basic deflate makes it compact enough

TL;DR: UTF-8 is not perfect, but having one encoding for every text outweighs

@tek

Show thread

Tekniquelly correct Mar 7

@mo @vathpela Also, UTF-8 is trivially easy to synchronize. If you delete a byte out of the middle of a file, at most you’ll lost the one affected character (well, code point). The ones before and after it will be fine. That’s not true of some other Unicode encodings, like double width ones where everything after would be out of sync.

Show thread

root42 Mar 7

@tek This! UTF-8 is a great encoding. Unicode can be a mess at times though. :)

Show thread

Mans R Mar 7

@mo @vathpela @tek Variable length encoding adds a little complexity at the input and output stages, but I think the benefits outweigh that, especially the 8-bit compatibility that allows a lot of software to work (at least to some extent) unmodified.