Mastodawn

Ну всё, пора закапывать UTF-8

Здравствуйте, меня зовут Дмитрий Карловский и я... серийный убийца устоявшихся стандартов. Сегодня я выследил и нанёс критический урон UTF-8. И сейчас я расскажу, как я его переиграл и уничтожил новым стандартом кодирования текста — Unicode Compact Format . No, God! Please, No, NO!

https://habr.com/ru/articles/983042/

#utf8 #utf16 #utf32 #ucs2 #ucs4 #scsu #bocu1 #utfc #ucf #$mol

Ну всё, пора закапывать UTF-8

Хабр

The Eclectic Light Company [Unofficial]Jan 3

Text, strings and Unicode

https://fed.brid.gy/r/https://eclecticlight.co/2026/01/03/text-strings-and-unicode/

christophwarner Feb 22, 2025

Started down this rabbit hole of language. Pirahã is interesting, ultimate recursion, few numbers, few concepts of time. Very difficult to learn outside the community culture it exists in… #languages #utf32 #humanities

https://youtube.com/watch?v=DQnyh_1kqy8&si=SSWKLKUWPotjFQh4

Pirahã: The Amazonian Tribe That Challenges Everything We Know About Language | SLICE

YouTube

Felix Palmen

Sep 13, 2024

I finally started work to support translated texts in #Xmoji! After finally deciding to come up with my own #l10n tooling... This way I can use my #utf32 string datastructure natively, but it will be a LOT of work until completed 😫

https://github.com/Zirias/xmoji/pull/8

Meanwhile, enjoy #FreeBSD and #NetBSD packages of Xmoji 0.7 ... and hey, anyone wants to do some packaging for other systems, e.g. some #Linux dists? 😏🙈

#X11 #emoji #keyboard

localization: Implement feature to translate texts by Zirias · Pull Request #8 · Zirias/xmoji

This will add tooling to translate texts in Xmoji and should finally handle both texts used directly in the UI and translated emoji names from Unicode.

GitHub

Felix Palmen

Aug 20, 2024

This is now kind of a dev microblog concerning #Xmoji. I'm kind of stalled, now that version 0.7 seems reasonably stable and portable (I see there's a #nix pkg, unfortunately outdated, and a #NetBSD #pkgsrc port, I will deliver #FreeBSD *soon*).

It misses a few convenience features my previous #qXmoji had: save/restore the window size, optionally enforce a single instance, offer a #tray icon. I'll add all of that, seems straight-forward, for the tray icon I'll only implement the old #X11 spec based on #Xembed and if some desktop environment insists on only supporting the newer standard based on #dbus, well, screw that. Too much complexity, sorry.

The real issue is #localization (#l10n), specifically "just" translations. I still have no good concept for that. With #Qt, it was a no-brainer to also use Qt's mechanism. Without a toolkit, obvious choices would be either #POSIX message catalogs, or #GNU #gettext. The latter is much more convenient, but pulls in extra deps (with #GPL/#LGPL foo). Both have in common that they only operate on char* ... 8bit encodings. I have many of my texts stored as char32_t (#Unicode UCS-4 or #UTF32, difference doesn't matter much here). I could redesign that to base everything on #UTF8, but I'm a bit reluctant ... why add more runtime conversions?

I seriously think about coming up with my own tooling. But then, how far should I jump? Should I really try to parse my own source (using LLVM's #libclang for example)? Or should I hardcode tables with identifiers for all translatable texts?

I'll sleep on that a few more nights I guess....

#X11 #emoji #keyboard

Show thread

aww-yawn Jul 18, 2024

LibreOffice writer is rendering the character correctly

#weather #icon #symbol #LibreOffice #LibreOfficeWriter #typography #utf #utf16 #utf32

silas Jun 19, 2024

Really good and interesting article about the problem of calculating the length of strings.

#unicode #utf8 #utf16 #utf32 #programming

https://hsivonen.fi/string-length/

It’s not wrong that "🤦🏼‍♂️".length == 7

Show thread

Felix Palmen

May 29, 2024

Next thing is working on a #TextBox #widget. This will require, among other things, translating between pixel coordinates and string positions (for selections and drawing a cursor at the correct position).

Of course, #harfbuzz operates on #Unicode codepoints, stored as 32bit unsigned integers. So, yesterday, I started implementing a string class holding the value in both #utf8 and #utf32 and converting between these.

Now I realized this wasn't all to helpful, so I'll start over. What I want is some immutable string class using char32_t internally and offering functions to create mutated clones...

#xcb #x11 #programming

Habr Jan 23, 2024

Суперсемейка против Unicode: Эластика и ее противник гибкий UTF-8

Кодировка символов – это про то, как символы которыми мы пишем наши сообщения выглядят в двоичном коде. В мире существует множество кодировок, но самые популярные из них, это; ASCII – это самая первая кодировка в мире, она была создана в Америке. Собственно благодаря ей, 8 бит равны 1 байт. UTF-8, 16 и 32 – данные кодировки были созданы организацией Unicode (Юникод). Если по простому, то они это то же самое что и ASCII, но более вместительные, что означает, что они занимают больше памяти. Все бы ничего, легкая тема, но есть одно но – кодировка UTF-8 имеет, как по мне, гениальную особенность: она умеет "растягиваться". То есть адаптироваться под большое кол-во символов.

https://habr.com/ru/articles/788230/

#unicode #utf8 #utf16 #utf32 #ascii #ram #byte #css #encode #computer_science

Суперсемейка против Unicode: Эластика и ее противник гибкий UTF-8

Оперативная память (RAM) Оперативная память состоит из большого набора ячеек. Одна ячейка это один бит. Бит – минимальная единица хранения информации в ячейки оперативной памяти. Бит имеет булевое...

Хабр

Hapaxia Oct 1, 2023

Today is a Unicode day.

#unicode #utf8 #utf16 #utf32