Mastodawn

Julia Evans Apr 3, 2023

is there a site like https://float.exposed for utf-8? Like where you paste in a UTF-8 string and see how it's broken up into Unicode code points?

Float Exposed

Floating point format explorer – binary representations of common floating point formats.

Show thread

Julia Evans Apr 3, 2023

https://www.fontspace.com/unicode/analyzer looks like a great site for breaking down Unicode text (h/t @alice)

https://www.babelstone.co.uk/Unicode/whatisit.html is similar but a bit less pretty

Unicode Text Analyzer | FontSpace

Find out the real characters in a string of text. Great for finding hidden or similar Unicode codepoints!

fontspace

Show thread

Adam Katz Apr 3, 2023

@b0rk @alice
That Font Space page is great! Too bad it doesn't do a byte count or per-byte breakdown (UTF-8).

Show thread

Rev. GothAlice Apr 3, 2023

@adamhotep If you can demonstrate a way to reliably get binary input on a web page, I'd gladly throw together a single page tool to do the breakdown.

Unfortunately… I don't think it's that simple. Web page encoding, browser choice of content encoding for submitted data, and all text access in JS is of the Unicode code point characters, not the underlying binary data.

🤔 Could run through and generate the UTF-8 manually, just to highlight…

Dang. Now I've got a new night project.

@b0rk

Show thread

Adam Katz

@alice
I was just thinking of a simple byte count and hex dump (with options for /xe2/x82/xac vs € vs € vs &#20ac;, etc). Consuming binary would be cumbersome, and while you could do it via base64, I don't really see the point.

Show thread

Rev. GothAlice Apr 3, 2023

@adamhotep Weirdly, in most places (practically everywhere) I never bother to encode. My HTML files explicitly declare their UTF-8 encoding, so… why?

" " ← non-breaking space, for example. ⌥␣ on a macOS keyboard. Even my CSS icons have all largely switched to name tables, letting you use "user" as the actual named glyph… "character". (Might actually be a ligature? There's absolutely a proper name table in the font, though.)

Most compact would be without secondary encoding/escaping if possible.

Show thread

Rev. GothAlice Apr 3, 2023

@adamhotep As an example of how far this can go, some programming languages offer _extensible_ encoding/decoding:

def 📢(✉️): print(✉️)

📢("✋ 🌏")

Yes; that's valid Python. It even prints out "hello world". Looks like a joke. Is not joke. https://pypi.org/project/emoji-encoding/

emoji-encoding

Module providing Emoji encoding for Python

PyPI

Show thread

Adam Katz Apr 3, 2023

@alice
I work in email, so there's lots of ASCII and quoted-printable encoding, sometimes frivolously so as a form of obfuscation.