https://www.fontspace.com/unicode/analyzer looks like a great site for breaking down Unicode text (h/t @alice)
https://www.babelstone.co.uk/Unicode/whatisit.html is similar but a bit less pretty
@adamhotep If you can demonstrate a way to reliably get binary input on a web page, I'd gladly throw together a single page tool to do the breakdown.
Unfortunately… I don't think it's that simple. Web page encoding, browser choice of content encoding for submitted data, and all text access in JS is of the Unicode code point characters, not the underlying binary data.
🤔 Could run through and generate the UTF-8 manually, just to highlight…
Dang. Now I've got a new night project.
/xe2/x82/xac vs € vs € vs ac;, etc). Consuming binary would be cumbersome, and while you could do it via base64, I don't really see the point.@adamhotep Weirdly, in most places (practically everywhere) I never bother to encode. My HTML files explicitly declare their UTF-8 encoding, so… why?
" " ← non-breaking space, for example. ⌥␣ on a macOS keyboard. Even my CSS icons have all largely switched to name tables, letting you use "user" as the actual named glyph… "character". (Might actually be a ligature? There's absolutely a proper name table in the font, though.)
Most compact would be without secondary encoding/escaping if possible.
@adamhotep As an example of how far this can go, some programming languages offer _extensible_ encoding/decoding:
def 📢(✉️): print(✉️)
📢("✋ 🌏")
Yes; that's valid Python. It even prints out "hello world". Looks like a joke. Is not joke. https://pypi.org/project/emoji-encoding/
@b0rk UTF-8 specific… that's harder to recall.
https://www.fontspace.com/unicode/analyzer is one option for general; it hand-waves the encoding bit, a bit, though.
It's a… shockingly simple variable width integer encoding, so even an online hex editor with the right "template" (or such) applied to it could theoretically work.
@b0rk not a website, but I can recommend this CLI tool
Reads UTF-8 on stdin and prints out the raw Unicode codepoints. Useful for seeing exactly what a string consists of. - GitHub - lunasorcery/utf8info: Reads UTF-8 on stdin and prints out the raw Uni...
@b0rk Another option I haven't seen in your replies:
@b0rk https://codepoints.net/analyze (by me) might be helpful.
This part of the site is brand-new, might still have some bugs, and I’m planning on expanding it.
The one thing that I try to make sure is to have a glyph rendered as often as possible for every code point, so that people know what it looks like instead of just tofu.
@b0rk easy enough to create one. I had to mess with this a few years back. There were a whole mess of corner cases:
https://landley.net/notes-2017.html#01-09-2017
https://landley.net/notes-2017.html#29-08-2017
http://lists.landley.net/pipermail/toybox-landley.net/2017-September/025230.html
Parser and test code:
https://github.com/landley/toybox/blob/master/lib/lib.c#L372
https://github.com/landley/toybox/blob/master/toys/example/demo_utf8towc.c