is there a site like https://float.exposed for utf-8? Like where you paste in a UTF-8 string and see how it's broken up into Unicode code points?
Float Exposed

Floating point format explorer – binary representations of common floating point formats.

https://www.fontspace.com/unicode/analyzer looks like a great site for breaking down Unicode text (h/t @alice)

https://www.babelstone.co.uk/Unicode/whatisit.html is similar but a bit less pretty

Unicode Text Analyzer | FontSpace

Find out the real characters in a string of text. Great for finding hidden or similar Unicode codepoints!

fontspace
@b0rk @alice
Try perhaps also https://r12a.github.io/uniview/?charlist=%E1%B0%A3%E1%B0%A6%E1%B0%A3%E1%B0%A4%E1%B0%A7%E1%B0%B3%E1%B0%B6%E1%B0%80%E1%B0%A6 . Benefit is that there are images for all characters in Unicode except Han/Tangut/Korean. But you can also do much more analysis on the characters. hth
UniViewSVG 15

@ri You made that page? I’ve been using (versions of) that page for years! Thank you! @b0rk @alice
@b0rk @alice I have one that is specialized for generating programming-language representations. http://acme.com/unicode/decode.html
De-Unicode

@b0rk @alice
That Font Space page is great! Too bad it doesn't do a byte count or per-byte breakdown (UTF-8).

@adamhotep If you can demonstrate a way to reliably get binary input on a web page, I'd gladly throw together a single page tool to do the breakdown.

Unfortunately… I don't think it's that simple. Web page encoding, browser choice of content encoding for submitted data, and all text access in JS is of the Unicode code point characters, not the underlying binary data.

🤔 Could run through and generate the UTF-8 manually, just to highlight…

Dang. Now I've got a new night project.

@b0rk

@alice
I was just thinking of a simple byte count and hex dump (with options for /xe2/x82/xac vs € vs € vs &#20ac;, etc). Consuming binary would be cumbersome, and while you could do it via base64, I don't really see the point.

@adamhotep Weirdly, in most places (practically everywhere) I never bother to encode. My HTML files explicitly declare their UTF-8 encoding, so… why?

" " ← non-breaking space, for example. ⌥␣ on a macOS keyboard. Even my CSS icons have all largely switched to name tables, letting you use "user" as the actual named glyph… "character". (Might actually be a ligature? There's absolutely a proper name table in the font, though.)

Most compact would be without secondary encoding/escaping if possible.

@adamhotep As an example of how far this can go, some programming languages offer _extensible_ encoding/decoding:

def 📢(✉️): print(✉️)

📢("✋ 🌏")

Yes; that's valid Python. It even prints out "hello world". Looks like a joke. Is not joke. https://pypi.org/project/emoji-encoding/

emoji-encoding

Module providing Emoji encoding for Python

PyPI
@alice
I work in email, so there's lots of ASCII and quoted-printable encoding, sometimes frivolously so as a form of obfuscation.
@alice @adamhotep someone pointed me to this https://mothereff.in/utf-8 but i'd really like it to highlight the utf-8 bytes and explain what code point each byte sequence corresponds to
UTF-8 encoder/decoder

An online, on-the-fly UTF-8 encoder/decoder.

@b0rk not that I'm aware - I go over it in https://fasterthanli.me/articles/working-with-strings-in-rust but none of it is interactive. I'm getting more into interactive viz lately but you'll probably beat me to this one! (we should really have a coordinated index at some point)
Working with strings in Rust

There's a question that always comes up when people pick up the Rust programming language: why are there two string types? Why is there String , and &str ? My Declarati...

fasterthanli.me
@fasterthanlime what interactive visualizations have you been working on?

@b0rk UTF-8 specific… that's harder to recall.

https://www.fontspace.com/unicode/analyzer is one option for general; it hand-waves the encoding bit, a bit, though.

It's a… shockingly simple variable width integer encoding, so even an online hex editor with the right "template" (or such) applied to it could theoretically work.

Unicode Text Analyzer | FontSpace

Find out the real characters in a string of text. Great for finding hidden or similar Unicode codepoints!

fontspace

@b0rk not a website, but I can recommend this CLI tool

https://github.com/lunasorcery/utf8info

GitHub - lunasorcery/utf8info: Reads UTF-8 on stdin and prints out the raw Unicode codepoints. Useful for seeing exactly what a string consists of.

Reads UTF-8 on stdin and prints out the raw Unicode codepoints. Useful for seeing exactly what a string consists of. - GitHub - lunasorcery/utf8info: Reads UTF-8 on stdin and prints out the raw Uni...

GitHub
@b0rk https://r12a.github.io/app-analysestring/ by @ri! (Includes Unicode properties & links to the UCD)
Analyse string tool

Tool to analyse what characters are in a string and list information about them.

CyberChef

The Cyber Swiss Army Knife - a web app for encryption, encoding, compression and data analysis

@b0rk Another option I haven't seen in your replies:

https://apps.timwhitlock.info/unicode/inspect

Unicode character inspector

Examine Unicode characters in UTF-8 encoded strings

apps.timwhitlock.info
@b0rk Others have mentioned @ri's tools, but the full list is at http://r12a.github.io/applist, there are several tools for inspecting, breaking down, converting, and finding Unicode characters. The code converter has long been one of my favorite tools on the internet. Fairly utilitarian, but very functional!
r12a >> apps

Small web apps written in html and javascript.

@b0rk https://codepoints.net/analyze (by me) might be helpful.
This part of the site is brand-new, might still have some bugs, and I’m planning on expanding it.

The one thing that I try to make sure is to have a glyph rendered as often as possible for every code point, so that people know what it looks like instead of just tofu.

Analyze – Codepoints

Look under the hood of a string of text and find out what it is made of.

Unicode Visualizer

A tool for working with Unicode

Unicode Visualizer
@b0rk I would love that!