I built this proof of concept of a tool called https://text.makeup. It is meant to be a friendly Unicode explainer – meant not just for Unicode nerds, but nerds of any kind. Useful for debugging, but also learning.

You can go there now to play (much more fun on desktop!), but I also recorded a 5-minute video that explains it further.

I am curious: Does this feel like fun? Is it worth building out for real? What would you like to see in it if so?

Text makeup

I am a bit overwhelmed with the idea of finishing it and maintaining it since Unicode is so vast and complex. But maybe this could be open sourced, for example?

I have an About page (https://text.makeup/about) that talks more about what’s in there and what *could* be in there. (I have already learned *a lot* from building it, so there’s no skin off my back.)

Thanks for any thoughts you might have!

About – Text makeup

@mwichary This is excellent! It's like a "debugger" for text, peeling back layers of mystery that are usually awkward to access. I love the idea of open-sourcing it; that could make it self-hostable, which would be great.
@80columns Sorry, what does self-hostable mean?
@mwichary Self-hostable means it's possible to run one's own instance of the system, either on a different public server, or on a private server on a local network. That offers various benefits, including data privacy and the ability to run local modifications to the software (including the development of new revisions).

@mwichary it looks like it's already better for the purpose of exploring text than my previous go-to, Babelstone's venerable "what unicode character is this". https://babelstone.co.uk/Unicode/whatisit.html

(and it has a more memorable URL)

What Unicode character is this ?

@mwichary This is spectacular, Marcin! (Only just checking it out today, as I’m way behind on my timeline this week.) I think that open sourcing might be a good solution, especially if the Unicode community finds it useful.

@mwichary This is very cool, and the kind of thing I am very interested in - so much so that I snagged the unicode.exposed domain a while back (a la https://float.exposed/) and then didn't get far enough in that project to be worth putting up there 

If you'd like to take over that domain for this project, feel free to message me! I'd also be interested in collaborating a bit, but as indicated by my half-finished attempt, that may be overambitious. Here's where I left off on my own explorations, you made much more progress than I did, and obviously took a different tack: https://mahogany-waiting-treatment.glitch.me/

My initial interest is in breaking down the mystery of "how do computers represent things", particularly around text encoding, but I also had ambitions parallel to yours - particularly the Observations panel is a brilliant little way to highlight some of the info I wanted to capture. Thanks for this!

Float Exposed

Floating point format explorer – binary representations of common floating point formats.

@mwichary Also, apologies for the delay - I had meant to reply when you originally posted this but got distracted, and then was looking through my domains today and was reminded to circle back.
@mwichary the amount of info it already displays is great, like I didn't expect it to say anything about :) and certainly not about \n
@mwichary I use https://unicode.fyi and this really feels like a supercharged version of that for normal people
unicode.fyi

@noiob Yeah, there are quite a few tools that feel exactly like this, but I wanted to try a friendlier approach.

The "Information" and "Observation" boxes do not change when I type something in the "Text" box.

The Javascript console starts out with:

TypeError: s is undefined

and as I type I get a bunch of:

Uncaught TypeError: Intl.Segmenter is not a constructor stringSplit https://text.makeup/main.js?v8:965 senseQuotedPrintableEscape https://text.makeup/main.js?v8:479 hasQuotedPrintableEscapes https://text.makeup/main.js?v8:912 detectStringType https://text.makeup/main.js?v8:931 processInputIntoOutput https://text.makeup/main.js?v8:2201 onInput https://text.makeup/main.js?v8:4825 main https://text.makeup/main.js?v8:6473 main https://text.makeup/main.js?v8:6473 <anonymous> https://text.makeup/:123 main.js:965:28 stringSplit https://text.makeup/main.js?v8:965 senseQuotedPrintableEscape https://text.makeup/main.js?v8:479 hasQuotedPrintableEscapes https://text.makeup/main.js?v8:912 detectStringType https://text.makeup/main.js?v8:931 processInputIntoOutput https://text.makeup/main.js?v8:2201 onInput https://text.makeup/main.js?v8:4825 main https://text.makeup/main.js?v8:6473 (Async: EventListener.handleEvent) main https://text.makeup/main.js?v8:6473 <anonymous> https://text.makeup/:123

errors.

This is in Firefox 114.0 on Debian GNU/Linux 12.

@asjo Thanks for the report! It seems like the Intl.Segmenter has only been added in Firefox 125 released in April of this year.

I should throw a more gentle, user-facing error on this…

@asjo Another person reported this, so I updated https://text.makeup with a fallback grapheme segmenter – it should work on your Firefox now! Let me know what you think.
Text makeup

It works for me now in Firefox - cheers!
@mwichary the thing I most want to see is a per-codepoint index into the history of the writing systems. The Unicode proposals contain a remarkable amount of scholarship on writing systems, mostly buried in hard to find PDFs. The excitement about was very popular but articles about things like Linear-B or Deseret are also fascinating and hard to access.
@nelson Thanks! Alas I’ll have to admit this is not that interesting to me (yet).
@nelson That link is very cool, though, and should definitely be included.
@nelson @mwichary I really want this too! The closest I have found to this is SIL's ScriptSource "Unicode Status" pages, for example: https://www.scriptsource.org/cms/scripts/page.php?item_id=entry_detail&uid=ur8y3qj6yk
@simoncozens @mwichary oooh this is new to me and very cool! thank you.
@nelson @mwichary I have long wanted Unicode charts that contain a "how did this glyph wind up in Unicode?" explainer. Even being able to track things back to "it was part of WingDings" is useful information!
@tedmielczarek @nelson I would actually love that part, too! I was always wondering how hard it is to research that from meeting files etc.
@mwichary @nelson the documents are basically all there, from when I've dug into this at times, but it's an entirely manual process to track them down. This doesn't account for symbols that came in as part of a bulk import from elsewhere, like the mysterious angzarr ⍼.

@mwichary @nelson For new additions in the upcoming version of Unicode, the Pipeline page is great, as it links to sources: https://www.unicode.org/alloc/Pipeline.html

The versioned charts page for each release version is nice as it highlights additions/changes, but doesn't link sources: https://www.unicode.org/charts/PDF/Unicode-16.0/

Proposed New Characters: The Pipeline

@mwichary @nelson What I don't know (and I guess I should ask someone actually involved in the Unicode standards work) is whether they maintain historical copies of that Pipeline page or whether fetching the Wayback Machine copy is all we've got. e.g. https://web.archive.org/web/20240526110216/https://www.unicode.org/alloc/Pipeline.html shows the additions for Unicode 16 with references, like you can follow the trail of references for "Graphic shapes for legacy computing" all the way to: https://www.unicode.org/L2/L2021/21235-terminals-supplement.pdf
Proposed New Characters: The Pipeline

@tedmielczarek @mwichary That pipeline page is nice! I recall seeing similar references in past communications going back as far as 20 years. Maybe in the details of Unicode releases, or in mailing list archives.
It is still a manual process to assemble it all. Would be a really useful scholarly project but a lot of work. (And I hear ya Marcin, if it doesn't interest you...)
@nelson @mwichary it's amazing how many fractal dimensions of rabbit holes there are to fall down just within the domain of Unicode!
@tedmielczarek @nelson It’d be super cool to automate it so you can plug in a codepoint and it would just spit out all the results… unless that exists already.

@mwichary @tedmielczarek I asked an expert about this back in 2015 and got this reply:

I have long wanted to create an index of all UTC/WG2 documents, which would go part way to achieving what you want, but I have not had the time to do so unfortunately. For most characters the paper trail is quite straight forward, but for some characters the history is convoluted, and you have to delve into the minutes and resolutions of UTC and WG2 meetings to determine what happened.

@nelson @tedmielczarek I’m interested in a) good stories, and b) stuff that helps understand other things (“how we got there”).

For example, very curious what made “symbols for legacy computing” made it in this year particularly!

@tedmielczarek @nelson Yeah, so far in my tool I did focus on stories that helps you understand particular issues rather than stories of glyphs in general. But I’d be interested in digging deeper.

@mwichary Maybe it's just me, but when I see "This sequence can be encoded in a more complex way" my first thought is "I must find the _most_ complex way" and then comes the supervillain laugh. Maybe replace "Just so you know" with "<abbr title='Just so you know. This is not a dare. This is not a challenge. OMG do not do this'>Just so you know</abbr>"

Or maybe add that meme "Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should"

@lahosken Haha. I didn’t plan to have a “more complex” thing initially, but I ended up needing it to balance the “simpler” one.
@mwichary I didn't know that more complex or specific emoji are created with the combination of more generic ones + other characters + modifiers 🤩 This IS educational and fun to play with.
@mwichary It might be very useful, one of those tools you rarely use but should always remmeber about.
@mwichary this is going straight into my favorites bar
@mwichary Wow, this is very cool Marcin! As a font- and in general typesettig-curious person, this is a very nice tool.

@mwichary @kayserifserif This looks *great*! Fun for sure, and i’m betting also very useful.

Can you (or have you already somewhere?) talk about how much of this is manual versus automated? Like, I know things like the composition of the flags and handshake can be deconstructed programmatically, but are things like the “fake font“ manual? Are you creating those all yourself?

@a @kayserifserif It’s a huge spectrum. Some things are almost all automated (like flags), but other things (like indeed fake fonts) very manual. There is also a secondary spectrum on “manual based on manually compiled data” or “automatic based on manually compiled data” and so on. The About page has some details on data sources, if you are interested.

Some fully automatic things like confusables and compiled stuff will require slight manual assists/additions, too.

@a @kayserifserif Lmk if this could be explained better!

@mwichary this is super cool!

I think many people would benefit from something like a trail map for exploring the data

@JoeGermuska Say more?
@mwichary sorry, i was on my phone. now that i look at it on desktop, the examples are probably close enough to what I was thinking of

@mwichary also: plug-ins? although I'm not sure how that works with a webapp

I just listened to @simon on a podcast yesterday so I have plug-ins on the brain

@mwichary I enjoyed looking at the way it breaks down sequences in Devanagari.

@DunwichType Thanks! Is it even remotely useful? I haven’t tested at all.

If you have any example URLs or strings handy that you tested, that’d be useful for my orientation!

@mwichary Your page looks great and it's fun to play with. I tried a syntax that I've seen from time to time but that I never understood: the escaping of domain names with umlauts. And to my surprise, text.makeup knows about it, but it's not fully implemented it seems (no further details and the highlighting of characters and URL parts doesn't line up). Shall I add an issue on github?
@compfu Oh, I can tell from the screenshot that the JavaScript broke somewhere along the way – possibly it’s been processed properly by new URL() in JavaScript, but the tool itself is unaware. I completely forgot about this one and it’s a great bug report. No GitHub necessary!
@compfu I’m also curious to learn more about it now.
@compfu BTW which browser are you using? Weirdly enough it doesn’t do that for me.
@mwichary it's Vivaldi. As soon as I enter the "ü" this error shows up in the console:

@compfu Thanks! I figured out the original issue – turns out I wasn’t doing exactly the steps you were doing, and it’s not Vivaldi-specific after all.

It would be fun to support and explain Punycode, but it will a bit more work. But in the meantime I fixed the international domains from breaking at least, if you reload the tool. Gave you credit at https://text.makeup/about/ – thanks again!

About – Text makeup

@compfu I added support to Punycode – thanks again for suggestin that. It was actually really hard because that encoding can be non-linear… it’s not bug-free, but it should work much better now: https://text.makeup/#example-punycode-in-domain-name
Text makeup

@mwichary wow, thanks for remembering my post from waaaay back :)
@mwichary very cool! I shoved the trans flag at it and it very nicely explained why it's a weird combinations of flags and symbols
@foone It would be cool to support LGBTQ symbols specifically and put them up as an independent example! https://en.wikipedia.org/wiki/LGBTQ_symbols has a nice list. Added as a potential to-do item!
LGBTQ symbols - Wikipedia

@mwichary This is a really good idea! 🙂