Mastodawn

Alex Feb 19, 2025

Today I learned that there is a specific #unicode "record separator" symbol, formally known as "U+001E Information Separator Two".

https://codepoints.net/U+001E

It is meant to be used to indicate a separation between two units of information. An example of where this could be used is in a separated-value file, e.g. a CSV, but using this symbol instead of a comma.

This is interesting because there are vanishingly few instances where the record separator symbol would appear in most contexts, but many instances where a comma appears. Using this symbol instead of a comma (or a semi-colon, or an exclamation point, or any one of the usual separators) could make some data hygiene scenarios much more straightforward.

U+001E INFORMATION SEPARATOR TWO*: ␞ – Unicode

␞, codepoint U+001E INFORMATION SEPARATOR TWO* in Unicode, is located in the block “Basic Latin”. It belongs to the Common script and is a Control.

Codepoints.net

Show thread

mhoye Feb 18, 2025

@phrawzty We had this in ASCII and nobody knew about it!

Show thread

james Feb 18, 2025

@mhoye @phrawzty And yet somehow gettext used \u0004 END OF TRANSMISSION as a separator! (Which, because I learned about it from gettext, I also have)

Show thread

Simon Sapin Feb 18, 2025

@Jamessocol @mhoye @phrawzty if you’re designing a new format, please please use well-defined escaping or length-prefixing instead of trying to find a less common delimiter and hoping for the best

Show thread

Dan 🌈Feb 18, 2025

@mhoye
REJECT MODERNITY
EMBRACE TRADTION(AL ASCII)

Show thread

🔏 Matthias Wiesmann Feb 18, 2025

@phrawzty This would be way better than using CSV which is a disaster format. In particular, these characters are not Unicode interchange valid, so they can never appear in the text fields inside the table…

https://wiesmann.codiferes.net/wordpress/archives/19862

More CSV Evil

good,⃣evil My post on CSV parsing got quite some attention, with various systems parsing them quite differently, one google+ posting by Kristian …

Thias の blog

Show thread

Terence Eden Feb 19, 2025

@phrawzty I've built a (toy) format which uses it.
https://shkspr.mobi/blog/2017/03/kyli-because-it-is-superior-to-json/

KYLI - because it is superior to JSON

This is a (silly) attempt to fix some of the shortcomings of JSON. Hence it is named after the goddess of music. It uses C0 Control Characters Here is an example: ␜ ␁ This is a KYLI document ␂ ␝ GroupName ␞ data ␟ value ␛ Comments are supported too! They can be multilined easily. ␙ I've used Unicode Control Pictures so you can see what's happening.…

Terence Eden’s Blog

Show thread

Dan 🌈Feb 19, 2025

@Edent Ooh that's fun! I like how you're using all the symbols there. I get that it's all a bit tongue-in-cheek, but the your "Why this is better" list actually is better :D

Show thread

tajpulo Feb 19, 2025

@Edent
@phrawzty
Yeah, here is some repository which put even more effort into this: https://github.com/SixArm/usv (2022, so your blogpost predates it)

The issue, obviously, is that the symbols are non-printable and non-tech people do not know how to deal with it.

GitHub - SixArm/usv: Unicode Separated Values (USV) data markup for units, records, groups, files, streaming, and more.

Unicode Separated Values (USV) data markup for units, records, groups, files, streaming, and more. - SixArm/usv

GitHub

Show thread

Terence Eden Feb 19, 2025

@tajpulo @phrawzty
Are non-techies going to be looking at JSON?

Show thread

Dan 🌈Feb 19, 2025

@Edent
Yes. 😅
@tajpulo

Show thread

Kornelis Feb 19, 2025

@phrawzty I’m a lay person, but I still want to code sometimes. I like separators that are always visible. Perhaps that is naive. Anyway, I can see RS in vim as ^^, but I’m not sure that uniquely identifies RS. When I test RS in the fonts on my system, half display it as white space, the other half tell me RS does not exist. Not sure what that means.

Show thread

Dan 🌈Feb 19, 2025

@kornelis Font support is a whole thing, for sure. As is whatever underlying structure is interpreting, storing, and rendering the symbols. For example, when I try to paste the symbol into the web interface of mastodon (i.e. right now), it pastes as a unicode box, but the stored and/or rendered result just omits it entirely.

fun!

Show thread

Andreu Casablanca 🐀Feb 19, 2025

@phrawzty What's the difference between "Information Separator One" and "Information Separator Two"? (besides them having dofferent code points, of course)

EDIT: Oh, I see there are at least four of them! 🤔

EDIT 2: Now I see that they map directly to the ASCII control characters group. I find it strange that the Unicode database does not keep any reference to their original names in the ASCII context.

Show thread

Dan 🌈Feb 19, 2025

@castarco The group is all meant to be used together! This comment from @Edent has a link to a toy implementation that makes the usages of the group more clear. https://mastodon.social/@Edent/114029501886306300