Mastodawn

My friends drew my attention to this paper which was written by computer scientists so pure, so theoretical, so far above the sins of the empirical plane, that they ran a benchmark comparing common serialization formats in their *mind palace* and came to a conclusion which is faster https://arxiv.org/abs/2505.13478

An Extensive Study on Text Serialization Formats and Methods

Text serialization is a fundamental concept in modern computing, enabling the conversion of complex data structures into a format that can be easily stored, transmitted, and reconstructed. This paper provides an extensive overview of text serialization, exploring its importance, prevalent formats, underlying methods, and comparative performance characteristics. We dive into the advantages and disadvantages of various text-based serialization formats, including JSON, XML, YAML, and CSV, examining their structure, readability, verbosity, and suitability for different applications. The paper also discusses the common methods involved in the serialization and deserialization processes, such as parsing techniques and the role of schemas. To illustrate the practical implications of choosing a serialization format, we present hypothetical performance results in the form of tables, comparing formats based on metrics like serialization deserialization speed and resulting data size. The discussion analyzes these results, highlighting the trade offs involved in selecting a text serialization format for specific use cases. This work aims to provide a comprehensive resource for understanding and applying text serialization in various computational domains.

arXiv.org

Chip Unicorn May 21, 2025

@0xabad1dea looks at the paper

Wow. No one has ever, uhmmm, measured the speed difference between four of the most common text formats before. We definitely need to use a hypothetical computer for that!

David Chisnall (*Now with 50% more sarcasm!*)May 21, 2025

@Chip_Unicorn @0xabad1dea But that’s a comparison of implementations which, as they point out, brings confounding factors. Running experiments involves a lot of different sources of error. In contrast, if you simply make up your results, there is only one source of error. The latter is obviously a superior approach arising from Lemma 1.4: 1 < many.

Alessandra Sierra May 21, 2025

@0xabad1dea "To illustrate the practical implications of choosing a serialization format, we present hypothetical performance results in the form of tables …"

because, of course, *hypothetical* performance measurements are more reliable when you put them in tables 😂

Asta [AMP]May 22, 2025

@lambdasierra @0xabad1dea ah damn, you beat me to it 😂

Strider Uwe 🇺🇦🇨🇦🇲🇽May 21, 2025

@0xabad1dea Wow. 21 authors too. Some resume padding maybe…

䷰ Xīn Jīn Mèng 新金梦 May 21, 2025

I didn't know we'd still see, "I have only proved it correct, not tried it" in 2025.

Eric McCorkle May 21, 2025

.....This is at the level of what I'd expect from someone's project paper for a course in a CS-adjacent associate's degree program. It is several tiers below a research paper

Ludwig Vielfrass May 21, 2025

@0xabad1dea Now I’m wondering how many of my tech rants I’ve written over the years I can reformat in Latex and call “An Extensive Study”. Academia here I come!

Trammell Hudson May 21, 2025

@0xabad1dea wait you can just do that?

Pxl Phile May 22, 2025

@th @0xabad1dea of course and it is a viable scientific method¹.

_______
Sources:

Jayne May 21, 2025

@0xabad1dea in their defense, computer science is less frustrating when you throw out the computers

Daniel Carosone May 21, 2025

@dotjayne you can still make this an opportunity for science! Mostly physics. Mass, acceleration under gravity, parabolic trajectories, even wind resistance if you measure carefully!

Dan Lyke May 21, 2025

@dotjayne @0xabad1dea I'm old enough that I went to college when they were just separating the CS department from the math department, so... yeah, they tried that once.

@danlyke
Were the remaining mathematicians happier afterwards?

@dotjayne @0xabad1dea

Dan Lyke May 23, 2025

@Landa yes. I think everybody benefitted.

@dotjayne @0xabad1dea

Ben Evans May 24, 2025

@Landa @danlyke @dotjayne @0xabad1dea Steady on. Imagining mathematicians happy sounds like a bit of a stretch to me.

Chris Bohn May 22, 2025

@dotjayne @0xabad1dea Something something astronomy and telescopes.

@dotjayne @0xabad1dea cience

Crissa Kentavr May 22, 2025

@0xabad1dea
This seems like the first step in creating a hypothesis, tho... Like saying 'this is x number of steps and that is y number of steps so we can compare them before dealing with the size of each step in each implementation...'

abadidea May 22, 2025

@Crissa sure, but… then where’s the rest of the paper…

Crissa Kentavr May 22, 2025

@0xabad1dea
Well, they drank the bottle instead of launching a ship with it.

CyberFrog May 22, 2025

@0xabad1dea absolutely wild LOL

The hypothetical results presented in the previous section, while not based on a real benchmark, align with the generally understood characteristics of the evaluated text serialization formats. Analyzing these hypothetical results allows us to discuss the typical trade-offs involved in choosing a format for a given application.

niconiconi May 22, 2025

@0xabad1dea It almost feels that the paper and its benchmarks were LLM-generated, and the authors just did a s/benchmark/hypothetical benchmark/g.

Rob Jellinghaus May 22, 2025

@0xabad1dea "The simulated benchmark was run on a hypothetical standard computing environment."

I think they misunderstand the meaning of "run" here. Nothing was ever run. THEY NEVER RAN ANYTHING.

They really need the word "imagined" in there somewhere.

Lindsey Kuper May 22, 2025

@0xabad1dea honestly, this looks like some people's largely LLM-generated undergrad course project

Jeffy 🏳️‍⚧️🏳️‍🌈🇺🇦 ❤️🇱🇺May 25, 2025

@0xabad1dea “sins of the empirical plane” feels sloppily redundant.