Leon P Smith

@leon_p_smith@ioc.exchange
91 Followers
300 Following
2.4K Posts
Communications engineer and mathematician. Longtime functional programming and Haskell enthusiast, occasional Schemer. Inventor of corecursive queues, postgresql-simple, an aggregate theory of concrete mathematics, and self-documenting cryptography. Currently aspiring to become an Epistemic Frame Engineer.
Pronounshe/him
There’s a lot of very clear thinking in this paper! Hits close to home for me as it strongly relates to the challenge of predicting real world #robotics performance from benchmarks, or how benchmarks can be misleading. https://arxiv.org/abs/2506.21521
Potemkin Understanding in Large Language Models

Large language models (LLMs) are regularly evaluated using benchmark datasets. But what justifies making inferences about an LLM's capabilities based on its answers to a curated set of questions? This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs -- such as AP exams -- are also those used to test people. However, this raises an implication: these benchmarks are only valid tests if LLMs misunderstand concepts in ways that mirror human misunderstandings. Otherwise, success on benchmarks only demonstrates potemkin understanding: the illusion of understanding driven by answers irreconcilable with how any human would interpret a concept. We present two procedures for quantifying the existence of potemkins: one using a specially designed benchmark in three domains, the other using a general procedure that provides a lower-bound on their prevalence. We find that potemkins are ubiquitous across models, tasks, and domains. We also find that these failures reflect not just incorrect understanding, but deeper internal incoherence in concept representations.

arXiv.org

Great! A bunch of us here wanted it. Now it exists. 👍

It's a "dark archive" of the arXiv - a non-public backup to save the data in case of attack by hackers or the US government. The arXiv, I hope you know, is the biggest source of modern math and physics papers.

Who got the job done? The TIB: the Technische Informationsbibliothek, run by the Leibniz Information Centre for Science and Technology, in Hannover, Germany.

They write:

"The TIB has now set up a so-called dark archive for the arXiv content in order to be able to make the backed-up data accessible if the data stored in the USA is lost. The archive functions as a silent reserve: the complete copy of the content is stored decentrally at the TIB, but is not publicly accessible. This means that the data stock – almost 10 terabytes – is protected against potential outages and can be activated in an emergency.

The TIB is currently working on processes to keep the archive up to date: new submissions and updated versions must be backed up regularly in order to preserve the state of research as completely as possible.

“Building a Dark Archive is an expression of our longstanding commitment for a reliable, international academic provision, and as a partner of arXiv. Even though the Dark Archive today only works in the background, it is a key element in safeguarding digital research contents in the long term, because in case of a crisis, we could open the archive,” explains Dr Irina Sens, Deputy Director of the TIB."

We should call it the darXiv.

More details here:

https://blog.tib.eu/2025/05/14/protecting-science-tib-builds-dark-archive-for-arxiv/

Protecting Science: TIB builds Dark Archive for arXiv - TIB-Blog

Research and science are international; it is not for nothing that we speak of international specialist communities. Although a service such as arXiv is operated by an institution based in the USA, namely Cornell University, it is used by researchers worldwide. Part of arXiv‘s funding has also been internationalised since 2010 with the introduction of arXiv membership. The TIB finances the German contribution together with the Helmholtz Association of German Research Centres (HGF) and the Max Planck Society (MPG). The TIB has now set up a so-called dark archive for the arXiv content in order to make the backed-up data accessible in the event that the data located in the USA is lost.

TIB-Blog

@inthehands But like, that's presumably less than 2.2 billion URLS to try, so it would be doable, at least if you do it slowly with a distributed indexer.

On the other hand, this feels like a fairly low-effort redirector site, so the chance might be good that you could just crawl it over the course of a few days from a small number of computers. Though if I were ethically compromised enough to build such a site, I'd probably try to identify scrapers and replace active URLs with redirects.

And... assuming they are sending a unique code upon every SMS text, you shouldn't have to go looking for long before you find something interesting. I'd guess you wouldn't need more than a few hundred URLs, tops.

Of course, they could recycle their URLs and replace them with sketchy redirects after some period of time, say a week or three, which means the number of active URLs could be much closer to their recent spam activity. In that case, I'd guess you might have to try a few thousand URLs before you find one that is interesting.

@inthehands Yeah I don't know what their kind of saturation of the space might be, to estimate how many URLs you'd have to try to have a reasonable chance of finding one that does something other than try to fake the non-existence of the domain.

Waaayyy more effort than what I'm willing to try, though.

@inthehands I'm getting 302-redirected to not-found.domain.

I'm guessing if you put in the actual links they sent you, you'd get redirected elsewhere. Sketchy as hell.

@jnl Honestly, an electric teakettle should be the most efficient way of heating water. A microwave isn't bad, though.

Personally, I don't much care, though given a choice I'd probably use an electric kettle.

"Zohran Mamdani’s victory proves it: The ‘gotcha’ mode of fighting antisemitism has to go. …

When we reduce understanding of antisemitism to buzzwords — and say that we expect certain answers to certain questions and, if we don’t hear them, that means the candidate is an antisemite who has no place holding office — we confuse the definition of antisemitism. And we do nothing to actually, tangibly advance Jewish safety."

~ Emily Tamkin

#Mamdani #Jews #diversity #antisemitism
/9

You know back in my day, we had static analysis tooling that would give you exactly this kind of feedback, except it was correct. Now we have shit which only looks at the vibes of the source text and does no semantic analysis whatsoever, so of course it's just fucking wrong

Sent a pull request to Audacity fixing a crash bug I'd been running into frequently. The cause was an out-of-bounds memmove. Classic C++ areas.

Anyway I got a fucking copilot review on my PR which left two comments, both completely wrong, one of which suggesting I reintroduce the out of bounds memory access. I'm furious!

Kinda hit me this morning how AI is an assault on gifting economies: reddit, Wikipedia, github, AO3 (even books/art, although those are more tangled with money-making) are all gifting economies that run on the idea that we all benefit by sharing. People freely give because it makes life better.
1/n