Poren Chiang

@RSChiang@g0v.social
88 Followers
198 Following
365 Posts
FLOSS developer and digital law researcher.
📡 digital policy · electronic voting · information law
📌 資訊法、翻譯、自由軟體
🌏https://poren.tw
🐙https://github.com/RSChiang
Please help us to reach #ios developers. We need their feedback to our survey https://fsfe.org/news/2025/news-20250618-01.en.html in order to keep #Apple accountable under the #DMA in a developer friendly way. #DeviceNeutrality
DMA: tell us how gatekeepers are handling your interoperability requests - FSFE

Free Software developers: your voice is needed! The Free Software Foundation Europe has launched the Digital Markets Act Interoperability Survey to gather ...

FSFE - Free Software Foundation Europe

“Potemkin Understanding in Large Language Models”

A detailed analysis of the incoherent application of concepts by LLMs, showing how benchmarks that reliably establish domain competence in humans can be passed by LLMs lacking similar competence.

H/T @acowley

Link: https://arxiv.org/abs/2506.21521

想要偷偷分享我很喜歡的木吉他歌手 @chanwenju !輕快的旋律不論是週末早晨或是午夜時分都很撫慰人心。

《快樂旅社》
• iTunes Store: https://music.apple.com/tw/album/%E5%BF%AB%E6%A8%82%E6%97%85%E7%A4%BE/1822804717
• Spotify: https://open.spotify.com/album/01wtvbyk51x128s26faatl

#乳齒象友的深夜歌單

詹雯如在 Apple Music 上的《快樂旅社》

專輯 · 2025年 · 9 首歌

Apple Music - 網頁播放器

📢 We've sat down with our artist @dopatwo and created a sticker pack for @signalapp. Now you can send cute elephants to your friends, and promote the #fediverse at the same time. We ❤️ Signal, too!

https://signal.art/addstickers/#pack_id=43a9c3e16e24b2f182e2d3e03a7e1338&pack_key=87a129905fbe7371568eef6485f93a81b7569a963bf711063bf804123a075083

The Third Edition of Portable Network Graphics (PNG) today became a W3C Standard.

It adds support for High Dynamic Range (HDR), and adds Animated PNG to the official standard.

https://www.w3.org/TR/png-3/

Portable Network Graphics (PNG) Specification (Third Edition)

This document describes PNG (Portable Network Graphics), an extensible file format for the lossless, portable, well-compressed storage of static and animated raster images. PNG provides a patent-free replacement for GIF and can also replace many common uses of TIFF. Indexed-color, greyscale, and truecolor images are supported, plus an optional alpha channel. Sample depths range from 1 to 16 bits.

If you're one of my academic publishing folks: sadly, it's true. Due to a funder's unexpected decision to pull support, we've had to make the incredibly difficult decision to wind down PubPub over the next 18 months and regroup to figure out how to best serve our mission.

I'm sure I'll have more to say, but today I'm feeling both gratified and saddened by the overwhelmingly supportive responses to our announcement.

I urge you to learn from our mistakes: https://www.knowledgefutures.org/updates/2025-06-update/

Not Enough: Open Infrastructure Funding and the Future of Knowledge Futures

Knowledge Futures mission is to make information useful.

I rarely subtoot, but when I do just to say: if an open source project that your commercial project depends on breaks something in your software stack, causing you trouble, no matter how much, that's your problem and your problem alone.

"The software is provided as is" is a part of OSS licenses for a reason, and unless we have a contract that says otherwise, I'm not part of your bloody "supply chain".

Oh the humanity…
#COSCUP 投稿上了!還請大家今年跟我一起看怎麼追討廠商的疏失!(・ω・)ノ(欸)
Study: Meta’s Llama 3.1 can recall 42 percent of the first Harry Potter book
The research could have big implications for generative AI copyright lawsuits.
https://arstechnica.com/ai/2025/06/study-metas-llama-3-1-can-recall-42-percent-of-the-first-harry-potter-book/?utm_brand=arstechnica&utm_social-type=owned&utm_source=mastodon&utm_medium=social
×

“Potemkin Understanding in Large Language Models”

A detailed analysis of the incoherent application of concepts by LLMs, showing how benchmarks that reliably establish domain competence in humans can be passed by LLMs lacking similar competence.

H/T @acowley

Link: https://arxiv.org/abs/2506.21521

@gregeganSF @acowley
Maybe this could lead to better exam questions?
@Steveg58 better exam questions for LLMs? Or for humans?
In the further case who cares. In the latter why would LLM error modes have to do with human error modes?
@kevingranade
To create exam questions that require the sitter to demonstrate understanding, The current ones obviously don't. Any exam question that allows someone with no actual knowledge to pass is obviously flawed.

@Steveg58 @kevingranade

The paper makes the case that when humans misunderstand things, they usually do so in a very different way than LLMs — and that exams that are genuinely pretty good at assessing humans’ competence can nonetheless be terrible at assessing LLMs.

This doesn’t mean all current exams are perfect at assessing humans, but it’s a big [and unjustified] leap to go from “an LLM can ace this exam while having no competence at applying the relevant concepts” to “a human who aces this exam might understand as little as the worst LLM that could also ace this exam”.

We might have to change exams to catch out humans who are cheating with real-time assistance from concealed LLMs, but that’s a different matter.

@gregeganSF @acowley

This. So much this. I had the following exchange with Claude a couple of months ago (lightly condensed and paraphrased)

Me: draw the structure of naphthalene, please

Claude: (produces something with the right shape and general connectivity of the atoms, but one carbon atom has five bonds, a classic oopsie in Intro Organic Chem)

Me: one of the carbon atoms has the wrong number of chemical bonds. Please fix it.

Claude: I’ve redrawn it. (Its effusive self-directed praise about how perceptive I am and how it successfully redid the drawing has been deleted.)

(a human who understands carbons are limited to 4 bonds, while also usually having exactly 4 bonds in neutral molecules, would find the error and fix it easily. Claude did…not. Its revised structure has *four* carbon atoms with five bonds.)

Me: your structure has four carbon atoms with the wrong number of bonds.

Claude: I’ve redrawn it. (Again, much obsequiousness deleted.)

(Claude has drawn the same exact structure as the previous one, except a couple of the bonds have been shifted by a couple of pixels.)

That was Claude. I tried the same thing on Gemini and it was immeasurably worse, culminating in a structure that had carbon bound to three other carbons *and three hydrogens* and that wasn’t even the weirdest thing about it.

@gregeganSF @acowley Ah this ratio of 19/20 accuracy explains a lot!

This kind of accuracy/inaccuracy ratio is actually really difficult for humans to understand.

It is accurate enough times to throw a facade over it’s nonunderstanding, but not accurate enough to trust in the long run.

It passes just under our instinctive thresholds of validating a person’s understanding of a concept. We let it pass, as a human who gets 19 out of 20 right ”gets it” and would improve, where an #ai doesnt

#llm

@gregeganSF @acowley reminds me of Feynman's anecdote about physics students in Brazil? who could perfectly recite textbook statements about polarisation of light, but were incapable of applying them to a very simple problem.

@gregeganSF @acowley
It's great to have this work, to show how (depending on the viewpoint) models are able to solve some problems even though they don't understand, or some benchmarks are too easy to solve by pattern matching and don't measure understanding.

It would be even nicer if they used best practices when invoking the models, such as chain of thought prompting. Especially given that some of the problems require deliberate consideration when people do them. And some of the problems touch on known issues with language understanding, such as counting syllables. But most of the tasks are free of these.

@gregeganSF @acowley

"incoherent application of concepts"

Reminder that no concepts are involved. in a random (Markov) walk through word space. Shannon 1948.

From Pogo: "We could eat this picture of a chicken, if we had a picture of some salt."

@glc @gregeganSF @acowley But that is not the word space they're walking...

@dpwiz @gregeganSF @acowley

I suppose you must be referring to Pogo, which is not, for the present purposes, even a word space (or: not fruitfully treated as such).

@glc @gregeganSF @acowley no, the LLMs aren't operating in **word**-space.

@dpwiz @gregeganSF @acowley

Are you trying to distinguish tokens and words?

Or do you have a point? If so, what is it?

@glc @gregeganSF @acowley No, bytes/tokens/words/whatever is irrelevant. The important part that's wrong in the "word-space" model is that it misses the context. The "language" part is a red herring. What's really going on is a tangle of suspended code that's getting executed step by step. And yes there are concepts, entities, and all that stuff in there.

@dpwiz

I'd say there is syntax without semantics (in the traditional sense of formal logic, that is).

You have some other view evidently.
That much is now clear.

I don't see much difference from Markov and Shannon, apart from some compression tricks which are needed to get a working system.

@glc Perhaps. I just hope this not another "X is/has/... Y" claim.
What's your favorite or most important consequence of this distinction?

@dpwiz

That no concepts are involved, and the numerous corollaries of that, I suppose. At least, that's what I find myself harping on now and then.

I have no strong interest in the details. though considerable interest in watching this play out.

—Someone like Cosma Shalizi is going to actually get into the weeds a bit more:
http://bactra.org/notebooks/nn-attention-and-transformers.html

You'll probably find much to agree with and much to disagree with there. And at adequate length.

"Attention", "Transformers", in Neural Network "Large Language Models"

@glc > I find this literature irritating and opaque.

That's a promising start! (8

@gregeganSF @acowley in short: a LLM stores data. It doesn't UNSERSTAND that data.

No surprise here.
@gregeganSF @acowley I don’t know why so many scientists seem to suck at using LLMs, but Claude Sonnet 3.7 had no problem correctly answering these 3 questions at first try.
@dgavin @gregeganSF @acowley Well many of us prefer to think for ourselves. After all, that's our job.
@ariaflame @gregeganSF @acowley That’s not the point. I prefer that too. But there are too many scientific studies that just dont stand up to peer review in the LLM field.
@dgavin @gregeganSF @acowley Well let's put it this way. If scientists suck at using it, what do you think the average person is going to be like trying to use it and interpret its output?
@ariaflame @gregeganSF @acowley I started using computers in the 80ies and I am to this day still trying to teach scientists at our university to use Word. Most people suck at using computers in general and we still use them more than any other tool humanity has ever invented. We will have to tame AI, because people will use it, no matter what.
@dgavin @gregeganSF @acowley At least until the bubble collapses because it isn't magic and won't give the investors the financial rewards they are hoping for. And I started using computers then too. I am fully aware of GIGO and LLMs just magnify this.
@dgavin @gregeganSF @acowley And it's as if the output of LLMs isn't consistent. So your chances of getting accurate information from it are unknown.
@ariaflame That has been known for years. That’s why LLM are not a tool for scientific writing. You don’t use a hammer to drive in a screw. That doesn’t mean a hammer is a useless tool. LLM are extremely good at other things humans suck at, such as making sense of large amounts of data.
@dgavin No, they don't make sense of anything. They are a stochastic parrot predicting most likely words after another word. They don't think. They don't analyse. Now neural nets trained on specific tasks, there is some hope there, but LLMs are not an analysis tool and I would be extremely dubious of any data summaries it 'produced'/