Mastodawn

Janelle Shane Aug 8, 2025

New AIWeirdness post: ChatGPT will apologize for anything, including stuff that didn't happen. It's not reflecting on stuff it did wrong, it's improv.
https://www.aiweirdness.com/chatgpt-will-apologize-for-anything/

Show thread

david turgeon

@janellecshane i was wondering some stupid thing the other day: if chatgpt outputs something that's verifiably true (i.e. “water is wet”) & you retort that it's actually false, does it apologize? my guess would be that it does.

Show thread

Pete Alex Harris🦡🕸️🌲/∞🪐∫Aug 8, 2025

@dt @janellecshane
Very likely, because it has no internal model of factuality or semantics. "Water is wet" and "Water is dry" are syntactically equivalent, just one is more frequently found in the training data. If apologising for a statement is the most probable response to a challenge or rebuke in the training data, it'll generate an apology-shaped token stream as expected.

Show thread

George B Aug 9, 2025

@dt @janellecshane

"Verifiably true" is meaningless in this context since the LLM has no way to verify the truth of statements, just a way to make them.

Show thread

david turgeon Aug 9, 2025

@gbargoud @janellecshane i know how llms work. but if a million monkeys with typewriters end up writing the sentence “water is wet”, that sentence is verifiable even though the monkeys have no idea what they've done. if a parrot utters the sentence “water is wet” then it has unknowingly uttered a verifiable sentence. if i read “water is wet” in a book, it's a verifiable sentence even if i am to later learn that it was generated by some algorithm.

Show thread

George B Aug 10, 2025

@dt @janellecshane

But since the LLM has no internal concept of true or false it treats verifiably false statements exactly the same as verifiably true ones.

Show thread

david turgeon Aug 10, 2025

@gbargoud @janellecshane that should be right but so far i've only seen examples of “apologies” for producing statements which we know are obviously false. what about “apologies” for producing statements which we know are true? the model doesn't know truth from falsehood, sure, but does it have something like a “confidence score” that makes it slip more easily into “apology mode” when countered about certain statements which it considers to have a low score? that's mostly what i'm wondering.