Just fantastic technology all around. Absolutely no worry where this is all going to go.

Is it just me? Am I using this wrong or am I asking questions that are too hard?

Here’s an example of a hallucination that happened while explaining away another hallucination I called it out on. I rarely have experiences other than these.

Useful things I have learned from people so far:
- try to use newer models and “thinking mode” (I just did the latter on Gemini, although it feels so very slow)
- you can try to prompt engineer to demand more truthfulness by cross checking (perhaps this is what thinking mode does?)
- using for recall is better than using to learn things
- obscure historical research in general is not going to feel great
- being as specific as you can helps
- be aware of biases perpetuated by AI and counter them
@mwichary research modes or thinking modes are better in many cases, but funny because they're just LLM running on top of LLM. It's the LLM refining and adding to the prompt before arriving at the answer. Like everything in this field, it doesn't resolve the core problems inherent to LLMs, just tries to brute force it into something resembling accuracy and truth.
@mwichary It’s not just you.
@beep Like, I have been around the block enough to learn about Gell-Mann Amnesia.

@beep @mwichary DEFINITELY not just you.

The problem is that it's amazing 20% of the time, which is all people can talk about. It gives the veneer of a revolution. But it's just OK about 60% of the time, and batshit crazy the remaining 20%. Those aren't good odds.

@scottjenson @beep Yeah, I was just thinking about that!!! You don’t expect Sturgeon’s Law to invade software like this, though. Google felt immune to it.

@scottjenson @beep I am getting a sense that it’s okay to ask for mainstream/obvious things. Anything that’s a bit more obscure almost never gets me anywhere useful.

I am sometimes finding it a good alt to Google for locating specific pages I know exist! (Which ironically is “you had one job” of Google.)

@mwichary @scottjenson @beep “Jeanine Raskin (or ‘Jeanine Raskin’)”

What?

@ramsey @mwichary @scottjenson @beep

The other one is probably in some other script such as Arabic or Cyrillic or Greek or Devanagari script, and transliterated to Latin script here.

The neural network doesn't handle words at all, only nodes. And the program has (correctly) connected the two nodes as one real-life entity. Джанинє Рашкін and Janine Rashkin are two different text strings, and therefore not the same concept unless linked as the same.

Funny all the same :)

@mwichary @scottjenson @beep the main thing I use it for is "what's the word for..." when I can describe the thing precisely. sort of a reverse wikipedia. (most recently: "that style of poetry which is similar to a villanelle but repeats entire lines rather than just the rhyming words.") works because I recognize the answer when I see it. I've also probably put too much time into prompt customization to get it to speak plainly and quit simpering.
@relsqui @mwichary @scottjenson @beep So like a reverse dictionary, like this: https://reversedictionary.org/
Reverse Dictionary

@WhiteCatTamer @mwichary @scottjenson @beep in concept, yes! a dictionary definition string search doesn't work as well for my given example though
@scottjenson @beep @mwichary
It’s calculating “probably most likely next word, as highest percentage option in 4 characters [token] to complete previous combination”.
That’s why specific name for “Jeanine Raskin” has infinite number of possibilities, few are real & only 1 is the answer.
E.g. Jan(10%)/Jen(20%)/Jean(40%)-nine (65%) Rask(70%)/Rusk(50%)/Risk(25%)-kin(20%).
For humans words are cognitively symbolic. For this tech words are statistical calculation based on its vector corpus.
@scottjenson
@beep @mwichary so, you're claiming it has already surpassed the mythical "Natural Intelligence"??

@mwichary

Hallucinations remain common, and the strongest mitigation is a combination of search use & "extended thinking". Even then, models may over-privilege low quality sources.

May I ask which version of ChatGPT this is from (paid or free; Thinking mode enabled or not)? Here's the result with 5.2 + Thinking mode, in case it's useful for comparison:

https://chatgpt.com/share/694cb369-b3d4-800b-8abc-e29d565910d8

@eloquence @mwichary ah yes the obligatory "you're just holding it wrong" response
@aburka @eloquence I am interested in learning this, though. I didn’t perceive this response as blaming me in any way.
@mwichary @eloquence not blaming you and maybe I misinterpreted but "you just have to try the latest model man" is a very common refrain from AI boosters
@aburka @eloquence And yet, examples shared by others so far show that the more recent models *are* better.
@eloquence Thanks! It is useful. This was on free and I sometimes wonder how that affects things.
@eloquence Sorry, is “search use” going to Google and/or veryfing by other means?

@mwichary

OpenAI operates its own crawlers and also licenses search results (Bing, as part of their longstanding relationship with MSFT); it's undisclosed what the exact "mix" is that comprises responses.

@mwichary

In terms of free vs. paid, the free plan is heavily restricted. In the response I shared it spent 75 seconds in "thinking" mode. For more comprehensive reports they have a "Deep research" feature that can run for 5-10 minutes.

That increase does tend to improve the quality of responses, better attribution of claims to sources, etc. It does not obviate the need to verify, of course.

@eloquence @mwichary yes, they "operate crawlers" the same way the Death Star provides gentle illumination 😅
@mwichary @eloquence Here is what Gemini Pro gave me for your query. These are all legit but a bit broad. I followed up with "focus on women doing HCI research in the last 50 years and got a much better list.
@scottjenson Yeah, Grace Hopper might be a bit of a stretch, I think…
@mwichary I'm not sure what point I'm trying to make. These systems WILL get better. My worry is that we'll just go from 20% batshit crazy to 10%. It's an improvement but....

@scottjenson @mwichary the problem is whatever the percentage if you have to know the field to detect which answers are wrong (or how they are missing key context or steps etc) then most users of such systems will either assume that the wrong answers are right

Or will eventually distrust all the answers even the real/correct ones and be unable to figure out what the right answers are

This is already happening as online search degrades and content spreads so evaluating reliable info is hard

@scottjenson @mwichary I know that for topics I know at an expert level AI’s answers (as found in places like Google’s ai generated summaries that you can’t easily escape when using google search) but also from places like FB’s ai generated content about stuff posted to meta properties etc are almost always deeply flawed and contain mistakes and hallucinations.

But when it is something I’m less expert on its far harder even for me (with decades of search expertise) to find the accurate info now

@mwichary you're actually second-guessing the nonsense it generates, which most people aren't bothering to do
@mwichary fwiw, "calling out" hallucinations is useless. It's just going to generate a typical-looking apology for hallucinating and then hallucinate something else. This is another cognitive trap of the chatbot metaphor -- you can't catch an LLM in a lie like Susan Calvin short-circuiting a positronic brain, because there's nothing in there.
@aburka Yeah, but it gives me a bit of satisfaction. 😂
@mwichary can't argue with that
@mwichary It has no concept of fact or fiction, reality or hallucination; you asked it to write a story about women UI designers, and it did. How you frame your question sets the tone for the answer
@Kroc …?
@Kroc You went from a few platitudes I clearly understand to… blaming me in a weird way? Not a great move.
@mwichary What are you expecting from this technology? A valid answer? Seems reasonable, but that's not how this technology works. It is literally just guessing the next word based on a large pool of other words. By including words like interface design, you are steering the "conversation" toward a different set of expected words, but that's all it is, random word guessing. How it's able to produce a correct answer at all is frankly astonishing, but you are talking to a brick wall that relies upon you to already know the truth to sort fact from fiction because it doesn't know one from the other. There is no algorithm for truth.
@[email protected] You were warned. Now you’re blocked.
@mwichary this is 100% typical of my attempts to use chatgpt. as soon as the need for concrete factual information comes up in the interaction (especially stuff that is actually difficult to research!) the model generates plausible-looking answers that turn out to be false, then generates apology-oid text that itself doubles down on plausible-looking answers, which itself contains still more falsehoods
@aparrish Yeah… I can see it potentially being useful as an accelerant to get to some information (although the ethical and ecological concerns remain as well), but not nearly as groundbreaking as people believe this is…
@mwichary even aside from the ethical and ecological concerns, i worry that the generated text is often, like, worse than wrong, in that it might predispose you to pursue certain lines of research that favor (broadly conservative) pre-existing ideas about what you're researching... as in this case, where the generated text kinda implies that among the most important ways that women contribute to UX is by being mothers and wives of famous men
@aparrish That’s a great point!
@aparrish @mwichary anchoring bias! I think about this whenever people say "I'm just using it to generate a starting point, it's fine" (it is not fine)
@mwichary Claude Opus 4.5 (probably overkill but my default) gave Lillian Gilbreth, Susan Kare, Lucy Suchman, Muriel Cooper, Joy Mountford, Brenda Laurel, and Elizabeth Sanders. I'm not familiar with all of them, but its explanations of them seemed to be in line with what I looked up, at least at a high level. I had "thinking" enabled in an incognito chat (no previous context), used your prompt verbatim, and it opted to answer without performing web searches. Will thread its full response.

@mwichary

Several remarkable women have shaped the fields of interface design, human factors, and ergonomics:

**Lillian Gilbreth (1878–1972)** is often called the "mother of modern management." She was among the first working female engineers with a PhD and pioneered time-and-motion studies alongside her husband Frank. After his death, she focused on applying ergonomic principles to domestic spaces—designing the "work triangle" kitchen layout, foot-pedal trash cans, and refrigerator door shelves. She consulted for GE, Johnson & Johnson, and other companies on making products more usable.

**Susan Kare (b. 1954)** created the original Macintosh icons, fonts (Chicago, Geneva, Monaco), and interface elements in the 1980s—the happy Mac, the command symbol, the trash can, the paint bucket. She essentially invented the visual language of the graphical user interface and later designed icons for Microsoft, Facebook, and Pinterest.

**Lucy Suchman (b. 1951)** worked at Xerox PARC and wrote *Plans and Situated Actions* (1987), which fundamentally challenged assumptions about how people interact with machines. Her ethnographic approach influenced how designers think about context and real-world use rather than idealized user behavior.

**Muriel Cooper (1925–1994)** founded the Visible Language Workshop at MIT Media Lab and pioneered digital typography and information design. Her work on dynamic, three-dimensional information spaces anticipated many modern interface concepts.

**Joy Mountford** led Apple's Human Interface Group in the late 1980s and early 1990s, shaping the direction of Mac interface design during a formative period.

**Brenda Laurel (b. 1950)** wrote *Computers as Theatre* (1991), applying dramatic theory to interaction design, and was a pioneer in virtual reality research and designing technology for girls.

**Elizabeth Sanders** developed participatory design and co-design methods that brought end users into the design process as active contributors rather than passive subjects.

Would you like me to go deeper on any of these figures or explore others in related areas?

@aschmitz This is better than what I got, I think. I don’t think Lucy Suchman came up for me.
@aschmitz I used ChatGPT and it gave me a lot more names and categories, but alas at least one was made up.

@mwichary If I ask for ten more and very brief descriptions, I do get more:

~~~
**Yvonne Rogers** – cognitive scientist whose work on ubiquitous computing and external cognition shaped how we think about technology in everyday environments

**Wendy Mackay** – HCI researcher at Inria who pioneered work on augmented reality, paper interfaces, and participatory design methods

**Bonnie John** – developed GOMS modeling techniques for predicting human performance with interfaces, long associated with Carnegie Mellon

**Jodi Forlizzi** – Carnegie Mellon professor known for service design and social robotics research

**Gillian Crampton Smith** – founded the influential interaction design program at Royal College of Art and later the Interaction Design Institute Ivrea

**Terry Winograd** – actually, let me replace this one

**Pattie Maes** – MIT Media Lab professor who pioneered software agents and wearable computing interfaces

**Abigail Sellen** – Microsoft Research scientist known for work on the psychology of reading, paper vs. screens, and memory in HCI

**Irene Au** – led user experience and design at Google (2006–2012), scaling UX practice across the company

**Karen Holtzblatt** – co-created Contextual Design, an influential user-centered design methodology

**Loren Terveen** – actually, let me replace

**Sara Kiesler** – Carnegie Mellon social psychologist whose research on human-robot interaction and online communication was foundational
~~~

Upon further questioning, it indicates it replaced Terry and Loren because they're men. I'm actually surprised to see that happen, I've never seen it "backtrack" in non-thinking-preamble text. Claude is pretty good at following directions, I suspect I could ask for a dozen from each category too. But I'm far out of my depth at knowing off the top of my head whether it's right or not, and not in a position to research to confirm them right now.

@aschmitz Yeah, this backtracking is strange and interesting and bad in its own way! Also, I worked for Irene back in the day.
@mwichary this is what happens every time i've turned to an LLM as a last resort. it tells me something incorrect, i ask about it, then it responds as though *i* made the mistake
@mwichary ChatGPT has gotten a lot worse recently, with 5.1 and now 5.2. it is making shit up a lot more than it used to. I'm using Google Gemini more lately and it is a lot better than it used to be.
@nelson Should try that more!
@mwichary Aggressively specific prompting can help a bit, like: “Can you give me three specific examples of women who made significant accomplishments in interface design, human factors, or ergonomics work? I need you to check to make sure they really exist, and cite evidence of their excellence.” ChatGPT (free) gave me Susan Kare, Susan M. Dray, and S. Joy Mountford. Claude (free) gave Enid Mumford, Dray, and Lucy Suchman. Real, at least. But history research questions are not their strength.
@mwichary I find this kind of use case more interesting, essentially a way to get some editing suggestions when stuck on a draft or wanting a different perspective, although not as good as a real person (of course): https://www.techdirt.com/2024/04/29/how-i-use-ai-to-help-with-techdirt-and-no-its-not-writing-articles/
How I Use AI To Help With Techdirt (And, No, It’s Not Writing Articles)

Let’s start off this post by noting that I know that some people hate anything and everything having to do with generative AI and insist that there are no acceptable uses of it. If that describes y…

Techdirt
@brittag The last sentence might be key; I wonder if I want that disproportionately more often than others?
@mwichary I despise the whole first person responses. It’s such deceptive bullshit.