Mastodawn

Just fantastic technology all around. Absolutely no worry where this is all going to go.

Is it just me? Am I using this wrong or am I asking questions that are too hard?

Here’s an example of a hallucination that happened while explaining away another hallucination I called it out on. I rarely have experiences other than these.

Show thread

Erik Moeller

@mwichary

Hallucinations remain common, and the strongest mitigation is a combination of search use & "extended thinking". Even then, models may over-privilege low quality sources.

May I ask which version of ChatGPT this is from (paid or free; Thinking mode enabled or not)? Here's the result with 5.2 + Thinking mode, in case it's useful for comparison:

https://chatgpt.com/share/694cb369-b3d4-800b-8abc-e29d565910d8

Show thread

aburka 🫣Dec 25

@eloquence @mwichary ah yes the obligatory "you're just holding it wrong" response

Show thread

Marcin Wichary Dec 25

@aburka @eloquence I am interested in learning this, though. I didn’t perceive this response as blaming me in any way.

Show thread

aburka 🫣Dec 25

@mwichary @eloquence not blaming you and maybe I misinterpreted but "you just have to try the latest model man" is a very common refrain from AI boosters

Show thread

Marcin Wichary Dec 25

@aburka @eloquence And yet, examples shared by others so far show that the more recent models *are* better.

Show thread

Marcin Wichary Dec 25

@eloquence Thanks! It is useful. This was on free and I sometimes wonder how that affects things.

Show thread

Marcin Wichary Dec 25

@eloquence Sorry, is “search use” going to Google and/or veryfing by other means?

Show thread

Erik Moeller Dec 25

@mwichary

OpenAI operates its own crawlers and also licenses search results (Bing, as part of their longstanding relationship with MSFT); it's undisclosed what the exact "mix" is that comprises responses.

Show thread

Erik Moeller Dec 25

@mwichary

In terms of free vs. paid, the free plan is heavily restricted. In the response I shared it spent 75 seconds in "thinking" mode. For more comprehensive reports they have a "Deep research" feature that can run for 5-10 minutes.

That increase does tend to improve the quality of responses, better attribution of claims to sources, etc. It does not obviate the need to verify, of course.

Show thread

aburka 🫣Dec 25

@eloquence @mwichary yes, they "operate crawlers" the same way the Death Star provides gentle illumination 😅

Show thread

Scott Jenson Dec 25

@mwichary @eloquence Here is what Gemini Pro gave me for your query. These are all legit but a bit broad. I followed up with "focus on women doing HCI research in the last 50 years and got a much better list.

Show thread

Marcin Wichary Dec 25

@scottjenson Yeah, Grace Hopper might be a bit of a stretch, I think…

@mwichary Agreed

@mwichary I'm not sure what point I'm trying to make. These systems WILL get better. My worry is that we'll just go from 20% batshit crazy to 10%. It's an improvement but....

Show thread

Marcin Wichary Dec 25

@scottjenson Yeah. Also on my mind!

Show thread

Shannon Clark Dec 25

@scottjenson @mwichary the problem is whatever the percentage if you have to know the field to detect which answers are wrong (or how they are missing key context or steps etc) then most users of such systems will either assume that the wrong answers are right

Or will eventually distrust all the answers even the real/correct ones and be unable to figure out what the right answers are

This is already happening as online search degrades and content spreads so evaluating reliable info is hard

Show thread

Shannon Clark Dec 25

@scottjenson @mwichary I know that for topics I know at an expert level AI’s answers (as found in places like Google’s ai generated summaries that you can’t easily escape when using google search) but also from places like FB’s ai generated content about stuff posted to meta properties etc are almost always deeply flawed and contain mistakes and hallucinations.

But when it is something I’m less expert on its far harder even for me (with decades of search expertise) to find the accurate info now