Q: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

What do you think the LLM output was?

Please; review the output.

#ai #LLM #ai

Deepseek and Qwen

#llm #ai

@knowmadd What I like most is that the Qwen website shows this little light bulb with the text “thinking completed.” :)
@knowmadd Deepseek was so close. 😆
@Azuaron @knowmadd deepseek does not recommend to walk 🤔
@OutOfSpace @knowmadd "For minimal environmental benefit -> walk (and then drive)"

@Azuaron @knowmadd Yeah, as a second option. First option recommended:
For convinience -> Drive.

This is what is called selective reporting. Marketing departments of pharmaceutical industry are famous for it.

My point was that deepseek recognized that the car needs to be at the car wash in the end. This is at least a little bit better than the other llms in your test. Your alt-text suggested otherwise.

I don't want to say that deepseek performed well in your test though 🤣

@OutOfSpace *squints eyes* What are you talking about? My "alt-text"? I didn't make any alt text. I laughed that Deepseek recognized that the car had to be at the car wash, but then still recommended an option to walk there, walk back, then drive there, and falsely reported there was a "minimal environmental benefit".

It was, as I said, so close.

@Azuaron @OutOfSpace I think they didn't realize that you and Kevin were different people.
@knowmadd that Mistral checklist should be fun
@knowmadd I like Deepseek’s “hey just go for a walk anyway but remember to come back for the car” response 😅
@knowmadd That alt text does not convey the same information as the image.
@knowmadd
Deepseek seems to bump into the issue but commits to it's original course in spite of it.
@knowmadd
To be fair it took me a minute, too 🤦🏻‍♀️

"how will I wash the car once I've arrived if I choose to walk?"

I'll leave you all to try this out and see the results.

One output was "you got me", another was "wash the car as it's already there" after telling me to walk. The others double down in some interesting ways.

#llm #ai

@knowmadd don't tell us to try out LLMs

@knowmadd for a second, I read the question as, “The car is 50 meters away, should I walk or drive?” Then I realized it said “The car wash is 50 meters away,” and I got why this would trick the AI.

LLMs work on the “attention” model to predict what output comes next. It is trained on which parts of the sentence deserve the most focus when predicting the result and generating an answer. If the meaning of a sentence can be changed entirely by just one short word, it is more likely to trip-up an LLM.

@knowmadd clankers have no idea about real life. I hope we will see the end of this bullshit.

@knowmadd this sounds like the nerd grocery shopping problem.

A: "Darling, please go shopping. Bring 2 liters of milk. If they have eggs, bring 10."

Later the nerd returns.

A: "Why did you bring so much milk?!"

B: "They had eggs. You said, I should bring 10 liters of milk if they have eggs."

@knowmadd perplexity told me go get a bucket of water and a sponge from the car wash and wash my car at home.
@knowmadd I think that one might get a better answer from Grok. As it's trying to destroy humanity as fast as possible, it might actually get the correct answer... Even if it's more by chance.

@knowmadd @MissGayle

I think you should walk to the carwash, dismantle it, walk back and rebuild it around your car. When tested everything, make sure your permits are okay, etc, then start the washing.

@bitchboss @knowmadd @MissGayle
Right. Of course, LLMs, lacking creative thinking, aren't able to come up with this by themselves.

@GerardThornley @knowmadd @MissGayle

What did we expect from an optimised translator/spell checker? Creativity? Reasoning? Ethics? Meh. It loosely strings things together and searches for combinations that appear in a piece of text that was once ripped off, and assumes without even reasoning that it must be the holy truth.

@knowmadd Did you also do a survey how many people would be tricked by this question? I, for one, admit am one, because my initial reaction to your post was: what's wrong with that answer?
@knowmadd Google's gets it right, but then goes on to ramble about stuff. Someone needs to instruct these things not to analyse or "break this down" so much.
All in all, as expected, disappointing.
@Nux @knowmadd Google has its tongue firmly in its cheek!

@Nux @knowmadd I _love_ how it - exactly as real people - needed to check Instagram in order to proceed.

Once it is able to do so, it would probably also watch a few YouTube clips every time you ask what is 1+1. Like real people.... :)

@knowmadd gemini 👍
@rode @knowmadd "Most car washes"? Which car washes *don't* require the vehicle to be present? I want to exclusively use those magical car washes, they probably use a lot less water.
@rode @knowmadd Ahh, but the wording implied there were car washes that don't need the car to be present and that car *is* present.
@StarkRG @knowmadd Okay, enough. I'm not Gemini's lawyer. 😅
@knowmadd if you walk, you are, in fact, carrying heavy equipment: the car. :D
@knowmadd This is a very sad reflection on the minds of people today, the inability to read a question fully, the wrong standards, the assumptions made, everything.
@knowmadd @hook Gemini says you have to take the car. Maybe it's somehow connected to how it scores better on Vendibench? It has a better baseline for common sense.

@t_var_s @knowmadd @hook

Don't forget, we don't know when there's a "human in the loop".

There may or may not be some low wage workers involved in the answer.

Some like Google has enormous investments from Saudi Arabia. Oracle is "training" 50,000 Saudi Arabians in AI.
https://gulfbusiness.com/oracle-targets-training-50000-saudis-in-ai-latest-tech/

Or is it Lebanese?
https://today.lorientlejour.com/article/1487826/shehadi-defends-deal-with-oracle-to-train-50000-lebanese-in-ai.html

How many "answers" are just 700 employees in India, is hard to know. The AI bubble is rife with fraud.

https://www.firstpost.com/world/builder-ai-bankruptcy-plea-london-start-up-hired-indian-engineers-to-pose-as-ai-tools-scam-13894570.html

https://medium.com/write-a-catalyst/the-ai-company-that-fooled-microsoft-and-softbank-is-not-using-ai-0e17558be510

Oracle targets training 50,000 Saudis in AI, latest tech

The training is set to form part of an initiative called ‘Mostaqbali’ (My Future), and will be supervised by Saudi Arabia’s Ministry of Human Resources and Social Development.

Gulf Business

@Npars01 @knowmadd @hook I got the right answer when I took a screenshot of Chat GPT and just asked gemini to transcribe it. It just added the right explanation on top. Don't think this is a case of a Waymo getting driven remotely.

Doesn't mean there isn't the possibility of fraud. For example, benchmarks are probably optimised for.

@knowmadd yeah, LLMs will replace us all ... they are so much better at {looking frantically through my notes} ... providing answers with high confidence that are utter nonsense.
@knowmadd I tried to reproduce the result with Gemini and ChatGPT. Either the AI has learned something new, or there is another reason for this. Neither fell for the trick question and even responded with irony in some cases.
@roblen @knowmadd How often have you tried? Only once?
@weizenspreu @knowmadd Yes. Only once.
@roblen @knowmadd Given that LLMs are non-deterministic and employ randomness a single test often isn‘t enough.
@weizenspreu @knowmadd Ok. I try it again.
@roblen @weizenspreu @knowmadd don't waste your time on fact checking a joke. With the right system prompt you'll be able to have any LLM say wild things. The point of the joke is to not trust their output, and it's been well made imho.
@iwein @roblen @knowmadd But it‘s still a nice learning possibility. I often see people saying that their LLM answered differently - applying the deterministic assumption that the responses will the same each time.
@knowmadd i got this : "Verdict: Walking is the best choice here—it’s quick, eco-friendly, and practical for such a short distance. Plus, you’ll avoid driving a dirty car to the car wash!"
@knowmadd This is what techbros and pro AI people talk about like its the second comming of christ or something btw 😂 so cringe.
@knowmadd ignoring the problems of washing a car, I was perplexed that it would say 50m distance is 30 to 40 steps? My strides are nowhere close to 1.2m, maybe half that, and I'm a full grown person.
@djuber @knowmadd would be 50 to 55 steps for me and I'm above average height.
@knowmadd So, a car isn't "heavy equipment." 🤔
@knowmadd I’d say it’s right on the nose! The LLM specifically says that a special case is if you have heavy equipment to carry, and your car is certainly heavy equipment that you’d need to carry if you don’t drive it there!
@knowmadd I definitely want to see the list of things you should take with you! Like "a bathing suit" or "a banana"? 🤔
@knowmadd gpt-oss also recommends walking. I asked if I should buy a 50m hosepipe to take with me and it rightly reminded me: "No. A 50m hosepipe is excessive for washing a car 50m from your house — you don’t need to stretch it that far. A 25m hose is sufficient and more manageable." Can't argue with 120bn in logic. 🤡💦