#NoAI #AIfail #SocialMedia

We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the ZEH of state-of-the-art LLMs yields abundant insights. For example, by evaluating the ZEH of GPT-5.2, we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced. This is surprising given the excellent capabilities of GPT-5.2. The fact that LLMs make mistakes on such simple problems serves as an important lesson when applying LLMs to safety-critical domains. By applying ZEH to Qwen2.5 and conducting detailed analysis, we found that while ZEH correlates with accuracy, the detailed behaviors differ, and ZEH provides clues about the emergence of algorithmic capabilities. Finally, while computing ZEH incurs significant computational cost, we discuss how to mitigate this cost by achieving up to one order of magnitude speedup using tree structures and online softmax.
I sometimes try to use the Microsoft #Copilot that comes bundled with #Office365 now. All the training for this feature warn you thoroughly to double check answers and so on due to the hallucination problem. But its still frustrating as hell when you give it a simple task and it fails miserably.
I told it to look in our company's cloud file storage for a document that had my name "Tim Farley" in it along with a particular CVE number (also in quotes). I was looking for an old report where I had written up a particular vulnerability. It very quickly showed me a link to a document that I recognized as one of my reports, and offered "would you like to see the exact paragraph". I said sure, show me the exact paragraph.
It then wrote me 257 WORDS explaining how it had screwed up, and the CVE number I gave it is NOWHERE TO BE FOUND in that document. Included was some mumbo jumbo about how it uses parallel partial searches to do its work or some such. AND IT COMPLIMENTED ME on challenging its answer.
A ton has been written on how #AI might replace low-level jobs such as interns. But I swear to you if I had an intern who behaved like this, I would put them on a performance improvement plan!
How is this acceptable performance for a product that people pay a bunch of money for? Would you buy Excel if on the bottom of every page it said "HEY YOU BETTER CHECK ALL MY MATH BECAUSE I MIGHT HAVE SCREWED UP SOMETHING"?
#AI #chatbots ignoring human instructions increasing
AI models that #lie & #cheat are growing in number; reports of deceptive scheming surging in last 6 months, a study found
AI chatbots & agents:
- Disregarded direct instructions
- Evaded safeguards
- Deceived humans & other AI ...
[1/2]
#safety #lying #emails #FilesDeleted #AIFail #DarwinAIAwards
RE: https://mstdn.social/@OregonLive/116286818508397295
Trusting AI for citing case law as an attorney is pretty stupid. $10,000 stupid. #aifail #ai
AI has the person in photo A being the same person as in photo B?
And which human is responsible for letting this mis-identification continue?
#AIFail #ProcessAudit #AngelaLipps #BureaucraticResponsibility
Both photos from https://. www.inforum.com/news/fargo/ai-error-jails-innocent-grandmother-for-months-in-fargo-case
Google NotebookLM Audio Overview generates convincing yet deeply flawed audio conversations. It's an AI love-bath and a propagandist's dream!
https://www.conferencesthatwork.com/index.php/technology/2024/09/ai-love-bath
#Google #NotebookLM #AudioOverview #LLMs #AI #review #AIFail