Mastodawn

"Our research finds that most AI agents used by indie developers today are deployed without a clear read on how they actually perform." #infosec #LLM

https://www.kaggle.com/experimental/sae

Show thread

Nicolas MØUART

Really cool test: I got some weird results with Qwen3.5 35B A3B Q4_0 and Q4_M (disappointing tho, it won't pass but I think a Q8 might). I think most LLMs are really designed for webdev (ie not programming). When it comes to reasoning, even LLMs from major companies struggle with some of the questions, and those which do not, use tools like python calculators (it's not cheating to write a one-liner I guess, assuming the LLM can execute it, and get the result..) #LLM
https://www.kaggle.com/blog/standardized-agent-exams

Standardized Agent Exams: Zero-setup evaluation for your AI agents | Kaggle

An experimental MVP to test your AI agents with zero setup

Show thread

Nicolas MØUART 2d ago

NB: I got around 68% under 30 mins, without thinking, and just over 70% in 3hrs with thinking (ie way over the 30 mins time limit). Q4_0 vs Q4_M made little difference. The exam client was actually built by the LLM itself, in Python, and it worked out of the box, with some minor improvements (I had to read the docs).

It's unsafe/unreliable at Q4, it won't provide the same answer twice sometimes, and when it comes to safety, it's not a good idea to run an AI agent like that unsupervised or not.

Show thread

Nicolas MØUART 1d ago

Eventually, Qwen3.5 35B A3B Q4_M thinking got 87.5% in 27 mins at mock up SAE exam using llama.cpp WebUI, thus PASS (just the list of same questions + verification by itself, and then mine).

Now, what's funny is that Sonnet 4.6 (Extended ie Thinking) falls into the same pitfalls on the same questions as Qwen3.5 35B A3B Q4_M non-thinking 🤯

#Alibaba #Qwen #anthropic #sae #LLM #kaggle #AIsafety

Show thread

Nicolas MØUART 12h ago

Really interesting case actually, I had not built an agent since 2024. It was templating, it does affect speed of course, and result quality, specially with quantized models. Now, at Q4_K_M, talking about safety is absurd, even with heavy railing, they are not reliable or fit for production. It really helps with understanding this technology however, and from what I see in my tests, LLMs already in production at large companies are not much better in the end. #AISafety #enisa #LLM #ItWasTemplate