NB: I got around 68% under 30 mins, without thinking, and just over 70% in 3hrs with thinking (ie way over the 30 mins time limit). Q4_0 vs Q4_M made little difference. The exam client was actually built by the LLM itself, in Python, and it worked out of the box, with some minor improvements (I had to read the docs).
It's unsafe/unreliable at Q4, it won't provide the same answer twice sometimes, and when it comes to safety, it's not a good idea to run an AI agent like that unsupervised or not.
Eventually, Qwen3.5 35B A3B Q4_M thinking got 87.5% in 27 mins at mock up SAE exam using llama.cpp WebUI, thus PASS (just the list of same questions + verification by itself, and then mine).
Now, what's funny is that Sonnet 4.6 (Extended ie Thinking) falls into the same pitfalls on the same questions as Qwen3.5 35B A3B Q4_M non-thinking 🤯
#Alibaba #Qwen #anthropic #sae #LLM #kaggle #AIsafety
©️ Nicolas Mouart, 2018-2026