@CppGuy @orionkidder @grammargirl the error rate also also depends on what is in the training data.
That's no doubt is part of the problem with Grok, as it's training data contains many unreliable statements garnered from X as well as deliberate falsifications added.
If the training data was just Wikipedia you would get more reliable results.
For other AI vendors adding random chats from Facebook or Instagram or AI generated websites will also lower the accuracy.
Claude Code may be slightly better, for now, because it is just plagiarising code. This won't last as the code repositories fill up with AI slop and these are flagged up as such and excluded by the crawlers.
Likewise if you ask GenAI to summarise a document it may well incorporate data from its training data as well as the text you supply.
The other reason is that GenAI just doesn't simply reproduce single sources, whatever their accuracy. It acts as a stochastic mixer: if you see a AI generated legal case reference some of it may come from one citation and some from another and the legal inference drawn maybe from something entirely different.
Likewise if you ask GenAI to summarise a document it may well incorporate words from its training data as well as the text you supply.