3/

"As expected, the accuracy of the answers increased as the refined models became larger and decreased as the questions got harder [...]
The fraction of wrong answers among those that were either incorrect or avoided rose as the models got bigger, and reached more than 60 %, for several refined models" [1]

The study "found that all the models would occasionally get even easy questions wrong, meaning there is no ‘#SafeOperatingRegion’ in which a user can have high confidence in the answers"