This https://aclanthology.org/2024.eacl-long.5/ is a very important paper, published at #EACL2024. But the sad truth is that this would have been avoidable, if people would have followed well-known best practices in doing science: Avoid the hype, use local #llms in defined and controlled states. Reminds me of "Googleology is Bad Science" from 2010: https://aclanthology.org/J07-1010/
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs

Simone Balloccu, Patrícia Schmidtová, Mateusz Lango, Ondrej Dusek. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

ACL Anthology
In brief: The problem is that #ChatGPT has seen most of the samples in most of the benchmarks used. I.e., many evaluations involve testing on the training data.
Consequence: We have no idea whether OpenAI models really are as good as their position in the leaderboards claims.
@nilsreiter
Maybe we do. IMO this paper strong evidence that OpenAI is incapable of any semblance of #reasoning, #math, or #planning. GPT4 got ~10% accuracy on the easiest class of reasoning tests for humans (using LLMs' native language for representing reasoning, code). 0% accuracy on medium and harder problems for post 2021 tests that would be difficult for OpenAI to incorporate into their training data squeezing out other memorized solutions: https://arxiv.org/pdf/2312.02143.pdf
@hobs the paper I originally posted wasn't about GPT performance, but about our ability to evaluate it. I think the only sure way to control this with ChatGPT / GPT4 is to use each benchmark only once.