Remember seeing something about GPT-4 doing well on standardized tests? It turns out it may have memorized the answers.
https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
#gpt4 #AIHype #ThisIsWhyWeDontTestOnTheTrainingData
GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.

AI Snake Oil
@janellecshane not surprised in the least, but it's nice to see it confirmed
@janellecshane That’s exactly how Charles Van Doren and all the other scam quiz show “winners” were busted in the late 1950s.
@janellecshane "I was a good Bing" syndrom
@janellecshane interesting, as it's proof that it is a pure LLM with no understanding or actual intelligence - any human who had learned the training data would be able to pass a new exam or at least make a good effort.

@janellecshane @KevinMarks

ehhh memorize isn't even the right word. still anthropormophizing too much, dagnamit

@quinn @janellecshane @KevinMarks talk of storage as computer memory, as in RAM/ROM, is standard
@mapto @janellecshane @KevinMarks when is the last time you said that your computer memorized that file?
@quinn @janellecshane I might have said it memory mapped a file, but that's a different metaphor.
You could say that the LLM had already read the answers to those questions (again with the existing metaphor of reading data)
@KevinMarks @janellecshane It's just important to remember that these amazing new AIs are still closer to being very elaborate magic 8-balls than they are to being children, or even cats.
@janellecshane this behavior has been documented for other cases: E.g see here: https://arxiv.org/abs/2202.07206
Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

arXiv.org
@janellecshane not sure I trust benchmarks based on in house testing
@janellecshane this is how I got an A on my Chemistry A-level 🙃