Mastodawn

Remember seeing something about GPT-4 doing well on standardized tests? It turns out it may have memorized the answers.
https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks
#gpt4 #AIHype #ThisIsWhyWeDontTestOnTheTrainingData

GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.

AI Snake Oil

Show thread

aeva Mar 29, 2023

@janellecshane not surprised in the least, but it's nice to see it confirmed

Show thread

Steve Torrente Mar 29, 2023

@janellecshane That does help.

Show thread

Derek Kompare Mar 29, 2023

@janellecshane That’s exactly how Charles Van Doren and all the other scam quiz show “winners” were busted in the late 1950s.

Show thread

Leverbal Mar 29, 2023

@janellecshane "I was a good Bing" syndrom

Show thread

Murf Mar 29, 2023

@janellecshane interesting, as it's proof that it is a pure LLM with no understanding or actual intelligence - any human who had learned the training data would be able to pass a new exam or at least make a good effort.

Show thread

Quinn Norton Mar 29, 2023

@janellecshane @KevinMarks

ehhh memorize isn't even the right word. still anthropormophizing too much, dagnamit

Show thread

Martin Ruskov Mar 30, 2023

@quinn @janellecshane @KevinMarks talk of storage as computer memory, as in RAM/ROM, is standard

Show thread

Quinn Norton Mar 31, 2023

@mapto @janellecshane @KevinMarks when is the last time you said that your computer memorized that file?

Show thread

Kevin Marks Mar 31, 2023

@quinn @janellecshane I might have said it memory mapped a file, but that's a different metaphor.
You could say that the LLM had already read the answers to those questions (again with the existing metaphor of reading data)

Show thread

Quinn Norton Mar 31, 2023

@KevinMarks @janellecshane It's just important to remember that these amazing new AIs are still closer to being very elaborate magic 8-balls than they are to being children, or even cats.

Show thread

Benjamin Gagl Mar 29, 2023

@janellecshane this behavior has been documented for other cases: E.g see here: https://arxiv.org/abs/2202.07206

Impact of Pretraining Term Frequencies on Few-Shot Reasoning

Pretrained Language Models (LMs) have demonstrated ability to perform numerical reasoning by extrapolating from a few examples in few-shot settings. However, the extent to which this extrapolation relies on robust reasoning is unclear. In this paper, we investigate how well these models reason with terms that are less frequent in the pretraining data. In particular, we examine the correlations between the model performance on test instances and the frequency of terms from those instances in the pretraining data. We measure the strength of this correlation for a number of GPT-based language models (pretrained on the Pile dataset) on various numerical deduction tasks (e.g., arithmetic and unit conversion). Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above $70\%$ (absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data, and we encourage researchers to take the pretraining data into account when interpreting evaluation results.

arXiv.org

Show thread

loucovey Mar 30, 2023

@janellecshane not sure I trust benchmarks based on in house testing

Show thread

eli holderness

Mar 30, 2023

@janellecshane this is how I got an A on my Chemistry A-level 🙃