In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for $\approx$ 50\% of a list is different from when it accounts for $\approx$ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.
The recent emergence of "damages" as plural of damage - when "damage" was already plural and "damages" used to be reserved solely for legal damages - bugs me a bit. "Not responsible for any damages caused" should be "not responsible for any damage caused."
I'm also not a a fan of "learnings" and "trainings", which used to never appear with an 's' at the end.
But for internaional English, those ships appear to have sailed (and I understand how they happened, linguistically).
OldManShakesFistAtPidgins.gif
Tycho's Meta-Law of Inversion of Sufficient Advancement:
When inverted, any sufficiently advanced "any sufficiently advanced X is indistinguishable from Y" law ... is also true.
Examples:
"Any sufficiently advanced technology is indistinguishable from magic" (Clarke's third law)
vs.
"Any sufficiently advanced magic is indistinguishable from technology"
"Any sufficiently advanced incompetence is indistinguishable from malice" (Grey's Law)
vs.
"Any sufficiently advanced malice is indistinguishable from incompetence"
etc.
Inversion should always reveal a different kind of wisdom - or at least food for thought. :D
(That second example has specific application in the security space - think about it.)
New Longread: Layoffs in Responsible AI teams.
It starts:
Wendy Grossman asks “what about all those AI ethics teams that Silicon Valley companies are disbanding? Just in the last few weeks, these teams have been axed or cut at Microsoft and Twitch...” and I have a theory.
My theory is informed by a conversation that I had with Michael Howard, maybe 20 years ago. I was, at the time, a big proponent of code reviews, and I asked about Microsoft’s practices. He said, “oh, they don’t scale, we don’t do things that don't scale.” (Or something like that. It was a long time ago.) After I joined the SDL team, and we started working together, I saw the tremendous focus that the team had on bugs. (My first day on the job included an all-hands, and I saw GeorgeSt present how many bugs the Secure Windows Initiative had managed through the Vista process.)