Cormac Herley

29 Followers
35 Following
14 Posts
Fraud & abuse, unsupervised learning, passwords. I like clarity. Speaking only for myself.
I increasingly find myself alternating between positivist research modes and asking "first of all, what does this even mean", and realizing the seeds of the latter were all planted in my Epistemology class in college. I think the Scottish Rationalists are increasingly having a bigger influence on me than most computer scientists.
New paper up: Can we count on LLMs? Can we rely on them? Can they perform basic tasks like counting reliably?
https://arxiv.org/abs/2409.07638v2
Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for $\approx$ 50\% of a list is different from when it accounts for $\approx$ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.

arXiv.org

The recent emergence of "damages" as plural of damage - when "damage" was already plural and "damages" used to be reserved solely for legal damages - bugs me a bit. "Not responsible for any damages caused" should be "not responsible for any damage caused."

I'm also not a a fan of "learnings" and "trainings", which used to never appear with an 's' at the end.

But for internaional English, those ships appear to have sailed (and I understand how they happened, linguistically).

OldManShakesFistAtPidgins.gif

Tycho's Meta-Law of Inversion of Sufficient Advancement:

When inverted, any sufficiently advanced "any sufficiently advanced X is indistinguishable from Y" law ... is also true.

Examples:

"Any sufficiently advanced technology is indistinguishable from magic" (Clarke's third law)
vs.
"Any sufficiently advanced magic is indistinguishable from technology"

"Any sufficiently advanced incompetence is indistinguishable from malice" (Grey's Law)
vs.
"Any sufficiently advanced malice is indistinguishable from incompetence"

etc.

Inversion should always reveal a different kind of wisdom - or at least food for thought. :D

(That second example has specific application in the security space - think about it.)

Park service really updating for the times.

New Longread: Layoffs in Responsible AI teams.

It starts:
Wendy Grossman asks “what about all those AI ethics teams that Silicon Valley companies are disbanding? Just in the last few weeks, these teams have been axed or cut at Microsoft and Twitch...” and I have a theory.

My theory is informed by a conversation that I had with Michael Howard, maybe 20 years ago. I was, at the time, a big proponent of code reviews, and I asked about Microsoft’s practices. He said, “oh, they don’t scale, we don’t do things that don't scale.” (Or something like that. It was a long time ago.) After I joined the SDL team, and we started working together, I saw the tremendous focus that the team had on bugs. (My first day on the job included an all-hands, and I saw GeorgeSt present how many bugs the Secure Windows Initiative had managed through the Vista process.)

https://shostack.org/blog/responsible-ai-layoffs/

Shostack + Friends Blog > Layoffs in Responsible AI Teams

Some inferences from layoffs in responsible AI teams

"The original question, `Can machines think?' I believe to be too meaningless to deserve discussion."
A.M. Turing (1950)
Jan Mieszkowski (on the bird site, but not here): "Beckett is furious with Giacometti for overdecorating the Christmas tree."