I've seen a number of threads, blog posts, essays, etc., discussing the implications of Large Language Models such as the ChatGPT implementation of GPT3.5.

The worry is that these systems do a decent job at writing answers to fairly specific prompts, in that they bring together multiple elements to form a question. I've included an example below. If I asked a question like this on an exam, I'd give an answer like this full marks.

#teaching #gpt3 #highereducation

But I'm not all sure we're sunk just yet.

It seems to me is that what is happening is that AI systems are creeping their way up Bloom's Taxonomy (image under CC license from Vanderbilt University Center for Teaching).

With GPT3 and the likes, they've gone from being good at looking stuff up (level 1, remember) to being able *fake* (level 2, understanding.)

An aside: they don't actually *understand* anything, though this is a separate thread.

The other remarkable thing is the way that these AIs have leap-frogged to the top of Bloom's taxonomy. I think, though my mind could be changed, that it is accurate to say that they are able create original work (but not, mind you, understand what they've created.)

That seems scary because it gets around the sorts of prompts we might have used for online open book exams in the past.

Here I ask the AI to write a fable about a group of #shoebills who succumb to #GoodhartsLaw.

IMO, remarkable.

However, I've been asking GPT3 to answer my old exam questions and I'm finding that it does very poorly on most of them.

(A large fraction of them involve data graphics in some capacity, and so unfortunately I can't test these.)

Where it seems to be falling short is on analyzing and evaluating arguments.

Here, the AI knows the definition of mathiness (from my book no less) but #McKinsey Consulting fooled it with their silly trust question. It even doubles down and defends the equation!

This question, also about mathiness but from a previous year's exam, nicely illustrates what GPT3.5 is good at and what it fails at.

Here I asked the students to come up with a mathiness equation, and then to call BS on it.

The AI is very good at creating. It's mathiness-style formula for BS detection ability is pitch-perfect.

But it totally fails in its critique and misses the core point.

This is a kind of narrow work-around, but we often ask students to explain why a given XKCD comic about statistical topics (or very occasionally a comic from some other source) is funny.

This is of course notoriously hard for AI even when it's not merely a LLM trained to babble convincingly.

Here, ChatGPT starts out on the right track but totally blows it.

Another workaround is frame questions so that they cannot be answered without a knowledge of about current events. Unlike #Galactica, at least #ChatGPT reports that doesn't know what's going on.

Given the high compute costs of training these models, this is likely to be a safe strategy for a while until effective methods of inexpensively updating previously trained models have been developed.

To my surprise, ChatGPT scored 100% on the multiple choice questions from my tests. I don't understand why.

However it does quite poorly at the true/false questions. This one was a particular embarrassment. In addition to getting two questions wrong, it first states only #4 is false, then later adds that #5 is also false.
Remember, it's babbling, not working based on some sort of logic model.

(In its defense, the corpus on which it is trained also probably gets #2 wrong most of the time.)

After an evening's thought, I see that these systems pose challenges when writing open-book, open-internet exams.

(I am unreservedly opposed to using any sort of digital spyware —e.g. Proctorio—for reasons that I explain in this twitter thread: https://twitter.com/CT_Bergstrom/status/1322369355930165248)

My hope is that as we come to understand better what these systems can and can't do, we can learn to write questions that cannot be answered by AI, and that in doing so we will learn to write better questions in general.

Carl T. Bergstrom on Twitter

“1. A thread for my fellow college and university instructors, though high school teachers and students may be interested as well. Amidst the pandemic we're all trying to adapt to giving and taking exams online. The "simple" solution is to keep writing exams as we always have...”

Twitter

I'd love to hear others' thoughts—particularly when supported by experiment on their own test questions—about what it takes to write interesting test questions that encourage students to think and learn and that at same time are not easily emulated by an AI.

Maybe in the long run this is a losing battle, but LLMs on their own are very stupid in very specific ways.

I do think as a community we can figure out how to write questions where their their output is clearly not a good answer.

/fin

@ct_bergstrom Fascinating... and you inspired me to run a test on my own assignments. Here's an essay prompt from my most recent midterm (Journalism 101 at NYU).

A priori, I thought there would be a pretty big natural language processing hump to get over to understand what I'm asking for... but if it succeeded on that front, the essay itself is pretty well suited to an AI attack.

@ct_bergstrom To my surprise, the NLP part was not a problem at all. It got right to writing an appropriate essay. After running it a few times, I saw results that got as high as maybe a B-, Cs were more typical. Here's an example of a C/C- essay.

First paragraph is almost reasonable -- but it says the same thing over and over rather than exploring that argument.

Second paragraph exhibits a fundamental misunderstanding of the fair report privilege...

@ct_bergstrom
Third paragraph misunderstands "actual malice", which is something that would put one of my students in the doghouse for sure, and reintroduces assertion of fact which is essentially the same as the opinion section in the first graf.

Fourth paragraph isn't awful, but it is shallow and oversimplified.

Fifth graf is pure waffle. (Interestingly, almost all the runs waffled in a similar way!)

Also... none pointed out that Floyd's dead, short-circuiting the libel claim!

@cgseife This is fascinating and highlights the fact that these are convincing-babble-generators, not systems with underlying knowledge models about factual relations in the world.
@ct_bergstrom @cgseife We've all encountered university students like that. And Silicon Valley is full of them ... 🥴