I've seen a number of threads, blog posts, essays, etc., discussing the implications of Large Language Models such as the ChatGPT implementation of GPT3.5.

The worry is that these systems do a decent job at writing answers to fairly specific prompts, in that they bring together multiple elements to form a question. I've included an example below. If I asked a question like this on an exam, I'd give an answer like this full marks.

#teaching #gpt3 #highereducation

But I'm not all sure we're sunk just yet.

It seems to me is that what is happening is that AI systems are creeping their way up Bloom's Taxonomy (image under CC license from Vanderbilt University Center for Teaching).

With GPT3 and the likes, they've gone from being good at looking stuff up (level 1, remember) to being able *fake* (level 2, understanding.)

An aside: they don't actually *understand* anything, though this is a separate thread.

The other remarkable thing is the way that these AIs have leap-frogged to the top of Bloom's taxonomy. I think, though my mind could be changed, that it is accurate to say that they are able create original work (but not, mind you, understand what they've created.)

That seems scary because it gets around the sorts of prompts we might have used for online open book exams in the past.

Here I ask the AI to write a fable about a group of #shoebills who succumb to #GoodhartsLaw.

IMO, remarkable.

However, I've been asking GPT3 to answer my old exam questions and I'm finding that it does very poorly on most of them.

(A large fraction of them involve data graphics in some capacity, and so unfortunately I can't test these.)

Where it seems to be falling short is on analyzing and evaluating arguments.

Here, the AI knows the definition of mathiness (from my book no less) but #McKinsey Consulting fooled it with their silly trust question. It even doubles down and defends the equation!

This question, also about mathiness but from a previous year's exam, nicely illustrates what GPT3.5 is good at and what it fails at.

Here I asked the students to come up with a mathiness equation, and then to call BS on it.

The AI is very good at creating. It's mathiness-style formula for BS detection ability is pitch-perfect.

But it totally fails in its critique and misses the core point.

This is a kind of narrow work-around, but we often ask students to explain why a given XKCD comic about statistical topics (or very occasionally a comic from some other source) is funny.

This is of course notoriously hard for AI even when it's not merely a LLM trained to babble convincingly.

Here, ChatGPT starts out on the right track but totally blows it.

Another workaround is frame questions so that they cannot be answered without a knowledge of about current events. Unlike #Galactica, at least #ChatGPT reports that doesn't know what's going on.

Given the high compute costs of training these models, this is likely to be a safe strategy for a while until effective methods of inexpensively updating previously trained models have been developed.

To my surprise, ChatGPT scored 100% on the multiple choice questions from my tests. I don't understand why.

However it does quite poorly at the true/false questions. This one was a particular embarrassment. In addition to getting two questions wrong, it first states only #4 is false, then later adds that #5 is also false.
Remember, it's babbling, not working based on some sort of logic model.

(In its defense, the corpus on which it is trained also probably gets #2 wrong most of the time.)

After an evening's thought, I see that these systems pose challenges when writing open-book, open-internet exams.

(I am unreservedly opposed to using any sort of digital spyware —e.g. Proctorio—for reasons that I explain in this twitter thread: https://twitter.com/CT_Bergstrom/status/1322369355930165248)

My hope is that as we come to understand better what these systems can and can't do, we can learn to write questions that cannot be answered by AI, and that in doing so we will learn to write better questions in general.

Carl T. Bergstrom on Twitter

“1. A thread for my fellow college and university instructors, though high school teachers and students may be interested as well. Amidst the pandemic we're all trying to adapt to giving and taking exams online. The "simple" solution is to keep writing exams as we always have...”

Twitter

I'd love to hear others' thoughts—particularly when supported by experiment on their own test questions—about what it takes to write interesting test questions that encourage students to think and learn and that at same time are not easily emulated by an AI.

Maybe in the long run this is a losing battle, but LLMs on their own are very stupid in very specific ways.

I do think as a community we can figure out how to write questions where their their output is clearly not a good answer.

/fin

Very interesting thread - thanks. I'd welcome others experience/investigations! I suspect some of the techniques we've been using to counteract old fashioned plagiarism - specificity, up to date examples required - may still help - but it's clearly a step change
@lilianedwards one of the challenges is the referencing - and not seen this being done yet. Esp when students obliged to use up to date material
@lilianedwards @ct_bergstrom
On a simple history question I found that it gave contradictory answers with just very slight working changed. Chatgpt did stay consistent within chats, but not between.