I've seen a number of threads, blog posts, essays, etc., discussing the implications of Large Language Models such as the ChatGPT implementation of GPT3.5.

The worry is that these systems do a decent job at writing answers to fairly specific prompts, in that they bring together multiple elements to form a question. I've included an example below. If I asked a question like this on an exam, I'd give an answer like this full marks.

#teaching #gpt3 #highereducation

But I'm not all sure we're sunk just yet.

It seems to me is that what is happening is that AI systems are creeping their way up Bloom's Taxonomy (image under CC license from Vanderbilt University Center for Teaching).

With GPT3 and the likes, they've gone from being good at looking stuff up (level 1, remember) to being able *fake* (level 2, understanding.)

An aside: they don't actually *understand* anything, though this is a separate thread.

The other remarkable thing is the way that these AIs have leap-frogged to the top of Bloom's taxonomy. I think, though my mind could be changed, that it is accurate to say that they are able create original work (but not, mind you, understand what they've created.)

That seems scary because it gets around the sorts of prompts we might have used for online open book exams in the past.

Here I ask the AI to write a fable about a group of #shoebills who succumb to #GoodhartsLaw.

IMO, remarkable.

However, I've been asking GPT3 to answer my old exam questions and I'm finding that it does very poorly on most of them.

(A large fraction of them involve data graphics in some capacity, and so unfortunately I can't test these.)

Where it seems to be falling short is on analyzing and evaluating arguments.

Here, the AI knows the definition of mathiness (from my book no less) but #McKinsey Consulting fooled it with their silly trust question. It even doubles down and defends the equation!

This question, also about mathiness but from a previous year's exam, nicely illustrates what GPT3.5 is good at and what it fails at.

Here I asked the students to come up with a mathiness equation, and then to call BS on it.

The AI is very good at creating. It's mathiness-style formula for BS detection ability is pitch-perfect.

But it totally fails in its critique and misses the core point.

This is a kind of narrow work-around, but we often ask students to explain why a given XKCD comic about statistical topics (or very occasionally a comic from some other source) is funny.

This is of course notoriously hard for AI even when it's not merely a LLM trained to babble convincingly.

Here, ChatGPT starts out on the right track but totally blows it.

Another workaround is frame questions so that they cannot be answered without a knowledge of about current events. Unlike #Galactica, at least #ChatGPT reports that doesn't know what's going on.

Given the high compute costs of training these models, this is likely to be a safe strategy for a while until effective methods of inexpensively updating previously trained models have been developed.

To my surprise, ChatGPT scored 100% on the multiple choice questions from my tests. I don't understand why.

However it does quite poorly at the true/false questions. This one was a particular embarrassment. In addition to getting two questions wrong, it first states only #4 is false, then later adds that #5 is also false.
Remember, it's babbling, not working based on some sort of logic model.

(In its defense, the corpus on which it is trained also probably gets #2 wrong most of the time.)

After an evening's thought, I see that these systems pose challenges when writing open-book, open-internet exams.

(I am unreservedly opposed to using any sort of digital spyware —e.g. Proctorio—for reasons that I explain in this twitter thread: https://twitter.com/CT_Bergstrom/status/1322369355930165248)

My hope is that as we come to understand better what these systems can and can't do, we can learn to write questions that cannot be answered by AI, and that in doing so we will learn to write better questions in general.

Carl T. Bergstrom on Twitter

“1. A thread for my fellow college and university instructors, though high school teachers and students may be interested as well. Amidst the pandemic we're all trying to adapt to giving and taking exams online. The "simple" solution is to keep writing exams as we always have...”

Twitter

I'd love to hear others' thoughts—particularly when supported by experiment on their own test questions—about what it takes to write interesting test questions that encourage students to think and learn and that at same time are not easily emulated by an AI.

Maybe in the long run this is a losing battle, but LLMs on their own are very stupid in very specific ways.

I do think as a community we can figure out how to write questions where their their output is clearly not a good answer.

/fin

Very interesting thread - thanks. I'd welcome others experience/investigations! I suspect some of the techniques we've been using to counteract old fashioned plagiarism - specificity, up to date examples required - may still help - but it's clearly a step change
@lilianedwards one of the challenges is the referencing - and not seen this being done yet. Esp when students obliged to use up to date material
@Jkjacobus hmmmm I will go and see if it can do it! But again easy to add manually..
@lilianedwards @ct_bergstrom
On a simple history question I found that it gave contradictory answers with just very slight working changed. Chatgpt did stay consistent within chats, but not between.
@ct_bergstrom as a retired prof I’m relieved I don’t have to deal with this 😊 but have been thinking about it since your posts. Have tested it on my current subject (urban family planning in Africa). Answers would get around 80%. Maybe obvious but ask exam responses to maximise use of discussions and references from our classes?? Without specifying what they are, naturally.
@ct_bergstrom erm... sounds like one almighty captcha 😊
@ct_bergstrom In the long run, yes, this is likely a losing battle. Isn't the whole idea with AI to emulate natural intelligence? (Of course with language models, this applies much more narrowly.) You don't want the students answering your questions to consult with human experts either, do you?
@ct_bergstrom what does AI recommend as an answer to this challenge if you ask it? ‘I’m sorry Dave, I’m afraid I can’t let you do that ‘?
@ct_bergstrom it will be most instructive for students to use AI to get to the surface level (takes 10 seconds now) then dig in themselves to reach the deeper problem solving layers - some kind of machine-aided didactic-model of education
@ct_bergstrom Fascinating thread. I agree that #ChatGPT has remarkable strengths, as well as glaring weaknesses. But we shouldn't fall into the trap of viewing everything the AI does well as something that's not worth teaching to humans; and anything worth teaching should be assessed if possible. So the answer to these problems some of the time will be denying the use of the AI in an invigilated environment.
@ct_bergstrom This same question, but in a different context popped up in my Bird feed, but phrased with respect to intrinsic motivation. The gripe there was the lack of intrinsic motivation and learning for tests, rather than interest. I think much of resolving this issue lies there, in any different context this will just become an arms race I fear.

@ct_bergstrom

General comment, not specific to GPT but about testing.

Students should be wanting to learn, not to cheat and lie.

@SpaceLifeForm @ct_bergstrom Interestingly, my daughter told me that the landscape architecture program at her university uses AI in some way (I forget which use).

But the students are quite motivated to complete their work, especially their studio work, in as original a way as possible, at times with costumes. I guess that, for some of those students, AI will come in later, in professional, often cookie-cutter applications, as seen in our boring urban and suburban environments.

Historians will call it the Copy And Paste Civilization.

@SpaceLifeForm @ct_bergstrom Yes. Which means they need a personal investment in the process, rather than seeing it as a hoop to jump through for a credential. The questions need to be meaningful to them, not just the teacher. (Grading also tends to be more interesting and rewarding when this happens, btw.)
@shawrd773
Suppose a well known private school secretly offered it's MS program students they could pay tuition and take the classes, or, for the same price they could just have the sheepskin printed. What fraction of today's students would choose to take the classes vs print the degree?
@SpaceLifeForm @ct_bergstrom

@dlakelan @shawrd773 @ct_bergstrom

Too many. But I do not want to have to work with them later and then find out they really don't know the basics of their discipline.

It's not fair to the coworkers.

@SpaceLifeForm @shawrd773 @ct_bergstrom

Here's the thing though... how many of the people who take the classes because you can't just pay for the sheepskin, but really just want the sheepskin, actually know anything once they've graduated?

@SpaceLifeForm
@ct_bergstrom on the other hand, educators (myself included) sometimes overrely on basic knowledge prompts and questions/ quizzes and perhaps we owe more creativity (in asking and evaluation) to our students

@ct_bergstrom We use these:

You’re in a desert walking along in the sand when all of the sudden you look down, and you see a tortoise, it’s crawling toward you. You reach down, you flip the tortoise over on its back. The tortoise lays on its back, its belly baking in the hot sun, beating its legs trying to turn itself over, but it can’t, not without your help. But you’re not helping. Why is that?

Describe in single words, only the good things that come into your mind about your mother.

@ct_bergstrom Much of the recent writing and discussion of large language models I have encountered asserts that educators must prepare for an "AI arms race" with students. In this thread many seem to be struggling with questions that seem more on the right track like: How do we design learning environments in which students feel free to focus on growing as writers and learners? AI might have a place in that kind of environment as a legitimate tool.
@ct_bergstrom Interesting thread! I tried #ChatGPT on some of my sophomore cell biology exam questions, and here are two examples where it failed. It is interesting that students often give the same answer to the first question as ChatGPT gave! I think that questions incorporating high-level implicit knowledge would confuse the system. But I am not sure why it failed the second question (it seems to have the required implicit knowledge).
@veraksa These are very nice examples. Thank you!
@veraksa @ct_bergstrom (I'm not a biologist, nor biology student even) - is the answer "three nucleotides", since then you will typically lose exactly one amino acid, instead of shifting the entire thing over from there on out?
@veraksa @ct_bergstrom whoohoo! My old-man brain hasn't completely forgotten my run through of Compeau and Pevznern from a few years ago :)
@ct_bergstrom I've tested it quite extensively today and it answers pretty well most our standard exam questions, even provides literature references where required. So far, the best cases where I could win in this game is the questions that deal with drawing or analysis of figures
@ct_bergstrom I find this an interesting question partly because I've been fighting a related battle for a while: students in my Latin classes trying to use Google Translate for their answers. But it's easier there, I think, because there's a much clearer division between "this student is cheating" and "this student is expressing themself awkwardly" than there is in long-form English, where the ability to explain an idea is a separate skill from actual understanding of the mateiral.
@ct_bergstrom I ran old upper-division psychology essay questions through and GPT absolutely crushed them. All would have been A responses. One of the many reasons I left academia was realizing this was coming and it is as amazing and terrifying as I expected.

@ct_bergstrom it may be that exams as knowledge recall/synthesis measurements are busted beyond repair. Are they necessary?

Perhaps longer term project work creates more opportunities for showing progress in writing/thinking over time.

@ct_bergstrom This sort of thing isn't WHY I've always preferred essays and projects to exams, but it certainly does give me a bunch of ammo in arguments...

@ct_bergstrom

For now at least, large language models only take text as input, so perhaps test questions that require interpretation of an image would help? It would have to be something not easily described as text. Perhaps charts and graphs?

@skybrian As I mentioned in the thread, I could only run a very small subset of my questions through the system for this exact reason.
@ct_bergstrom For the time being, it means that my questions have to be more technical (mathy) but not coding questions, because it does really well at: Write an R function that does a two sample bootstrap test.. or anything like that.

@ct_bergstrom Fascinating... and you inspired me to run a test on my own assignments. Here's an essay prompt from my most recent midterm (Journalism 101 at NYU).

A priori, I thought there would be a pretty big natural language processing hump to get over to understand what I'm asking for... but if it succeeded on that front, the essay itself is pretty well suited to an AI attack.

@ct_bergstrom To my surprise, the NLP part was not a problem at all. It got right to writing an appropriate essay. After running it a few times, I saw results that got as high as maybe a B-, Cs were more typical. Here's an example of a C/C- essay.

First paragraph is almost reasonable -- but it says the same thing over and over rather than exploring that argument.

Second paragraph exhibits a fundamental misunderstanding of the fair report privilege...

@ct_bergstrom
Third paragraph misunderstands "actual malice", which is something that would put one of my students in the doghouse for sure, and reintroduces assertion of fact which is essentially the same as the opinion section in the first graf.

Fourth paragraph isn't awful, but it is shallow and oversimplified.

Fifth graf is pure waffle. (Interestingly, almost all the runs waffled in a similar way!)

Also... none pointed out that Floyd's dead, short-circuiting the libel claim!

@ct_bergstrom The AI essays all felt like they were feeling around a decent answer but couldn't cut to the heart of the issue.

On the plus side, there's an unexpected bonus for we professors in the age of AI (attached).

@cgseife This is fascinating and highlights the fact that these are convincing-babble-generators, not systems with underlying knowledge models about factual relations in the world.
@ct_bergstrom @cgseife We've all encountered university students like that. And Silicon Valley is full of them ... 🥴

@ct_bergstrom I sort of wonder if you could be successful asking students to analyze AI-generated text.

"The ChatGPT AI was asked to explain why the following xkcd cartoon is funny.

[cartoon text]

The AI responded with this answer: [answer]

Discuss the ways the AI is correct or incorrect."

@ct_bergstrom Even if it does work, I'm not sure how long that would last. But I wonder if adding a layer of abstraction or meta to the question would confuse the AI or make it more likely to output something completely wrong.

@ct_bergstrom interesting thread, thank you.

I wonder if one possible approach might be to avoid the "arms race" scenario entirely... Rather than trying to figure out the types of questions that such models do poorly at, universities might instead de-emphasise assessment quantity and prioritise assessment quality.

Eg fewer written exams and marking, instead focus on face-to-face spoken assessments (ie viva voce which aren't at all common in many disciplines!). Another approach might be to do away with grading entirely, but I doubt academia would be so radical.

@ct_bergstrom and just a small follow-up - ChatGPT does seem to struggle with providing real references unless properly prompted, and even then I don't believe it is using the referencing appropriately within text, basically incorrectly sourcing passages to references

@ct_bergstrom perhaps one path forward in pedagogical context is to try write questions where LLM can be productively used as a tool, but requiring human mind to create a solution out of the LLM reply.

Certainly challenging for a teacher, but the students *will* face a world where they need to be able to use LLMs, and to harness that power to serve their creativity and productivity.

@ct_bergstrom I read a Tweet at one point that a student was so stressed in her exam that she was holding back tears. The software then failed her for cheating! Quite heartbreaking.
@ct_bergstrom Thanks for putting this out there. I'd love to hear what others have to say, because Chat GPT surely will be a challenge for designing effective take home exams.
I presume at this point, giving data graphs or data images for students to interpret will still be on the menu until visual recognition is built into AI.
Otherwise, we'll need to resort to in-person oral examinations or grade only in-person participations.

@ct_bergstrom If you feed it the same prompt multiple times, how similar are the answers? Scope for a plagiarism detector to work with its output?

What happens if you ask it to critique its own answer? Could you frame questions as "Here's how an AI bot answered the question; what did it get right, what did it get wrong, what did it miss?"

@ct_bergstrom I love these experiments! Did you try giving it the questions one at a time? That might make it more similar to the training data.
@ct_bergstrom Yeah, They actually say that the model's training data cuts off in 2021, so anything in 2022 it shouldn't know about.
There ARE ways to add new training data, but making sure new info doesn't mess up the AI/be overly biased one way or another is decently hard unless you're looking to make a very specialized model that only knows one author's style or something.
@ct_bergstrom Popehat over on the other site and a pile of lawyers tested it on law and then went off the rails. 😂
@ct_bergstrom @thesiswhisperer @jasondowns aaahh I need to feed it my exam questions but they tend to be very specific…
@deboraha @ct_bergstrom @jasondowns I guess that's how we'll have to start testing things? Maybe Turnitin will have detection tools? Seems like a hard nut to crack.