#ChatGPT can pass exams that a majority of students can't"

i think this says more about how our education relies on non-introspective reproduction of information than it says about the future of computing

like, it just shows that our education systems are incompatible with actual humans when a computer is better at it?

@bram i'd like to see some proof for that.
@bram no, clearly this means that we have to ban AI now and crack down on cheating
(sarcasm, ofc)
@bram no, it shows your education system produces robots that are to repeat learned things like parrots instead of understanding what they’re doing
@bram tbh not quite

there's a bunch of evidence that it was trained on tests that people then tried it on, which it of course can regurgitate

I'd bet that if you got a new test that wasn't on the internet before and tried ChatGPT on it it would fail miserably

it has very little of actual intelligence, and it's knowledge is quite lossy as well.

@bram It's a demonstration that the researchers who made that claim are dumb. The answers to the tests are probably in the large amount of text fed to chatGPT. It's just regurgitating that.

The researchers tried to account for that, but the procedure they used is dumb. They searched for exact substring matches of the questions, and excluded those questions from the test. Any question they didn't find that way is considered to be "not in the training data", but they probably are, paraphrased

@bram (Supporting evidence: someone tried replicating the Leetcode scores reported, and they found that chatGPT does well on questions that were posted before the training data cutoff date, and terribly on questions posted after. Something similar is probably happening with the other tests, like the Uniform Bar Exam probably has many semi-hidden cheat sheets on the internet, and those are likely in the training data.)

@gray17 @bram

I use GPT-4 daily to help me with my programming projects, also for moral support, planning, etc.

It's not perfect, but it's better than a human co-worker at almost everything, and in terms of value for money there's just no comparison.

I estimate that it's better than 9 out of 10 developers in terms of one-shot code quality (vs their refined code); it's breadth of knowledge is better than 100% of developers; and it's 100 times less expensive than the cheapest junior developer.

@sswam @gray17 @bram I hope you're not using it with commercial code...

@bram It's very dangerous that the rate of improvement of LLMs is so unequal across areas, it fools people. ChatGPT has no model of most concepts.

The couple of emergent properties that appear in it combined with the insane amount of data they can regurgitate are enough to wow people and make them think it's a genius, while it's more at the level of a little human child on a lot of actual reasoning tasks and absolutely terrible at some, but its use of language and emotional mimicry is expert.

@bram It's a bit like a child with an encyclopedia strapped to its head, it can talk about most things like someone who heard a vague description of them, but it has never seen them.

There are some things where I feel it has something approaching understanding of it, because it actually has the thing in its dataset, those are abstract things like a Linux shell and simple snippets of code. It's good enough to not only write about them but simulate them, indicating some kind of model.

@bram Seriously. If an exam only requires parroting knowledge with no critical thought, then it isn’t an exam. It’s just a memory test.

@XanIndigo @bram
Actually, it's an exam for becoming a manager.

"We have problem [x]."
"Have you tried [perfectly obvious thing that might solve x]?"

@bram C4 scores 99%ile on IQ tests. By the measures we have it’s truly smarter than you or I and it “knows” vastly more than we do.

I wish it were a matter of mismeasure. I would not be anxious then.

@bram That is a very succinct observation. I had to train myself to do well on standardized tests. I had to ask myself what answer they were looking for when I could make a case for more than one of their choices.
@bram if that’s true (source?) there’s something massively wrong with the subject syllabus or the exam if a majority of students can’t pass it
@bram Last time I checked, students taking tests weren't allowed to access high-speed compute and vast amounts (TB? PB?) of RAM.
@VisualStuart @bram they do. It’s called a brain……

@rbonini @bram Eh, the human brain is estimated at perhaps 10 TB, maybe 100TB, much of which is occupied with important stuff like where you parked the car and why Marylou walked out on you at your senior prom 20 years ago, as opposed to taking tests an problem solving.

I just recalibrated my own brain in terms of flops (floating point operations per second) and came up with the same value as I had last year: zero.

@bram Exam scores are evidence that a human has studied a subject enough to presumably have a deep understanding of it, but they don’t directly measure the depth of that understanding. They show that a human is well-read in the topic and can do some simple reasoning, and for humans this is a proxy for understanding. Given the huge amount of text LLMs train on, and their propensity for memorization, it’s not clear these tests are as a good proxy for LLM understanding.
@bram I wouldn’t necessarily agree with that. A good assessment strategy contains a blend of appropriate methodologies. Traditional exams, which are just one type of assessment, are usually designed to test specific things that are best examined under proctored conditions and without access to all human knowledge.
@bram alternatively : exams are poor benchmarks for LLMs capabilities as they are designed to discriminate between humans based on skills that humans tend to suck at.
@bram When everything is a multiple choice test where 2 of 4 answers are clearly BS and a 3rd is obvs BS if you've been in class, you aren't testing understanding or comprehension. Standardized tests don't test learning, they test attendance.
@bram
@vikxin
Also: it typically can't pass exam questions from after it's training cutoff
@bram ChatGPT GPT-4 can also do most things in the textual realm better than most humans. It's not just better than 90% of students at passing exams, it's better than 90% of humans at just about everything within its domain. Going forward, I wouldn't hire a junior developer, copywriter, or anyone really, except as charity / patronage, or for a truly exceptional talent. It can't do my electrical work yet, so tradies are safe for another six months or so.
@bram the problem is, if computers are a better fit for our economic model, that won’t matter as an argument to the people holding the strings. I mean, it will at some point, but not until the last bag is being held and the economy is such a closed loop that nothing more can be extracted (i.e. there are no consumers left because whatever system of machines we potentially end up with is trading whatever passes for money with ~100% efficiency…)
@bram Also, not all educational systems are centered in exams. They are widely used, often abused of and mostly misunderstood, but they often do not take center stage of the education plans of a subject. Having said that, some disciplines do put a lot of importance in memory, but an "extended mind" with IA as a reference is all the more interesting there. 🤔

@bram Less snappy, but more truthful would be "Exam boards publish enough mock exams online that a statistical language model trained on free content can reproduce accurate answers better than students can recall them from memory."

It's less sensational and clickbaity, granted, but is actually true.

@bram exams, tech interview questions, etc, are the kinds of things that have huge knowledge bases on the internet with tons of exact answers published all over the place. an LLM can solve them for the same reason a person with a search engine could solve them; if I copy and paste a stackoverflow answer for how to reverse a linked list or something it doesn't mean I understand anything about the solution, it just means I can draw a correlation between a question and a probable answer
@bram Later tonight: a shocking test reveals that possessing the answer sheet to an exam results in higher test scores. More at 7.

@bram

Can it pass the tests without access to the computer data, because humans can pass any test with the answers on their device.

@bram I presume a fair amount of the exams were also in the model's training data and nobody went to check
@bram
ie:
“ChatGPT recently passed the U.S. Medical Licensing Exam, but using it for a real-world medical diagnosis would quickly turn deadly“
https://inflecthealth.medium.com/im-an-er-doctor-here-s-what-i-found-when-i-asked-chatgpt-to-diagnose-my-patients-7829c375a9da
I’m an ER doctor: Here’s what I found when I asked ChatGPT to diagnose my patients

With news that ChatGPT successfully “passed” the U.S. Medical Licensing Exam, I was curious how it would perform in a real-world medical situation. As an advocate of leveraging artificial…

Medium

@bram

... unless mindless repetion, simple stimulus/response and_or production of "more of the same" is indeed what's expected of the "educated" (drilled) humans.

@bram Ah yes, reminds me of failing O level history.

It didn't matter how much you understood or explained what had been going on, you got one mark for each name or date, and I wasn't (and still am not) good at memorising random facts.