Mastodawn

Carl T. Bergstrom Apr 26, 2023

After hearing Sebastian Bubeck talk about the #SparksOfAGI paper today, I decided to give #GPT4 another chance.

If it can really reason, it should be able to solve very simple logic puzzles. So I made one up. Sebastian stressed the importance of asking the question right, so I stressed that this is a logic puzzle and didn't add anything confusing about knights and knaves.

Still, it gets the solution wrong.

Show thread

Carl T. Bergstrom

Just for fun, here's a #KnightsAndKnaves version of the same puzzle.

It does no better.

(Actual solution: Only a knight can say "at least one of us is a knave", so 4 is a knight and indeed there is at least one knave. Since Jailer 2 says that Jailer 1 is a knight, Jailers 1 and 2 have to be the same type. If both are knaves, then jailer 3 must be a knight (otherwise 1's statement would be true). If both are knights, jailer 3 must be a knave. So the tiger is behind door 1 or 3. Open door 2.)

Show thread

Carl T. Bergstrom Apr 26, 2023

Ok, to be fair it's not systematically wrong. I ran the original, simple-language problem six times. Four times it incorrectly told me to open doors 2 and 3. Once it incorrectly told me to open doors 1 and 2. And once it correctly told me to open doors 1 and 3.

Even in this final case the logic itself was flawed.

Show thread

Carl T. Bergstrom Apr 26, 2023

Also to be fair, it does better than #GPT3.5, which concludes "In either case, you should not choose door 1, since we know that the tiger is not behind that door."

Show thread

Steve Dodge Apr 26, 2023

@ct_bergstrom
So it’s stochastically wrong?

Show thread

Claudius Link Apr 26, 2023

@ct_bergstrom
Wait! Is the probability to get a correct answer by random choice not 2/3?

So GTP4 is not only wrong sometimes, but it is considerably more wrong than rolling dice 😱

Show thread

standev Apr 26, 2023

@ct_bergstrom The inconsistencies between runs are really bothering me. I’ve been playing around with asking Bing to explain jokes to me, and sometimes it gets part of the explanation right and sometimes it totally fails.