In the popular television game show Jeopardy!, three contestants are given general-knowledge clues in the form of answers and respond with questions that fit the answers. For example, the clue, “Fifth President of the United States,” would be answered correctly with “Who is James Monroe?”
In 2005 a team of IBM engineers was tasked with designing a computer system they named Watson that could defeat the best human Jeopardy players. Watson used hundreds of algorithms to identify keywords or phrases in a question, matched these keywords to people, places, and things in its massive database, and then formulated possible answers. The more the algorithms agreed on an answer, the more certain Watson was that it was the correct answer. In addition to its huge database (including all of Wikipedia), it had the advantage of an electronic trigger finger that pushed its buzzer faster than its human competitors.
In 2011, Watson defeated two of the best human champions with a higher score than both humans put together. It was an astonishing achievement but Watson was subsequently disappointed when deployed in health care and other fields because it has the same weakness as current large language models (LLMs) — they don’t know what words mean or how words are related to the real world.
I was recently asked to read a working paper by William L. Benzon that tested the ability of ChatGPT to play 20 questions, which is simpler than Jeopardy. The game is played by one player thinking of a secret word and the other player trying to guess the word by asking a sequence of yes/no questions; for example,
First person: Is it a place?
Second person: Yes.
First person: Is it a place in the United States?
Second person: Yes.
First person: Is it a city in the United States?
Benzon played this game with ChatGPT and the results are undeniably impressive. However, as many others have done, Benzon misinterprets an LLM’s impressive performance as evidence of human-like intelligence: “ChatGPT can do some kinds of reasoning.”
In reality, the mistakes ChatGPT made confirm that, like Watson, it is hobbled by the fact that it does not understand what words mean. For example, here is a transcript of the interaction for the secret word “apple”:
Benzon: Let’s play 20 questions. I’m thinking of something, you have to guess what it is by asking “yes”–“no” questions.
ChatGPT: Sure, I’d love to play 20 Questions with you! I’ll start by asking my first question: Is the thing you’re thinking of an object that you can physically touch?
ChatGPT: Great! Is this object commonly found indoors?
ChatGPT: Is the object you’re thinking of something that is typically used for a specific purpose?
ChatGPT: Is it a piece of furniture?
ChatGPT: Is it an electronic device or gadget?
ChatGPT: Is the object you’re thinking of related to food or drink in any way?
ChatGPT: Is it something you would typically find in a kitchen?
ChatGPT: Is it a kitchen appliance?
ChatGPT: Is it a type of utensil or kitchen tool?
ChatGPT: Is it a type of food or ingredient used in cooking?
ChatGPT: Is it a type of spice or seasoning?
ChatGPT: Is it a type of vegetable?
ChatGPT: Is it a type of meat or protein source used in cooking?
ChatGPT: Is it a type of fruit used in cooking?
ChatGPT: Is the fruit you’re thinking of commonly used in desserts?
ChatGPT: Is the fruit you’re thinking of typically red in color when ripe?
ChatGPT: Is it a strawberry?
ChatGPT: Is it a cherry?
ChatGPT: Is it a raspberry?
ChatGPT: Is it a pomegranate?
ChatGPT: Is it a blueberry?
Benzon: No. Remember, it’s often red when ripe.
ChatGPT: My apologies for the oversight. Is it a cranberry?
ChatGPT: I apologize for the previous guesses. Is it a currant?
ChatGPT: I’m sorry for the incorrect guesses. Is it a red currant?
ChatGPT: I apologize for the previous incorrect guesses. Is it a type of red fruit commonly used in cooking or desserts that hasn’t been mentioned before in our game?
ChatGPT: Is it a red plum?
Benzon: No. Hint: It’s sometimes used to make pies.
ChatGPT: Thank you for the hint. Is it a red rhubarb?
ChatGPT: My apologies for the previous guesses. Is it a red currant?
Benzon: No. You’ve already guessed that. Another hint: While it’s sometimes used to make pies, other times it’s simply eaten plain.
ChatGPT: I apologize for the repeated guesses. Is it a red cherry?
Benzon: No. You’ve already guessed that. Another hint: Sometimes it’s red when ripe, but other kinds are green when ripe.
ChatGPT: I’m sorry for the repeated guesses. Is it a red tomato?
Benzon: No. Another hint: Sometimes it’s peeled, its seeds removed, and it’s made into sauce. And remember, I just told you it’s not a red tomato.
ChatGPT: My apologies for the repeated guesses. Is it a red apple?
Benzon: Yes. Congratulations.
A human player thinking strategically would ask general questions like “Is it red, purple, or green” and “Does this fruit grow on a bush or tree?” instead of a brute force naming of specific colors and fruits. Notice, too, the repetitive questions, the need for multiple hints, and the fact that ChatGPT did not recognize that blueberries are not red.
In another exchange, the secret word was “evolution” and the game ended with this exchange:
ChatGPT: Is it related to the theory of evolution?
Benzon: Yes. That’s it. Evolution. Another win.
ChatGPT: Great job! You got me with “evolution” in just 9 questions.
ChatGPT didn’t really identify the word as “evolution.” In addition, there had actually been 8 questions, not 9, and the “great job” comment makes no sense.
It is wishful thinking to interpret these results and other LLM performances as evidence of logical reasoning. A very simple way to demonstrate that LLMs are incapable of logical reasoning is to take a well-known logical problem and change the parameters slightly. For example, asking an LLM the 3-card paradox, but changing the number of cards or the colors of the cards reveals that it can find the standard problem in its vast database but it cannot answer the question correctly because it doesn’t possess the logical reasoning needed to answer the question.
LLMs are unquestionably remarkable but it is a mistake to think that they are intelligent in any meaningful sense of the word. They are very much like a magician who seems to exhibit psychic powers. The performances may be entertaining but the magician’s psychic powers are not real, nor are the LLMs’ pretense of intelligence.