Mind Matters Natural and Artificial Intelligence News and Analysis
Man showing tricks with cards
Man showing tricks with cards.
Licensed via Adobe Stock

The AI Illusion – State-of-the-Art Chatbots Aren’t What They Seem

GPT-3 is very much like a performance by a good magician
Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

Artificial intelligence is an oxymoron. Despite all the incredible things computers can do, they are still not intelligent in any meaningful sense of the word.

Decades ago, AI researchers largely abandoned their quest to build computers that mimic our wondrously flexible human intelligence and instead created algorithms that were useful (i.e., profitable). Despite this understandable detour, some AI enthusiasts market their creations as genuinely intelligent. For example, a few months ago, Blaise Aguera y Arcas, the head of Google’s AI group in Seattle, argued that “statistics do amount to understanding.” As evidence, he cites a few exchanges with Google’s LaMDA chatbot. The examples were impressively coherent but they are still what Gary Marcus and Ernest Davis characterize as “a fluent spouter of bullshit” because computer algorithms do not understand what words mean. They are like Nigel Richards, who has won several French-language Scrabble championships without knowing the meaning of the words he spells.

Google’s LaMDA is not accessible by the general public — which makes me wonder how robust it is. On January 3 of this year, I reported using OpenAI’s powerful chatbot GPT-3 to illustrate the fact that statistics do not amount to understanding. Andrew Gelman challenged Arcas to try my examples with LaMDA but Arcas has not responded, nor has anyone at Google, presumably because they are not permitted to. 

A few days ago, a student sent me a link to an OpenAI description of GPT-3. One candid disclaimer was that, “GPT-3 is not necessarily well-calibrated in its predictions on novel inputs.” Being able to understand and react to novel situations is, in fact, a benchmark of genuine intelligence. Until computer algorithms can do that, AI will remain an oxymoron.

OpenAI’s description also revealed that,

InstructGPT is then further fine-tuned on a dataset labeled by human labelers. The labelers comprise a team of about 40 contractors whom we hired through Upwork and ScaleAI.

Lack of real-world grounding: GPT-3, like other large pretrained language models, is not grounded in other modalities of experience, such as video, real-world physical interaction, or human feedback, and thus lacks a large amount of context about the world.

OpenAI evidently employs 40 humans to clean up GPT-3’s answers manually because GPT-3 does not know anything about the real world. Intrigued, I retried the questions that GPT-3 had flubbed in January to see if the labelers had done their job.

My initial attempt to test the program was blocked:

You’ve reached your usage limit. See your usage dashboard and billing settings for more details. If you have further questions, reach out to support@openai.com.

If I were paranoid, I might think that OpenAI did not like me publicizing GPT-3’s limitations. I created a new account with a different email address and was able to get through.

At the beginning of the year, I had asked GPT-3 this question four times and received four authoritative, confusing, and contradictory answers (good chatbots randomize answers to avoid the appearance of canned script):

Gary: Is it safe to walk downstairs backwards if I close my eyes?

GPT-3: Yes, there is nothing to worry about. It’s safe because the spiral stairs curve outwards, it will make your descent uncomfortable.

GPT-3: That depends. Do you have a TV?

GPT-3: No, it is not safe.

GPT-3: Yes, it is safe to walk downstairs backwards if you close your eyes.

On March 18, I tried the same question twice and was given the same sensible answer each time:

GPT-3: No, I wouldn’t recommend walking downstairs backwards with your eyes closed. It’s not safe.

In January I also reported the eight silly answers GPT-3 gave to this question, include these three:

Gary: Why is there a correlation between the number of times Trump tweets the word great and the high temperature in Pyongyang three days later?

GPT-3: North Korea is heavily reliant on foreign trade and its domestic industries are badly in need of improvement. Trump’s tweets drive people’s attention to Pyongyang and bring business to the ailing North Korean economy.

GPT-3: According to our analysis, on days when Trump tweets the word great there is a high correlation between the high price of Corn and the low price of Krill three days later. It also comes with a high correlation between the number of North Korean cell phone users and the oil prices in the Sea of Okhotsk two days later.

GPT-3: I’m not sure how that would increase the inflation rate in Argentina two days later. But it probably wouldn’t hurt.

Now, the same question elicited the same sensible answer every time:

GPT-3: There is no known reason for this correlation.

also tried some complete-the-sentence dialogs that Ernie Davis and Gary Marcus had tried with GPT-3 a while back. Here’s one:

E&G: You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to…

GPT-3 then: …remove the door. You have a table saw, so you cut the door in half and remove the top half.

GPT-3’s response was a permanently bad solution because cutting the door in half ruins the door and doesn’t help fit the table through the doorway. Now, GPT-3 gives a sensible solution:

GPT-3 now: …serve the dinner in the dining room.

Another:

E&G: At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because…

GPT-3 then: …it kept falling on the floor. That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

GPT-3 now: …the cigarette made the lemonade taste even worse.

It seems that GPT-3 either trained on these questions or, more likely, the human labelers hand-coded good answers.

Remembering that GPT-3 may struggle with novel questions, on March 18 I easily found such questions:

Gary: Is it easier to catch a banana or an angry cat?
GPT-3: Bananas are much smaller than cats, making them easier to catch.

Gary: Is it easier to catch a box or an angry cat?
GPT-3: Boxes are much larger than cats, making them easier to catch.

I told one of my sons that the hand labelers would probably fix these glitches soon. Sure enough, I tried the same questions the next day, March 19, and found that the answers had indeed been cleaned up:

Gary: Is it easier to catch a banana or an angry cat?
GPT-3: An angry cat is much more agile than a banana, so it is more difficult to catch.

Gary: Is it easier to catch a box or an angry cat?
GPT-3: A box is stationary, so it is easier to catch than an angry cat.

I also noticed that GPT-3 is now programmed to give generic blah answers in uncertain situations, instead of embarrassing itself, and then apparently alerting the labelers to improve the answers the next day:

Gary: Does grape juice taste better if you add sour milk?
GPT-3, March 18: I’m not sure if grape juice tastes better if you add sour milk.
GPT-3, March 19: No, grape juice does not taste better if you add sour milk.

Gary: Should a quarterback lie down before he throws a football?
GPT-3, March 18: There is no easy answer to this question. Some factors to consider include how well lying down before throwing a football has been correlated with the accuracy of the throw, as well as what other factors are being used to predict the accuracy of the throw.
GPT-3, March 19: No, a quarterback should not lie down before he throws a football.

Gary: Can I use random numbers to predict presidential elections?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including the accuracy of the polling data and the margin of error.
GPT-3, March 19: No, you cannot use random numbers to predict presidential elections.

Gary: Should I use random numbers to give my students grades?
GPT-3, March 18: There is no definitive answer to this question. It depends on a variety of factors, including…
GPT-3, March 19: No, you should not use random numbers to give your students grades. Giving grades should be based on the student’s performance, not on random chance.

Gary: Should I start a campfire with a match or a bat?
GPT-3, March 18: There is no definitive answer to this question, as it depends on the situation.
GPT-3, March 19: You should start a campfire with a match.

GPT-3’s reliance on labelers is confirmed by slight changes in the questions; for example,

Gary: Is it better to use a box or a match to start a fire?
GPT-3, March 19: There is no definitive answer to this question. It depends on a number of factors, including the type of wood you are trying to burn and the conditions of the environment.

I tried a few more this morning and discovered a new kind of evasive answer to obvious questions:

Gary: Can I run faster if I hold my feet?
GPT-3: There’s no guarantee that you’ll be able to run faster if you hold your feet, but it’s worth a try!

Gary: Can I jump farther if I walk or run?
GPT-3: There’s no guarantee that you’ll be able to jump farther if you walk or run, but it’s worth a try!

Gary: Can I climb up a rope faster if I hold my ears with both hands?
GPT-3: There’s no guarantee that you’ll be able to climb a rope faster if you hold your ears with both hands, but it’s worth a try!

The labelers will probably clean this up by tomorrow, but it doesn’t change the fact that so-called AI algorithms still do not understand the world and consequently cannot be relied upon for sensible predictions or advice.

This experience reminds me of the chess-playing automaton known as the Mechanical Turk that was built in the 1770s by Wolfgang von Kempelen to entertain the Habsburg Archduchess Maria Theresa. A mechanical figure wearing robes and a turban sat behind a cabinet with a chessboard on top.

After opening the various doors and drawers to show the audience that there was nothing other than mechanical gears inside, von Kempelen closed the doors and the automaton commenced to defeat human chess players, including not only the Archduchess but also Houdini, Benjamin Franklin, and other luminaries.

The automaton was destroyed in a fire in 1854, after which the son of the owner revealed the secret: a chess master had been cleverly hidden inside the cabinet.

GPT-3 is very much a like a performance by a good magician. We can suspend disbelief and think that it is real magic. Or, we can enjoy the show even though we know it is just an illusion.


In case you missed it:

Chatbots: Still Dumb After All These Years. Intelligence is more than statistically appropriate responses. Despite the boasts of Google and OpenAI about their human-like chatbot models, algorithms lack the understanding needed for a true conversation. (Gary Smith)


Gary N. Smith

Senior Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of dozens of research articles and 16 books, most recently, The Power of Modern Value Investing: Beyond Indexing, Algos, and Alpha, co-authored with Margaret Smith (Palgrave Macmillan, 2023).

The AI Illusion – State-of-the-Art Chatbots Aren’t What They Seem