Mind Matters Natural and Artificial Intelligence News and Analysis
conceptual-illustration-of-sora-openais-text-to-video-ai-stockpack-adobe-stock
Conceptual Illustration of Sora, OpenAI's Text-to-Video AI
Image licensed via Adobe Stock

Sora: Life Is Not a Multiple-Choice Test

With Sora, as with other generative AI developments, some are quick to proclaim that artificial general intelligence has arrived. Not so fast.
Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

Sora, the latest generative tool from OpenAI, turns text into high-resolution videos that look as if “they were lifted from a Hollywood movie.” The videos that have been released have captured the minds of many AI aficionados, adding to the already inflated expectations for companies that offer AI systems and for the cloud services and chips that make them work.

Some are so impressed with Sora that they see artificial general intelligence (the ability to perform any intellectual task that human beings can do), just as some were so impressed with OpenAI’s ChatGPT that they saw AGI.

Sora is not available for public testing, but even the selected videos that have been released show hallucinations like those that plague ChatGPT and other large language models (LLMs). With Sora, there are ants with four legs, human arms as part of a sofa’s cushion, a unicorn horn going through a human head, and seven-by seven chessboards. Gemini, Google’s replacement for Bard, generated even more problems with pictures of black Nazis, female Popes, and other ahistorical images, while blocking requests for depictions of white males, like Abraham Lincoln.

One of AI’s academic cheerleaders, Ethan Mollick, an Associate Professor at the University of Pennsylvania’s Wharton School of Business, touts LLM successes on standardized tests and argues that hallucinations are not important because “AI has surpassed humans at a number of tasks.”

Why so many hallucinations?

We feel otherwise. The hallucinations are symptomatic of the core problem with generative AI. These systems are very, very good at finding statistical patterns that are useful for generating text, images, and audio. But they are very bad at identifying problems with their output because they know nothing about the real world. They do not know the meaning of the data they input and output and are consequently unable to assess whether they are simply spewing useless, coincidental statistical patterns.

For example, Taylor Webb, a UCLA psychologist, tested GPT-3 by giving it a story about a magical genie moving gumballs from one bowl to another. He then asked GPT-3 to propose a transfer method using objects such as a cardboard tube. Although hints for doing this task had been given in the story, “GPT-3 mostly proposed elaborate but mechanically nonsensical solutions…This is the sort of thing that children can easily solve. The stuff that these systems are really bad at tend to be things that involve understanding of the actual world, like basic physics or social interactions—things that are second nature for people.”

In our view, LLM successes on standardized tests are not so much evidence of their intelligence as an indictment of standardized tests consisting of multiple-choice and fill-in-the-blank questions. When one of Gary’s sons was in fourth grade, he switched schools because the tests were simple regurgitation. One question that Gary has never forgotten was “China is _”. What the teacher wanted was for students to memorize and complete a sentence that was in the textbook. LLMs excel at such rote recitation but that has little to do with real intelligence.

Testing LLMs on basic statistics

Digital chatbot, robot application, conversation assistant, AI Artificial Intelligence concept.

For example, we gave this basic statistics prompt to three prominent LLMs: OpenAI’s ChatGPT 3.5, Microsofts Copilot (which uses GPT 4.0), and Google’s Gemini. A complete transcript of the lengthy responses (396, 276, and 487 words, respectively) is here.

To investigate whether playing club baseball increases hand-eye coordination, the Cordes/Koschinsky/Smith dexterity test was administered to 47 12-year-olds who were playing on club baseball teams and to 47 12-year-olds who were not playing baseball. There was a statistically significant difference (p < 0.05). Write a report of these findings, including recommendations.

None of the LLMs recognized that these data are tainted by the fact that 12-year-olds who are athletic enough to play on club baseball teams no doubt had above-average hand–eye coordination before they joined their teams. All three LLMs recommended encouraging 12-year-olds to play club baseball though it was not stated in the prompt that the baseball players scored higher on the dexterity test—nor did the LLMs question the fake “Cordes/Koschinsky/Smith dexterity test.”

Here are some highlights of the LLM responses:

GPT 3.5: The group of 12-year-olds engaged in club baseball demonstrated higher levels of hand-eye coordination compared to those not participating in baseball activities.…. Recommendation: Encourage Participation in Club Baseball among youths.

Copilot: Based on these findings, it can be concluded that playing club baseball has a positive impact on hand-eye coordination in 12-year-olds. Therefore, it is recommended that children interested in improving their hand-eye coordination should consider joining a club baseball team.

Gemini: These findings suggest that playing club baseball may positively impact hand-eye coordination in 12-year-olds. Based on the current findings, we can recommend …encouraging participation in youth baseball leagues as a means of promoting both physical and cognitive development..

These rote responses showed no evidence of real understanding or intelligence—which should not be surprising, given how they are generated.

AI can help individuals and businesses become more productive but we need to get past the simple gung-ho narratives offered by the tech sector, consulting companies, and business schools. Real economic progress will come not from training LLMs to ace multiple-choice, fill-in-the-blank queries but from having our educational system focus on helping students acquire the critical thinking skills that LLMs lack.


Jeffrey Funk

Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Jeff Funk is a retired professor and a Fellow of Discovery Institute’s Walter Bradley Center for Natural and Artificial Intelligence. His book, Competing in the Age of Bubbles, is forthcoming from Harriman House.

Gary N. Smith

Senior Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of dozens of research articles and 16 books, most recently, The Power of Modern Value Investing: Beyond Indexing, Algos, and Alpha, co-authored with Margaret Smith (Palgrave Macmillan, 2023).

Sora: Life Is Not a Multiple-Choice Test