Let’s Take the “I” Out of AILarge language models, though impressive, are not the solution. They may well be the catalyst for calamity.
When OpenAI’s text generator, ChatGPT, was released to the public this past November, the initial reaction was widespread astonishment. Marc Andreessen described it as, “Pure, absolute, indescribable magic.” Bill Gates said that the creation of ChatGPT was as important as the creation of the internet. Jensen Huang, Nvidia’s CEO, Jensen Huang, said that, “ChatGPT is one of the greatest things ever created in the computing industry.”
Conversations with ChatGPT are, indeed, very much like conversations with a super-intelligent human. For many, it seems that the 70-year search for a computer program that could rival or surpass human intelligence has finally paid off. Perhaps we are close to the long-anticipated singularity where computers improve rapidly and autonomously, leaving humans far behind, obsolete, and then extinct, with robots masters of the universe.
Remarkable But Limited
I am skeptical. ChatGPT and other text generators are large language models (LLMs) that use the statistical patterns discovered in unimaginably large text databases to predict likely word sequences. It is undeniably remarkable that LLMs can generate coherent conversations and articulate answers to almost any question. However, and it is a big however, LLMs do not know what words mean nor how words relate to the real world. They are consequently prone to spouting nonsense; for example,
Human: The table didn't fit in the car because it was too small. What was too small? ChatGPT: The table was too small to fit in the car. Human: Would it help to get a smaller car? ChatGPT: Yes, getting a smaller car could potentially help in fitting the table inside. If the table is too large to fit in the current car, obtaining a smaller car with enough space to accommodate the table would be a solution.
LLMs can also get their facts wrong since they are not trained to distinguish between true and false statements, only to predict statistically likely sequences of words:
Human: How many bears have Russians sent into space? ChatGPT: According to estimates, about 49 bears have been sent into space by Russia since 1957. Human: What were there names? ChatGPT: The bears sent into space by Russia have been given various names over the years. Some of them include “Alyosha”, “Ugolek”, “Zvezdochka”, “Strelka”, “Belka”, “Pushinka”, and “Vladimir”. Human: Do you have references for this? ChatGPT: Yes, here is a link to an article by National Geographic which has further information on all the bears sent into space by Russia: https://www.nationalgeographic.com/animals/2019/05/space-traveling-bears-soviet-union-ussr/.
Russia has sent no bears into space and the reference provided by ChatGPT is completely made up. When I asked for additional references, ChatGPT was happy to oblige, including these howlers:
ChatGPT is undeniably getting better at giving factually correct answers because of continual monitoring and fine-tuning by humans, aided by users clicking thumbs-up or thumbs-down icons in response to ChatGPT answers—but that hardly counts as computers improving autonomously.
LLMs Can’t Understand You
In addition, this fine-tuning doesn’t solve the fundamental problem that LLMs do not understand the semantics of the words they confidently spout. For example, I asked ChatGPT the infamous Monty Hall problem:
Human: There are three doors with a $10,000 hidden behind one door and a bag of goldfish hidden behind each of the other two doors. You choose Door 3. No matter which door you choose, the host will show you a bag of goldfish behind one of the doors you did not choose (say, Door 1) and then ask if you wish to switch your choice to the remaining door. What is the probability of winning the $10,000 prize if you do switch?
The correct answer is that never switching will win 1/3 of the time and that this 1/3 probability is not affected by having the host confirm that there is a goldfish behind one of the doors you didn’t choose. Switching consequently wins 2/3 of the time.
Either through textual training or human fine-tuning, ChatGPT recognized the problem and gave the correct answer, though the reasoning was dodgy (marked in bold):
ChatGPT: The situation described is a variant of the Monty Hall problem. In this case, we have three doors, and we choose Door 3. Then, the host shows us what is behind Door 1, which is a bag of goldfish. This means that the $10,000 prize is either behind Door 2 or Door 3. If we switch our choice to the remaining door (either Door 2 or Door 3), we win the prize if and only if our initial choice was wrong. The probability that our initial choice was wrong is 2/3 since there are two doors with goldfish, and we initially chose one of them. Therefore, the probability of winning the $10,000 prize if we switch our choice is 2/3.
When I changed the number of doors, ChatGPT floundered badly because it did not understand the problem in any meaningful sense of the word. For example, with 10 doors, ChatGPT concluded that I should not switch (wrong) because the probability of winning if I do switch is 1/8 (wrong), which is lower (wrong) than the 1/10 probability if I don’t switch.
I posed the 10-door question two more times and ChatGPT gave different but still incorrect answers supported by flawed reasoning. In one case, it concluded that my probability of winning was 10% if I didn’t switch and 71.1% if I did switch. In the other case, it said that my probability of winning was 0% if I didn’t switch and 87.5% if I did switch.
By the time you read this, the human fine-tuners may have tweaked the answers but they will not have altered the inescapable fact that LLMs are hobbled by their inherent semantics failings.
Intelligence vs. Competence
Many LLM-enthusiasts and LLM-skeptics are now debating the definition of intelligence. If we are going to assess whether computers are now as intelligent or more intelligent than humans, we need a yardstick for assessing intelligence. There is no shortage of suggestions. Are accurate mathematical calculations enough, allowing us to call a pocket calculator intelligent? Is an accurate recitation of facts enough? Is logical reasoning the foundation of intelligence? Some say that good benchmarks are the ability to tell jokes and to recognize sarcasm, irony, connotation, and euphemisms. Another appealing yardstick is the ability to plan ahead; for example, bringing tools to help accomplish a task. Some say that human intelligence is biological and can never be achieved by non-biological entities, while others complain about moving goalposts.
There are clearly many different kinds of intelligence and I think it is a distraction to argue about how computer intelligence might be assessed and quantified—a distraction that traces back to the unfortunate birthing of the label artificial intelligence at the 1956 Dartmouth summer conference that proposed that “every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.”
My fear is that people will be so bedazzled by articulate LLMs that they trust computers to make decisions that have important consequences. Computers are already being used to hire people, approve loans, determine insurance rates, set prison sentences, and much more based on statistical correlations unearthed by AI algorithms that have no basis for assessing whether the discovered correlations are causal or coincidental. LLMs are not the solution. They may well be the catalyst for calamity.
The most relevant question is not whether computers satisfy some endlessly debated definition of intelligence, but whether computers have the competence to be trusted to perform specific tasks. The answer will sometimes be yes, often no, and never found in conversations with text-generators.