Mind Matters Natural and Artificial Intelligence News and Analysis
cute-artificial-intelligence-robot-with-notebook-stockpack-adobe-stock
cute artificial intelligence robot with notebook
Image licensed via Adobe Stock

Why Chatbots (LLMs) Flunk Routine Grade 9 Math Tests

Lack of true understanding is the Achilles heel of Large Language Models (LLMs). Have a look at the excruciating results
Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email

An alert reader recently emailed me a “basic high school problem” that he had been using to test Google’s touted Gemini program. The problem came from last year’s University of Waterloo online exam for 9th graders:

In a survey, 100 students were asked if they like lentils and were also asked if they like chickpeas. A total of 68 students like lentils. A total of 53 like chickpeas. A total of 6 like neither lentils nor chickpeas. How many of the 100 students like both lentils and chickpeas?

The solution can be obtained by using a two-way table. Here is the given information:

Like Lentils Do Not Like Lentils Total
Like Chickpeas53
Do Not Like Chickpeas6
Total68100

Filling in the missing cells, there are 27 students who like both lentils and chickpeas:

  Like Lentils Do Not Like Lentils Total
Like Chickpeas 27 26 53
Do Not Like Chickpeas 41 6 47
Total 68 32 100

This is a straightforward type of problem that large language models (LLMs) will have encountered many times in their training. Indeed, LLMs might have seen this exact problem. Yet, this reader gave the problem to Gemini four times and got the answers 34, 32, and 15 (twice).

It is revealing that, not only did Gemini not give the correct answer, it gave different incorrect answers. This is not like a conversation about the Arsenal football team in which an LLM might give varying, yet still statistically appropriate, answers. There is one correct answer and no need for multiple responses. Gemini was floundering because it simply did not know how to answer the problem.

Trying again…

I wondered if OpenAI’s ChatGPT 3.5 and Microsoft’s Copilot would struggle too. I first gave the problem to Gemini again and was given yet another incorrect answer, 10.5! (To Gemini’s credit, it did say that fractional people are not possible.):

Out of the 68 students who like lentils, 53 – x also like chickpeas, where x is the number of students who like both lentils and chickpeas.

Similarly, out of the 53 students who like chickpeas, 68 – x also like lentils.

Adding the counts of students who like lentils and chickpeas, we get 68 – x + 53 – x + x = 100, which combines to 121 – 2x = 100.

Solving for x, we get x = 10.5, which is impossible as the number of students liking both cannot be a decimal.

Therefore, there seems to be an inconsistency in the given information. Please double-check the numbers provided for the students who like lentils, chickpeas, and neither.

The first two sentences are nonsense and the math in the third sentence is wrong. If Gemini understood what the numbers mean, it would know that the numbers given in the problem are fine and should not require a fractional answer. It would then check how it arrived at its answer.

How did the other chatbots do?

ChatGPT 3.5 gave the incorrect answer of –46 though it recognized that this answer was impossible. Like Gemini, if it understood the given numbers, it would know that a negative answer is not required and would check its work.
 
Let’s denote the number of students who like both lentils and chickpeas as x. We know that:
Total number of students who like lentils (A) is 68.
Total number of students who like chickpeas (B) is 53.
Total number of students who like neither lentils nor chickpeas (A∩B) is 6.
The formula for the union of two sets is given by:
A∪B=A+B−A∩B
In this case, the union of students who like lentils and chickpeas (A∪B) represents the total number of students who like either lentils or chickpeas or both.
So, we can write:
A∪B=68+53−6
Now, we know that:
A∪B=A+B−A∩B
Substituting the values we have:
68+53−6=68+53−x
Solving for x (the number of students who like both lentils and chickpeas):
x=68+53−6−68−53
x=75−121
x=−46
However, the number of students who like both lentils and chickpeas cannot be negative. This suggests that there might be an issue with the given information or there might be an error in the data provided. Please double-check the information provided or verify if there is any additional context or constraints that might affect the solution 
 
Copilot gave yet another incorrect answer, 21, because it mistakenly entered the total number of students for the number who like lentils or chickpeas, or both:
 
Let’s break down the information given:
     Total students surveyed: 100
     Students who like lentils: 68
     Students who like chickpeas: 53
Students who like neither lentils nor chickpeas: 6
We want to find out how many students like both lentils and chickpeas. Let’s denote this number as (x).
We can use the principle of inclusion and exclusion. The total number of students who like either lentils or chickpeas (or both) is the sum of those who like lentils and those who like chickpeas, minus the overlap (students who like both):
Total = Lentils + Chickpeas – Both
Substituting the given values:
100 = 68 + 53 – x 
Solving for (x):
x = 68 + 53 – 100 = 21
Therefore, 21 students like both lentils and chickpeas. 
 
In all three cases, the LLMs are clearly relying on procedures they have seen but cannot implement correctly because they don’t understand, in any meaningful sense, what words mean and, therefore, do not know how the given numbers should be used in these procedures.
 
This has been and continues to be the Achilles heel of LLMs. It is increasingly recognized that the solution is not to make the models bigger. On January 16, 2024, Yann LeCun, Chief AI Scientist at Meta, said that, “The future of AI is not generative. It will not get us to where we need to go.” The next day, Sam Altman, OpenAI’s CEO, January17, 2024, said that further progress will not come from making the models bigger: “I think what it means to train these models is going to change a lot in the next few years.”
 
Instead of training LLMs on ever larger databases, the focus should be on discovering a way for AI systems to understand the data they input and output.


Gary N. Smith

Senior Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of dozens of research articles and 16 books, most recently, The Power of Modern Value Investing: Beyond Indexing, Algos, and Alpha, co-authored with Margaret Smith (Palgrave Macmillan, 2023).

Why Chatbots (LLMs) Flunk Routine Grade 9 Math Tests