Why Chatbots (LLMs) Flunk Routine Grade 9 Math Tests
Lack of true understanding is the Achilles heel of Large Language Models (LLMs). Have a look at the excruciating resultsAn alert reader recently emailed me a “basic high school problem” that he had been using to test Google’s touted Gemini program. The problem came from last year’s University of Waterloo online exam for 9th graders:
In a survey, 100 students were asked if they like lentils and were also asked if they like chickpeas. A total of 68 students like lentils. A total of 53 like chickpeas. A total of 6 like neither lentils nor chickpeas. How many of the 100 students like both lentils and chickpeas?
The solution can be obtained by using a two-way table. Here is the given information:
Like Lentils | Do Not Like Lentils | Total | |
Like Chickpeas | 53 | ||
Do Not Like Chickpeas | 6 | ||
Total | 68 | 100 |
Filling in the missing cells, there are 27 students who like both lentils and chickpeas:
Like Lentils | Do Not Like Lentils | Total | |
Like Chickpeas | 27 | 26 | 53 |
Do Not Like Chickpeas | 41 | 6 | 47 |
Total | 68 | 32 | 100 |
This is a straightforward type of problem that large language models (LLMs) will have encountered many times in their training. Indeed, LLMs might have seen this exact problem. Yet, this reader gave the problem to Gemini four times and got the answers 34, 32, and 15 (twice).
It is revealing that, not only did Gemini not give the correct answer, it gave different incorrect answers. This is not like a conversation about the Arsenal football team in which an LLM might give varying, yet still statistically appropriate, answers. There is one correct answer and no need for multiple responses. Gemini was floundering because it simply did not know how to answer the problem.
Trying again…
I wondered if OpenAI’s ChatGPT 3.5 and Microsoft’s Copilot would struggle too. I first gave the problem to Gemini again and was given yet another incorrect answer, 10.5! (To Gemini’s credit, it did say that fractional people are not possible.):
Out of the 68 students who like lentils, 53 – x also like chickpeas, where x is the number of students who like both lentils and chickpeas.
Similarly, out of the 53 students who like chickpeas, 68 – x also like lentils.
Adding the counts of students who like lentils and chickpeas, we get 68 – x + 53 – x + x = 100, which combines to 121 – 2x = 100.
Solving for x, we get x = 10.5, which is impossible as the number of students liking both cannot be a decimal.
Therefore, there seems to be an inconsistency in the given information. Please double-check the numbers provided for the students who like lentils, chickpeas, and neither.
The first two sentences are nonsense and the math in the third sentence is wrong. If Gemini understood what the numbers mean, it would know that the numbers given in the problem are fine and should not require a fractional answer. It would then check how it arrived at its answer.
How did the other chatbots do?