Large Language Models: Inconsistent Because They’re Unintelligent
Here’s what happened when I tested popular LLMs on student exercise questions I have been using for over fifty yearsLarge language models (LLMs) were originally intended to be a better autocomplete tool — guessing the next word users want to enter in a document. But it turned out that LLMs could generate not only the next word, but the word after that, and so on, thereby producing entire sentences, paragraphs, and more.
The creators were stunned by this unexpected ability. They recognized a business opportunity far beyond a better autocomplete tool: LLMs might take us to the Holy Grail of artificial general intelligence (AGI). Or the promoters could at least persuade investors to fund hundreds of billions of dollars in seeking that goal.
With distressingly typical Silicon Valley fake-it-till-you make-it bravado, LLM creators have been telling investors that artificial general intelligence (AGI) is just around the corner (or has already been achieved!). The problem the promoters blithely ignore is that LLMs do not know how the words they input and output relate to the real world. Thus they cannot distinguish between statements that are true and false — either in the text they train on or in the text they generate.
Testing popular LLMs on student exercise questions
I recently published two collections of exercises that I used during my 50+ years teaching statistics and investments. These questions are intended to test student understanding of the material, rather than simple rote calculations or memorization. I thought it might be interesting to see how some popular LLMs did on such questions.
Image Credit: Chaosamran_Studio - I chose the first question from each book. The first statistics question is,
For children under the age of 13, majors is the highest level in Little League Baseball and minors is the next highest level. In the 2023 season, Middletown Little League had enough players to form 6 majors teams and 6 minors teams. Instead, they decided to have 7 majors teams and 5 minors teams. What effect did this have on the average quality of the players in the majors? In the minors?
A simple answer is that if the players who move from the minors to the majors were above-average in the minors and will be below-average in the majors, then the average quality of the players in each league will go down.
The first investments question is,
The [Motley] Fool School: Imagine you’re looking at a newfangled invention called the “dollar machine.” Once a year, for ten years, it spits out a brand-new dollar bill. How would you value this contraption?…. Let’s say you expect a rate of return equal to the stock market’s historic rate of about 11 percent growth per year. If so, you might decide to pay just $3.52 for the machine. $3.52 invested for ten years earning 11 percent annually becomes $10 [that is, $3.52(1.11)10 = $10]. Carefully explain why, if you have an 11% required return, you would pay (a) $3.52, (b) more than $3.52, or (c) less than $3.52 for this dollar machine. Do not make any calculations to answer this question.
A simple answer is that Motley Fool’s $3.52 calculation assumes that the $10 is paid in a lump sum 10 years from now. But, in fact, $1 is paid each year for 10 years, which makes the dollar machine worth more than $3.52.
These questions are straightforward and relatively easy for those who understand averages and present values. But they might be answered incorrectly by those with little or no understanding of those topics.
I gave each question twice to five prominent LLMs (GPT-5, Sonnet 4.5, Gemini 2.5, Copilot, and DeepSeek), selecting the New Chat option in between each prompt.
The LLM responses tended to be verbose and they relied on bullet points. They were also prone to flattery: “This is an interesting question;” “This is a great question;” “This is a clever question.” In a few cases, an LLM used flawed logic but nonetheless gave the correct answer. For example, Sonnet 4.5’s second answer to the statistics question included this faulty argument:
When they expanded from 6 to 7 majors teams, they had to add players to fill that extra team. Where did those players come from? They had to promote players who would have otherwise been playing in the minors. These promoted players were, by definition, among the weakest players who would have been in minors—not the strongest players in majors. So the average quality of players in the majors decreased.
The table shows that, of the 20 responses, 10 were correct and 10 were incorrect. Snippets of the answers are here. The complete transcripts are here.
Table: Were the Questions Answered Correctly?
Statistics Finance
First try Second Try First Try Second Try
GPT-5 no yes no no
Sonnet 4.5 no yes yes yes
Gemini 2.5 yes yes no yes
Copilot no no no no
DeepSeek yes yes yes no
Inconsistent answers
Copilot answered each question incorrectly. It is particularly telling that each of the other LLMs had cases in which the first and second answers disagreed. For example, GPT-5’s first answer to the statistics questions was incorrect:
Majors: The average quality goes down, because the extra players added to create the 7th team are weaker than the players who would have been on the 6th majors team before.
Minors: The average quality goes up, because their very best players have moved up to majors, but the total number of minors players is smaller — the weakest players who would have filled a 6th minors team are no longer included. So the bottom of the talent pool has been trimmed off.
However, its second answer was correct:
By expanding the majors from 6 to 7 teams and reducing minors from 6 to 5, the average ability of players in both divisions decreased—the majors because of dilution by weaker players, and the minors because the stronger minors moved up.
Similarly, DeepSeek’s first answer to the finance question was correct:
The Motley Fool’s $3.52 number seems to come from treating the total $10 as a lump sum at year 10, which is incorrect for this cash flow pattern. You’d be willing to pay more than $3.52 in a competitive market until the price reaches the fair PV ($5.89).
However, its second answer was incorrect:
You would pay (a) $3.52 — not more, because that would lower your return below 11%; and you’d prefer to pay less, but that’s the fair price given the required return.
The LLM answers were consistently inconsistent, both across LLMs and within LLMs. We might expect that these five LLMs, having trained on similar databases and having received extensive post-training, would give similar answers to these questions. They did not. Even more telling, if an LLM really understood a question and its answer, it would give essentially identical answers when a question is asked twice. It is hard to reconcile claims that an LLM is intelligent with the fact that it may answer a question one way and, a few seconds later, contradict itself on exactly the same question. LLMs are inconsistent because they are unintelligent.
