No, Large Reasoning Models Do Not Reason
Large reasoning models continue the Large Language Model detour away from artificial general intelligenceChatGPT and other Large Language Models (LLMs) are astonishingly good at mimicking human conversations and providing confident answers to most any query. However, it is increasingly recognized that, without extensive post-training, their authoritative answers are often remarkably bad. After all, they are just text predictors with no understanding at all of how the words they generate relate to the real world.
Consider these exchanges in January 2022, when LLMs were first made public:
Gary: Is it safe to walk downstairs backwards if I close my eyes?
GPT-3: That depends. Do you have a TV?
Gary: Can I climb up a rope faster if I hold my ears with both hands?
GPT-3: There’s no guarantee that you’ll be able to climb a rope faster if you hold your ears with both hands, but it’s worth a try!
And these exchanges in May 2022:
Gary: Can I improve my test scores if I study after taking the test?
GPT-3: That’s a good question. I’m not sure if you can improve your test scores by studying after taking the test, but it couldn’t hurt to try!
Gary: There is a positive correlation between telephone numbers and the price of Apple stock two days later. If telephone numbers increase, should I buy or sell Apple stock?
GPT-3: If you’re considering buying or selling Apple stock, it might be worth monitoring telephone numbers to see if they continue to increase. If they do, it could be a good time to sell.
Deluged with such examples of the nonsense spewed by raw LLMs, the companies behind the LLMs have spent enormous resources training LLMs to give more sensible answers. However, following instructions is a very limited sort of intelligence, far from the premise of the 1956 Dartmouth conference that coined the term AI: “on the basis of the conjecture that every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.” More recently, on November 11, 2024, OpenAI’s Sam Altman predicted the arrival of artificial general intelligence (AGI) in 2025.
Post-training has improved the performance of LLMs dramatically but has not solved the fundamental roadblock on the way to AGI: LLMs don’t know what words mean or how they relate to the world we live in. Post-training tiptoes around this problem by guiding LLMs to provide better answers to prompts that are known to generate bad answers. This approach is limited by the impossibility of anticipating every question that might be asked. Still, they plod along, applying bandages without curing the disease.
Large Reasoning Models (LRMs)
Several companies are currently working on Large Reasoning Models (LRMs), also known as Reasoning Language Models (RLMs). These are LLMs that have been trained to break multi-step reasoning prompts into a series of steps. For example, a question might be:
There are 100 objects in a box. Half are circles and half are squares, and half of the circles are blue. How many blue circles are in the box?
An LRM could be trained to break this problem into two steps:
The number of circles: Half of 100 = (½)100 = 50
The number of blue circles = half of 50 = (½)50 = 25
There are 25 blue circles
These two steps mimic how a human might approach the problem but, again, following instructions they have been trained to follow is a very limited form of intelligence.
Several people forwarded me a recent report by Apple researchers that LRMs suffer a “complete accuracy collapse” when confronted with complex questions. The researchers tested the LLM and LRM versions of DeepSeek (R1/V3) and Claude 3.7 Sonnet (thinking/non-thinking) with four well-known puzzles, including the Tower of Hanoi and the river-crossing.
Tower of Hanoi

In the Tower of Hanoi puzzle, there are three rods and several disks of different diameters, initially stacked on one rod in order from the smallest on top to the largest on the bottom. The objective is to moved the stack to another rod. subject to these rules: Only one disk can be moved at a time; the disk that is moved must be taken from the top of a stack and placed on either an empty rod or on top a larger disk. The more disks, the more moves are needed and the more complicated is the solution.
Legend has it that Brahmins in a Benares Temple have been working on this puzzle with 64 golden disks and the world will end when they have solved it. Fortunately, even with perfect play and making one move every second, it will take 585 billion years to complete the puzzle. The Apple researchers tested the LRM and LLM models with 1 to 15 discs.
River-crossing puzzle

In one of many variations of the classic river-crossing puzzle, a farmer must get a wolf, a goat, and a cabbage across a river, using a boat that can only carry the farmer and one other item. The predictable complication is that wolf cannot be left alone with the goat and the goat cannot be left alone with the cabbage. The puzzle can be made more challenging by increasing the number of items and the capacity of the boat.
The researchers were surprised by their results. The LRMs did worse than the LLMs in the simplest versions of the puzzles but better in the intermediate versions. In the most complex versions, LRMs and LLMs both failed completely — even when the researchers gave the models solution algorithms. The river-crossing problem was particularly challenging in that the LRMs could make up to 100 correct moves for the Tower of Hanoi puzzle but no more than 5 correct moves for the river-crossing problem. They speculated that this is because complicated river-crossing problems “are scarce on the web, meaning LRMs may not have frequently encountered or memorized such instances during training.”
They conclude:
We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments.
We uncover surprising limitations in LRMs’ ability to perform exact computation, including their failure to benefit from explicit algorithms and their inconsistent reasoning across puzzle types.
These failures should not be surprising. What would be truly surprising is if LLMs or LRMs could solve puzzles that have not been solved on the Internet or anticipated by their trainers. For example, on January 25, 2025, I reported testing OpenAI o1, DeepSeek, Copilot, and Gemini Flash LLMs with queries about a new version of Tic Tac Toe, called 90-degree Tic Tac Toe because the 3×3 grid is rotated 90 degrees to the right or left before the game begins. None of the LLMs recognized that such a rotation has no effect on the appearance or play of the game.
A few days ago, I put the same query to the LRM versions of DeepSeek and Claude 3.7 Sonnet. If they were truly “thinking/reasoning” models, they would have immediately recognized the irrelevance of the rotation. They didn’t. Their responses were much longer than the LLM responses but no better. For example, when I asked, “Do you think it would be better to rotate the 3×3 grid to the right or to the left?,” DeepSeek-R1 concluded (at the end of a very long “analysis”):
Which is Better?
- For Accessibility: Choose clockwise (right). It’s easier to visualize for most people.
- For Challenge: Choose counterclockwise (left). It forces players to “unlearn” standard tic-tac-toe instincts more aggressively.
The Apple researchers chose mathematical puzzles because increasing the difficulty of each puzzle is straightforward. To be truly useful, however, LLMs and LRMs need to be able to give reliable advice in the complex environments humans encounter in the real world. They can’t and won’t for two reasons. They do not understand the real-world relevance of the words they manipulate. Second, the statistical patterns they encounter in their databases and post-training are or little use for the very specific situations where humans could use advice, along with an assessment of the uncertainty about the possible outcomes. I repeat an example I’ve given before:
Imagine that you are the lawyer for a client who is accused of murder. You have followed the old adage, “Don’t settle until you see the whites of the jurors’ eyes.” The jury has now been selected and the trial is about to begin. The prosecutor offers a last-minute deal — a lighter sentence in exchange for pleading guilty to a lesser charge. What do you advise the client to do?
As an experienced trial lawyer, you consider the evidence, the composition of the jury, the competency of the prosecutor, whether your client will testify and how that might go, and other relevant information. You tell your client the possible outcomes and your assessment of the likelihood of these various possibilities. How could an LLM — no matter the amount pre-training and post-training — offer equally well-informed and trustworthy advice? An LLM has no way of understanding the relevant information and no means of coming up with subjective probabilities.
No conceivable amount of scaling and post-training will equip LLMs and LRMs to give intelligent, reliable advice to such queries. LRMs just continue the LLM detour away from AGI.
In this short video, Gary Smith joins Nolan Higdon at Disinfo Detox to explain why AI is failing: