Researchers: Chatbots just don’t DO formal reasoning
Earlier this month, AI analyst Gary Marcus pointed to a new preprint article by six Apple AI researchers that determined that chatbots (Large Language Models or LLMs) do not do formal reasoning. They summarized their findings in an X thread where they concluded,
Overall, we found no evidence of formal reasoning in language models including open-source models like #Llama, #Phi, #Gemma, and #Mistral and leading closed models, including the recent #OpenAI #GPT-4o and #o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in ‘better pattern-matchers,’ not necessarily ‘better reasoners.
Here’s the kind of problem they were talking about:
Marcus comments, “There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer.”
The difficulty for the programmers is that it isn’t a glitch; it’s a broad systemic problem:
The refuge of the LLM fan is always to write off any individual error. The patterns we see here, in the new Apple study, and the other recent work on math and planning (which fits with many previous studies), and even the anecdotal data on chess, are too broad and systematic for that.
The inability of standard neural network architectures to reliably extrapolate — and reason formally — has been the central theme of my own work back to 1998 and 2001, and has been a theme in all of my challenges to deep learning, going back to 2012, and LLMs in 2019.
“LLMs don’t do formal reasoning – and that is a HUGE problem,” October 11, 2024
He says he foresaw this issue in The Algebraic Mind (MIT 2001).
If the problem is as broad as that, it may be that some kinds of thinking just can’t be mechanized. Stay tuned.