Mind Matters Natural and Artificial Intelligence News and Analysis
white-chatbot-robot-leading-robots-group-on-dark-bluish-reddish-background-3d-rendering-stockpack-adobe-stock
White chatbot robot leading robots group on dark bluish reddish background 3D rendering
Image Credit: sdecoret - Adobe Stock

Hype despite ongoing failure to build human-like AI = “scandal”

Share
Facebook
Twitter/X
LinkedIn
Flipboard
Print
Email

That’s how AI analyst Gary Marcus sees the latest round of failures:

Deep learning is indeed finally hitting a wall, in the sense of reaching a point of diminishing results. That’s been clear for months. One of the clearest signs of this is the saga of the just-released Llama 4, the latest failed billion (?) dollar attempt by one of the majors to create what we might call GPT-5 level AI. OpenAI failed at this (calling their best result GPT-4.5, and recently announcing a further delay on GPT-5); Grok failed at this (Grok 3 is no GPT 5). Google has failed at reaching “GPT-5” level, Anthropic has, too. Several others have also taken shots on goal; none have succeeded.

According to media reports LLama 4 was delayed, in part, because despite the massive capital invested, it failed to meet expectations. But that’s not the scandal. That delay and failure to meet expectations is what I have been predicting for years, since the first day of this Substack, and it is what has happened to everyone else. (Some, like Nadella, have been candid about it). Meta did an experiment, and the experiment didn’t work; that’s science. The idea that you could predict a model’s performance entirely according to its size and the size of its data just turns out to be wrong, and Meta is the latest victim, the latest to waste massive sums on a mistaken hypothesis about scaling data and compute.

“Deep Learning, Deep Scandal,” April 7, 2025

The problem, he says, is that Big Tech won’t admit the failure to build computers that think like humans (artificial general intelligence or AGI):

The reality, reported or otherwise, is that large language models are no longer living up to expectations, and its purveyors appear to be making dodgy choices to keep that fact from becoming obvious. “Deep Scandal

Straws in the wind

Marcus and computer science prof Ernest Davis noted last week that shorn of programmer support, large language models (LLMs) performed very poorly on the questions from the recent USA Math Olympiad (March 19–20):

Hours after it was completed, so there could be virtually no chance of data leakage, a team of scientists gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking. The proofs output by all these models were evaluated by experts. The results were dismal: None of the AIs scored higher than 5% overall.

“Reports of LLMs mastering math have been greatly exaggerated,” April 5, 2025

Technology consultant Jeffrey Funk has been warning in recent months that the AI bubble is about to pop. The fundamental problem is that the the LLMs don’t boost productivity.

In short, even if the LLMs could think like humans, they might not be thinking like productive humans. No wonder Big Tech is slow to admit the problem.


Hype despite ongoing failure to build human-like AI = “scandal”