Mind Matters Natural and Artificial Intelligence News and Analysis
large-language-model-ai-machine-learning-concept-brain-business-stockpack-adobe-stock
large language model AI machine learning concept brain business

A Sloppy “AI Scientist” Could Make the Science Crisis Much Worse

A research team claims to have developed the AI Scientist that “generates novel research ideas, writes code, executes experiments ...” Really?
Share
Facebook
Twitter
LinkedIn
Flipboard
Print
Email
LLM, AI Large Language Model concept. Businessman working on laptop with LLM icons on virtual screen. A language model distinguished by its general-purpose language generation capability. Chat AI.

Tyler Cowen, Professor of Economics at George Mason University and co-author of a very popular economics blog, recently wrote a post titled, “Okie-dokie, solve for the equilibrium.” It was a tempting teaser-title and we clicked on the bait.

It turns out that Cowen’s post addresses a recent paper with the grandiose title, The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, co-authored by six respected computer science researchers.

It is worth quoting at length from the abstract:

This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation….Each idea is implemented and developed into a full paper at a cost of less than $15 per paper….The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world’s most challenging problems.

A prank?

When we first read the abstract, our initial reaction was that this paper was written by a large language model (LLM) or as a prank, along the lines of the infamous Sokal hoax. In addition to noting the LLM-like language and surely inflated claims, we had other reasons for skepticism:

  • Why are they making their code publicly available? Why not first generate dozens of papers that harvest Nobel prizes in physics, chemistry, physiology or medicine, economics, and literature (the Peace prize might be a stretch) — or at least magnify their resumes 10-fold?
  • Why would legitimate researchers advertise the $15 cost per paper, as if they were inviting paper mills to use their algorithm to generate an essentially unlimited number of bogus papers?

Apparently not, but…

We contacted one of the authors and he assured us that the paper was not a prank and was not written by an LLM. As we read through the paper, our skepticism about the paper’s intent evolved into skepticism about the claimed prowess of AI Scientist.

First, the authors claim that, “The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer.” A self-assessment is hardly definitive.

In addition, we have known for a long time that many conferences accept garbage. Several years ago, three MIT computer science graduate students created a prank program they called SCIgen. It uses randomly selected words to generate bogus computer-science papers complete with realistic graphs of random numbers. They made their intended point when one of their obviously bogus papers was quickly accepted for the World Multiconference on Systemics, Cybernetics and Informatics (WMSCI).

Cyril Labbé and Guillaume Cabanac soon found 243 bogus papers written entirely or in part by SCIgen that had been published by a total of 19 publishers, all reputable and all claiming that they only publish papers that pass rigorous peer review. One of the embarrassed publishers, Springer, subsequently announced that it was teaming with Labbé to develop a tool, SciDirect, that would identify nonsense papers. The obvious question is why such a tool is needed. Is the peer-review system so broken and the peer-review process so cursory that reviewers cannot recognize nonsense when they read it? In this environment, the AI Scientist bar for a good paper is pretty low.

LLMs are clearly much better than SCIgen at generating persuasive prose — but that is not necessarily a good thing. Ironically, 18 months ago, a Cowen post claimed that British science pioneer Francis Bacon (1561–1626) was a critic of the printing press. It turned out that this false assertion, buttressed by bogus references and quotations (“the multiplication of books is a burden of the world”) was generated by ChapGPT.

Second, the AI Science authors admit that they encountered occasional problems that are well-known to people who have tested LLMs, including hallucination about experimental details and the misrepresentation of results (for example, interpreting negative results as positive). Such sloppy science might well escape detection by casual reviewers, but that is a vice, not a virtue.

Several takeaways

  • The evident goal of AI Science is not to generate plausible, important hypotheses that might be tested rigorously by real researchers but to generate papers that might squeak through peer review and be published. What society needs is research that improves our lives, not papers that inflate resumes.
  • The way out of the replication crisis in science is not the creation of tools that make it easier to generate bogus papers but incentives for honest researches to do serious reviews and for journals worldwide to impose lifetime bans on authors who are caught using fake data or LLM-generated prose.
  • It continues to astonish us how easily people can be persuaded that LLMs can do our thinking for us.
  • It is distressing that so many energetic, talented people continue to work so mightily to turn the LLM sow’s ear into a silk purse.

Jeffrey Funk

Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Jeffrey Funk is the winner of the NTT DoCoMo Mobile Science Award and the author of six books including his most recent one: Unicorns, Hype and Bubbles: A Guide to Spotting, Avoiding and Exploiting Investment Bubbles In Tech.

Gary N. Smith

Senior Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on stock market anomalies, statistical fallacies, the misuse of data, and the limitations of AI has been widely cited. He is the author of more than 100 research papers and 18 books, most recently, Standard Deviations: The truth about flawed statistics, AI and big data, Duckworth, 2024.

A Sloppy “AI Scientist” Could Make the Science Crisis Much Worse