Two years ago, I wrote about how peer review has become an example of Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.” Once scientific accomplishments came to be gauged by the publication of peer-reviewed research papers, peer review ceased to be a good measure of scientific accomplishments. The situation has not improved.
One consequence of the pressure to publish is the temptation researchers have to p-hack or HARK. P-hacking occurs when a researcher tortures the data in order to support a desired conclusion. For example, a researcher might look at subsets of the data, discard inconvenient data, or try different model specifications until the desired results are obtained and deemed statistically significant—and therefore publishable. HARKing (Hypothesizing After the Results are Known) occurs when a researcher looks for statistical patterns in a set of data without any well-defined purpose in mind beyond trying to find a pattern that is statistically significant—and therefore publishable. P-hacking and HARKing both lead to the publication of dodgy results that are exposed as dodgy when they are tested with fresh data. This failure to replicate undermines the credibility of published research (and the value of publications in assessing scientific accomplishments).
From Data Torturing to Complete Fabrication
Even worse than p-hacking and HARKing is complete fabrication. Why torture data or rummage through large databases when you can simply make stuff up? An extreme example is SciGen, a random-word generation program created by three MIT graduate students. Hundreds of papers written entirely or in part by SCIgen have been published in reputable journals that claim they only publish papers that pass rigorous peer review.
More sophisticated cons are the “editing services” (aka, “paper mills”) that some researchers use to buy publishable papers or to buy co-authorship on publishable papers. These fake papers are not created by randomly generated words but they may be entirely fabricated or else plagiarized, in whole or in part, from other papers. It has been estimated that thousands of such papers have been published; it is known that hundreds have been retracted after being identified by research-integrity sleuths.
Many of the paper mills are located in China, as are many of the customers. Chinese doctors must be the lead author on at least two published articles in order to be promoted to deputy chief physician and the lead author on three articles to be promoted to chief physician. As Goodhart’s law predicts, the credibility of papers with Chinese authors has plummeted.
Russia and Iran have also been identified as popular paper-mill sites. One Russian website openly sells authorship slots for fees ranging from $750 to $5,000.
Just this month, a team of respected German researchers estimated that up to 24% of the medical research papers and 34% of the neuroscience papers published in 2020 were either plagiarized by copying data and/or results from other papers or completely fabricated.
Now we have ChatGPT and other large language models (LLMs) which can generate fake research papers far more efficiently and of much higher quality than can be done by SCIgen, paper mills, or DIY scammers. LLM-generated papers are well-written, confidently argued, and follow journal guidelines for structure and form. Many have been published. Some authors are trying to make a point and state in the conclusion that everything before the conclusion was LLM-generated. More worrisome are papers that hide the fact that they were written by an LLM.
Bad actors currently use a “firehose of falsehoods” on social media to undermine citizens’ faith in their government and social institutions. LLMs can turn these firehoses into tsunamis. Bad actors may start attacking science by flooding journals with nonsense papers that undermine our faith in journal articles. The peer-review system is already shaky and in danger of collapsing completely.
The replication crisis has laid bare the dangers of p-hacking and HARKing but it will take time to stop these practices. For years, researchers p-hacked away, oblivious to the pitfalls. Joseph Simmons, Leif Nelson, and Uri Simonsohn, who have persuasively documented the dangers of p-hacking, wrote that,
We knew many researchers—including ourselves—who readily admitted to dropping dependent variables, conditions, or participants so as to achieve significance. Everyone knew it was wrong, but they thought it was wrong the way it’s wrong to jaywalk. We [now know that it] was wrong the way it’s wrong to rob a bank.
Andrew Gelman has described the old way as the “find-statistical-significance-any-way-you-can-and-declare-victory paradigm,” and written that, “I can see that to people … who’d adapted to the earlier lay of the land, these [reforms] can feel catastrophic.”
I personally know statistics professors who continue to tell their students that p-hacking (though they don’t call it that) is a legitimate research strategy. Despite the incessant reminders that correlation is not causation, they evidently believe that a correlation obtained by any means necessary is sufficient.
HARKing will be even more difficult to tamp down because it is advertised as productive data mining and considered by many to be a new-and-improved statistical procedure. In the opening lines of a forward for a book on how to data mine, a computer science professor wrote, without evident irony,
“If you torture the data long enough, Nature will confess,” said 1991 Nobel-winning economist Ronald Coase. The statement is still true. However, achieving this lofty goal is not easy. First, “long enough” may, in practice, be “too long” in many applications and thus unacceptable. Second, to get “confession” from large data sets one needs to use state-of-the-art “torturing” tools. Third, Nature is very stubborn — not yielding easily or unwilling to reveal its secrets at all.
Coase intended his comment not as a lofty goal to be achieved by using state-of the art data-torturing tools, but as a biting criticism of the practice of ransacking data in search of patterns—because patterns unearthed by data mining and data torturing are usually useless.
Fraudulent papers are different from p-hacking and HARKing in that the authors know they are sinning. LLM-generated papers will not be caught by plagiarism detectors because LLMs are not plagiarizing. Journals can, however, require authors to submit all non-confidential data which can be scrutinized by reviewers before publication and by dedicated fraud detectors after publication. The publication of a dishonest paper should be a permanently risky lie because the fraud will be out there, waiting month and month, year after year, to be discovered.
Peer Review is Worth Saving
Careful peer review and fraud detection can be encouraged by meaningful compensation for jobs well done. I am not joking when I suggest that professional associations might pay bounties to people who discover scientific dishonesty in reputable journals. The payoff is not just that fraudulent papers will be discovered but that those who are tempted to cheat will know that they may be hunted down.
There are currently lists of predatory journals that will publish anything for a fee. Even better would be lists compiled by professional organizations of journals that have been certified as paying reputable researchers to review papers thoroughly. Something similar could be done with researchers. The website retractionwatch.com maintains a list of published papers that have been retracted. There should also be a list of authors who attempt to publish dishonest papers but are caught by journal editors during the review process.
Professional organizations might also award “research licenses,” comparable to the medical and legal certification required of doctors and lawyers. Applicants would have to prove that they are real people, perhaps through vetted academic degrees and employment histories, and pass a cursory test about scientific honesty. Anyone caught submitting or publishing fraudulent papers or engaging in other scientific misconduct would lose their research license and not be allowed to publish in certified journals. Employment contracts for researchers—academic or nonacademic—could contain a “death penalty” clause stating that the loss of a research license is grounds for dismissal.
Peer review is worth saving. Science is built on the dissemination of useful research and it is not practical for individuals to assess the usefulness of every claim made by any person. Peer review can be an effective and efficient screen. However, to be effective, peer review must be trustworthy. Research licenses, journal certification, peer-review compensation, and fraud-detection bounties might help.