^{Gary Smith
December 21, 2020

8

Artificial Intelligence, Machine Learning}

Torturing Data Can Destroy a Career: The Case of Brian Wansink

_{Wansink wasn’t alone. A surprising number of studies published in highly respected peer-reviewed journals are complete nonsense and could not be replicated with fresh data} _{Gary Smith
December 21, 2020

8

Artificial Intelligence, Machine Learning}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

Until a few years ago, Brian Wansink (pictured in 2007) was a Professor of Marketing at Cornell and the Director of the Cornell Food and Brand Lab. He authored (or co-authored) more than 200 peer-reviewed papers and wrote two popular books, Mindless Eating and Slim by Design, which have been translated into more than 25 languages.

In one of his most famous studies, 54 volunteers were served tomato soup. Half were served from normal bowls and half from “bottomless bowls” which had hidden tubes that imperceptibly refilled the bowls. Those with the bottomless bowls ate, on average, 73 percent more soup but they did not report feeling any fuller than the people who ate from normal bowls. Eating is evidently not about filling a stomach, but about emptying a bowl.

Wansink was given an IgNobel prize in 2007 for his bottomless-bowl study. Another IgNobel winner that year was for a study reporting that rats cannot always tell the difference between someone speaking Japanese backwards or Dutch backwards.

But many people took the bottomless-bowl study seriously, including Wansink. If eating is about emptying bowls, then serving food in smaller bowls might be an effective dieting plan. In other studies, he reported that people eat more when they use bigger plates, when the snacks are put in larger bowls at Super Bowl parties, and when popcorn, even stale popcorn, is put in larger cartons.

In 2016 the trouble started. In a blog post titled, “The Grad Student Who Never Said No,” Wansink wrote that a PhD student who came to work in his lab was given data collected at an all-you-can-eat Italian buffet:

When she arrived, I gave her a data set of a self-funded, failed study which had null results . I said, “This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.” I had three ideas for potential Plan B, C, & D directions (since Plan A had failed)…
Every day she came back with puzzling new results, and every day we would scratch our heads, ask “Why,” and come up with another way to reanalyze the data with yet another set of plausible hypotheses. Eventually we started discovering solutions.

Email correspondence surfaced in which Wansink advised the graduate student to separate the diners into “males, females, lunch goers, dinner goers, people sitting alone, people eating with groups of 2, people eating in groups of 2+, people who order alcohol, people who order soft drinks, people who sit close to buffet, people who sit far away, and so on “Then she could look at different ways in which these subgroups might differ: “# pieces of pizza, # trips, fill level of plate, did they get dessert, did they order a drink, and so on “ Wansink concluded that she should, “Work hard, squeeze some blood out of this rock.” She responded, “I will try to dig out the data in the way you described.”

Wansink and the student were data mining—that is, rummaging through data looking for patterns. By never saying no, the grad student got four papers (now known as the “pizza papers”) published with Wansink as a co-author. The most famous one reported that men eat 93 percent more pizza when they dine with women. This is the kind of data mining that Nobel laureate Ronald Coase described with the cynical comment, “If you torture data long enough, it will confess.”

A Cornell student who had worked as an intern in Wansink’s lab said, “I remember him saying it so clearly: ‘Just keep messing with the data until you find something,’” She was so uncomfortable with this directive that she left the lab before her internship ended.

Critics began assembling a Wansink Dossier that listed errors, inconsistencies, and dubious practices in his studies. Wansink responded that, “There was no fraud, no intentional misreporting, no plagiarism, or no misappropriation.”

Wansink’s biggest mistake seems to have been a misplaced faith in data mining, a belief that there is nothing wrong with trying to “squeeze some blood out of this rock.” In September 2018, a Cornell faculty committee investigating Wansink concluded that he had “committed academic misconduct” and Wansink resigned in 2018.

Wansink is hardly alone in data mining though he is one of the few who has been punished for his misdeeds. Here are some examples: Asian-Americans are prone to heart attacks on the fourth day of the month (British Medical Journal.) Chinese-Americans are unusually vulnerable to diseases of the zang and fu organs associated with their birth year (The Lancet). Jews can postpone death until after the celebration of Yom Kippur (American Sociological Review). The stock market does well if certain teams win the Super Bowl (Journal of Finance). HIV patients can be aided by distant healers sending positive thoughts from thousands of miles away (Western Journal of Medicine). People whose names have positive initials, such as ACE or VIP, live longer than do people with negative initials, such as PIG or DIE (Journal of Psychosomatic Research). These studies (and many, many more) were published in highly respected peer-reviewed journals, yet all are complete nonsense and none could be replicated with fresh data.

Yet Daryl Bem (pictured), a prominent social psychologist, has encouraged researchers to torture their data:

Examine [the data] from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don’t like, or trials, observers, or interviewers who gave you anomalous results, place them aside temporarily and see if any coherent patterns emerge. Go on a fishing expedition for something anything interesting.

Using the fishing-expedition approach, Bem was able to discover evidence for some truly incredible claims, such as “retroactive recall”: People are more likely to remember words during a recall test if they study the words after they take the test. As might be expected, other researchers could not replicate Bem’s reported results.

Which brings us to big data and big computers. The more data we have, the more opportunities there are to torture the data in search of provocative oddities. Separate the data by sex. By race. By age. By time period. Discard data that weaken the case you are trying to make.

For example, the fourth-day heart-attack study omitted data for heart diseases that contradicted the conclusion. So did the study relating diseases to birth year. The death-postponement study looked at the deaths of people who were not necessarily Jewish before one of many Jewish celebrations that might have been analyzed. The Super Bowl study counted some AFC teams as NFC teams. The HIV study looked at multiple measures of health. The positive/negative initials study used an unpersuasive list of initials and an incorrect grouping of data. The researchers tortured the data. They didn’t say no, because something can always be squeezed out of the driest stone.

This is one reason why artificial intelligence (AI) is often brittle. AI algorithms can torture data faster than any human and can find an essentially unlimited number of idiosyncrasies that humans might overlook. Computers don’t get tired and they never say no.

Alas, computers don’t have a BS detector, because they don’t understand what the data mean. They are very much like Nigel Richards who won the French-language Scrabble World Championship twice without knowing the meaning of any of the words he spelled. Computers can torture data for patterns, but they do not understand and cannot assess what they find. We can. Whenever we hear a provocative claim, we should flip on our BS detector and consider the possibility that the data have been bullied, tormented, and tortured.

By the way, did you hear about the study that found that on days when Trump tweets the word “with” frequently, there is generally a drop in the price of tea in China four days later?

You may also enjoy:

Interview: New book outlines the perils of big (meaningless) data. Gary Smith, co-author with Jay Cordes of Phantom Patterns, shows why human wisdom and common sense are more important than ever now.