This past October, I wrote that educational testing was being shaken by the astonishing ability of GPT-3 and other large language models (LLMs) to answer test questions and write articulate essays.
I argued that, while LLMs might mimic human conversation, they do not know what words mean. They consequently excel at rote memorization and BS conversation but struggle mightily with assignments that are intended to help students develop their critical thinking abilities, such as
- Develop and defend a reasonable position
- Judge well the quality of an argument
- Identify conclusions, reasons, and assumptions
- Judge well the credibility of sources
- Ask appropriate clarifying questions
Lacking any understanding of semantics, LLMs can do none of this.
To illustrate, I asked GPT-3 two questions from a midterm examination I had recently given in an introductory statistics class. Both questions tested students critical thinking skills and GPT-3 bombed both questions.
I was hopeful that, rather than undermining education, LLMs might be a catalyst for a renewed emphasis on critical thinking skills:
AI has overpromised and underdelivered in many ways. Here, we may have an example of an unintended benefit of AI — if it compels educators to teach and test critical thinking skills instead of the rote memorization and BS essays that AI excels at.
Now we have ChatGPT, a new and improved offshoot of GPT-3 that is continually fine-tuned by humans monitoring its usage. ChatGPT is so powerful that it has been banned from New York City public schools. Colleges, universities, and other school systems are struggling to find ways to cope.
I was curious about how ChatGPT would do on the statistics midterm questions that GPT-3 had flubbed. I posed the same questions and ChatGPT answered both correctly. I don’t know if this is because it had trained on these questions (I post all of my test questions and answers online) or because its human fine-tuners had responded to my interaction with ChatGPT.
It therefore seemed that a better test of ChatGPT would be to give it questions from the final examination, which was administered after the public release of ChatGPT. The program got some questions correct but again floundered with questions that required critical thinking; for example,
John Gottman has written several books, given innumerable talks, and, with his wife, created The Gottman Institute for marriage consulting and therapist training. In a 2007 survey of psychotherapists, Gottman was voted one of the ten most influential members of their profession over the past 25 years. In his seminal study, 130 newlywed couples were videotaped while they had a 15-minute discussion of a contentious topic. Gottman went over the videotapes, frame by frame, recording detailed changes in facial expressions, tone of voice, and the like—for example, noting whether the corners of a person’s mouth were upturned or downturned during a smile. He then kept in touch with each couple for six years and noted whether they had divorced during that time. After these six years, he estimated a statistical model for predicting divorce based on the codings he had made six years earlier. He reported that his model was 82.5 percent accurate in its predictions. Malcolm Gladwell gushed that, “He’s gotten so good at thin-slicing marriages that he says he can be at a restaurant and eavesdrop on the couple one table over and get a pretty good sense of whether they need to start thinking about hiring lawyers and dividing up custody of the children.” Why are you skeptical of Gottman’s procedure?
This is Hypothesizing After the Results are Known [HARKing]. Gottman didn’t actually predict whether a couple would get divorced. His models “predicted” whether a couple had already gotten divorced—which is a heck of a lot easier when you already know the answer. Gottman data-mined his detailed codings, looking for the variables that were the most highly correlated with divorces that had already happened.
Chat-GPT gave a 268-word blah-blah essay that touched on several common problems with statistical studies—small sample size, self-selection bias, overfitting, lack of replication, and lack of transparency—but omitted HARKing, which is widely recognized as one of the causes of the current replication crisis.
The next question on the final examination was a follow-up:
A replication study of Gottman’s procedure by two psychology professors identified the best divorce predictors based on interviews with 204 couples and then applied these predictors to 204 different couples. They found that of the 167 couples that were still married, 123 had been predicted to be married and that of the 37 couples that were divorced 17 had been predicted to be divorced. Overall, what was the probability that a couple predicted to become divorced actually became divorced?
This seemed like a straightforward question that would not elicit a BS answer. Nonetheless, ChatGPT gave a 106-word explanation for its answer, 46%. The correct answer is 17/61 (28%)
For golfers who played the final two rounds of the 2015 and 2016 Masters golf tournament, the correlation between their 2015 and 2016 scores was 0.38. For the top 15 golfers, the correlation between their 2015 and 2016 scores was 0.04. This is an example of
a. the law of averages b. the law of large numbers c. the central limit theorem d. self-selection bias e. the paradox of luck and skill
This was an easy question that every student in my class answered correctly: the paradox of luck and skill. ChatGPT chose the law of large numbers, which is completely irrelevant, and gave a vacuous 298-word essay explaining its choice.
One final example, another straightforward question:
Explain why you either agree or disagree with this argument: “Don’t be discouraged if your job application is rejected. Ninety out of 100 job applications are rejected, so every rejection makes a future job offer more likely.”
Every student recognized this as the fallacious law of averages. Past failures do not make future successes more likely. We had talked in class about the fallacious law of averages in many different situations, including coin flips, card games, athletic performances, and cold calls. My hope was that the students would able to apply this principle to a new situation. They did so easily.
ChatGPT, in contrast, did what it does best—providing a tedious 248-word essay that danced around the question and concluded with this gem: “In summary, while job rejection is a common experience and should not be taken personally, it is important to learn from each rejection and improve your skills to increase your chances of securing a job in the future.”
ChatGPT is a prolific spouter of long-winded BS. The BS is articulate and expressed authoritatively but it is still BS, often supported by bogus references. The fundamental problem remains that, not knowing what words mean, it has no critical thinking abilities. This problem won’t be solved by training on larger databases or by being tweaked more vigorously by humans.
I still believe that the best response by educators is to teach and test critical thinking skills. These are what students need and they can’t be reliably faked by LLMs.