^{Gary Smith

August 19, 2020

7

Artificial Intelligence, Education}

Don’t Blame AI for the British A-Level Test Scandal

_{When 39 percent of the final grades assigned during COVID-19 were lower than teacher predictions, it was headline news. But what happened?} _{Gary Smith

August 19, 2020

7

Artificial Intelligence, Education}

Share: Facebook; Twitter; LinkedIn; Flipboard; Print; Email

Many years ago, when I was a young assistant professor of economics, I had to endure a minor hazing ritual—serving for one year on the admissions committee for the PhD program. As a newbie, I was particularly impressed by a glowing letter of recommendation that began, “This is the best student I have had in 30 years.” The applicant’s test scores were not off-the-charts but the letter was number 1.

A dean who chaired the admissions committee year after year advised me to calm down because this professor wrote a recommendation that celebrated “the best student I have had in 30 years” every year. The committee had a chuckle at my expense.

I’ve now been teaching for nearly 50 years and I know firsthand the teachers’ temptation to praise students generously. We want our students to succeed and we are happy to help.

Admissions committees inevitably take this puffery into account. For the professor who, year-after-year, identified one of his students as the best student in 30 years, we discounted the claim because of the recommender’s reputation and the strong-but-not exceptional standardized test scores. But we did take into account the professor’s judgment that this student was the best applicant from his school during that year. We deflated the level of the praise but we paid attention to the rank ordering.

These memories came back during the recent hullabaloo related to British A-Level test grades, which are used in the United Kingdom to determine university admissions. Because of the COVID-19 pandemic, the scheduled summer 2020 tests were canceled. Instead, the government’s Office of Qualifications and Examinations Regulation (Ofqual) was given the thankless task of estimating what the grades (A, B, C… ) would have been assigned for more than 700,000 subject tests that 275,000 students signed up for but did not take.

Ofqual collected two kinds of data from the students’ teachers:

A prediction of “the grade that student would have been most likely to achieve if teaching and learning had continued and students had taken their exams as planned.”
A rank-ordering of the students who were predicted to receive the same grade on a particular subject test.

The Ofqual team relied on plenty of research that supported my personal experience; for example, teachers are typically twice as likely to be too generous as to be too stingy. Specifically, teacher grade expectations are accurate about half the time, too optimistic one-third of the time, and too pessimistic one-sixth of the time.

If Ofqual had simply assigned each student the grades that the teachers had reported as their expectations, the percentage of tests receiving the highest possible grade (A) would have increased from 7.7 percent in 2019 to 13.9 percent in 2020. The percentage receiving A or A grades would have increased from 25.2 percent to 37.7 percent and the percentage receiving grades of B or higher would have increased from 51.1 percent to 65 percent.

Interviews with teachers also revealed that almost all had submitted predictions of how their students would have done on a “good day.” The Ofqual team could have let it go at that, perhaps attaching disclaimers warning that the grades had been inflated by teacher generosity.

Instead, they made the politically dangerous decision to reduce many grades below teacher expectations in order to achieve a grade distribution comparable to previous years. The most obvious way to do this is to rely on the teachers’ rank orderings. An extreme case would be where generous teachers ranked every student one grade too high but made a perfect assessment of the rank order. Reducing every grade by one level would be a perfect solution.

In practice, various studies have concluded that the correlation between predicted and actual grades is not 100 percent but more like 80 percent, which still suggests that the rank order assessments provide useful information for adjusting grades.

The Ofqual team did a detailed statistical analysis of a dozen different adjustment methods and eventually settled on a system of adjusting the scores on each subject test at each school up or down (usually down) so that the average score on the subject test would be comparable to previous years and also reflect the rank-ordering teachers had sent them. This meant, for example, that if a student was ranked in the 50th percentile among students taking a particular subject test at a school and 50th percentile students in the past had received B grades, this student would be given a B grade, even if the teacher had reported an expectation of an A grade or a C grade. Because teachers are typically more generous than stingy, grades were more likely to be adjusted downward than upward.

Ofqual also made a few tweaks to account for situations where average scores from previous years might be misleading. If the sample was small, then tying current scores to previous scores would be perilous. So, with 5 or fewer students, no adjustment was made—the teacher prediction was used as the final grade. With more than 15 students, the teacher predictions were ignored. In between, with 6 to 15 students, the final scores were a combination of historical scores and teacher predictions.

Despite Ofqual’s best efforts, there were problems. First, the anchoring of the current grade distribution to the historical grade distribution made it very hard for high-achievers at low-scoring schools to get good grades. An extreme example would be a school where no one had previously received an A* grade on a particular subject test. It would not be possible for a current student to get an A* grade, no matter how talented the student was.

Second, the heavier weighting of teacher predictions for smaller sample sizes meant that students in small samples got the full benefit of teacher generosity. This weighting disproportionately benefited students at elite schools who took tests in elite subjects such as classical Greek and the history of art. For ordinary students at ordinary schools who took tests in ordinary subjects, the teacher predictions were ignored.

Overall, scores went up slightly (propelled in part by the reliance on teacher predictions for the small samples). But what got the headlines was that 39 percent of the final grades were lower than teacher predictions. The outcry was not a surprise nor was the argument that teachers know their students better than artificial intelligence (AI) algorithms.

I have written three books warning of the dangers of AI algorithms but these grade reductions were not an example of AI run amok. It was Ofqual’s intention to adjust grades downward to account for teacher generosity and make 2020 grades comparable to 2019 grades and they tried very hard to find a fair and reasonable way to do so.

AI algorithms are quite different. They typically search for patterns that will achieve specific goals, like matching photographs or playing board games. When the rules and objectives are clear and the task can be repeated a very large number of times, the successes may be astonishing, including the conquest of human experts at backgammon, checkers, chess, and Go.

When the rules and goals are ambiguous or in flux, however, AI algorithms can flop disastrously. Change a few pixels in a photograph or change the dimensions of a Go board, and AI algorithms calibrated on different data do poorly. AI algorithms for screening job applicants, pricing car insurance, approving loan applications, and determining prison sentences based on Facebook posts, Twitter likes, website visits, smartphone usage, and the like should not be trusted.

I don’t know if Ofqual could have designed a better way for adjusting for teacher generosity, but I do know that an AI data-mining algorithm would almost surely have done worse, possibly much worse.

You may also enjoy Gary Smith’s take on other efforts to use statistics for prediction:

Election models: Predicting the past is easy —and useless. You can seldom see where you are going by looking in a rear-view mirror.

and

Stanford’s AI index report: How much is BS? Some measurements of AI’s economic impact sound like the metrics that fueled the dot-com bubble