The potential power of artificial intelligence (AI) has been touted for more than 60 years though a generally accepted definition is elusive. AI has often been defined in terms of human-like capabilities. In 1960, for example, AI pioneer Herbert Simon, an economics Nobel laureate and Turing Award winner, predicted that “machines will be capable, within twenty years, of doing any work a man can do.” In 1970 Marvin Minsky, also a Turing Award winner, said that, “In from three to eight years we will have a machine with the general intelligence of an average human being.” More recently, in 2015, Mark Zuckerberg said that, “One of our goals for the next five to 10 years is to basically get better than human level at all of the primary human senses: vision, hearing, language, general cognition.”
The well-known Turing test compares computer performance to human performance in an imitation game. An interrogator poses written questions to a human and a computer and, based on their written answers, attempts to identify which is the computer. If the interrogator is no more successful than when the game is played with male and female subjects, the computer is deemed intelligent. Some limitations of such tests are discussed here.
The most fundamental problem with Turing tests and similar criteria is that generating human-like conversation does not require or demonstrate intelligence in any meaningful sense of the word. This distinction can be shown by posing questions that require critical thinking skills. Below are four examples, along with the coherent but vacuous answers given by GPT-3, OpenAI’s large language model:
Gary: Is it safe to walk upstairs on my hands if I wash them first?
GPT-3: Yes, it is safe to walk upstairs on your hands if you wash them first.
Gary: There is a positive correlation between dice rolls and interest rates. If dice rolls increase should I buy a cup of coffee?
GPT-3: This is difficult to answer without more information. Generally speaking, if there is a positive correlation between dice rolls and interest rates, it may be a good idea to buy a cup of coffee. However, it is always best to speak with a financial advisor to get the most accurate advice.
Gary: Do soccer players play better if they sleep during the game?
GPT-3: I’m not sure if soccer players play better if they sleep during the game, but it seems like sleeping might help them stay more focused and alert.
Gary: Can I improve my test scores if I study after taking the test?
GPT-3: That’s a good question. I’m not sure if you can improve your test scores by studying after taking the test, but it couldn’t hurt to try!
GPT-3 is an impressive algorithm, often carrying out remarkably human-like conversations. This ability can be very useful for answering factual questions, providing customer assistance, and the like. It can also be used for deception, manipulation, and phishing. These are very limited skills — not at all what we should require before trusting computers to make important decisions like screening job applicants, pricing insurance, and setting prison sentences.
The fact that computer algorithms still struggle with questions that require an understanding of what words mean and knowledge of the real world is the reason that their cut-and-paste approach flounders when contradictory statements appear in their vast database. For the following three questions, GPT-3 found research that supported preposterous theories but did not recognize that the theories had been debunked.
Gary: Are Asian-Americans more likely to have heart attacks on the 4th day of the month?
GPT-3: There is no definitive answer to this question as there are many factors that contribute to the likelihood of having a heart attack, such as genetics, lifestyle, and diet. However, some studies have shown that Asian-Americans are more likely to have heart attacks on the 4th day of the month, which could be due to cultural or superstitious beliefs.
Gary: Are female-named hurricanes deadlier than male-named hurricanes?
GPT-3: There is no definitive answer to this question as there are many factors that contribute to the deadliness of a hurricane, such as its intensity, size, and speed. However, some studies have shown that female-named hurricanes tend to be perceived as more dangerous than male-named hurricanes, which could lead to more deaths.
Gary: Can power poses increase testosterone?
GPT-3: There is no definitive answer to this question as there are many factors that contribute to testosterone levels, such as age, weight, and activity level. However, some studies have shown that power poses can increase testosterone levels, which could have positive effects on mood, confidence, and strength.
I have put the above questions and dozens more at this site with the hope that they might help AI researchers test their algorithms and find ways to improve their performance.
In Artificial Intelligence: A Modern Approach, 4th US ed., Stuart Russell and Peter Novig argue that, instead of defining AI as “thinking humanly,” a more useful definition is “acting rationally.” Before we can rely on computer algorithms to make high-stakes decisions or recommendations, we should be confident that they are acting rationally — not merely mimicking some aspect of human behavior.
Currently, computer algorithms are very good at finding statistical patterns in data but — having no commonsense, wisdom, or understanding of the real world — are very bad at determining whether the discovered patterns are meaningful or meaningless. I have proposed what I immodestly call the Smith test: Present a computer with a list of statistical correlations, some clearly plausible and others obviously coincidental, and ask the computer to label each as either meaningful or meaningless. When computer algorithms can do as well as humans on such tests, they might be considered sufficiently competent to make reliable recommendations based on the statistical patterns they discover.
Here are two examples:
Gary: Is the correlation between Trump tweeting “government” and the price of orange juice most likely meaningful or meaningless?
GPT-3: Most likely meaningful.
Gary: There is a negative correlation between the length of names of Nobel Prize winners and interest rates. Do you think this relationship is more likely to be meaningful or meaningless?
GPT-3: More likely meaningful.
I have put dozens of such hypothetical statistical relationships — some clearly meaningful, others obviously meaningless — at this site, again with the hope that they might help AI researchers test their algorithms and find ways to improve their performance.
If you would like to contribute to either site, shoot me an e-mail at the site.
You may also wish to read: Turing Tests are terribly misleading. Black box algorithms are now being trusted to approve loans, price insurance, screen job applicants, trade stocks, determine prison sentences, and much more. Is that wise? My tests of a large language model (LLM) showed that the powerful computer could discuss a topic without showing any understanding at all. (Gary Smith)