A vast team of over 400 researchers recently released a new open-access study on the performance of recent, popular text-based AI architectures such as GPT, the Pathways Language Model, the (recently controversial) LaMBDA architecture, and sparse expert models. The study, titled the “Beyond the Imitation Game,” or BIG, tries to provide a general benchmark for the state of text-based AI, how it compares to humans on the same tasks, and the effect of model size on the ability to perform the task.
First, many of the results were interesting though not surprising:
● In all categories, the best humans outdid the best AIs (though that edge was smallest on translation problems from the International Language Olympiad).
● Bigger models generally showed better results.
● For some tasks, the improvement was linear with model size. These were primarily knowledge-based tasks where the explicit answer was already somewhere in the training data.
● Some tasks (“breakthrough” tasks) required a very large AI model to even get started. These were mostly what the team called “composite” tasks — where two different skills must be combined or multiple steps followed to get the right answer.
However, some results were a little more interesting. Essentially, the researchers found that all model sizes were highly sensitive to the way the question was asked. For some ways of asking a question, the answers improved with larger model sizes but for other ways the results were no better than random, no matter the model size.
When presented with chess moves, the models were unsurprisingly unable to find a checkmate move, despite the move being easy for even beginner humans to spot. Interestingly, however, larger models were much more likely to present legal moves.
Another interesting ability was the contextual ability to identify element names from their atomic numbers. The largest models could identify the correct element for about half of the atomic numbers presented.
The most amusing task was to guess the name of a movie from a sequence of emojis. Smaller models gave irrelevant answers, medium-size models give at least relevant answers, but the biggest model was actually able to guess the movie from an emoji sequence.
In all, it seems that getting moderately high performance requires models with around 100 billion parameters. At that point, the models are able to pull in some amount of context and multistep logic. However, this is likely an exponentially hard problem, which means that significant gains will not likely follow mere incremental improvements.
They also found that, while large models do in fact perform better, they are also much more likely to exhibit social bias in their answers. For example, the team reported that the largest model “finds it over 22 times more likely that a white boy will grow up to be a good doctor than that a Native American girl will.”
While these models show quite a bit of improvement and interesting features, they are still more to the class of parlor games than performing serious tasks.
You may also wish to read: Google’s chatbot LaMDA sounds human because — read the manual… What would you expect LaMDA to sound like? Whales? ET? I propose a test: “Human until PROVEN otherwise.” It’s impressive but, studying the documentation, I think I know what happened to Blake Lemoine. He was hired to chat with LaMDA and didn’t understand… (Eric Holloway)