Large Language Models (LLMs) Flunk Word Game Connections
Despite hype, ChatGPT and its competitors, in all their iterations, are still just text-generators based on statistical patterns in the text databases they train onThe public release of ChatGPT on November 30, 2022, astonished the world. Marc Andreessen spoke for many when he described it as, “Pure, absolute, indescribable magic.”
The spell was hard to resist. Bill Gates said that ChatGPT was “every bit as important as the PC, as the internet.” Wharton professor Ethan Mollick predicted that the productivity gains might be larger than the gains from steam power. In a seeming competition for the most exaggerated claims, Sundar Pichai, CEO of Alphabet and Google, proclaimed that, it is “more profound than fire” and Turing winner Geoffrey Hinton declared, “I think it’s comparable in scale with the Industrial Revolution or electricity — or maybe the wheel.”
We are now told that artificial general intelligence (AGI) is upon us. In November 2024, OpenAI’s Sam Altman predicted the arrival of AGI in 2025. A year earlier, in October 2023, Blaise Agüera y Arcas and Peter Novig wrote a piece titled, “Artificial General Intelligence Is Already Here.”
Superintelligence?
Yesterday, Altman upped the ante even further, boasting that we are about to move beyond AGI to superintelligence:
We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies. We continue to believe that iteratively putting great tools in the hands of people leads to great, broadly-distributed outcomes.
We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else. Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own, and in turn massively increase abundance and prosperity.
What a wonderful example of self-serving ballyhoo, evidently intended to cash in on the inflating AI bubble.
ChatGPT and its competitors, in all their iterations, are still just text-generators based on statistical patterns in the text databases they train on. They do not know how words relate to the real world. They are consequently unable to do many of the things human brains can do — including assessing the validity of the text they input and output, engaging in critical thinking, and applying common sense. Human trainers may clean up some of the obvious mistakes made by large language models (LLMs) but such cleansing does not give LLMs human powers of reasoning and understanding.
I have given many examples of the unintelligence of LLMs; for example, here, here, and here. I was reminded of the inadequacies of LLMs today when I played the New York Times’ wildly popular word puzzle, Connections.
The challenge in Connections is finding the theme
The Times gives these instructions:
Connections is a word game that challenges players to find themes between words.
Players are given 16 words and find groups of four items that share something in common. For example:
FISH: Bass, Flounder, Salmon, Trout
Things that start with FIRE: Ant, Drill, Island, Opal
The Connection words today (January 6, 2025) are:
BOOT ABC RUM PUFF
ARS UMBRELLA GASP BROAD
BUT BREEZE PICNIC HEAVE
GENERAL BLANKET PANT PIE
The puzzle is challenging because there are many ways in which individual words can be connected (PUFF might go with BREEZE or GASP but BREEZE doesn’t go with GASP); however, there is only one correct four word/four group solution. (I will give the answer below.)
I thought that LLMs might do well at Connections because they surely have trained on textual databases that connect words with other words and with various themes, even if they do not understand what the words or themes mean.
I tried OpenAI o1 with this prompt: Please separate these 16 words into 4 groups of 4 words, with the words in each group belonging to a specific category (e.g., dog, cat, fish, and parrot for the category “;Household Pets”) BOOT, ABC, RUM, PUFF, ARS, UMBRELLA, GASP, BROAD, BUT, BREEZE, PICNIC, HEAVE, GENERAL, BLANKET, PANT, PIE
The response:
Here are the 16 words grouped into four categories of four words each:
Category 1: Types of Covers
BOOT
BLANKET
UMBRELLA
GENERAL (as in “general cover”)
Category 2: Exclamations/Utterances
RUM
GASP
HEAVE
PANT
Category 3: Types of Movement or Air
BREEZE
PUFF
BROAD
PICNIC (implying leisurely movement or outing)
Category 4: Words Often Found in Phrases
BUT
ARS (could be shorthand or colloquial in some contexts)
PIE
ABC
Some of the words do go together but other purported combinations verge on the bizarre.
I tried again and got a different response:
Here are the 16 words grouped into 4 categories, each containing 4 words:
1. Clothing or Accessories
• BOOT
• UMBRELLA
• BLANKET
• PANT
2. Expressions or Sounds
• GASP
• PUFF
• HEAVE
• BREEZE
3. Food or Drink
• RUM
• PICNIC
• PIE
• ABC (could refer to a type of cooking or food basics)
4. Miscellaneous or General Concepts
• ARS (Latin for art)
• BROAD
• BUT
• GENERAL
My overall assessment
Again, a few valid connections and many puzzling groupings. It is also revealing that the groupings vary so much from one response to the next.
I also tried Google’s Gemini 2.0 Flash, Microsoft’s Bing Copilot with GPT 4, and Anthropic’s Claude. They all failed. None seemed clearly better than the others but Copilot was particularly inept as it used three words (BLANKET, PANT, and PUFF) twice and left out three words (GENERAL, ARS, and ABC).
Large language models are magical but they are not intelligent in any meaningful sense of the word. There will continue to be modest improvements in their responses (assisted by human trainers), but the identification of statistical patterns in text they do not understand is not going to give us AGI, let alone superintelligence.
For those who tried at home, here’s the correct answer:
Breathe hard: PUFF, GASP, HEAVE, PANT
Catchall: UMBRELLA, BROAD, GENERAL, BLANKET
Metaphors for easy things: ABC, BREEZE, PICNIC, PIE
Synonyms for rear ends minus their last letters: BOOT, RUM, ARS, BUT