Mind Matters Natural and Artificial Intelligence News and Analysis
generative-ai-a-visually-striking-graphic-that-is-split-into-794726345-stockpack-adobe_stock
Generative AI : A visually striking graphic that is split into two halves. On one side, depict a traditional library or database to symbolize Large Language Models
Image Credit: The Little Hut - Adobe Stock

Large Language Models (LLMs) Flunk Word Game Connections

Despite hype, ChatGPT and its competitors, in all their iterations, are still just text-generators based on statistical patterns in the text databases they train on
Share
Facebook
Twitter/X
LinkedIn
Flipboard
Print
Email

The public release of ChatGPT on November 30, 2022, astonished the world. Marc Andreessen spoke for many when he described it as, “Pure, absolute, indescribable magic.”

The spell was hard to resist. Bill Gates said that ChatGPT was “every bit as important as the PC, as the internet.” Wharton professor Ethan Mollick predicted that the productivity gains might be larger than the gains from steam power. In a seeming competition for the most exaggerated claims, Sundar Pichai, CEO of Alphabet and Google, proclaimed that, it is “more profound than fire” and Turing winner Geoffrey Hinton declared, “I think it’s comparable in scale with the Industrial Revolution or electricity — or maybe the wheel.”

We are now told that artificial general intelligence (AGI) is upon us. In November 2024, OpenAI’s Sam Altman predicted the arrival of AGI in 2025. A year earlier, in October 2023, Blaise Agüera y Arcas and Peter Novig wrote a piece titled, “Artificial General Intelligence Is Already Here.”

Superintelligence?

Yesterday, Altman upped the ante even further, boasting that we are about to move beyond AGI to superintelligence:

We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies. We continue to believe that iteratively putting great tools in the hands of people leads to great, broadly-distributed outcomes.

We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else. Superintelligent tools could massively accelerate scientific discovery and innovation well beyond what we are capable of doing on our own, and in turn massively increase abundance and prosperity.

What a wonderful example of self-serving ballyhoo, evidently intended to cash in on the inflating AI bubble.

ChatGPT and its competitors, in all their iterations, are still just text-generators based on statistical patterns in the text databases they train on. They do not know how words relate to the real world. They are consequently unable to do many of the things human brains can do — including assessing the validity of the text they input and output, engaging in critical thinking, and applying common sense. Human trainers may clean up some of the obvious mistakes made by large language models (LLMs) but such cleansing does not give LLMs human powers of reasoning and understanding.

I have given many examples of the unintelligence of LLMs; for example, here, here, and here. I was reminded of the inadequacies of LLMs today when I played the New York Times’ wildly popular word puzzle, Connections.

The challenge in Connections is finding the theme

The Times gives these instructions:

Connections is a word game that challenges players to find themes between words.

Players are given 16 words and find groups of four items that share something in common. For example: 

FISH: Bass, Flounder, Salmon, Trout

Things that start with FIRE: Ant, Drill, Island, Opal  

The Connection words today (January 6, 2025) are:

BOOT ABC RUM PUFF

ARS UMBRELLA GASP BROAD

BUT BREEZE PICNIC HEAVE

GENERAL BLANKET PANT PIE

The puzzle is challenging because there are many ways in which individual words can be connected (PUFF might go with BREEZE or GASP but BREEZE doesn’t go with GASP); however, there is only one correct four word/four group solution. (I will give the answer below.)

I thought that LLMs might do well at Connections because they surely have trained on textual databases that connect words with other words and with various themes, even if they do not understand what the words or themes mean.

I tried OpenAI o1 with this prompt: Please separate these 16 words into 4 groups of 4 words, with the words in each group belonging to a specific category (e.g., dog, cat, fish, and parrot for the category “;Household Pets”) BOOT, ABC, RUM, PUFF, ARS, UMBRELLA, GASP, BROAD, BUT, BREEZE, PICNIC, HEAVE, GENERAL, BLANKET, PANT, PIE

The response:

Here are the 16 words grouped into four categories of four words each:

Category 1: Types of Covers

BOOT

BLANKET

UMBRELLA

GENERAL (as in “general cover”)

Category 2: Exclamations/Utterances

RUM

GASP

HEAVE

PANT

Category 3: Types of Movement or Air

BREEZE

PUFF

BROAD

PICNIC (implying leisurely movement or outing)

Category 4: Words Often Found in Phrases

BUT

ARS (could be shorthand or colloquial in some contexts)

PIE

ABC

Some of the words do go together but other purported combinations verge on the bizarre.

I tried again and got a different response:

Here are the 16 words grouped into 4 categories, each containing 4 words:

1. Clothing or Accessories

• BOOT

• UMBRELLA

• BLANKET

• PANT

2. Expressions or Sounds

• GASP

• PUFF

• HEAVE

• BREEZE

3. Food or Drink

• RUM

• PICNIC

• PIE

• ABC (could refer to a type of cooking or food basics)

4. Miscellaneous or General Concepts

• ARS (Latin for art)

• BROAD

• BUT

• GENERAL

My overall assessment

Again, a few valid connections and many puzzling groupings. It is also revealing that the groupings vary so much from one response to the next.

I also tried Google’s Gemini 2.0 Flash, Microsoft’s Bing Copilot with GPT 4, and Anthropic’s Claude. They all failed. None seemed clearly better than the others but Copilot was particularly inept as it used three words (BLANKET, PANT, and PUFF) twice and left out three words (GENERAL, ARS, and ABC).

Large language models are magical but they are not intelligent in any meaningful sense of the word. There will continue to be modest improvements in their responses (assisted by human trainers), but the identification of statistical patterns in text they do not understand is not going to give us AGI, let alone superintelligence.

For those who tried at home, here’s the correct answer:

Breathe hard: PUFF, GASP, HEAVE, PANT

Catchall: UMBRELLA, BROAD, GENERAL, BLANKET

Metaphors for easy things: ABC, BREEZE, PICNIC, PIE

Synonyms for rear ends minus their last letters: BOOT, RUM, ARS, BUT


Gary N. Smith

Senior Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on stock market anomalies, statistical fallacies, the misuse of data, and the limitations of AI has been widely cited. He is the author of more than 100 research papers and 18 books, most recently, Standard Deviations: The truth about flawed statistics, AI and big data, Duckworth, 2024.
Enjoying our content?
Support the Walter Bradley Center for Natural and Artificial Intelligence and ensure that we can continue to produce high-quality and informative content on the benefits as well as the challenges raised by artificial intelligence (AI) in light of the enduring truth of human exceptionalism.

Large Language Models (LLMs) Flunk Word Game Connections