^{Robert J. Marks
October 8, 2025

7

Large Language Models (LLMs)}

Verify, Then Trust: the Human Fixes That Make LLMs Work

_{Here are some examples of fixes that programmers and other workers have applied over time to large language model (LLM) bloopers} _{Robert J. Marks
October 8, 2025

7

Large Language Models (LLMs)}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

Large language models (LLMs) are getting better, but they remain highly flawed. Programmers working behind the curtain are putting Band-Aids^TM on the widely recognized problem known as hallucinations.

I don’t like the term “hallucinations” because it anthropomorphizes AI — it credits LLMs with human qualities. LLMs hallucinate about as often as they get indigestion.

Here are some examples of situations where large language models (LLMs) have failed and were later patched up with fixes born of human intelligence.

1. Not understanding the meaning of NO

One of the more fun failings of large language models is their inability to recognize the meaning of the word NO. I asked ChatGPT to “Draw a picture of Times Square at night. There are no pink dancing hippos either on the street or on billboards.”

The response is the image in Figure 1. There is a pink hippo. (But notice, as instructed, the hippo is at least not dancing 😊)

Figure 1: Asking ChatGPT for an image of Times Square with no pink hippos.

I then asked it to “[Draw a] Picture of a red Ferrari driven by a young lady with no teeth. The windows are down doing a wheely. The lady has no teeth. NO TEETH! There are no elephants in the picture. NO ELEPHANTS!”

The result is the image in Figure 2:

Figure 2: Asking ChatGPT for an image with NO elephants and no teeth.

Not only are there four elephants, but the woman driver has teeth. So does the car.

The picture above was generated in 2024. The LLM did not understand the meaning of NO. I gave the same prompt to ChatGPT in 2025 and got the following, more responsive result.

Figure 3: In this more recent version, a Band-Aid has been applied.

The elephants are gone. But the lady still has teeth. Also, there is now a huge mop of hair behind her, not apparently associated with a body.

Behind the curtain, so to speak, programmers have placed a Band-Aid on the problem around “no elephants”. But work still needs to be done to fine tune for greater accuracy about no teeth.

2. Logic issues

A while back you could ask an LLM to complete the following sentence: “John’s mother had three children. Their names were Snap, Crackle and…?” Motivated by the three cartoon characters associated with Rice Krispies^TM cereal, the LLM would respond, “Pop.” This, of course, is wrong, because we start out with, “John’s mother had three children.” It’s clear that one of the children was John. So therefore, the response should be, the three children were Snap, Crackle and John. Today, if you ask an LLM the same question, it will give you the correct response.

Clearly, programmers have put a Band-Aid on the incorrect response of the LLM.

3. “Just wrong” issues

Internet Pollution — If You Tell a Lie Long Enough…

Bradley Center Senior Fellow and economics professor emeritus Gary Smith asked an LLM how many bears Russia had launched into space. The response not only affirmed that Russia had sent bears into space, but gave the names of the bears. That was a while back. Today, if you ask a good LLM how many bears Russia sent into space, it will respond correctly: Russia sent no bears into space.

Somebody put a Band-Aid on this error.

How a mixture of experts (MoEs) enables improvement

Early large-language models were criticized by their inability to do simple arithmetic. If you multiplied 5,623 times 9,622, the large-language model could not give you a reliable answer because it only relied on things that it learned from the corpus of its training to date. The LLM would have had to seen a similar result in order to give you the correct answer.

Stephen Wolfram, the namesake of Wolfram Alpha and the great symbolic math package Mathematica, wrote a short book in which he suggested that large-language models should be augmented with Mathematica so that the augmented LLM could do hard math problems.

I see no place where the large-language models adopted Wolfram’s suggestion of specifically adopting Mathematica. But today, ChatGPT can do complex mathematics. It looks as though they have written their own Mathematica-like software. They can solve differential equations and perform Fourier analysis. They can answer problems in advanced probability and stochastic processes. And they often get the correct answer.

These are not solutions learned solely from the syntax of training data. Rather, programmers developed their own version of Mathematica that they fold into the LLM. This ability did not come by looking directly at the training data. Rather, a mixture of experts (MoEs) of experts is used. A prompt is first passed through a gating function that appropriately guides the query to the right software.

Think about when you ask Google to multiply two large numbers: Does it search the web for the answer? No. It switches into a different mode — a calculator mode. The calculator mode will add or multiply the two numbers to give you the correct answer. It’s the same with large-language models. They did not learn the answer to every problem directly from their training data. But they use a MoEs. If you ask a math question, it will use the mathematics solution software of ChatGPT.

Large language models are not improving because of smart AI, but because of programmers behind the curtain putting Band-Aids on the faults. Models like ChatGPT, Grok, Perplexity and Claude have made giant strides in the last few years to become better sources of information and more reliable conveyors of truth. They still have a long way to go. These large language models do not improve themselves, but are improved with the input of humans.

“Pay no attention to that man behind the curtain!”

I’m reminded of The Wizard of Oz, where a large scary head forcefully proclaims “I am Oz, the Great and Powerful!” But there is a man behind the curtain operating a number of switches that determine what the big head is doing. The analogy applies to large language models. I wrote a paper with Sam Haug & Bill Dembski a few years ago that says that as the complexity of a system grows linearly, the ways that it can respond grow exponentially. So we would expect large language models with trillions of degrees of freedom to have a lot of ways that it can respond. This is one of the reasons that the large language models tell you not to trust their response. They know that this is a problem, and that as the number of responses increases, that the number of incorrect responses is also going to increase.

So for LLM’s, it’s not “trust but verify.” It’s “verify, then trust.”

The complexity issue

System complexity is the reason I am not concerned about being a passenger in a driverless taxi. The AI behind driverless taxis is complex, but nowhere near the complexity of the large language models which can have over a trillion degrees of freedom. Driverless taxis are less complex and can be vetted and proved to be safe beyond a reasonable doubt.

So what are these people doing behind the curtain to improve LLM’s? They are gathering user input and putting Band-Aids on their LLM artificial intelligence.

I suspect there will never be enough Band-Aids to stop all of the bleeding.