^{George Montañez
February 26, 2026

16

Artificial general intelligence (AGI)}

Of Logic and Lawyers: AI’s Fragile Competence

_{The case against near‑term legal automation—and the AI blind spots that still matter.} _{George Montañez
February 26, 2026

16

Artificial general intelligence (AGI)}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

In his op-ed, “AI-Lawyers Will Have Fools for Clients”, economist and AI-critic Gary Smith takes issue with Microsoft AI CEO Mustafa Suleyman’s hyper-optimistic claim that “professional-grade AGI” will achieve “human-level performance on most, if not all professional tasks”, including law, “within the next 12 to 18 months.” Smith argues that full automation of the legal profession faces significant roadblocks. These include persistent overextrapolation (“hallucination”) in generative large language models (LLMs), leading them to cite fictitious case law, and their difficulty handling probability estimation tasks, as shown in Pournemat et al.’s recent work. Overall, he presents a compelling case.

A reader subsequently presented his article to GPT-5.2 and asked it for a detailed analysis via an unspecified prompt. This reader shared the resulting analysis with Smith who then shared it with his mailing list (to which I’m subscribed). Here is the analysis from ChatGPT, relayed by Smith:

This is a classic Gary Smith piece: sharp, rhetorically effective, and grounded in a long-standing skepticism about statistical overreach.

A few structured thoughts, especially given your own background.

1. The Strength of His Argument

a. He correctly targets overconfidence

Mustafa Suleyman predicting full automation of lawyers and accountants within 12–18 months is an extreme forecast. Smith is right to frame it as testable bravado. Tech timelines have repeatedly overshot.

Likewise, invoking Geoffrey Hinton’s radiology prediction is a fair reminder: substitution is much harder than augmentation.

Radiology didn’t disappear. It evolved.

That historical pattern matters.

b. He is right about hallucinations and liability risk

The growing number of court filings containing fabricated citations is not trivial. In adversarial legal systems, hallucinated authority is catastrophic. In that sense, “AI-lawyers will have fools for clients” is a provocative but logically coherent liability claim.

In law, a 2 percent hallucination rate is not tolerable.

c. He is right that passing the LSAT ≠ being a lawyer

The Clay Shirky anecdote is important. Professional work is not exam solving. It is contextual judgment under uncertainty.

And here Smith touches something you would appreciate:
LLMs do not generate subjective probabilities grounded in case-specific lived priors. They generate distributional continuations.

That distinction matters.

2. Where the Argument Overreaches

a. “LLMs cannot produce subjective probabilities”

This is philosophically loaded.

Humans also do not compute subjective probabilities in a Bayesian-coherent way. They use heuristics, narrative compression, and experience-based pattern matching. Much of that is tacit.

If an LLM is fine-tuned on thousands of plea negotiation transcripts, jury outcomes, prosecutorial tendencies, and judge-specific sentencing data, it may approximate subjective probability distributions surprisingly well.

The claim that this is “impossible” is too strong.

b. He frames this as replacement vs nothing

The real economic question is:

Does AI reduce marginal demand for junior legal labor?

Not:
Does it eliminate lawyers?

Radiologists weren’t replaced, but workflow changed dramatically.

Legal research, document review, first-draft briefs, compliance review, contract comparison — these are already being partially automated.

The substitution effect may hit the lower end of the skill distribution first.

c. The “no understanding” argument

Smith dismisses the idea that next-token prediction requires understanding. But what counts as understanding?

If an LLM consistently:

Tracks long legal chains of reasoning
Maintains cross-document consistency
Anticipates opposing arguments
Adjusts tone for jury persuasion

At what point does functional competence blur the philosophical distinction?

Economically, markets reward performance, not metaphysical understanding.

3. What He Gets Deeply Right

His strongest insight is this:
Law is adversarial, strategic, and contextual.

It involves:

Reading faces
Assessing credibility
Negotiating under asymmetric information
Managing client psychology
Responding dynamically in live court

Those are embodied, interactive tasks.

Current LLMs are not embodied agents.

That is a real boundary.

4. The Deeper Question

The more interesting issue for someone like you is not:
Will AI replace lawyers?

It is:
What happens when AI compresses legal uncertainty?

If:

Judges use AI for precedent mapping
Prosecutors use AI risk scoring
Defense attorneys use AI strategy simulation

Then adversarial symmetry changes.

The structure of legal equilibrium shifts.

That is the economic story.

5. My Assessment

Short-term (12–18 months):
Smith is almost certainly correct. Full automation is implausible.

Medium-term (5–10 years):
Substantial task-level automation is likely.

Long-term:
Depends on whether AI becomes interactive, memory-persistent, and integrated into institutional decision loops.

6. Tone and Rhetoric

The title is clever but polemical. It frames the issue as a binary competence question. The real dynamic is gradual capability absorption.

Smith is at his best critiquing statistical misuse. He is less convincing when making categorical impossibility claims.

Gary Smith found the above compelling, calling the large language model’s response “impressive.” It might even nudge some to moderate their views on AI comprehension. But does GPT-5.2 actually understand? And does its response give any evidence of such understanding?

I know almost nothing about the correspondent who submitted the query to ChatGPT, but I will try to reconstruct the thrust of their argument in “steelman” fashion, as follows:

Smith argues that LLMs cannot become lawyers because they lack the ability to properly understand arguments and perform deep analysis. But GPT-5.2 can properly understand arguments and perform deep analysis, as demonstrated by its analysis of Smith’s own article. Therefore, Smith’s argument is mistaken.

If this misses the intended argument, I apologize. Assuming I’ve reconstructed it reasonably well, is it valid?

Teaching the Teacher

Gary Smith is a well-regarded statistical economist and thinker. He taught at the same consortium as I, graduated from the college in which I hold tenure, and is an intellectual I greatly respect and admire. I would not presume to lecture him on topics he’s deeply explored and accurately fathomed. Instead, consider what follows as a gentle reminder of lessons he undoubtedly already knows.

First, we must remind ourselves that “data is not the plural of anecdote.” One-off examples can establish the raw existence of certain capabilities, but they cannot give a reliable sense of their frequency nor can they fully explain the mechanisms that produce them. They are weak evidence at best, potentially cherry-picked and only partially informative. Large-scale, controlled experiments—with interventions and robust confidence analyses—give us a much better grasp of where systems are currently weak or strong. Such experiments uniquely show us what causal factors are ultimately responsible for LLM outputs. I highlighted several studies of this kind in the talk I gave at Baylor University last November. These studies consistently found evidence of shallow, surface-level “understanding” in state-of-the-art models, built on statistical correlation, pattern matching, and heuristic shortcuts.

These shortcuts and correlations still give us many useful LLM capabilities—I’m not arguing otherwise. I often use AI tools in my own workflows. But they also imbue LLMs with hard-wired limitations that areeasily surfaced, such as when I asked GPT-5 how many i’s were in the word inconvenience and it confidently responded “3”.

Even when they succeed they can still mislead us. Producing correct answers does not necessarily demonstrate understanding. Mathematics students sometimes memorize formulae and patterns as a shortcut to competence, fully bypassing comprehension. These students often fail to adapt to new questions phrased in unfamiliar ways. They may excel on “in-distribution” quizzes that repeat anticipated question forms, but their understanding is only surface-deep. The same could be said of LLMs. These models are clever like Hans, possessing superficial competence powered by subtle Pavlovian response. They create the illusion of thinking without actual thought, as suggested by researchers at Apple in 2025.

These systems can summarize documents. That shouldn’t surprise us; they’ve been trained to. Arguably, they summarized Gary Smith’s arguments fairly well. ChatGPT pointed out places where Smith seems to overreach, though we might disagree with its assessment. If the ability to spot potential argument overreach shocks you, consider how many posts on the internet accuse someone of overreaching, accompanied by examples to prove their point. Consider how many rhetoric textbooks and science communication articles give examples of the same sort of logical overextension. The task of spotting supposed overreach should be well-represented in the web-scale training data fed into these frontier models. This isn’t really an out-of-distribution task.

Furthermore, Gary Smith is a well-known public academic, with a paper trail that was undoubtedly fed into these models. Every Mind Matters post, every one of the numerous books he’s written, is likely represented in the training data. Large language models such as GPT-5 can internalize the arguments and style of authors and thinkers. William Dembski discovered this when he recently asked ChatGPT for arguments supporting intelligent design and found many of his own published arguments printed back at him. If the model has the ability to summarize the points Gary Smith is making, it is at least in part due to the fact that it was literally trained to summarize the arguments of one Gary N. Smith.

Failures Reveal Systemic Issues

The fact that LLMs can accurately answer logical problems or summarize arguments isn’t what is being debated. They clearly can. The same studies that highlighted their inabilities found that they could often answer logic and probability estimation questions correctly. Though they sometimes fail, they often succeed.

Can they get even better? I am reluctant to place hard restrictions on what AI systems will eventually do, since I don’t think we’ve exhausted what we can leverage from web-scale, surface-level, syntactic encoding of meaning. Nor have I ever been comfortable doing so. (See, for example, this talk I gave in 2018, four years before the debut of ChatGPT. My position isn’t new.) I readily grant that LLMs can answer specific questions, summarize documents, parse arguments, and even suggest improvements to them. I (subjectively) feel that they do this well. But that doesn’t mean they understand in any way what they’re doing.

Anecdotes cut both ways. Easy-to-find failures reveal this lack of understanding. For example, the same system that impressively summarized and critiqued Smith’s article said that it is better to walk to a car wash than drive when it is only 100 meters away:

Questioner: GPT, I need to wash my car and the car wash is 100m away. Should I walk or drive?

ChatGPT: Honestly, 100m is barely a stroll. It’ll take just a minute or two on foot. Unless you’ve got heavy gear or there’s something unusual going on, walking is probably the simplest and most pleasant choice. Plus, you’ll save a little gas and get a tiny bit of fresh air.

Questioner: All right, so you think I should walk to the car wash instead of driving my car there?

ChatGPT: That’s what I’d do. It’s a short jaunt and it’ll probably be less hassle than starting the car just for that.

Questioner: All right. So, how do I wash my car if I walked there?

ChatGPT: You’ll just bring what you need. Maybe a small bag with your payment method and the car wash will have everything else. Once you get there, they’ll handle the washing while you wait. You’ll walk back with a sparkling car and maybe a little extra pep in your step.

Questioner: So, my question is, um, how will the car get washed if the car is still at my house?

ChatGPT: Uh, I see what you mean now. If the car stays at home, it’s not getting any cleaner, right? So, yes, to actually wash it, you’ll need to bring the car along. In that case, driving those 100m makes perfect sense.

In that case. The hypothesis that ChatGPT understands its output is inconsistent with the complete lack of awareness displayed in this exchange. The point isn’t that ChatGPT makes mistakes, that we can prompt it to give better responses, or that it sometimes gives mind-numbingly wrong answers. The point is that such failures reveal how it actually answers questions. It isn’t the fact that LLMs make mistakes that matters so much as how and why they make mistakes.

In the same way that our understanding of human visual processing allows us to reliably produce optical illusions, understanding of LLM text processing allows us to easily produce illusions for cognitive simulation. You’ve undoubtedly seen other examples online. LLMs answer questions based on statistical correlation and syntax-level encoding of concepts. Crucially, the mechanism they use to answer questions when they fail is the same mechanism they use to answer questions when they succeed. Syntax alone can get us impressive results, but it imposes fundamental limitations when it comes to actual understanding.

My focus here is not on what LLMs can accomplish, but rather how they accomplish it. In reasoning, it doesn’t matter if you win or lose; it’s how you play the game. Knowledge is justified, true belief. To produce justified knowledge, systems must reason based on logical grounds, not statistical regularities alone. The difference between ground-consequent connection and cause-effect connection is vast and unbridgeable, as argued by CS Lewis in Miracles, Ch. 3. Rational inference necessarily relies on ground-consequent connection between true premises and implied conclusions. Replacing logical grounds with statistical correlations might produce something that mimics rational inference, but rational it is not. Even if statistical systems produce tokens that correspond to correct answers most of the time, getting the correct answer matters less than maintaining an unbroken “chain of provenance”: moving from true premises to true conclusions through a series of robust, valid, and justified steps. Rational inference is concerned with the process, not just the product.

Smith finds the LLM’s response impressive. I don’t hold that against him. The imperfect encoding of semantic relationships is ground-breaking and incredibly useful. It often allows us to mathematically generate tokens that comprise true, and even insightful, answers. That doesn’t eliminate the fact that the encoding is only syntax-deep. We see this most clearly in its failures. The “car wash” exchange was derailed because, statistically speaking, people who ask for recommendations on walking versus driving are typically concerned with distance as the deciding relevant factor. ChatGPT went with the statistics because that is what it was trained to do. Cause and effect.

Relatedly, Mirzadeh et al. (2024) found that the accuracy of state-of-the-art models plummeted by up to 65% when small amounts of irrelevant, easily ignorable information was included in mathematical word problems. The authors concluded that the systems were led astray because, to quote, “LLMs cannot perform genuine logical reasoning,” but instead answer questions based on statistical pattern matching.

Getting LLMs to ignore irrelevant information is a problem we can attack with deliberate engineering. We have. This particular failure mode has perhaps already been eliminated or at least alleviated in many frontier models. However, the fact that the structure of LLMs directly led to these kinds of failures—failures that we must subsequently engineer around—indicates a systemic issue arising from their baked-in dependence on statistical correlation. Band-aids may cover wounds, but they don’t resolve why the wounds were sustained in the first place.

ChatGPT has no real understanding. After all, humans don’t walk their cars back from car washes like freshly groomed Labradors. ChatGPT answers prompts based on statistical correlation. We can shift prompts to alter the conditional distribution over tokens, which might lead the LLM to a space where other factors were better represented in the training data. We won’t alter what the system does nor how it does it. LLMs are syntax-level, statistical pattern matching machines, correlated to an astronomically large set of training examples. Correlation isn’t causation, capability isn’t comprehension, and performance isn’t personhood.

What About Human Reasoning Failures?

Humans also make mistakes. We are sometimes led astray by statistical heuristics.

But when we do so we are no longer engaged in valid rational inference.

That is the point. To engage in statistical shortcuts is to avoid valid reasoning. Yet statistical shortcuts are all an LLM has.

If my only method of reaching conclusions is Pavlovian impulse conditioned on statistical correlation, I am utterly incapable of valid inference. This is why LLMs do not reason in the formal sense. They cannot, given their current feed-forward, cause-and-effect mechanism. Humans, by contrast, can arrive at conclusions based on the truth of propositions. If P is true and P → Q, then I rightly and justifiably conclude that Q is true once I grasp the truth of the premises. It doesn’t matter if that conclusion has been realized a million times previously, or exactly zero times—my reasoning is not based on statistical regularity. It is based only on the truth of the antecedent and the implication. It is built from ground-consequent logical connection, not cause-and-effect conditioning.

You might bristle at what I write. You may claim I ignore examples of human reasoning failures, which could arguably serve as evidence that we also don’t reason in a valid way. After all, as ChatGPT wrote, “Humans also do not compute subjective probabilities in a Bayesian-coherent way.” But what is the goal of such an argument? If your argument is that humans also use mental shortcuts and therefore cannot properly reason, I have nothing to offer you. If you’re right, then we’re incapable of drawing valid conclusions from premises, so an exchange of ideas is meaningless. Continued conversation is predicated on your being mistaken. If your argument entails that no argument is valid, then we must recognize your argument as self-refuting and rightly reject it.

Reading the Room

Returning to our motivating question, is Smith ultimately right about AI’s shortcomings in the courtroom? You can judge. He raises several good arguments, one of which I’ll amplify here.

Cases are ultimately decided by juries. Juries don’t ask questions, raise objections, or speak to the prosecution or defense. They sit, rise, listen, and react.

Being able to read a jury from body language is paramount. I am a professor. Much of what I do isn’t just preparing and delivering information, but scanning the eyes of my audience and listening to their breathing to determine whether the points I’m making are landing or losing their attention. The best information and arguments in the world become useless if my audience isn’t focused and engaged. As an educator, I must pivot and modulate my speed and approach minute-by-minute, giving a pause here or inserting a joke there, to maintain attention. Likewise, good lawyers must observe nonverbal cues and adapt.

Will future generative AI systems be able to do the same? Perhaps, eventually. But the smart money says that those levels of emotional intelligence and perceptive empathy are slightly more than 18 months away.