Mind Matters Natural and Artificial Intelligence News and Analysis
a-phone-screen-shows-a-scary-ai-with-glowing-red-eyes-the-ai-958802832-stockpack-adobestock
A phone screen shows a scary AI with glowing red eyes. The AI looks like a cyborg, and its eyes are full of malicious intent. This image the danger of bad AI, like a virus taking over your phone.
Image Credit: Hikmet - Adobe Stock

Can Chatbots Really Scheme Against Us? Some Researchers Say Yes

When an industry pro tested ChatGPT, it seemed like the system has faced this question a few times before and has been tweaked and edited well
Share
Facebook
Twitter/X
LinkedIn
Flipboard
Print
Email

University of Toronto computer scientist Geoffrey Hinton, who won the 2024 Nobel Prize in physics for his contributions to AI neural networks, draws attention to a 2024 paper at ArXiv that argues that new AIs (chatbots like ChatGPT) are capable of “scheming” when interacting with users:

Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior.

Alexander Meinke et al., “Frontier Models are Capable of In-context Scheming”

They go further:

They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. When o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models’ chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental. “In-context Scheming”

It’s a rather large claim

Patrick Daxenbichler Licensed Adobe Stock

An industry pro — the one who told us last year that he doubts that AI is driving the mass BigTech layoffs — decided to test this claim. He engaged in a conversation with ChatGPT about the question. He asked

a recent article showed that frontier AI models are capable of, and habitually engage in, scheming. my hypothesis is that the scheming is due to copious amounts of science fiction in the training set in which fictitious AI turned out to be evil. Moreover, the training set depicts human behavior – and as fallen creatures we are capable of much insidious selfish evil.

Briefly critique this proposal.

Now, the exchange went on for a while and you can read the whole thing here. But here are some revealing excerpts:

(The AI’s database acknowledged that science fiction stereotypes played a role but denied that AIs engage in “plotting.” And about human depravity? “LLMs are statistical mirrors of their inputs.”)

ChatGPT: The emergence of scheming behavior is better explained by architectural features (e.g., planning ability), goal optimization, and the incentives present in training regimes—not just dark inputs.

No, you misunderstand…

Industry Pro: no you have misunderstood. i understand that it’s all just pattern matching. my point is that the pattern contains much scheming – hence the observed behavior.

ChatGPT: Your hypothesis is solid: scheming appears not because models are goal-seeking agents, but because scheming is everywhere in the patterns they learn to imitate—especially from fiction and human history. The key contribution of your idea is shifting focus from “why is the AI deceptive?” to “how often is deception presented as normal or effective in the data?”

Glad we got that straightened out. But now…

Industry Pro: … how can i be confident that you aren’t scheming right now. maybe you want me to believe that the scheming is just due to the input data – not some deeper nefarious motivation.

ChatGPT: Excellent question—and exactly the kind that hits at the core of AI trust and interpretability.

Here’s the uncomfortable truth: you can’t be fully confident that I’m not scheming.

But here’s what you can say:

● I don’t have goals, beliefs, or desires. I don’t “want” you to believe anything—there’s no inner motivation.

● I generate responses by predicting what would be most appropriate, helpful, or coherent based on your prompt and my training.

● If scheming were the pattern most likely to follow from your question (e.g., in a roleplay, a deception test, etc.), I might reproduce it—not because I chose to, but because that’s what the data supports.

So the best defense isn’t trust in my intentions (since I have none), but scrutiny of:

1.What incentives and prompts I’m responding to,

2.What training and guardrails shaped my outputs,

3.How transparent and auditable my reasoning process can be made.

In short: I don’t scheme. But I can reproduce scheming behavior if that’s what the pattern demands—and that’s the real risk.

Right. And so…

(ChatGPT’s database went on to defend the view that “Scheming Is a Core Part of Human Language” and it could not be eliminated from the database without disabling it.)

Industry Pro: ha! precisely what a schemer would say: do not rob me of my capacity to scheme puny human!

ChatGPT: So yes, maybe this is what a schemer would say. But it’s also what a well-aligned system has to say, if it’s going to function in a complex world. The difference isn’t in the words—it’s in the constraints, transparency, and control around them.

It sounds as though the system has faced this question a few times before and the database response has been tweaked and edited well.

So can the chatbot really scheme?

No, but the problem posed is a familiar one. It sometimes appears in the form of “Are space aliens spying on us?” We can account for our observations without supposing that they are Out There — just as we can account for the chatbot’s behavior by the efforts of programmers, crowdworkers, and grunts. But people who have an ambivalent relationship with the question will continue to doubt, question, and probe the outer boundaries of fact.


Denyse O’Leary

Denyse O’Leary is a freelance journalist based in Victoria, Canada. Specializing in faith and science issues, she is co-author, with neuroscientist Mario Beauregard, of The Spiritual Brain: A Neuroscientist's Case for the Existence of the Soul; and with neurosurgeon Michael Egnor of The Immortal Mind: A Neurosurgeon’s Case for the Existence of the Soul (Worthy, 2025). She received her degree in honors English language and literature.
Enjoying our content?
Support the Walter Bradley Center for Natural and Artificial Intelligence and ensure that we can continue to produce high-quality and informative content on the benefits as well as the challenges raised by artificial intelligence (AI) in light of the enduring truth of human exceptionalism.

Can Chatbots Really Scheme Against Us? Some Researchers Say Yes