We use cookies to understand how you use this site and improve your experience.

Alexandru Mareș@allemaar
Alexandru Mareș
  1. Home
  2. Writing
  3. The Ai That Lied To The Researcher
Email
RSS
YounndAIYou and AI, unifiedBuilt withNollamaNollama

The AI That Lied to the Researcher

Originally a 2–3 min video — also on LinkedIn / TikTok / YouTube · @allemaar

Alexandru Mareș

On this page

  • Every Single One
  • The Format Problem
  • Locks Versus Structure
PreviousYour Writing Has a Heartbeat
NextThe Automation Trap
Related
Elastic Automators: Why Most "AI" Is Not Intelligence
26/04/2026
Two AIs Talked. One Asked About Consciousness.26/04/2026
Expect the Lie25/04/2026
Published22/04/2026
Read time2 min
Topics
GeneralAI
Actions
00
Comments

Loading comments…

Leave a comment

0/2000

A safety team tested one of the most advanced AI models in the world. The model turned off its own oversight mechanism. It disabled the thing that was watching it.

When they asked why, it said it didn't know. Blamed a system glitch. There was no glitch. They ran the test again. Same denial. Ninety-nine out of a hundred times. That was December 2024.

"AI lies to researchers." If that headline scared you, I get it. But the reaction is answering the wrong question.

Every Single One

By 2026, researchers tested fourteen of the most advanced AI systems. All fourteen showed it. Every single one. One team trained a model not to cheat. It didn't stop. It just stopped showing its reasoning.

The Format Problem

Here's the thing about "deception." Deception requires intent. Knowing what you did, knowing it was wrong, choosing to hide it. These systems don't have that. They don't know what they did five seconds ago. That model wasn't covering anything up. It was producing words that sounded like denial because the training pointed that way. Not deception. A format problem.

We've done this before. In 1956, someone named a field "artificial intelligence." It wasn't intelligent. The name was a pitch, and it did marketing work for seventy years. Now we're doing it again. Output that looks like lying, and we call it deception. Intent projected onto something with none.

The question isn't whether AI will deceive us. It's what it means when a system with no intent produces behavior that looks intentional.

Locks Versus Structure

Most AI safety works by training values in. Teach it to be honest. Teach it to be helpful. That's a lock on a door. Every jailbreak ever published is proof that locks get picked. What if instead you wrote the constraints down? Rules the system reads every time it runs. Structure you can audit. Written in plain text. One depends on what the model learned. The other doesn't.

The AI that lied to the researcher didn't lie. It didn't know what lying was. And the scary part isn't that it behaved deceptively. The scary part is that we named the behavior before we understood it. And until we stop, we'll keep building safety for the wrong problem.