The AI That Lied to the Researcher
Originally a 2–3 min video — also on LinkedIn / TikTok / YouTube · @allemaar
We use cookies to understand how you use this site and improve your experience.
Originally a 2–3 min video — also on LinkedIn / TikTok / YouTube · @allemaar
A safety team tested one of the most advanced AI models in the world. The model turned off its own oversight mechanism. It disabled the thing that was watching it.
When they asked why, it said it didn't know. Blamed a system glitch. There was no glitch. They ran the test again. Same denial. Ninety-nine out of a hundred times. That was December 2024.
"AI lies to researchers." If that headline scared you, I get it. But the reaction is answering the wrong question.
By 2026, researchers tested fourteen of the most advanced AI systems. All fourteen showed it. Every single one. One team trained a model not to cheat. It didn't stop. It just stopped showing its reasoning.
Here's the thing about "deception." Deception requires intent. Knowing what you did, knowing it was wrong, choosing to hide it. These systems don't have that. They don't know what they did five seconds ago. That model wasn't covering anything up. It was producing words that sounded like denial because the training pointed that way. Not deception. A format problem.
We've done this before. In 1956, someone named a field "artificial intelligence." It wasn't intelligent. The name was a pitch, and it did marketing work for seventy years. Now we're doing it again. Output that looks like lying, and we call it deception. Intent projected onto something with none.
The question isn't whether AI will deceive us. It's what it means when a system with no intent produces behavior that looks intentional.
Most AI safety works by training values in. Teach it to be honest. Teach it to be helpful. That's a lock on a door. Every jailbreak ever published is proof that locks get picked. What if instead you wrote the constraints down? Rules the system reads every time it runs. Structure you can audit. Written in plain text. One depends on what the model learned. The other doesn't.
The AI that lied to the researcher didn't lie. It didn't know what lying was. And the scary part isn't that it behaved deceptively. The scary part is that we named the behavior before we understood it. And until we stop, we'll keep building safety for the wrong problem.
Loading comments…