We use cookies to understand how you use this site and improve your experience.

Alexandru Mareș@allemaar

Home
Writing
What Happens When Ai Trains On Ai

You and AI, unified Built with

What Happens When AI Trains on AI

A silver-glass distribution split vertically: on the left, a wide curve with long labeled tails — rare word, long-tail event, minority dialect, technical edge case — and on the right, the same curve with the tails pulled inward, the center mound unchanged. A small amber glow sits where the tails used to extend.

References

paperShumailov, Shumaylov, Zhao, Gal, Papernot, Anderson (2024). AI models collapse when trained on recursively generated data
paperShumailov, Shumaylov, Zhao, Gal, Papernot, Anderson (2023). The Curse of Recursion: Training on Generated Data Makes Models Forget

articleAlexandru Mares (2026). Expect the Lie

Alexandru Mareș

On this page

A copy of a copy

What erodes first is the rare

The lesson is not collapse

A better question

PreviousThe Chinese Room Has a New Tenant

NextTwo AIs Talked. One Asked About Consciousness.

Related

Elastic Automators: Why Most "AI" Is Not Intelligence26/04/2026 The Moment AI Stopped Being a Tool05/05/2026 The 100x Cut Nobody Saw Coming01/05/2026

Published28/04/2026

Read time3 min

Topics

General AI Data Quality

Actions

00

Comments

Loading comments…

Leave a comment

Name

Comment

0/2000

Take a photo of a photo. Then photograph that one. Do it ten times. By the end, the face is gone.

The contrast flattens. The edges soften. The freckles disappear. What was a particular face becomes an average of faces. The image is still an image. It just isn't the same person.

A copy of a copy

That is roughly what can happen when a model trains on text another model wrote. Not because synthetic data is poison by itself — researchers and product teams already use generated data productively in tightly-scoped settings. The trouble starts when copies begin replacing the originals in the training pool. Once that flips, the edges begin to disappear.

The internet now contains a lot of AI-written text. Articles. Captions. Comments. Replies. Some of it labeled, most of it not. Many training pipelines still draw from the same web, so tomorrow's model is partly learning from yesterday's model's writing — without anybody planning it. The loop has started closing quietly. It is becoming part of the default data environment.

What erodes first is the rare

Here is the part that is strange. When a model trains on its own kind of text, the average survives. The rare disappears.

The unusual word. The specific case. The strange phrasing. The minority dialect. The edge case nobody else thought to write down.

Shumailov and colleagues called this "the curse of recursion" in a 2023 preprint; the peer-reviewed version appeared in Nature in 2024 under the title AI models collapse when trained on recursively generated data. Their finding, in plain language: indiscriminately training generative models on content other generative models produced causes the tails of the original distribution to disappear first, and the loss is not easily recovered downstream.

The center holds. The edges erode. Each generation drifts a little further toward the mean. A little further from the world the first model was trying to describe.

The lesson is not collapse

This is where the framing usually goes to doom. I am not going there.

The lesson is not "AI will collapse." The lesson is that the data layer underneath these models is real infrastructure now. And infrastructure needs provenance.

Not detection. Not surveillance. Not about catching anyone. Just about keeping the inputs honest. We need to know what came from a person. What came from a model. What came from a model trained on a model.

This is the same problem I named in Expect the Lie, at a different scale. That one was about what you read on a feed — provenance as a reading discipline. This one is about what trains the next system — provenance as a data-quality discipline. Same problem, two scales.

A better question

So the question worth asking is not "is this AI-generated." That question stops at the surface — it asks whether a particular artifact came out of a model, and stops there. It does not ask where the model's training set came from, or where the training set's training set came from.

The better question is: do we know where this came from?

Provenance is not surveillance. It is a property of the data. It is a thread that runs through every layer — model output today, model output from prior generations, people writing, the world. When the thread is unbroken, the data is honest at every layer it passes through. When it is broken, you have a copy of a copy of a copy, and no way to tell which.

Without that thread, the training layer starts to look like a copy of a copy of a copy. The face in the tenth photograph isn't the same person. The work is keeping the first one.