Take a photo of a photo. Then photograph that one. Do it ten times. By the end, the face is gone.
The contrast flattens. The edges soften. The freckles disappear. What was a particular face becomes an average of faces. The image is still an image. It just isn't the same person.
A copy of a copy
That is roughly what can happen when a model trains on text another model wrote. Not because synthetic data is poison by itself — researchers and product teams already use generated data productively in tightly-scoped settings. The trouble starts when copies begin replacing the originals in the training pool. Once that flips, the edges begin to disappear.
The internet now contains a lot of AI-written text. Articles. Captions. Comments. Replies. Some of it labeled, most of it not. Many training pipelines still draw from the same web, so tomorrow's model is partly learning from yesterday's model's writing — without anybody planning it. The loop has started closing quietly. It is becoming part of the default data environment.
What erodes first is the rare
Here is the part that is strange. When a model trains on its own kind of text, the average survives. The rare disappears.
The unusual word. The specific case. The strange phrasing. The minority dialect. The edge case nobody else thought to write down.
Shumailov and colleagues called this "the curse of recursion" in a 2023 preprint; the peer-reviewed version appeared in Nature in 2024 under the title AI models collapse when trained on recursively generated data. Their finding, in plain language: indiscriminately training generative models on content other generative models produced causes the tails of the original distribution to disappear first, and the loss is not easily recovered downstream.
The center holds. The edges erode. Each generation drifts a little further toward the mean. A little further from the world the first model was trying to describe.
The lesson is not collapse
This is where the framing usually goes to doom. I am not going there.
The lesson is not "AI will collapse." The lesson is that the data layer underneath these models is real infrastructure now. And infrastructure needs provenance.
Not detection. Not surveillance. Not about catching anyone. Just about keeping the inputs honest. We need to know what came from a person. What came from a model. What came from a model trained on a model.
This is the same problem I named in Expect the Lie, at a different scale. That one was about what you read on a feed — provenance as a reading discipline. This one is about what trains the next system — provenance as a data-quality discipline. Same problem, two scales.
A better question
So the question worth asking is not "is this AI-generated." That question stops at the surface — it asks whether a particular artifact came out of a model, and stops there. It does not ask where the model's training set came from, or where the training set's training set came from.
The better question is: do we know where this came from?
Provenance is not surveillance. It is a property of the data. It is a thread that runs through every layer — model output today, model output from prior generations, people writing, the world. When the thread is unbroken, the data is honest at every layer it passes through. When it is broken, you have a copy of a copy of a copy, and no way to tell which.
Without that thread, the training layer starts to look like a copy of a copy of a copy. The face in the tenth photograph isn't the same person. The work is keeping the first one.
Loading comments…