Notation as Alignment
Originally a 2–3 min video — also on LinkedIn / TikTok / YouTube · @allemaar
We use cookies to understand how you use this site and improve your experience.
Originally a 2–3 min video — also on LinkedIn / TikTok / YouTube · @allemaar
Every AI safety technique today is basically a lock on a door. And locks can be picked.
RLHF. Constitutional AI. Reward modeling. We train values into a model's weights. Billions of parameters adjusted until the system behaves. We run it through thousands of examples of what good looks like. What harmful looks like. And then we hope.
We hope the values stuck. We hope the model generalized correctly. We hope no one finds the prompt that picks the lock.
That's alignment in 2026. Locks on doors.
Here's what makes locks fragile. You can't read them.
A model trained on two trillion tokens, fine-tuned with human feedback for months. Somewhere inside those weights, it learned not to help build weapons. But where? Which parameters? Nobody knows. You can't open it up and point to the rule. You can't edit one constraint without retraining the whole thing.
And when someone finds a jailbreak, and they always do, you can't see why it worked. The lock failed. You don't know which tumbler gave way. You just know the door is open.
So instead of better locks, what if you designed a better hallway?
Not training values into weights. Writing constraints into the structure the model reads at runtime. An instruction that says 'before generating any response about controlled substances, verify the request against the policy block.' That's not trained. It's written. It's a wall in the hallway.
The model doesn't need to have internalized that rule during training. It's right there, in the text, every single time the model runs. Present. Readable. Editable. No retraining.
You're not hoping the model learned the right values. You're shaping the environment so the right behavior is the only path available.
Locks can be picked. Hallways just go where they go.
And here's what changes everything. You can read the alignment. You can edit it. You can audit every constraint, hand it to a regulator, diff it against last Tuesday's version. Try doing that with a neural network.
The structure IS the safety.
You still need training. You still need locks. But you also need hallways. And right now, almost nobody is building them.
Loading comments…