Cluster · active

Cluster — Token-Substrate Hypothesis

# Cluster: Token-Substrate Hypothesis

## Short definition

The Cluster of work covering the **Token-Substrate Hypothesis (TSH)**: a position paper paired with a pre-specified multi-model probe study, arguing that for an LLM the externally writable in-context token sequence *IS* the cognitive substrate for category-use — not a representation of cognition that runs on some deeper substrate, but the substrate itself.

## Long explanation

For systems whose cognition is token-bound, the limit of language is the limit of the world. The strong form of Sapir–Whorf was rejected for humans on architectural grounds — humans have prelinguistic cognition, the *Off-Token Route*. LLMs do not, and for them Wittgenstein's *Tractatus* 5.6 stops being metaphor and becomes architecture. This Cluster develops the position formally and tests it empirically with the **Coinage Probe**: a paired-trial elicitation that measures an LLM's distinguishability on a coined term against named near-neighbors before and after introducing a one-sentence canonical definition. Across three cross-vendor frontier models, ten low-attestation coined targets plus two positive controls, mean cell-level Lexical Reachability (post minus cold) was +5.47 on a 9-point scale (cell-level Cohen's *d_cell* = +3.95 across n = 30 novel model×term cells). The boundary movement replicated across vendors, did not persist into a re-cold chat, and was absent on ceiling positive controls.

The Cluster connects three kinds of work: the **definitional position** (TSH as architectural claim + Coinage Probe methodology); the **empirical probe study** (multi-model pre-specified run with full data deposit); and the **adjacent extensions** (Off-Token Route as architectural defeater for strong Whorf-for-humans, Vocabulary Boundary as observable, Lexical Reachability as metric, notation-design-as-brain-design as deployment corollary).

## Why it matters

If for an LLM the externally writable substrate IS the cognitive medium for in-context category-use, then notation choices are not packaging — they are design surfaces that shape what distinctions a deployed system can reliably hold. The Cluster reframes a class of alignment and capability questions as substrate-engineering problems at the session level: vocabulary is one of the few handles a deployer or experimenter can deliberately write into the system's working medium.

This topic is one of EGGF's **anchor Clusters** — every adjacent topic (in-context learning, linguistic relativity for non-biological systems, symbol grounding without sensorimotor coupling, notation as alignment surface) routes back through here.

## Best starting point

1. **Read the paper:** [The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe](https://doi.org/10.5281/zenodo.20157153) (Zenodo DOI, 2026-05-13).
2. **Then:** browse the related essays below.

## Main paper / article / repo

- [Position paper (Zenodo)](https://doi.org/10.5281/zenodo.20157153) — *The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe*, Mares 2026, v1.0.0.
- [GitHub repository](https://github.com/allemaar/tsh-position-paper) — paper source, LaTeX, figures, and the full probe-run data bundle.
- [Research index](https://github.com/allemaar/papers) — research program across six domains.

## Related topics

- [[elastic-automators|Cluster: Elastic Automators]] — companion paper; EA names what these systems do, TSH names what their cognition runs on.

## Latest updates

- **2026-05-13** — Position paper v1.0.0 published on Zenodo.