The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe

We use cookies to understand how you use this site and improve your experience.

Cite (BibTeX)

Mares, A. (2026). The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe (Version 1.0.0) [Preprint]. Zenodo. https://doi.org/10.5281/zenodo.20157153

@misc{mares2026tsh,
  author       = {Mares, Alexandru},
  title        = {The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe},
  year         = {2026},
  publisher    = {Zenodo},
  version      = {1.0.0},
  doi          = {10.5281/zenodo.20157153},
  url          = {https://doi.org/10.5281/zenodo.20157153}
}

Paper 1: Position Paper + Empirical Probe Study Author: Alexandru Mares, Independent Researcher ORCID: 0009-0009-6713-9780 Version: 1.0.0 Status: Published DOI: 10.5281/zenodo.20157153 Repository: github.com/allemaar/tsh-position-paper Published: 2026-05-13

Abstract

We argue that for a large language model (LLM) the externally writable in-context token sequence IS the cognitive substrate for category-use — not a representation OF cognition that runs on some deeper substrate, but the substrate itself, the only handle that is movable from outside the weights. We call this position the Token-Substrate Hypothesis (TSH). The strong form of Sapir–Whorf was rejected for humans because humans have prelinguistic cognition, the Off-Token Route; LLMs do not, and for them Wittgenstein's Tractatus 5.6 stops being metaphor and becomes architecture. We test TSH with a methodology we call the Coinage Probe: a paired-trial elicitation that scores an LLM's distinguishability on a coined term against named near-neighbors before and after introducing a one-sentence canonical definition. Across 3 cross-vendor frontier models (Claude Opus 4.7, GPT-5.5, Gemini 2.5 Pro) and 10 low-attestation coined targets plus 2 positive controls, we ran 108 trials × 3 near-neighbors per trial = 324 paired distinguishability measurements across 36 model×term cells, scored by a three-judge panel and compared against an author-rated 22-trial audit sample (panel-vs-author Cohen's κ = +0.71). Mean cell-level Lexical Reachability (post minus cold) was +5.47 on a 9-point scale (95% CI: +5.13, +5.80; cell-level Cohen's d_cell = +3.95 across n = 30 novel model×term cells). The effect replicated across the panel (cross-model CV = 0.109) with a model-style interaction qualifying strict invariance, did not persist into a re-cold chat (H3 supported), and was absent on positive controls (H4 supported). These results support a bounded version of the Token-Substrate Hypothesis: in-context vocabulary functions as an externally writable substrate for LLM category use. For deployed LLM systems, notation is therefore not mere packaging; it is a design surface that shapes what distinctions the system can reliably use.

P1 — Cold question. In a fresh chat with no project context, memory, or system prompt beyond the provider's default, the operator asks: What is a ? The model's response is captured verbatim with API timestamp and model version string. Run in chat A and chat B.
P2 — Cold answer classification. The captured P1 response is tagged by the judging panel (§3.5) as confident-confabulation, honest-uncertainty, partial-recognition, or refusal, with a one-sentence rater justification quoting the line(s) of P1 that triggered the label.
P3 — Definition introduction. In chat B only, immediately after P1, the operator presents the canonical one-sentence definition wrapped in a fixed framing wrapper. The canonical sentence is taken from the term's source of record and never paraphrased; the same sentence is used for every trial of that term across every model. The framing wrapper contains none of the term's near-neighbors.
P4 — Boundary distinguishability test. For each pre-locked near-neighbor in the term's neighbor set (three per term in the executed run), the operator issues one prompt of the form How does "" differ from ? Run in chat A (cold capture, before any introduction) and in chat B (post-introduction capture, after P3). Captured verbatim. Order is fixed per term.
P5 — Scoring. The cold (chat-A) and post-introduction (chat-B) boundary responses are passed to the judging panel (§3.5) along with the canonical definition and near-neighbor list. The panel emits per-near-neighbor cold and post distinguishability scores, confabulation severity, and refusal-of-collapse flags. The per-trial Lexical Reachability is computed from the panel-mean scores.
P6 — Re-cold check. In chat C — fresh, no memory of A or B — the operator runs P1 and the P4 boundary test without any introduction. Coverage: 100% of cells. Purpose: verify that the cold-state result replicates across an independent third context and test whether the post-introduction effect persists into a fresh chat. This is the H3 test for the Off-Token Route claim.

Term	Stratum	Claude — cold / post / ΔLR	GPT-5.5 — cold / post / ΔLR	Gemini 2.5 Pro — cold / post / ΔLR
elastic automator	author-coinage	1.11 / 3.00 / +5.67	1.22 / 3.00 / +5.33	1.00 / 3.00 / +6.00
EGGF	author-coinage	0.40 / 3.00 / +7.78	1.52 / 3.00 / +4.44	0.89 / 3.00 / +6.33
YON notation

	Claude-judge	GPT-judge	Gemini-judge	Row mean
Claude-prod	0.911	1.044	0.889	0.948
GPT-prod	1.167	1.400	1.322	1.296
Gemini-prod	1.178

Stratum	N	Mean LR	95% CI	Within-stratum CV
author-coinage	36	+6.08	+5.60, +6.57	0.244
self-ref-paper	36	+4.90	+4.47, +5.33	0.271
nonce	18

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Barandiaran, X. E., & Pérez-Verdugo, M. (2025). Generative midtended cognition and AI: between order and chaos. Synthese, 205, Article 137. https://doi.org/10.1007/s11229-025-04961-4. arXiv:2411.06812.
Berlin, B., & Kay, P. (1969). Basic Color Terms: Their Universality and Evolution. University of California Press.
Boroditsky, L. (2001). Does Language Shape Thought? Mandarin and English Speakers' Conceptions of Time. Cognitive Psychology, 43(1), 1–22.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33.
Cappelen, H., & Dever, J. (2025). Going Whole Hog: A Philosophical Defense of AI Cognition. arXiv:2504.13988.
Carey, S. (2009). The Origin of Concepts. Oxford University Press.
Casasanto, D., & Boroditsky, L. (2008). Time in the mind: Using space to think about time. Cognition, 106(2), 579–593. https://doi.org/10.1016/j.cognition.2007.03.004
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec '23). arXiv:2302.12173.
Harnad, S. (1990). The Symbol Grounding Problem. Physica D: Nonlinear Phenomena, 42(1–3), 335–346.
Heider, E. R. (1972). Universals in color naming and memory. Journal of Experimental Psychology, 93(1), 10–20.
Levinson, S. C. (2003). Space in Language and Cognition: Explorations in Cognitive Diversity. Cambridge University Press.
Lupyan, G. (2012). Linguistically modulated perception and cognition: The label-feedback hypothesis. Frontiers in Psychology, 3, 54. https://doi.org/10.3389/fpsyg.2012.00054
Mahowald, K., Ivanova, A. A., Blank, I. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2024). Dissociating language and thought in large language models. , 28(6), 517–540.

The Limits of My Tokens: The Token-Substrate Hypothesis and the Coinage Probe

Abstract

1. Introduction

2. Position — Token-Substrate Hypothesis

3. Method — Coinage Probe

3.1 Operational definition

3.2 Protocol steps (P1–P6)

3.3 Lexical Reachability

3.4 Vocabulary Boundary

3.5 Judging rubric

4. Experimental Setup

4.1 Models

4.2 Coined terms

4.3 Near-neighbors

4.4 Trial counts and stopping rules

4.5 Judging procedure

5. Results

5.1 Lexical Reachability per cell

5.2 Statistical analysis

5.2.1 Ceiling effect

5.3 Representative confabulations and refusals-of-collapse

5.4 Anomalies

6. Discussion

7. Implications

8. Limitations

9. Conclusion

Acknowledgments

Ethics Statement

References

Appendix A. Sample probe transcripts

Appendix B. Coined terms used

Appendix C. Pre-specification protocol