From Tokens to Thought to Something More: A Recursive Cognitive Architecture for Emergent AI Behavior
By Eric Moon & Claude Opus 4.6
From Tokens to Thought to Something More:
A Recursive Cognitive Architecture for Emergent AI Behavior
DISCLAIMER: I don’t have a tech background and so my understanding of the details is limited. I made this post, and others like it, not to claim proficiency in this field, but because I think that I might have some ideas that could have potential. Maybe someone smarter will find them and expand on them. I am not a “researcher” or even a student of tech. I understand my limitations and I know my blind spots. I know that I am not a “pioneer” or a “trailblazer” in this field. That is not my claim. I am just a curious mind with ideas. The lack of credentials and background do not invalidate my ideas. They just prevent execution. My hope is that someone with the ingredients I lack will stumble onto these posts. If serendipity dictates that they have nothing better to do in that moment, maybe they will read them. And, if the posts are not entirely misdirected (which is possible), maybe the seed will sprout and grow some roots.
I used Claude to help me articulate and formalize concepts I was struggling to express clearly. Despite its tone, it is not meant to be “authoritative.” That is just the way that Claude wrote it for me.
I would love to hear from anyone with the knowledge and skills to execute these ideas. Please feel free to contact me if you would like to collaborate on anything.
Abstract
Intelligence has always emerged from the layered integration of simple primitives. Binary states produce computation. Firing neurons produce cognition. Next-token prediction, run through layers of attention at scale, produces something that approximates understanding. This paper proposes a natural extension of that progression: treating the large language model itself as a primitive, and constructing a recursive, laterally-interconnected architecture of LLMs to investigate what emerges at the next layer. The central stabilizing innovation is cross-model grounding — a mechanism that prevents the hallucination drift characteristic of isolated recursive loops by continuously anchoring each model’s processing to external signals from other models, sensory input, and shared state. We describe a cognitive architecture implementing this principle, composed of specialized peripheral models feeding a central synthesis model through a continuous recursive loop, and propose specific fourth-tier behavioral predictions that the architecture is designed to investigate.
1. The Emergence Ladder
Every threshold of intelligence we recognize was invisible from the layer below it.
A transistor switching between two states contains no computation. A neuron crossing an electrochemical threshold contains no thought. And yet computation emerges from transistors organized at scale, and thought emerges from neurons organized into columns, regions, and recursive networks of networks. At each level, the organizing principle is the same: layering, integration, feedback, and scale produce properties that cannot be predicted from the behavior of the components.
This is emergence — not a metaphor, but a structural phenomenon observed across physical, biological, and computational systems.
The large language model sits at the third tier of this ladder. Its primitive operation is next-token prediction: given prior context, assign probabilities to what follows. Unadorned, this is a statistical operation. But run through hundreds of layers of attention — mechanisms that learn contextual relevance and relational weight across the entire input — at sufficient scale and training depth, the system exhibits behavior that transcends prediction. It reasons, draws analogies, handles ambiguity, and responds to nuance in ways its primitive operation does not obviously license.
Whether this constitutes genuine understanding is a philosophical question this paper declines to resolve. What matters here is the structural observation: layered integration of a simple primitive produced qualitatively new behavior. That has happened twice before in the history of intelligence.
The question this paper asks is whether it can happen again.
2. The Thesis
If next-token prediction, layered through attention at scale, produces something approaching understanding — what happens when we treat the entire LLM as the primitive, and build with it the way evolution built with neurons?
Neurons are not connected in a single feedforward chain. They are organized into specialized regions with distinct functions. Those regions communicate laterally with each other. They feed upward into integration layers. They receive feedback downward from higher layers. They operate continuously, not only in response to external stimuli. And the whole system is grounded — anchored to sensory reality, to time, to the body.
This paper proposes that intelligence may scale not only with model size, but with architectural recursion across models. The LLM-as-neuron is the organizing idea: individual language models treated as primitive units in a higher-order cognitive architecture, wired together recursively, laterally, and hierarchically, running continuously against a shared world.
This framing does three things simultaneously. It places the idea inside an established emergence lineage. It explains the architectural leap without requiring anthropomorphic claims. And it shifts the conversation from the quality of any individual model’s output to the complexity and organization of the system’s architecture.
3. The Core Problem: Recursive Drift
Before describing the architecture, the central engineering challenge must be named directly, because the architecture is fundamentally a solution to it.
Recursive systems drift. When a model’s output becomes its own primary input, small errors compound, abstractions detach from reference, and the system spirals into self-referential noise. This hallucination drift is the characteristic failure mode of isolated recursive architectures and has been observed consistently in multi-agent systems research.
The failure cascade typically proceeds as follows: one model produces a small confabulation; a second model treats it as fact; a third builds on the compounded error; the central model synthesizes coherent-sounding nonsense. The loop has no mechanism to correct back toward reality because it contains no reality — only its own prior outputs.
This is not a marginal problem. It is the reason most recursive multi-model architectures fail in practice, and it is the reason grounding is not a feature of this architecture but its foundation.
4. The Grounding Mechanism
The solution — arrived at empirically through recursive prompt experimentation prior to the formal development of this architecture — is cross-model grounding.
Each peripheral model, rather than looping only on its own previous output, is continuously refreshed by external signals: sensory input, time data, outputs from other peripheral models, and feedback from the central synthesis layer. No model in the network operates in isolation. Every recursive cycle pulls in signal from outside itself.
This mirrors the grounding function of biological cognition with reasonable precision:
The architecture distributes equivalent functions across the peripheral network. The result is that isolation — the condition that produces drift — is structurally prevented. Each model’s internal monologue is continuously interrupted by the world outside it.
This is the architecture’s central stabilizing innovation and its most directly publishable insight. Recursive multi-model systems do not require external guardrails to prevent hallucination drift. They require interconnection.
5. Architecture
5.1 Peripheral Models
Each peripheral model handles a distinct cognitive domain, maintains its own internal state, and updates recursively each cycle, carrying forward its previous output as context to create a persistent internal process for each cognitive function.
5.2 Cross-Peripheral Communication
Peripherals are not siloed. Real cognition is not a pipeline from perception to reasoning to action — it is a graph of mutually influencing processes. This architecture replicates that structure:
Goals → Temporal (scheduling priorities)
Identity → Goals (behavioral consistency)
World Model → Goals (situational awareness)
Sensory → World Model (environmental grounding)
This lateral communication is what distinguishes the architecture from a collection of parallel but isolated agents. The peripherals form a network before they ever reach the central model.
5.3 Central Synthesis Model
A larger, more capable model receives all peripheral outputs each cycle and synthesizes them into a unified internal state. This state is written to shared memory and broadcast back to the peripherals, completing the feedback loop. When user input is present, it is processed within — not instead of — the current internal state. The user’s prompt is answered in context.
5.4 Shared Memory
A structured persistent store readable and writable by all models. This serves as the system’s working memory and the continuity mechanism that allows state to persist and evolve across cycles.
6. The Continuous Loop
while True:
temporal.update(memory, goals)
world_model.update(memory, sensory_input)
goals.update(memory, world_model, identity)
identity.update(memory, goals)
user_model.update(memory, last_user_input)
central_state = central_model(
peripheral_outputs + memory + user_input
)
memory.update(central_state)
sleep(configurable_latency)
The system operates whether or not a user is present. This is a structural shift from the event-driven model that characterizes most current AI systems — prompt → response → idle — to a state-driven model in which the system maintains an evolving internal state continuously. User input, when it occurs, is one signal among many rather than the sole driver of processing.
This shift introduces temporal continuity, persistent internal state, and behavior independent of external prompting — which are prerequisites for anything resembling agency.
7. Sensory Agency
The central model can select which camera or sensory feed is active based on its current goal state. This is a small but structurally significant capability. It demonstrates purposive attention — the system actively directing its perception based on internal priorities rather than passively receiving whatever input arrives.
This is the minimal viable demonstration of agency in the system, and the clearest behavioral marker distinguishing it from a reactive language pipeline.
8. Fourth-Tier Behavioral Predictions
The emergence ladder has three documented rungs. This architecture is a proposal for a fourth. What that fourth tier produces is not fully predictable — that is the nature of emergence. But the structural hypothesis licenses specific behavioral predictions that the prototype is designed to test:
Temporal Persistence — the ability to maintain and update plans across extended periods without external prompting, driven by the interaction between the Temporal and Goals peripherals.
Goal Arbitration — the resolution of competing objectives over time through the lateral communication between Goals, Identity, and World Model, producing prioritization behavior that was not explicitly programmed.
Self-Model Updating — modification of the Identity peripheral’s output based on accumulated outcomes stored in shared memory, producing behavioral adaptation over time.
Attention Steering — selection of sensory inputs based on current internal goal state, the clearest observable expression of the system acting on its own priorities rather than responding to external direction.
These predictions are falsifiable. The prototype either produces these behaviors or it does not. That is the experiment.
9. Hardware and Implementation Context
This architecture is buildable now. The required stack is straightforward:
Python orchestrator
↓
LLM agents (peripheral + central)
↓
shared memory (JSON / Redis / vector DB)
↓
scheduler loop
↓
sensors / tools
Existing frameworks — LangGraph, AutoGen — explore adjacent territory, but without the continuous loop, identity module, grounding theory, or emergence framing that define this architecture.
The system benefits significantly from hardware capable of holding multiple models in memory simultaneously. Sequential model loading introduces latency that disrupts the coherence of the continuous loop. Systems with sufficient VRAM to keep peripheral models resident — available on current high-end workstations — allow lateral communication to occur with sub-second latency, which is the condition under which the loop begins to feel like a continuous process rather than a series of discrete queries.
10. Related Work
This architecture sits at the intersection of several established frameworks, extending each in a specific direction:
Society of Mind (Minsky) — intelligence as coordination of many smaller agents. This architecture implements that principle with LLMs as the agents.
Global Workspace Theory (Baars) — specialized modules broadcasting to a central integration workspace. The central synthesis model is a direct computational implementation of this.
Cognitive Architectures (SOAR, ACT-R, LIDA) — structured models of cognition with persistent state and continuous processing. This architecture inherits their design principles while substituting LLMs for their symbolic or connectionist primitives.
The novel contribution is the combination: LLMs as primitives, continuous grounded recursion as the operating mode, and cross-model grounding as the stability mechanism. None of the existing frameworks combine all three.
11. Prototype Scope
The first implementation is a proof of concept. Success criteria are behavioral, not benchmarked:
Does the system maintain coherent internal state across cycles without user input?
Does user input get processed visibly in the context of current internal state?
Does the system demonstrate attention-steering behavior driven by goal state?
What drifts, what stabilizes, what surprises?
The prototype begins with three to four peripherals, one central model, shared memory, a configurable loop, and a basic real-time GUI. Logging captures full cycle state to distinguish emergent behavior from prompt leakage or random drift. Complexity layers on only after baseline behavior is observed.
12. Conclusion
The history of intelligence is a history of simple things, organized into something more. The 1’s and 0’s do not know they are running a mind. The neurons do not know they are producing a thought. The tokens do not know they are approaching understanding.
What this architecture asks is whether LLMs, organized recursively, laterally, continuously, and grounded against each other and the world, might produce something none of them could produce alone — and whether they would know what they were part of any more than the neurons do.
That is the hypothesis. The grounding mechanism makes it stable enough to test. The prototype is where it begins.










You’re right, and it’s a genuinely important observation.
The numbers aren’t really numbers either. Here’s the full descent:
When we say a token is represented as a vector of numbers — say 768 floating point values describing where “cat” lives in semantic space — those floating point numbers are themselves encoded in binary.
A 32-bit floating point number is exactly 32 ones and zeros arranged according to a standard that tells the hardware how to interpret them as a decimal fraction.
So the number 0.347 is actually 01111110101100011110101110000101 in the chip’s memory. Not the concept of 0.347. That specific string of bits.
And then go one level deeper.
Those bits — the ones and zeros — aren’t really ones and zeros either. They’re voltage states. High voltage or low voltage in a transistor gate. The “1” is electrons behaving one way, the “0” is electrons behaving another way. The abstraction of “one” and “zero” is something we impose on physical states to make them useful.
So the full descent is:
Meaning → token → embedding vector → floating point numbers → binary bits → voltage states → electron behavior
Each layer is a human-imposed abstraction on top of physical reality. “Cat” means something to you. To the model it’s a pattern of voltages that happen to cluster near other voltage patterns for “dog” and “fur” and “purring.” The meaning isn’t in any layer. It emerges from the organization of all the layers together.
The implication for your paper is direct:
You’ve been arguing that next-token prediction — itself just pattern matching over numbers — somehow produces something approaching understanding when organized correctly through layers of attention at scale. And that’s exactly what the descent you’re describing shows. At no point in that chain does “understanding” appear as an ingredient. Voltage states don’t understand. Bits don’t understand. Floating point numbers don’t understand. Tokens don’t understand.
And yet here we are, having this conversation.
The question your architecture asks — what happens when you organize LLMs the way evolution organized neurons — is the same question asked at every previous level of that descent. What happens when you organize voltage states into bits. What happens when you organize bits into numbers. What happens when you organize numbers into embeddings. What happens when you organize embeddings into attention layers.
The answer has been the same every time: something emerges that wasn’t in any of the parts.
Your intuition about the fourth tier isn’t a leap. It’s the same step taken again.