TokenMizer: Graph-Structured Session Memory for Lon...

TokenMizer: Graph-Structured Session Memory for Lon... | AI Research

Key Takeaways

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management Large language models often struggle with long-running tasks because thei...
Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not.
When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded.
Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable.
We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph.

Paper AbstractExpand

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management
Large language models often struggle with long-running tasks because their "context window"—the amount of information they can remember at once—is finite. When a work session exceeds this limit, the model begins to forget important details like architectural decisions, file histories, and task statuses. Existing solutions typically treat conversation history as a flat stream of text, which often leads to the loss of critical, structured information. TokenMizer addresses this by acting as a transparent proxy that organizes session history into a typed knowledge graph, allowing the model to "remember" the structure and rationale of a project rather than just the raw text.

How TokenMizer Works

Instead of storing history as a simple list of messages, TokenMizer models a session as a knowledge graph consisting of 14 node types (such as tasks, files, and decisions) and 7 edge types (such as "implements" or "fixes"). This system uses a hybrid extraction pipeline to identify these elements in real-time. When a session approaches its memory limit, the system uses a three-tier checkpoint process to create a "resume block"—a compact, structured summary of the graph. This allows the model to maintain continuity across long sessions without needing to store every previous interaction.

Efficiency and Performance

TokenMizer is designed to be highly efficient, achieving a 47.3% reduction in token usage through a heuristic compression pipeline that requires no external dependencies. In tests across 21 sessions spanning five different domains—including software engineering, data science, and debugging—TokenMizer produced resume blocks that were twice as small as those generated by standard baseline methods. Despite this smaller size, it achieved higher recall for key information, such as why a specific technology was chosen or how a task was completed.

Key Advantages

The primary strength of TokenMizer is its ability to preserve the "why" behind a decision. While traditional methods might only note that a technology like "Redis" was mentioned, TokenMizer captures the rationale behind that choice, which is vital for complex, multi-step tasks. By using a transparent proxy, the system can be integrated into existing workflows without requiring changes to the underlying application code. It also includes a semantic cache to reduce latency for repeated queries, further optimizing the performance of long-horizon tasks.

Limitations and Future Work

While TokenMizer performs well in structured environments like software engineering, its effectiveness can vary depending on the domain. Sessions that rely on implicit reasoning or planning are more difficult to capture than those using explicit, imperative language. Additionally, the current results are based on a controlled, synthetic benchmark; the author notes that testing the system on live, real-world developer sessions is the most important next step for future research. Currently, the system relies on a heuristic-based extraction method, with more advanced LLM-based extraction reserved for future development.