Back to AI Research

AI Research

From Token Lists to Graph Motifs: Weisfeiler-Lehman... | AI Research

Key Takeaways

  • From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features This paper explores a new way to organize and understand the thou...
  • Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features.
  • Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined.
  • A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space.
  • Cluster assignments are stable across graph-construction hyperparameters and random seeds.
Paper AbstractExpand

Sparse autoencoders (SAEs) have become central to mechanistic interpretability, decomposing transformer activations into monosemantic features. Yet existing analyses characterise features almost exclusively through top-activating token lists or decoder weight vectors, leaving the higher-order co-occurrence structure shared across features largely unexamined. We introduce a graph-structured representation in which each SAE feature is modelled as a token co-occurrence graph: nodes are the tokens most frequent near strong activations, and edges connect pairs that co-occur within local context windows. A custom WL-style, frequency-binned graph kernel then provides a similarity measure over this structural space. Applied as a proof of concept to features from a large SAE trained on GPT-2 Small and probed with a synthetic mixed-domain corpus, our clustering recovers heuristic motif families (punctuation-heavy patterns, language and script clusters, and code-like templates) that are not recovered by clustering on decoder cosine similarity. A token-histogram baseline achieves higher overall purity, so the contribution of the graph view is complementary rather than dominant: it surfaces structural relationships that token-frequency and decoder-weight views alone do not capture. Cluster assignments are stable across graph-construction hyperparameters and random seeds.

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
This paper explores a new way to organize and understand the thousands of "features" discovered by sparse autoencoders (SAEs) in large language models. While researchers typically analyze these features by looking at lists of top-activating tokens or weight vectors, this approach often misses the broader structural relationships between them. The authors propose representing each feature as a "co-occurrence graph," where tokens are nodes and their tendency to appear together in text forms the edges. By applying a graph-based similarity measure, the researchers aim to group features into structural "motif families" that reveal patterns—such as code templates or specific language scripts—that traditional methods might overlook.

Modeling Features as Graphs

To move beyond simple lists of tokens, the authors transform each SAE feature into a graph. They identify the tokens that appear most frequently when a specific feature is strongly active and treat these as nodes. If two tokens frequently appear together in the same local context window, the system draws an edge between them. This creates a structural map of how a feature "thinks" about language. To compare these maps, the team uses a custom version of the Weisfeiler-Lehman (WL) graph kernel, a mathematical tool that iteratively refines node labels based on their neighbors to determine how similar two graphs are in their overall structure.

Comparing Structural Motifs

The researchers tested this method on a large SAE trained on GPT-2 Small, using a synthetic corpus designed to include diverse text types like code, URLs, and multiple languages. When they clustered these features based on their graph structures, they discovered coherent groups that represent distinct "motif families." These include clusters dominated by punctuation patterns, specific programming syntax, and natural language scripts. This structural organization provides a different perspective than traditional methods, surfacing relationships between features that appear unrelated when viewed only through their individual decoder weights.

Performance and Complementary Insights

The study compares the graph-based approach against two common baselines: clustering by decoder cosine similarity and clustering by simple token-frequency histograms. The results show that the graph-based view is a powerful complement to existing techniques. While a token-histogram baseline achieves higher overall purity in grouping features, the graph-based method excels at capturing specific structural relationships, such as alphabetic patterns, that other methods miss. The authors note that their graph-based clusters are stable across different random seeds and construction settings, suggesting that the structural motifs are a genuine property of the features rather than an artifact of the analysis.

Scope and Future Directions

The authors emphasize that this work is a proof-of-concept focused on organizational structure rather than providing a complete mechanistic explanation of individual features. The current analysis is limited to a single SAE layer in GPT-2 Small and uses a synthetic corpus to ensure a variety of surface motifs are present. Because the study relies on a specific set of hyperparameters and a controlled text environment, the authors note that further research is needed to see how these structural motifs hold up in larger models, different SAE architectures, or across more natural, large-scale datasets.

Comments (0)

No comments yet

Be the first to share your thoughts!