From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
This paper explores a new way to organize and understand the thousands of "features" discovered by sparse autoencoders (SAEs) in large language models. While researchers typically analyze these features by looking at lists of top-activating tokens or weight vectors, this approach often misses the broader structural relationships between them. The authors propose representing each feature as a "co-occurrence graph," where tokens are nodes and their tendency to appear together in text forms the edges. By applying a graph-based similarity measure, the researchers aim to group features into structural "motif families" that reveal patterns—such as code templates or specific language scripts—that traditional methods might overlook.
Modeling Features as Graphs
To move beyond simple lists of tokens, the authors transform each SAE feature into a graph. They identify the tokens that appear most frequently when a specific feature is strongly active and treat these as nodes. If two tokens frequently appear together in the same local context window, the system draws an edge between them. This creates a structural map of how a feature "thinks" about language. To compare these maps, the team uses a custom version of the Weisfeiler-Lehman (WL) graph kernel, a mathematical tool that iteratively refines node labels based on their neighbors to determine how similar two graphs are in their overall structure.
Comparing Structural Motifs
The researchers tested this method on a large SAE trained on GPT-2 Small, using a synthetic corpus designed to include diverse text types like code, URLs, and multiple languages. When they clustered these features based on their graph structures, they discovered coherent groups that represent distinct "motif families." These include clusters dominated by punctuation patterns, specific programming syntax, and natural language scripts. This structural organization provides a different perspective than traditional methods, surfacing relationships between features that appear unrelated when viewed only through their individual decoder weights.
Performance and Complementary Insights
The study compares the graph-based approach against two common baselines: clustering by decoder cosine similarity and clustering by simple token-frequency histograms. The results show that the graph-based view is a powerful complement to existing techniques. While a token-histogram baseline achieves higher overall purity in grouping features, the graph-based method excels at capturing specific structural relationships, such as alphabetic patterns, that other methods miss. The authors note that their graph-based clusters are stable across different random seeds and construction settings, suggesting that the structural motifs are a genuine property of the features rather than an artifact of the analysis.
Scope and Future Directions
The authors emphasize that this work is a proof-of-concept focused on organizational structure rather than providing a complete mechanistic explanation of individual features. The current analysis is limited to a single SAE layer in GPT-2 Small and uses a synthetic corpus to ensure a variety of surface motifs are present. Because the study relies on a specific set of hyperparameters and a controlled text environment, the authors note that further research is needed to see how these structural motifs hold up in larger models, different SAE architectures, or across more natural, large-scale datasets.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!