Patch-Effect Graph Kernels for LLM Interpretability
Mechanistic interpretability aims to understand how transformer models work by identifying the "circuits"—or specific sub-networks—responsible for certain behaviors. Researchers typically use "activation patching," a technique that swaps internal model values to see if a specific behavior is restored. However, this process generates massive, disorganized datasets that are difficult to compare across different tasks. This paper proposes a new framework that treats these patching results as graphs, allowing researchers to use machine learning to systematically compare and analyze the internal structures of AI models.
Turning Interpretability into a Graph Problem
The core idea of this framework is to represent the results of activation patching as "patch-effect graphs." In these graphs, the nodes represent specific components of the model (such as layers, tokens, or attention heads), and the edges represent the relationships between them. By creating a graph for every task or prompt variation, the researchers can apply graph-learning techniques to measure how similar or different the model’s internal circuits are when performing different tasks. This moves the field away from analyzing single prompts in isolation toward a more scalable, comparative approach.
Methods for Building Circuits
The authors introduce three ways to construct these graphs based on how model components influence one another:
Direct-Influence: Uses causal mediation to measure if restoring one component changes the effect of another. This is the most accurate but also the most computationally expensive method.
Partial-Correlation: Removes shared upstream causes to isolate the relationship between two components.
Co-Influence: Measures the correlation of patch-effect profiles across different examples. This is the most efficient method and serves as a primary tool for large-scale analysis.
To ensure these graphs are manageable, the framework uses "top-k" sparsification, which keeps only the most significant connections, and applies various graph kernels to convert these structures into fixed-dimension vectors that can be easily classified by standard machine learning models.
Key Findings and Performance
When testing this approach on GPT-2 Small using the Indirect Object Identification (IOI) task, the researchers found that their graph-based representations successfully captured meaningful structural signals. A major takeaway is that "localized edge-slot features"—which focus on specific connections between components—performed better at identifying task differences than global descriptors that only look at the overall shape of the graph.
The study also included rigorous controls, such as comparing their graph results against "prompt-only" baselines. This revealed that some of the performance in previous interpretability studies might be due to surface-level cues in the text rather than the model's internal logic. By separating these surface cues from the actual circuit-level evidence, the framework provides a more reliable way to validate claims about how AI models actually think.
Considerations for Future Research
While the framework is highly effective for comparing circuits, the authors note that it is important to distinguish between "slice-discriminative" evidence (identifying that two tasks are different) and "task-general" causal claims (identifying the universal circuit for a behavior). The current pipeline is designed to compress complex patching data into a format that makes these distinctions explicit. As the field moves toward larger models, this method offers a way to maintain interpretability without the quadratic growth in complexity that typically plagues such analyses.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!