Back to AI Research

AI Research

Patch-Effect Graph Kernels for LLM Interpretability | AI Research

Key Takeaways

  • Patch-Effect Graph Kernels for LLM Interpretability Mechanistic interpretability aims to understand how transformer models work by identifying the "circuits"...
  • Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching.
  • However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically.
  • We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components.
  • We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures.
Paper AbstractExpand

Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.

Patch-Effect Graph Kernels for LLM Interpretability
Mechanistic interpretability aims to understand how transformer models work by identifying the "circuits"—or specific sub-networks—responsible for certain behaviors. Researchers typically use "activation patching," a technique that swaps internal model values to see if a specific behavior is restored. However, this process generates massive, disorganized datasets that are difficult to compare across different tasks. This paper proposes a new framework that treats these patching results as graphs, allowing researchers to use machine learning to systematically compare and analyze the internal structures of AI models.

Turning Interpretability into a Graph Problem

The core idea of this framework is to represent the results of activation patching as "patch-effect graphs." In these graphs, the nodes represent specific components of the model (such as layers, tokens, or attention heads), and the edges represent the relationships between them. By creating a graph for every task or prompt variation, the researchers can apply graph-learning techniques to measure how similar or different the model’s internal circuits are when performing different tasks. This moves the field away from analyzing single prompts in isolation toward a more scalable, comparative approach.

Methods for Building Circuits

The authors introduce three ways to construct these graphs based on how model components influence one another:

  • Direct-Influence: Uses causal mediation to measure if restoring one component changes the effect of another. This is the most accurate but also the most computationally expensive method.

  • Partial-Correlation: Removes shared upstream causes to isolate the relationship between two components.

  • Co-Influence: Measures the correlation of patch-effect profiles across different examples. This is the most efficient method and serves as a primary tool for large-scale analysis.
    To ensure these graphs are manageable, the framework uses "top-k" sparsification, which keeps only the most significant connections, and applies various graph kernels to convert these structures into fixed-dimension vectors that can be easily classified by standard machine learning models.

Key Findings and Performance

When testing this approach on GPT-2 Small using the Indirect Object Identification (IOI) task, the researchers found that their graph-based representations successfully captured meaningful structural signals. A major takeaway is that "localized edge-slot features"—which focus on specific connections between components—performed better at identifying task differences than global descriptors that only look at the overall shape of the graph.
The study also included rigorous controls, such as comparing their graph results against "prompt-only" baselines. This revealed that some of the performance in previous interpretability studies might be due to surface-level cues in the text rather than the model's internal logic. By separating these surface cues from the actual circuit-level evidence, the framework provides a more reliable way to validate claims about how AI models actually think.

Considerations for Future Research

While the framework is highly effective for comparing circuits, the authors note that it is important to distinguish between "slice-discriminative" evidence (identifying that two tasks are different) and "task-general" causal claims (identifying the universal circuit for a behavior). The current pipeline is designed to compress complex patching data into a format that makes these distinctions explicit. As the field moves toward larger models, this method offers a way to maintain interpretability without the quadratic growth in complexity that typically plagues such analyses.

Comments (0)

No comments yet

Be the first to share your thoughts!