Back to AI Research

AI Research

CausaLab: A Scalable Environment for Interactive Ca... | AI Research

Key Takeaways

  • What the paper is about We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents.
  • We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents.
  • Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism.
  • The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge.
  • CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth.
Paper AbstractExpand

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

What the paper is about

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

What it covers

\correspondingauthor Dylan Zhang, [email protected] CausaLab : A Scalable Environment for Interactive Causal Discovery Toward AI Scientists Junlin Yang

  • Tsinghua University Dylan Zhang
  • University of Illinois Urbana-Champaign Xiangchen Song Carnegie Mellon University Qirun Dai University of Chicago Xiao Liu University of Chicago Yuen Chen University of Illinois Urbana-Champaign Aniket Vashishtha University of Illinois Urbana-Champaign Jing Shi Adobe Chenhao Tan University of Chicago Hao Peng University of Illinois Urbana-Champaign Abstract We introduce CausaLab , a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent’s evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F 1 F_{1} . This observation further motivates our exploration of different interaction strategies: Mixed observation–intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge F 1 F_{1} . Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge F 1 F_{1} . We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents’ limits as experimental causal reasoners. Code: https://github.com/DylanZSZ/CausaLab
  • Junlin Yang and Dylan Zhang contributed equally and both serve as project leads. Junlin Yang’s work was done at the University of Illinois Urbana-Champaign. 1 Introduction Causal reasoning is important because scientific, medical, and policy decisions depend on how systems would respond to interventions, not only on observed associations ( Pearl , 2009 ; Pearl and Mackenzie , 2018 ; Imbens and Rubin , 2015 ) . Yet measuring and making progress in causal reasoning remains challenging, particularly for today’s large language models (LLMs). Existing benchmarks generally translate causal graphs, datasets, or narratives into question-answering and classification tasks ( Qin et al. , 2019 ; Romanou et al. , 2023 ; Stolfo et al. , 2023 ; Jiang et al. , 2024 ; Vashishtha et al. , 2025 ; Jin et al. , 2023a ; Wang , 2024 ; Chen et al. , 2024b ; Jin et al. , 2023b ) . While useful, they leave open the “causal parrot” concern ( Zečević et al. , 2023 ) : models can succeed with memorized causal facts or linguistic cues rather than causal reasoning behaviors needed to discover causal mechanisms ( Zheng et al. , 2023 ; Liu et al. , 2023 ) . To illustrate, let’s consider the following thought experiment. Suppose we are interested in studying the causal relationship between temperature and the resonance frequency of a crystal. An LLM agent might appear useful in at least two different ways. (1) It may retrieve from existing sources, such as Wikipedia or its training data, that temperature causes resonance frequency. (2) It may observe paired measurements of temperature and frequency, formulate hypotheses, design experiments, perform interventions, observe the resulting changes, and infer causation from evidence ( Pearl , 2009 ; Hauser and Bühlmann , 2012 ; Lampinen et al. , 2023 ) . While both are valuable in practice, (1) offers little help when the relevant causal knowledge lies beyond the current frontier of human knowledge. We therefore argue that (2) is especially important, particularly for important applications such as scientific discovery, because it enables LLM agents to help advance the frontiers of knowledge in a manner closer to what human scientists would do ( Langley , 2019 ; Dunbar and Fugelsang , 2005 ; Jansen et al. , 2024 ) . Figure 1 : Overview of a CausaLab episode. (1) The environment instantiates a hidden SCM — a causal graph over crystal properties plus structural equations and coefficients — and uses it to generate prior measurement records, a manipulator crystal, and a held-out reactor crystal. All are governed by the same SCM but have different property values. (2) The agent observes measured properties and frequency values in the prior records, then chooses, within a budget, interventions on the manipulator crystal through the property manipulator. (3) At every step the agent emits a DSL hypothesis (graph, structural equation for frequency , coefficients) that we parse against the ground-truth SCM. (4) The agent predicts the held-out frequency of the reactor crystal; we score the final prediction and the trajectory of recovered mechanisms. We introduce CausaLab (Figure 1 ), a scalable environment for evaluating LLM agents as interactive causal discoverers, joining a recent line of interactive scientific-agent and causal-discovery benchmarks ( Jansen et al. , 2024 ; Havrilla et al. , 2025 ; Chen et al. , 2026 , 2025 ; Geng et al. , 2025 ) . Each episode is generated by a hidden structural causal model (SCM) ( Pearl , 2009 ) : a causal graph together with structural equations that determine crystal properties and frequency. The agent receives prior measurement records, can run budgeted interventions on a manipulator crystal through a property manipulator, and must predict the frequency of a separate reactor crystal governed by the same SCM (Figure 1 ; § 3 ). Two design choices distinguish CausaLab from prior causal-reasoning evaluations. First, the hidden SCM is sampled per episode rather than drawn from public causal corpora, which sidesteps the “causal parrot” concern that scores reflect memorized causal lexicon. Second, a lightweight domain-specific language (DSL; § 4 ) records the agent’s accumulated evidence, current graph and equation hypothesis, planned experiment, and action at each step, so we can score not only the final prediction but also the trajectory-level faithfulness of the recovered mechanism to the ground-truth SCM (§ 5 ). Our experiments span closed and open-weight models, multiple model sizes, and thinking versus non-thinking variants, surfacing four findings that prior static benchmarks cannot reach. (1) Correct predictions often do not reflect correct mechanism discovery. Across matched functional-form, hidden-perturbation, and target-edge controls, endpoint accuracy and mechanism fidelity move separately: agents can find plausible parents while missing quantitative equations, preserve task success while degrading all-edge recovery, or lose accuracy mainly when the target equation itself is perturbed. (2) Observation-conditioned online intervention best balances prediction and graph recovery. Pure observation can boost endpoint accuracy without recovering structure, and pure intervention is weak before observations narrow the hypothesis space. For GPT-5.2-high on 6-node graphs, pure observation reaches 92% accuracy but only 0.47 all-edge F 1 F_{1} , while mixed online observation–intervention reaches 80%/0.80. Offline intervention traces do not replace online experimental choice: injecting “Golden” chains raises GPT-5-mini accuracy to 90% on 4 nodes while lowering all-edge F 1 F_{1} . (3) Model family and scale pay off unevenly across the two axes. GPT-5.2-high has the best endpoint accuracy and lowest directed all-edge SHD at every graph size, but gains are not uniform across graph sizes or metrics. Open-weight Qwen3.5 can approach GPT-5-mini on some task scores, yet its SHD rises faster as graphs grow; thinking generally lowers Qwen SHD. Even GPT-5.2-high drops to 64% accuracy and directed SHD 4.761 at 7 nodes. (4) Many failures come from premature commitment, not exhausted budget. Both successful and failed runs leave roughly half the intervention budget unspent, failed runs end with hypotheses inconsistent with their own data, and a single explicit verification step lifts 4-node accuracy from 48% to 60%. CausaLab therefore separates predictive success from causal understanding, revealing how current LLM agents still struggle to explore unfamiliar environments interactively, test candidate mechanisms, and revise toward the causal regularities that govern them. 2 Background and Related Work Causal reasoning goes beyond associational prediction by asking how a system would change under interventions and counterfactual alternatives ( Pearl , 2009 ; Pearl and Mackenzie , 2018 ; Imbens and Rubin , 2015 ) . Structural causal models (SCMs) formalize these assumptions as directed graphs plus structural equations ( Pearl , 2009 ) . In CausaLab , each episode’s hidden SCM is both the ground truth (§ 3.1 ) and the evaluation target (§ 3.3 ), letting us score whether an agent recovers the graph and target equation, not only whether it predicts the reactor value. Most LLM causal evaluations are static: they ask models to answer textual causal questions, reason over given graphs, classify cause–effect direction, or solve formal causal-inference queries ( Kıcıman et al. , 2023 ; Jin et al. , 2023a , b ; Chen et al. , 2024b ; Wang , 2024 ; Chen et al. , 2024a ) . Related work also uses LLMs as causal priors for edge scoring, causal ordering, or query-efficient discovery ( Long et al. , 2023 ; Darvariu et al. , 2024 ; Vashishtha et al. , 2023 ; Jiralerspong et al. , 2024 ) . Recent SCM-oriented studies either use LLM metadata reasoning to support graph discovery ( Abdulaal et al. , 2024 ) or test coefficient elicitation when the DAG is supplied ( Yamaoka et al. , 2026 ) . These settings clarify what causal knowledge LLMs can express, but they usually provide the variables, evidence, graph, or query up front. CausaLab instead asks whether an LLM agent can gather evidence, revise a hypothesis, and transfer the learned mechanism to a new instance, all within a scientific-discovery setting that offers no hints about the underlying causal structure. Interactive environments broaden evaluation beyond one-shot answers, including scientific-discovery worlds, budgeted graph-discovery games, causal games, and non-LLM intervention planners ( Jansen et al. , 2024 ; Havrilla et al. , 2025 ; Chen et al. , 2026 ; Gregorini et al. , 2025 ) . A basic agent scaffold for such settings is ReAct-style reasoning and acting, where the model interleaves deliberation with executable environment actions ( Yao et al. , 2023 ) . The closest recent benchmark is Auto-Bench, where LLM agents iteratively query scientific or social-network environments to recover a hidden causal graph ( Chen et al. , 2025 ) . Work on black-box reverse engineering similarly shows that actively designing queries is not equivalent to receiving another agent’s intervention data ( Geng et al. , 2025 ) . CausaLab differs from Auto-Bench in its evaluation target. Auto-Bench primarily asks whether an agent can discover a hidden DAG through interaction. CausaLab asks whether the discovered mechanism transfers : after learning from prior measurements and interventions on a manipulator crystal, the agent must predict a held-out reactor crystal generated by the same SCM, while its per-step DSL hypotheses expose the graph, the frequency structural equation, and the coefficients it is committing to. This makes it possible to separate task utility from structural and quantitative faithfulness, and to audit how an LLM agent revises or fails to revise an explicit SCM over time. This connects two evaluation traditions: explicit SCM recovery from causal discovery and sequential experiment design from agent benchmarks. Because each episode has a known ground-truth SCM and a logged interaction trace, CausaLab can score both final-task utility and the faithfulness of the recovered mechanism. 3 The Construction of CausaLab This section first defines the episode-level task and what the agent must infer, then specifies the SCM in § 3.1 , the observation and intervention protocol in § 3.2 , and the evaluation targets in § 3.3 . Artifact, licensing, and implementation details are provided in Appendix A.3 . Design principles. The benchmark is designed around three goals. First, can a model infer a causal mechanism that transfers to a new instance, rather than fitting an isolated value pattern? Second, can it choose informative interventions rather than passively consume a fixed dataset? Third, how do these abilities scale with graph size, topology, functional form, intervention budget, and hidden disturbances? The corresponding design choices that realize these goals are shared-mechanism transfer between two crystals, online intervention choice, and synthetically controlled SCM generation with known ground truth. Task formulation. A CausaLab episode is a transfer problem under a hidden SCM: the causal graph, structural equations, and coefficients are all hidden, and the agent is given only prior measurement records plus a finite budget for interventions (Figure 1 ). The episode also contains two crystals generated by the same SCM: a manipulator crystal on which the agent may intervene, and a reactor crystal whose frequency is held out. The initial records contain physical properties and resulting frequency values from earlier measurements under the same SCM. The agent then spends its interaction budget on interventions over controllable non- frequency properties of the manipulator crystal and observes the resulting measurements. After collecting this evidence, the agent predicts the hidden frequency of the reactor crystal. The records, manipulator crystal, and reactor crystal share the same SCM but have different property values, so the agent cannot solve the task by copying an observed frequency; it must infer a mechanism that transfers. The agent is told the property names and functional family but receives interventions only on a configured subset C ⊆ O C\subseteq O of controllable observable non- frequency variables; variables outside C C (including Y Y and any non-controllable property) are observable but not intervenable. The reactor crystal exposes only its non- frequency variables; per-variable access is summarized in Appendix Table 2 . At each step the agent also emits a DSL hypothesis that we parse into a directed graph, frequency equation, and coefficients. Solving an episode therefore requires both a correct reactor prediction and a causal hypothesis that matches the hidden SCM under the metrics of § 3.3 . 3.1 Structural Causal Models Each episode instantiates an SCM ℳ = ( 𝐔 , 𝐕 , F , P ​ ( 𝐔 ) ) \mathcal{M}=(\mathbf{U},\mathbf{V},F,P(\mathbf{U})) ( Pearl , 2009 ) . Here 𝐔 \mathbf{U} are exogenous source terms, 𝐕 \mathbf{V} are endogenous variables, F F is the set of structural equations, and P ​ ( 𝐔 ) P(\mathbf{U}) is the exogenous distribution. In CausaLab , the endogenous variables are observable properties O O plus the target Y = frequency Y=\texttt{frequency} . Root variables are endogenous nodes whose values are generated from exogenous source terms, and optional hidden-noise terms are also exogenous. We sample a DAG G G over 𝐕 = O ∪ { Y } \mathbf{V}=O\cup{Y} , assign root nodes from their exogenous sources, then compute non-root variables in topological order. In the linear family, X = b + ∑ p ∈ pa ​ ( X ) w p ​ p , X=b+\sum_{p\in\mathrm{pa}(X)}w_{p}p, and in the quadratic family, X = b + ∑ p ∈ pa ​ ( X ) ( u p ​ p 2 + w p ​ p ) . X=b+\sum_{p\in\mathrm{pa}(X)}(u_{p}p^{2}+w_{p}p). The sampled graph, equations, and coefficients, including the base value of frequency , are shared across the prior records, manipulator crystal, and reactor crystal; controllable-property base values differ across these instances. This asymmetry is what forces the agent to infer how variables are connected and then apply that mechanism to the reactor’s property values. Some graph families also include an unobserved exogenous disturbance H H that perturbs the system as follows. After every intervention, H H is resampled and added as a fixed-weight shift to a designated subset of observable endogenous variables; those shifted values then propagate downstream through F F . H H itself is not in 𝐕 \mathbf{V} , is not named to the agent, and cannot be observed or set directly — the agent sees only its downstream effects on the returned variable values. These settings test whether an agent can distinguish a stable causal mechanism from post-intervention noise. Additional distributions and coefficient ranges appear in Appendix A ; formal SCM and hidden-disturbance details appear in Appendix A.2 . 3.2 Interaction and Outputs Each episode proceeds through a repeated hypothesis–experiment loop. The agent receives an initial batch of measurement records, including non- frequency properties and the resulting frequency . It may then intervene by setting one controllable non- frequency property on the manipulator crystal; the environment recomputes that crystal’s resulting measurement under the hidden SCM and returns it to the agent. The reactor crystal is observed but not intervened on: its non- frequency properties are visible, and its frequency remains hidden until the agent submits a final value. Concretely, the loop begins with the initial observation batch and then alternates between interventions and observations: choose an intervention on one controllable manipulator-crystal property → \rightarrow observe the resulting manipulator-crystal measurement → \rightarrow revise the DSL hypothesis and choose the next intervention . For example, after seeing several prior measurement records, an agent may set the manipulator crystal’s radiation to a chosen value, see how temperature , conductivity , and frequency change, and then decide whether the evidence supports a direct edge into frequency or an indirect path through another property. This is the interaction that Figure 1 depicts at the task level and Appendix Figure 8 exposes at the trajectory level. The intervention semantics are shift-style rather than hard do ​ ( X = v ) \mathrm{do}(X{=}v) ( Rothenhäusler et al. , 2015 ) , and we specify them here because they determine what the agent’s returned observations mean. For a controllable variable X ∈ C X\in C , an intervention request with value v v replaces the base term in that variable’s structural equation for the next environment update: X ← v + ∑ p ∈ pa ​ ( X ) w p ​ p X\leftarrow v+\sum_{p\in\mathrm{pa}(X)}w_{p}p in the linear family, and analogously X ← v + ∑ p ∈ pa ​ ( X ) ( u p ​ p 2 + w p ​ p ) X\leftarrow v+\sum_{p\in\mathrm{pa}(X)}(u_{p}p^{2}+w_{p}p) in the quadratic family. Incoming parent contributions are therefore retained; only the intercept/base component is shifted. A hard intervention would instead force X = v X=v and sever incoming causal influence. At the end of the episode, the agent submits a numeric prediction for the reactor frequency and a final DSL hypothesis specifying causal edges, the proposed structural equation for frequency , and coefficients. The same DSL can be emitted at intermediate steps, giving a trajectory of evolving hypotheses. 3.3 Evaluation We evaluate whether the model both solves the held-out task and recovers the mechanism needed to solve it causally. Task success is frequency accuracy on the reactor crystal. Mechanism recovery compares the parsed DSL hypothesis against the ground-truth SCM: graph precision, recall, and F 1 F_{1} measure recovered causal edges; structural Hamming distance (SHD) counts missing, extra, and reversed directed edges, with lower values indicating closer graph recovery; coefficient F 1 F_{1} measures whether the quantitative frequency mechanism is correct; and root-node identification measures whether the agent distinguishes exogenous/root variables from mediated variables. This separation is essential: an agent may predict the held-out frequency without recovering the SCM, or recover the qualitative graph while missing the coefficients needed for reliable transfer. A correct solution therefore requires three linked behaviors: collect useful observational/interventional evidence, infer a graph and target equation that explain the prior records and manipulator-crystal measurements, and apply that mechanism to the reactor crystal’s observed properties. 4 A DSL for Causal Trajectories At each interaction step t t , the agent emits a compact DSL record with five fields: Memory M t M_{t} , the persistent episode notes; Thought T t T_{t} , a short interpretation of the current evidence; Past data 𝒟 ≤ t \mathcal{D}{\leq t} , the accumulated observations and intervention outcomes; Hypothesis H t H{t} , the current causal claim; and Experiment E t E_{t} , the next planned intervention and its rationale. Only H t H_{t} is used as a scored causal artifact: it states hypothesized edges, the structural equation for frequency , and the associated coefficients. Appendix Figure 8 shows how parsed hypotheses are rendered as candidate graphs and recovery metrics over time. Prompting and repair details appear in Appendix A.5 . Making the hypothesis parsable. We make H t H_{t} a scored object by requiring a fixed schema rather than free-form prose. The schema contains three typed parts: directed edges as (parent, child) pairs over episode variables, a frequency structural equation in the declared functional family, and numeric coefficients for the equation terms. A deterministic parser converts each valid hypothesis into a candidate graph G t G_{t} and target mechanism f ^ t \hat{f}{t} , producing a trajectory { ( G t , f ^ t ) } t = 1 T {(G{t},\hat{f}{t})}{t=1}^{T} . This lets the benchmark score the mechanism the agent commits to at each step using the same graph, root, and coefficient metrics used for final evaluation, rather than relying only on the final numeric answer. 5 Experiments We use CausaLab to ask four questions. (RQ1) Does correct prediction imply mechanism recovery? (RQ2) Which interaction regime best balances task accuracy and graph recovery, and can offline intervention traces replace online experimental choice? (RQ3) How do model family, scale, and thinking traces affect prediction and mechanism recovery across graph sizes? (RQ4) Why do agents fail, and what simple check can reduce these failures? The paired prediction and SCM-recovery targets separate task success from mechanism faithfulness, and DSL traces expose the hypotheses agents commit to. 5.1 Experimental Setup Setup. The main suite evaluates four models— GPT-5-mini , GPT-5.2-high , Qwen3.5-Thinking , and Qwen3.5-Non-thinking —on CausaLab ’s 3–7 node graph families, with up to 50 topologies per (graph size, model) cell and one run per task. Observation–intervention scaling experiments use GPT-5-mini and GPT-5.2-high on the 4-node and 6-node suites. Targeted follow-ups use the 4-/6-node suites, primarily with GPT-5-mini . All runs use temperature 0.1 and fixed observation/intervention budgets per graph size (Appendix A ). The reactor crystal’s hidden frequency is the target in every episode, so end-task accuracy is the exact prediction rate for that value; mechanism recovery is scored separately with graph, parent, root, edge, and coefficient metrics against the full episode SCM. Except for the explicit observation–intervention scaling suite, all follow-up analyses use the mixed regime with two initial observations; RQ2 motivates this setting as the anchor for subsequent analyses. Figure 2 : Matched 4-node comparison between linear and hard-quadratic mechanisms for GPT-5-mini . Topology is fixed; only the functional form changes. Task accuracy and frequency -weight F 1 F_{1} collapse while all-edge and root-node F 1 F_{1} are preserved or even rise — agents lose the mechanism, not the graph. 5.2 RQ1: Correct Frequency Prediction Does Not Imply Mechanism Recovery CausaLab pairs each episode with a ground-truth SCM, so we can score the answer and the mechanism separately. Three controls show that these axes split in different ways rather than collapsing to one scalar. Function form. Holding the 50 four-node topologies fixed but replacing the linear mechanism with a hard-quadratic one cuts GPT-5-mini accuracy from 48% to 24% (Figure 2 ). The graph is not simply lost: root-node F 1 F_{1} rises (0.559 → \to 0.829) and edge precision is preserved, but frequency -weight F 1 F_{1} collapses (0.589 → \to 0.251; Appendix Table 3 ). The agent can find plausible parents and still fail because it misses the quantitative mechanism. Hidden perturbations. Off-target hidden noise leaves accuracy near baseline (40–54% versus 48%) but lowers all-edge F 1 F_{1} from 0.79 to 0.61–0.70. When the hidden disturbance can perturb frequency itself, accuracy drops to 26–40% (Appendix Figure 9 ; Appendix Table 4 ), showing that some successful predictions came from fitting a local target equation rather than recovering a mechanism robust to hidden target perturbations. Target outgoing edges. FreqParent keeps mean edge counts matched but lets frequency have outgoing edges. Accuracy rises on 4- and 6-node graphs because the target has fewer incoming edges to fit, while all-edge recovery falls because global directionality is harder (Appendix Figure 10 ; Appendix Table 5 ). Takeaway Prediction accuracy is necessary but not sufficient evidence of mechanism recovery. Figure 3 : Prediction-versus-recovery gap across the four scaling families. Each suite is shown as an Obs.-only → \to Mixed arrow in (task accuracy, all-edge F 1 F_{1} ) space: mixed regimes consistently shift mass toward higher graph fidelity at comparable or better task accuracy. 5.3 RQ2: Observation-Conditioned Online Intervention Outperforms Pure and Offline Regimes RQ2 separates two questions: whether agents need observations, interventions, or both; and whether offline intervention data is enough when the agent does not choose the experiments online. Figure 3 summarizes the three online regimes across our four scaling families. For GPT-5-mini , pure observation often gives the strongest end-task accuracy on the easier graphs, but mixed observation-conditioned intervention consistently recovers more faithful graphs on both the 4-node and 6-node families. In the GPT-5.2-high 6-node setting, for example, observation-only has higher accuracy than mixed (92% versus 80%) but much lower graph-recovery F 1 F_{1} (0.47 versus 0.80). Pure intervention is weak on both axes, becoming useful only after observation narrows the hypothesis space. We therefore use mixed online regimes as the anchor for follow-up controls. The full regime scatter appears in Appendix Figure 12 ; full scaling curves and tables appear in Appendix Figures 13 and 14 and Appendix A.9 . The Golden control then separates offline intervention data from online intervention decisions by giving the agent a bounded low-MEC intervention chain instead of letting it intervene online. Golden improves task accuracy above the main suite baselines (90% versus 48% on 4-node graphs, 44% versus 24% on 6-node graphs) but drops all-edge F 1 F_{1} on both sizes (Figure 4 ; Appendix Table 6 ). High-quality intervention chains therefore behave mostly like stronger observations: they help fit the target equation, but they do not replace the structural signal supplied by the agent’s own online intervention loop. Takeaway Observation-conditioned online intervention gives the best balance: observations narrow the hypothesis space, while agent-chosen interventions recover more faithful graphs. Figure 4 : Golden-intervention experiments on GPT-5-mini . Baseline → \to Golden arrows in (task accuracy, all-edge F 1 F_{1} ) space: injected low-MEC intervention traces improve frequency prediction but hurt all-edge recovery, separating intervention data from intervention choice. Figure 5 : Capability gap ( GPT-5.2-high − - GPT-5-mini ) in percentage points across graph sizes and metrics. Scaling concentrates in accuracy and frequency -weight F 1 F_{1} ; root-node gains are near zero or negative at 6–7 nodes, showing where larger models still stall. 5.4 RQ3: Model Family and Scale Pay Off Unevenly Across the Two Axes GPT-5.2-high outperforms GPT-5-mini across graph sizes, but the gains concentrate on mediated structure and quantitative mechanism fitting rather than every metric uniformly. Figure 6 extends the model-family comparison to all 3–7 node main suites, covering the two GPT models and Qwen3.5 with and without thinking traces. GPT-5.2-high is the strongest model overall, with the best endpoint accuracy and lowest directed all-edge SHD at every graph size. Open-weight Qwen3.5 models can be competitive with GPT-5-mini on some task scores, but their SHD rises faster as graph size grows. Thinking generally improves Qwen structure recovery, lowering SHD at four graph sizes and raising all-edge F 1 F_{1} at every measured size. Across the full 3–7 node sweep, even GPT-5.2-high still drops to 64% accuracy and directed SHD 4.761 at 7 nodes (Figure 6 ), and the per-metric gap (Figure 5 ) concentrates in accuracy and frequency -weight F 1 F_{1} while ro

Comments (0)

No comments yet

Be the first to share your thoughts!