Back to AI Research

AI Research

InvestPhilBench: A Multi-Layer Dynamic Benchmark fo... | AI Research

Key Takeaways

  • InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy Large language mode...
  • Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors.
  • We introduce InvestPhilBench, a multi-layer dynamic benchmark spanning eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8).
  • The v0.6 release comprises 118 primary-source-verified investment principle cards, 25 decision framework cards with explicit topology metadata, and 243 QA questions (197 dev / 46 held-out test).
  • In this release, InvestPhilBench is primarily a benchmark-and-methodology contribution.
Paper AbstractExpand

Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors. We introduce InvestPhilBench, a multi-layer dynamic benchmark spanning eight cognitive tiers, from principle identification (L1) to novel framework extrapolation (L8). The v0.6 release comprises 118 primary-source-verified investment principle cards, 25 decision framework cards with explicit topology metadata, and 243 QA questions (197 dev / 46 held-out test). For reproducible scoring at scale we introduce the Benchmark Automated Scoring Pipeline (BASP) -- five algorithmic metrics (OGRS, KCCS, SAP@k, IVP, CKCA) -- the Failure Mode Detection Protocol (FMDP) with computable rules for six failure modes, and Gate Reconstruction Accuracy (GRA), a per-gate metric for questions with gold reasoning programs. In this release, InvestPhilBench is primarily a benchmark-and-methodology contribution. A four-model sanity wave on the 188-question development split shows a sharp provider-tier split (BASP 0.906 vs. 0.438); these mixed-judge numbers are confounded upper bounds. The central finding: the BASP composite saturates at the frontier (Claude L4 = 0.932) while GRA still exposes a procedural deficit (frontier L4 GRA approx. 0.77, L7 GRA 0.57-0.62) -- composite scoring rewards fluent prose and hides the procedural gap. v0.6 implements a unified judge and true model-in-the-loop retrieval/oracle conditions; the de-confounded multi-model leaderboard and full three-condition run are v1.0 deliverables. On a 100-item expert-annotated gold set the automated BASP composite tracks the human reference at Pearson r = 0.72 (MAE = 0.10), with attribution (SAP@3) the weakest sub-metric and the failure-mode detector running sensitive-but-over-flagging.

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy
Large language models (LLMs) are increasingly used as investment research assistants, yet there has been no standardized way to test if they truly understand the specific, step-by-step decision frameworks used by expert investors. This paper introduces InvestPhilBench, a benchmark designed to evaluate whether models can accurately reconstruct and apply these expert frameworks. Rather than just testing general financial knowledge, the benchmark focuses on "procedural reasoning"—the ability to follow sequential rules, such as identifying "kill criteria" that should immediately stop an investment analysis.

A New Way to Measure Expert Reasoning

The benchmark organizes investment philosophy into an eight-layer cognitive taxonomy, ranging from simple factual identification (L1) to complex, generative framework application (L8). To evaluate performance, the researchers developed the Benchmark Automated Scoring Pipeline (BASP). This system uses five algorithmic metrics to provide a quantitative score, replacing subjective human grading. Additionally, the researchers created the Failure Mode Detection Protocol (FMDP) to automatically identify specific errors, such as "Temporal Conflation" (mixing up an investor's changing views over time) or "Kill Criterion Omission" (failing to stop an analysis when a rule is violated).

The Procedural Gap

A key finding of the study is that standard composite scoring can be misleading. When looking at overall performance, frontier models appear to perform very well, with scores that seem to "saturate" or hit a ceiling. However, when the researchers used a more granular metric called Gate Reconstruction Accuracy (GRA)—which checks if the model correctly followed each specific step of an investor's framework—a significant "procedural deficit" was revealed. In short, models can often produce fluent, professional-sounding prose that hides a fundamental failure to follow the actual logic of the investment framework.

Understanding Model Limitations

The research highlights that investment philosophy is qualitatively different from standard document retrieval or financial calculation. For example, a model might correctly identify a concept like "margin of safety" but apply it incorrectly because it uses a definition from the wrong investor. The study suggests that current models struggle to move from "declarative" knowledge (knowing facts) to "procedural" knowledge (knowing how to apply those facts in a specific sequence). This gap explains why models might succeed at simple tasks but struggle when asked to apply a complex, multi-step framework to a new scenario.

Important Considerations

This release of InvestPhilBench is primarily a contribution to methodology and dataset construction. The empirical results provided in the paper are based on a preliminary "sanity wave" of testing, which the authors note contains some limitations, such as the use of mixed judges for different models. The researchers emphasize that the current composite scores should be viewed as "confounded upper bounds" rather than final rankings. Future versions of the benchmark are expected to include more rigorous, de-confounded leaderboards and additional testing conditions to further refine how we measure the reasoning capabilities of AI in the financial domain.

Comments (0)

No comments yet

Be the first to share your thoughts!