InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy
Large language models (LLMs) are increasingly used as investment research assistants, yet there has been no standardized way to test if they truly understand the specific, step-by-step decision frameworks used by expert investors. This paper introduces InvestPhilBench, a benchmark designed to evaluate whether models can accurately reconstruct and apply these expert frameworks. Rather than just testing general financial knowledge, the benchmark focuses on "procedural reasoning"—the ability to follow sequential rules, such as identifying "kill criteria" that should immediately stop an investment analysis.
A New Way to Measure Expert Reasoning
The benchmark organizes investment philosophy into an eight-layer cognitive taxonomy, ranging from simple factual identification (L1) to complex, generative framework application (L8). To evaluate performance, the researchers developed the Benchmark Automated Scoring Pipeline (BASP). This system uses five algorithmic metrics to provide a quantitative score, replacing subjective human grading. Additionally, the researchers created the Failure Mode Detection Protocol (FMDP) to automatically identify specific errors, such as "Temporal Conflation" (mixing up an investor's changing views over time) or "Kill Criterion Omission" (failing to stop an analysis when a rule is violated).
The Procedural Gap
A key finding of the study is that standard composite scoring can be misleading. When looking at overall performance, frontier models appear to perform very well, with scores that seem to "saturate" or hit a ceiling. However, when the researchers used a more granular metric called Gate Reconstruction Accuracy (GRA)—which checks if the model correctly followed each specific step of an investor's framework—a significant "procedural deficit" was revealed. In short, models can often produce fluent, professional-sounding prose that hides a fundamental failure to follow the actual logic of the investment framework.
Understanding Model Limitations
The research highlights that investment philosophy is qualitatively different from standard document retrieval or financial calculation. For example, a model might correctly identify a concept like "margin of safety" but apply it incorrectly because it uses a definition from the wrong investor. The study suggests that current models struggle to move from "declarative" knowledge (knowing facts) to "procedural" knowledge (knowing how to apply those facts in a specific sequence). This gap explains why models might succeed at simple tasks but struggle when asked to apply a complex, multi-step framework to a new scenario.
Important Considerations
This release of InvestPhilBench is primarily a contribution to methodology and dataset construction. The empirical results provided in the paper are based on a preliminary "sanity wave" of testing, which the authors note contains some limitations, such as the use of mixed judges for different models. The researchers emphasize that the current composite scores should be viewed as "confounded upper bounds" rather than final rankings. Future versions of the benchmark are expected to include more rigorous, de-confounded leaderboards and additional testing conditions to further refine how we measure the reasoning capabilities of AI in the financial domain.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!