BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents
Financial research is rarely about finding a single number; it is about building a defensible, auditable argument. When an analyst calculates a figure, they must be able to explain which sources they used, what accounting adjustments they applied, and how they arrived at their conclusion. Existing benchmarks often fail to capture this, focusing instead on simple fact-retrieval or final answers. BigFinanceBench addresses this gap by evaluating the entire "derivation"—the step-by-step workflow an agent follows to reach a conclusion—rather than just the final output.
A New Standard for Financial Reasoning
The benchmark consists of 928 open-ended, expert-authored financial research tasks. These questions were created by investment banking and private equity professionals to mimic real-world work. Unlike typical benchmarks that provide a single document to analyze, these tasks require agents to navigate multiple sources, apply accounting judgments, and perform complex calculations. Each task is paired with a point-weighted rubric that breaks the derivation down into independently checkable steps, such as identifying the correct ticker, selecting the right fiscal period, and applying specific accounting adjustments.
Evaluating the Full Workflow
To grade an agent, the benchmark examines the entire "trajectory" of its work, including tool calls, intermediate calculations, and the final answer. By using a point-weighted rubric, the system can award partial credit for correct intermediate steps even if the final answer is wrong. This allows researchers to pinpoint exactly where an agent fails—whether it is in retrieving the right data, understanding an accounting definition, or executing the final math. This approach provides a much higher resolution of performance than traditional "right or wrong" grading.
Key Findings and Model Performance
When testing ten current frontier and open-weight agents, the researchers found that there is still significant room for improvement. The best-performing systems achieved only a 58.8% rubric score, indicating that even the most advanced models struggle with the nuances of professional financial research. The study also revealed that final-answer accuracy is a "lossy" proxy for quality; models often get the right answer for the wrong reasons, or fail to get the right answer despite performing most of the derivation correctly.
Specialization Across Workflows
The results show that no single model dominates every type of financial task. Different models exhibit "orthogonal strengths," meaning one model might excel at earnings quality analysis while another is better at M&A or valuation tasks. Because these models have different areas of expertise, the researchers found that using a simple router to direct questions to the most capable model for a specific workflow can significantly improve overall performance compared to relying on a single system.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!