Back to AI Research

AI Research

WorkstreamBench: Evaluating LLM Agents on End-to-En... | AI Research

Key Takeaways

  • WorkstreamBench is a new evaluation framework designed to test how well AI agents can build complex, professional-grade financial spreadsheets from scratch.
  • LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions.
  • To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch.
  • This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets.
  • Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits.
Paper AbstractExpand

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

WorkstreamBench is a new evaluation framework designed to test how well AI agents can build complex, professional-grade financial spreadsheets from scratch. While previous AI benchmarks have focused on simple tasks like answering a single question or fixing one formula, this research addresses the need for agents that can handle end-to-end financial workflows, such as building multi-sheet financial models for company acquisitions or scenario analysis.

A New Standard for Spreadsheet Quality

In professional finance, a spreadsheet is not just a place to get a final number; it is a collaborative tool that must be readable, auditable, and easy for others to modify. To measure this, the researchers developed a taxonomy that evaluates spreadsheets across three core dimensions:

  • Accuracy: Does the model perform the required calculations and scenario analyses correctly?

  • Formula: Are the formulas robust, interpretable, and free of "hardcoded" values that would break if assumptions changed?

  • Format: Is the presentation professional, readable, and well-structured?

Evaluating AI with an "LLM-as-Judge"

Because these quality standards are nuanced and difficult to verify with simple automated checks, the researchers created an "LLM-as-judge" pipeline. This judge acts as an expert reviewer, analyzing the spreadsheets produced by AI agents against a detailed rubric. The researchers validated this approach by comparing the judge’s feedback against human expert annotations, finding that the AI judge is highly effective at identifying subtle issues—such as "off-by-one" errors or the use of hardcoded values instead of dynamic formulas—that traditional automated testing would miss.

Performance of Current AI Agents

The study tested several leading AI agents on tasks ranging from simple exercises to complex financial modeling. The results show that while the Claude family of models currently leads the benchmark, even the most advanced agents struggle to meet professional standards. A common failure point is the tendency to "hardcode" results—providing the correct final number without showing the underlying logic or formulas. This makes the resulting spreadsheets useless for professional settings, where managers need to trace and verify the steps taken to reach a conclusion.

The Challenge of Complexity

The research highlights a significant performance gap: as the complexity of the task increases, the quality of the agents' output degrades sharply. While agents are getting better at performing individual calculations, they are not yet capable of reliably producing the multi-sheet, interconnected workbooks required for real-world financial analysis. The findings suggest that current AI agents still lack the structural understanding necessary to handle the high-stakes, collaborative nature of professional financial modeling.

Comments (0)

No comments yet

Be the first to share your thoughts!