Spreadsheet-RL: Advancing Large Language Model Agen...

Spreadsheet-RL: Advancing Large Language Model Agen... | AI Research

Key Takeaways

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning Spreadsheet-RL is a new framework designed to...
Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows.
As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction.
We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Paper AbstractExpand

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
Spreadsheet-RL is a new framework designed to transform Large Language Models (LLMs) into specialized agents capable of performing complex, multi-step tasks in Microsoft Excel. While existing AI agents often rely on simple prompting to perform basic spreadsheet operations, they frequently struggle with the intricate, real-world workflows required in professional settings. This research introduces an end-to-end reinforcement learning (RL) approach that trains agents to interact with a real spreadsheet environment, significantly improving their ability to handle professional data tasks.

Building a Specialized Spreadsheet Environment

The core of this framework is the "Spreadsheet Gym," a multi-turn environment that allows an AI agent to interact directly with Microsoft Excel through a Python sandbox. Unlike previous methods that might use simplified or simulated spreadsheet interfaces, this environment supports advanced Excel features like dynamic array formulas. To guide the agent, the researchers developed a "spreadsheet-native" harness. This toolset provides the agent with specific, structured commands—such as inspecting ranges, filling formulas, or clearing cells—rather than forcing it to rely on generic code. This structure helps the agent follow a logical workflow: inspect the data, plan the edit, execute the change, and verify the result.

Automated Data Collection

Training an effective RL agent requires a large volume of high-quality examples, which are traditionally difficult and expensive to gather. To solve this, the researchers created an automated "Spreadsheet Data Agent." This system scrapes real-world spreadsheet problems and solutions from online forums, then uses powerful coding models to generate the correct "oracle" final spreadsheets. This process creates a scalable pipeline of initial-to-final spreadsheet pairs, allowing the model to learn from realistic, complex scenarios rather than just simple, synthetic exercises.

Reinforcement Learning for Better Accuracy

Spreadsheet-RL uses an on-policy reinforcement learning method called GRPO to train the models. Instead of just predicting the next word, the agent is rewarded based on the actual outcome of its actions—specifically, whether the final spreadsheet it produces matches the correct "oracle" version. By using this outcome-based reward system, the agent learns to refine its interaction strategy, becoming more efficient and accurate over time. The researchers applied this to the Qwen3 model series, observing significant performance gains on both general spreadsheet benchmarks and their newly curated "Domain-Spreadsheet" dataset, which covers professional fields like finance, supply chain management, and human resources.

Real-World Impact and Availability

The results demonstrate that RL post-training is a highly effective way to improve an AI’s ability to handle data interfaces. By moving beyond simple prompt engineering, the Spreadsheet-RL framework enables models to perform more reliable, multi-step data manipulation. To support further research, the team is releasing their training data, the Spreadsheet Gym environment, the training pipeline, and the resulting models. This provides an open-source foundation for developers and researchers to build more capable AI agents for everyday professional data work.