The Stanford EDGAR Filings Dataset: Reconstructing...

The Stanford EDGAR Filings Dataset (SEFD) is a project designed to transform the U.S. Securities and Exchange Commission’s (SEC) massive public archive of corporate filings into a high-quality, structured resource for training large language models (LLMs). While the EDGAR database contains millions of financial documents, its raw format is often messy, inconsistent, and difficult for AI to parse. By reconstructing these filings into a clean, layout-faithful format called MultiMarkdown, the researchers have created a token-efficient dataset that preserves the visual and structural cues—such as tables and indentation—that are essential for accurate financial reasoning and document analysis.

Reconstructing Complex Financial Data

The primary challenge in using SEC filings for AI training is that they are designed for human visual consumption rather than machine readability. Filers often use "layout engineering," such as splitting numbers across multiple table cells to force them to align visually in a web browser. Standard text-extraction tools often break these structures, leading to lost meaning or incorrect data. SEFD uses a "visual-first" approach, treating documents as a 2D grid to reassemble fragmented text and tables. By converting these into MultiMarkdown, the dataset maintains the logical relationships between financial figures, headers, and document sections while significantly reducing the number of tokens required to represent the information.

Handling Diverse Filing Formats

Because the EDGAR archive spans decades, it contains a mix of legacy plaintext, HTML, XML, and PDF files. The SEFD pipeline employs specialized strategies for each:

Plaintext: Legacy documents are preserved as-is within code blocks to maintain their original fixed-width alignment.
HTML: The system reconstructs the rendered coordinate system rather than just reading the underlying code, allowing it to fix fragmented headers and tables.
XML: The pipeline maps complex schema-based disclosures into a readable format, capturing data from ownership reports and fund filings.
PDFs: Using an OCR system, the researchers convert visual attachments and "glossy" reports into the same MultiMarkdown format used for the rest of the corpus, ensuring consistency across the entire dataset.

Benchmarking Financial Intelligence

To ensure the dataset is useful for real-world applications, the researchers introduced two benchmarks. The first, EDGAR-Forecast, tests an AI’s ability to predict future financial outcomes by providing it with five years of historical filings and asking it to forecast data from reports released after the model’s knowledge cutoff. The second, EDGAR-OCR, evaluates how well models can transcribe complex financial tables into HTML. Together, these tools demonstrate that SEFD is not just a collection of text, but a robust foundation for training models to perform tasks like compliance review, accounting analysis, and agentic financial reasoning.

Key Characteristics

The resulting SEFD-v1 snapshot provides 152 billion tokens of high-quality data, with a larger archive estimated at 550 billion tokens. A notable advantage of this dataset is its minimal overlap with common web-based training corpora, meaning it provides unique, domain-specific knowledge that is not already saturated in existing models. By focusing on "quality over quantity," the researchers aim to provide a more efficient way to train models, allowing for better performance in financial and business contexts without the need for massive, indiscriminate data scaling.

The Stanford EDGAR Filings Dataset: Reconstructing... | AI Research

Key Takeaways

Reconstructing Complex Financial Data

Handling Diverse Filing Formats

Benchmarking Financial Intelligence

Key Characteristics

Comments (0)

No comments yet