Back to AI Research

AI Research

ProfiliTable: Profiling-Driven Tabular Data Process... | AI Research

Key Takeaways

  • ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows Table processing—tasks like cleaning, transforming, and matching data—is a criti...
  • Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines.
  • Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios.
  • These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.
  • ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows Table processing—tasks like cleaning, transforming, and matching data—is a critical but notoriously difficult part of data science.
Paper AbstractExpand

Table processing-including cleaning, transformation, augmentation, and matching-is a foundational yet error-prone stage in real-world data pipelines. While recent LLM-based approaches show promise for automating such tasks, they often struggle in practice due to ambiguous instructions, complex task structures, and the lack of structured feedback, resulting in syntactically correct but semantically flawed code. To address these challenges, we propose ProfiliTable, an autonomous multi-agent framework centered on dynamic profiling, which constructs and iteratively refines a unified execution context through interactive exploration, knowledge-augmented synthesis, and feedback-driven refinement. ProfiliTable integrates (i) a Profiler that performs ReAct-style data exploration to build semantic understanding, (ii) a Generator that retrieves curated operators to synthesize task-aware code, and (iii) an Evaluator-Summarizer loop that injects execution scores and diagnostic insights to enable closed-loop refinement. Extensive experiments on a diverse benchmark covering 18 tabular task types demonstrate that ProfiliTable consistently outperforms strong baselines, particularly in complex multi-step scenarios. These results highlight the critical role of dynamic profiling in reliably translating ambiguous user intents into robust and governance-compliant table transformations.

ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
Table processing—tasks like cleaning, transforming, and matching data—is a critical but notoriously difficult part of data science. While Large Language Models (LLMs) have shown potential for automating these steps, they often struggle with ambiguous instructions and complex data structures, frequently producing code that runs without errors but fails to achieve the intended result. ProfiliTable addresses this by introducing an autonomous, multi-agent framework that treats data profiling not as a one-time step, but as an ongoing, iterative process of discovery and refinement.

A Dynamic Approach to Data Understanding

Unlike traditional methods that rely on static rules or limited data sampling, ProfiliTable uses a "dynamic profiling" paradigm. The system employs a specialized Profiler agent that performs ReAct-style exploration—actively interrogating the data to test hypotheses. By sampling and inspecting actual cell values rather than just relying on column headers, the agent builds a deep semantic understanding of the table. This allows it to resolve ambiguities, such as determining how to standardize a currency column by identifying the specific symbols present in the data.

The Multi-Agent Workflow

The framework coordinates several specialized agents to manage the entire processing pipeline:

  • Interpreter: Analyzes the user's request to determine if the task is simple or requires a multi-step approach.

  • Decompositer: Breaks complex, multi-step instructions into smaller, manageable subtasks.

  • Generator: Uses Retrieval-Augmented Generation (RAG) to pull pre-validated code templates from a library, ensuring that the generated scripts are based on reliable, domain-specific practices.

  • Evaluator and Summarizer: These agents form a closed-loop feedback system. The Evaluator checks if the code produces the correct output, while the Summarizer inspects the results to provide diagnostic insights. If a task is not completed correctly, this feedback is fed back into the system to guide the next iteration of code generation.

Performance and Reliability

ProfiliTable was tested against a new, comprehensive benchmark covering 18 different types of tabular tasks, including cleaning, transformation, augmentation, and matching. The results show that the framework consistently outperforms existing baselines in terms of correctness, completeness, and execution reliability. Notably, the system achieved a 100% task-wise runnable rate, meaning it reliably produces code that executes successfully, which is a vital requirement for real-world production environments.

Key Takeaways for Data Processing

The core innovation of ProfiliTable is its ability to bridge the gap between vague human intent and the rigid requirements of tabular data. By combining interactive data exploration with a feedback-driven refinement loop, the system avoids the "one-shot" failure common in many LLM-based tools. This approach ensures that the final output is not only syntactically correct but also aligned with the specific semantic needs of the user, making it a robust solution for complex, real-world data curation challenges.

Comments (0)

No comments yet

Be the first to share your thoughts!