ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
Table processing—tasks like cleaning, transforming, and matching data—is a critical but notoriously difficult part of data science. While Large Language Models (LLMs) have shown potential for automating these steps, they often struggle with ambiguous instructions and complex data structures, frequently producing code that runs without errors but fails to achieve the intended result. ProfiliTable addresses this by introducing an autonomous, multi-agent framework that treats data profiling not as a one-time step, but as an ongoing, iterative process of discovery and refinement.
A Dynamic Approach to Data Understanding
Unlike traditional methods that rely on static rules or limited data sampling, ProfiliTable uses a "dynamic profiling" paradigm. The system employs a specialized Profiler agent that performs ReAct-style exploration—actively interrogating the data to test hypotheses. By sampling and inspecting actual cell values rather than just relying on column headers, the agent builds a deep semantic understanding of the table. This allows it to resolve ambiguities, such as determining how to standardize a currency column by identifying the specific symbols present in the data.
The Multi-Agent Workflow
The framework coordinates several specialized agents to manage the entire processing pipeline:
Interpreter: Analyzes the user's request to determine if the task is simple or requires a multi-step approach.
Decompositer: Breaks complex, multi-step instructions into smaller, manageable subtasks.
Generator: Uses Retrieval-Augmented Generation (RAG) to pull pre-validated code templates from a library, ensuring that the generated scripts are based on reliable, domain-specific practices.
Evaluator and Summarizer: These agents form a closed-loop feedback system. The Evaluator checks if the code produces the correct output, while the Summarizer inspects the results to provide diagnostic insights. If a task is not completed correctly, this feedback is fed back into the system to guide the next iteration of code generation.
Performance and Reliability
ProfiliTable was tested against a new, comprehensive benchmark covering 18 different types of tabular tasks, including cleaning, transformation, augmentation, and matching. The results show that the framework consistently outperforms existing baselines in terms of correctness, completeness, and execution reliability. Notably, the system achieved a 100% task-wise runnable rate, meaning it reliably produces code that executes successfully, which is a vital requirement for real-world production environments.
Key Takeaways for Data Processing
The core innovation of ProfiliTable is its ability to bridge the gap between vague human intent and the rigid requirements of tabular data. By combining interactive data exploration with a feedback-driven refinement loop, the system avoids the "one-shot" failure common in many LLM-based tools. This approach ensures that the final output is not only syntactically correct but also aligned with the specific semantic needs of the user, making it a robust solution for complex, real-world data curation challenges.
Comments (0)
to join the discussion
No comments yet
Be the first to share your thoughts!