Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks

Salesforce AI and the University of Washington have introduced TACO, a new framework for training multi-modal action models. TACO utilizes a large, high-quality synthetic dataset of over 1.…

Open original source

Salesforce AI and the University of Washington have introduced TACO, a new framework for training multi-modal action models. TACO utilizes a large, high-quality synthetic dataset of over 1.8 million examples generated using GPT-4, focusing on complex reasoning and action sequences.

This dataset includes 15 tools like OCR and mathematical solvers, enabling models to handle intricate tasks. TACO combines LLaMA3 and CLIP, and it outperforms instruction-tuned baselines by an average of 3.6% across eight benchmarks, with gains up to 15% on tasks involving OCR and math.

The framework's emphasis on coherent multi-step reasoning and action integration sets a new standard for multi-modal AI performance.