Introducing TACO – Salesforce AI Research’s Family of Multimodal Action Models

Introduction

Imagine taking a photo of a gas station sign displaying various fuel prices and asking an AI system, “How much gas can I buy with $50?” It might sound simple, but for the system, it’s a complex task. It has to identify where the prices are on the sign, extract the numbers using text recognition, and then perform the calculations to provide an answer. Real-world problems are often like this: they involve different types of information across multiple modalities and demand multi-step solutions. 

However, today’s open-source multi-modal models struggle to solve realistic complex problems in a step-by-step manner. This limitation stems from their training, which heavily emphasizes straightforward, single-step problems with brief, direct answers. It’s like trying to teach someone to cook by only showing them how to make toast—it doesn’t prepare them for more complicated recipes. 

Worse, most open-source models struggle to articulate their problem-solving process even when prompted to do so. As a result, when these models make mistakes—it’s often difficult to determine which part of the process went wrong. For example, for the question above, was it the text recognition? The reasoning? Or the calculation? 

To address these challenges, we present TACO, a family of multimodal large action models designed to improve performance on complex questions that require multiple capabilities and demand multi-step solutions. 

To answer such questions, TACO produces chains-of-thought-and-action (CoTA), executes intermediate steps by invoking external tools such as OCR, depth estimation and calculator, then integrates both the thoughts and action outputs to produce coherent responses (Figure 1).

Figure 1. Example outputs of TACO vs. other multimodal large language models.

To enable TACO to generate chains-of-thought-and-action at inference time, we generate synthetic CoTA data and fine-tune open-source multimodal language models on it (Figure 2). 

Figure 2. An overview of the TACO training and inference pipeline

Synthetic CoTA Data Generation

To train TACO, we create a large dataset of 1M+ synthetic CoTA traces generated with multimodal large language model (e.g. GPT-4o) and Python programs. 

Figure 2. Model-based (top) and programmatic (bottom) data generation pipelines. 

In model-based generation, we take existing image and QA pairs from instruction tuning datasets as inputs and prompt a multimodal large language model (e.g. GPT-4o) to generate either a chain-of-thought-and-action (CoTA) or chain-of-thought (CoT) without actions to answer the questions. Then, we verify that the chains lead to correct final answers and parse successfully; if not, we convert them into the direct answer (Direct) format with groundtruth answers. 

In programmatic generation, we first gather image annotations with human annotators or models, and then use the dense annotations to fill in manually written templates and generate QA and the corresponding CoTA with Python programs. 

Results

We show that fine-tuning with CoTA data enables multimodal language models to reason and take actions on complex visual tasks, significantly boosting their average performance across 8 benchmarks. The increase in accuracy is 30-50% compared to few-shot prompting in the CoTA format, and 3-4% compared to finetuning with direct answers (Figure 3).

Figure 3. Models’ average performance on all 8 benchmarks when prompted or fine- tuned with Direct answer or CoTA format  

What’s more, TACO consistently beats baseline models instruction tuned with only direct answers by significant margins of up to 20% on MMVet regardless of model backbones and starting checkpoints (Figure 4). 

Figure 4. Models’ accuracy on MMVet after finetuning with Direct answers only vs. CoTA data

Conclusion 

In conclusion, we propose a new framework to solve complex visual tasks with multimodal action models and introduce a new family of multimodal action models named TACO. We train TACOs with large-scale synthetic Chain-of-Thought-and-Action data. We demonstrate both quantitatively and qualitatively that TACO achieves up to 4% gains on average across 8 benchmarks compared to instruction-tuned baselines, and up to 20% on the challenging MMVet benchmark. 

With our framework, future works can train new models with different actions for other applications such as web navigation or for other domains such as medical question answering. We also encourage future work to further improve the quality of CoTA data, where the diversity and quality of both thoughts and actions are important. 

Explore More

Acknowledgments

Full Author List: Zixian Ma, Jianguo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *