ProVision | Multimodal Instruction Data Generation

Introduction

The development of multimodal language models (MLMs) such as GPT4-V and BLIPs [1,2] have enabled many multimodal applications such as answering complex image-based queries; for example, “How many students are raising their hands in this image?”. These models rely heavily on instruction data—datasets that pair visual content with corresponding questions and answers. 

However, generating such data is a challenging task due to the limitations of existing approaches. While manual data collection can be expensive and time-consuming, many rely on costly proprietary models to generate instruction data, which are not only computationally intensive but also prone to issues such as hallucinations, scalability constraints, and difficulties in ensuring interpretability and factual accuracy.

ProVision

To address the challenges in generating multimodal instruction data, we developed ProVision, a scalable, programmatic framework that employs scene graphs and human-written programs to systematically synthesize vision-centric instruction data. 

We represent each image as a scene graph, with objects and attributes as nodes and edges denoting their relationships. Using Python programs and textual templates, our data generators synthesize instruction data by creating questions and answers from the scene graph

With these data generators, we can automatically synthesize questions and answers given an image’s scene graph. For example, given an image of a busy street, ProVision can generate questions such as, “What is the relationship between the pedestrian and the car?” or “Which object is closer to the red building, car or pedestrian?” 

Unlike traditional approaches, ProVision ensures interpretability, factual accuracy, and scalability in generating instruction data for multimodal language models (MLMs). Also, one can add as many data generators as he/she wishes to synthesize novel instruction data.

To synthesize instruction data for images without associated scene graphs, we resort to a scene graph generation pipeline, which is composed of many state-of-the-art vision models, for automatic scene graph generation. With this, we are able to generate instruction data for any image.

The current ProVision integrates a suite of 24 single-image and 14 multi-image instruction generators to create detailed question-answer pairs about objects, attributes, relations, and more. We use these data generators to synthesize over 10M instruction data, which is made publicly available as the ProVision-10M dataset.

Results

ProVision-10M can enhance the performance of multimodal models during fine-tuning. We incorporate our synthesized single-image and multi-image instruction data into established MLM fine-tuning recipes: LLaVA-1.5 for single-image instruction data and Mantis-SigLIP-8B for multi-image instruction data.

The average performance on 8 benchmarks are shown in the following figure. We can see that Provision data with both synthesized and manually annotated scene graphs can enhance average performance, with manually annotated ProVision data yielding the highest improvement in both cases.

In addition, we found that adding the ProVision data in both pretraining and fine-tuning of xGen-MM-4B (BLIP3) can lead to an average improvement of 1.6% across 11 benchmarks, outperforming baseline without our data and adding it into either stage individually.

Future Works

As we demonstrate the potential of programmatically synthesized instruction data for training multimodal language models, future work can further improve the system by adding more data generators to include new types of instruction data or enhancing the scene graph generation pipeline for more accurate scene graphs. 

As we include data generators for synthesizing both single-image and multi-image instruction data, future work can extend the pipeline to synthesize video instruction data and more.

Explore More 

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post. Connect with us on social media and our website to get regular updates on this and other research projects.

Acknowledgments

Full Author List: Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, silvio savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu.

Reference

[1] Li, Junnan, Dongxu Li, Silvio Savarese and Steven C. H. Hoi. “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” International Conference on Machine Learning (2023).

[2] Xue, Le, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S. Ryoo, Shrikant B. Kendre, Jieyu Zhang, Can Qin, Shu Zhen Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong and Ran Xu. “xGen-MM (BLIP-3): A Family of Open Large Multimodal Models.”

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *