LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
  Framework, and Benchmark

Bai, Lei; Cao, Jianjian; Huang, Xiaoshui; Li, Mukai; Liu, Dingning; Ouyang, Wanli; Shao, Jing; Sheng, Lu; Shi, Zhelun; Wang, Jiong; Wang, Zhiyong; Yin, Zhenfei

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

Authors: Lei Bai
Jianjian Cao
Xiaoshui Huang
Mukai Li
Dingning Liu
Wanli Ouyang
Jing Shao
Lu Sheng
Zhelun Shi
Jiong Wang
Zhiyong Wang
Zhenfei Yin
Publication date: 18 June 2023
Publisher

Abstract

Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities. In this work, we extend the research of MLLMs to point clouds and present the LAMM-Dataset and LAMM-Benchmark for 2D image and 3D point cloud understanding. We also establish an extensible framework to facilitate the extension of MLLMs to additional modalities. Our main contribution is three-fold: 1) We present the LAMM-Dataset and LAMM-Benchmark, which cover almost all high-level vision tasks for 2D and 3D vision. Extensive experiments validate the effectiveness of our dataset and benchmark. 2) We demonstrate the detailed methods of constructing instruction-tuning datasets and benchmarks for MLLMs, which will enable future research on MLLMs to scale up and extend to other domains, tasks, and modalities faster. 3) We provide a primary but potential MLLM training framework optimized for modalities' extension. We also provide baseline models, comprehensive experimental observations, and analysis to accelerate future research. Codes and datasets are now available at https://github.com/OpenLAMM/LAMM.Comment: 37 pages, 33 figures. Code available at https://github.com/OpenLAMM/LAMM ; Project page: https://openlamm.github.io

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.06687

Last time updated on 14/06/2023