The integration of visual encoders and large language models (LLMs) has
driven recent progress in multimodal large language models (MLLMs). However,
the scarcity of high-quality instruction-tuning data for vision-language tasks
remains a challenge. The current leading paradigm, such as LLaVA, relies on
language-only GPT-4 to generate data, which requires pre-annotated image
captions and detection bounding boxes, suffering from understanding image
details. A practical solution to this problem would be to utilize the available
multimodal large language models (MLLMs) to generate instruction data for
vision-language tasks. However, it's worth noting that the currently accessible
MLLMs are not as powerful as their LLM counterparts, as they tend to produce
inadequate responses and generate false information. As a solution for
addressing the current issue, this paper proposes the Visual Instruction
Generation and Correction (VIGC) framework that enables multimodal large
language models to generate instruction-tuning data and progressively enhance
its quality on-the-fly. Specifically, Visual Instruction Generation (VIG)
guides the vision-language model to generate diverse instruction-tuning data.
To ensure generation quality, Visual Instruction Correction (VIC) adopts an
iterative update mechanism to correct any inaccuracies in data produced by VIG,
effectively reducing the risk of hallucination. Leveraging the diverse,
high-quality data generated by VIGC, we finetune mainstream models and validate
data quality based on various evaluations. Experimental results demonstrate
that VIGC not only compensates for the shortcomings of language-only data
generation methods, but also effectively enhances the benchmark performance.
The models, datasets, and code will be made publicly available