75 research outputs found
SEPT: Towards Scalable and Efficient Visual Pre-Training
Recently, the self-supervised pre-training paradigm has shown great potential
in leveraging large-scale unlabeled data to improve downstream task
performance. However, increasing the scale of unlabeled pre-training data in
real-world scenarios requires prohibitive computational costs and faces the
challenge of uncurated samples. To address these issues, we build a
task-specific self-supervised pre-training framework from a data selection
perspective based on a simple hypothesis that pre-training on the unlabeled
samples with similar distribution to the target task can bring substantial
performance gains. Buttressed by the hypothesis, we propose the first yet novel
framework for Scalable and Efficient visual Pre-Training (SEPT) by introducing
a retrieval pipeline for data selection. SEPT first leverage a self-supervised
pre-trained model to extract the features of the entire unlabeled dataset for
retrieval pipeline initialization. Then, for a specific target task, SEPT
retrievals the most similar samples from the unlabeled dataset based on feature
similarity for each target instance for pre-training. Finally, SEPT pre-trains
the target model with the selected unlabeled samples in a self-supervised
manner for target data finetuning. By decoupling the scale of pre-training and
available upstream data for a target task, SEPT achieves high scalability of
the upstream dataset and high efficiency of pre-training, resulting in high
model architecture flexibility. Results on various downstream tasks demonstrate
that SEPT can achieve competitive or even better performance compared with
ImageNet pre-training while reducing the size of training samples by one
magnitude without resorting to any extra annotations.Comment: Accepted by AAAI 202
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
In the realm of large multi-modal models (LMMs), efficient modality alignment
is crucial yet often constrained by the scarcity of high-quality image-text
data. To address this bottleneck, we introduce the ShareGPT4V dataset, a
pioneering large-scale resource featuring 1.2 million highly descriptive
captions, which surpasses existing datasets in diversity and information
content, covering world knowledge, object properties, spatial relationships,
and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated
100K high-quality captions collected from advanced GPT4-Vision and has been
expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V
first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT)
phase, by substituting an equivalent quantity of detailed captions in existing
SFT datasets with a subset of our high-quality captions, significantly
enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME
and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and
2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training
and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple
architecture that has remarkable performance across a majority of the
multi-modal benchmarks. This project is available at
https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the
LMMs community.Comment: Project: https://ShareGPT4V.github.i
MiChao-HuaFen 1.0: A Specialized Pre-trained Corpus Dataset for Domain-specific Large Models
With the advancement of deep learning technologies, general-purpose large
models such as GPT-4 have demonstrated exceptional capabilities across various
domains. Nevertheless, there remains a demand for high-quality, domain-specific
outputs in areas like healthcare, law, and finance. This paper first evaluates
the existing large models for specialized domains and discusses their
limitations. To cater to the specific needs of certain domains, we introduce
the ``MiChao-HuaFen 1.0'' pre-trained corpus dataset, tailored for the news and
governmental sectors. The dataset, sourced from publicly available internet
data from 2022, underwent multiple rounds of cleansing and processing to ensure
high quality and reliable origins, with provisions for consistent and stable
updates. This dataset not only supports the pre-training of large models for
Chinese vertical domains but also aids in propelling deep learning research and
applications in related fields.Comment: 4 pages,2 figure
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Despite the great advance of Multimodal Large Language Models (MLLMs) in both
instruction dataset building and benchmarking, the independence of training and
evaluation makes current MLLMs hard to further improve their capability under
the guidance of evaluation results with a relatively low human cost. In this
paper, we propose MLLM-DataEngine, a novel closed-loop system that bridges data
generation, model training, and evaluation. Within each loop iteration, the
MLLM-DataEngine first analyze the weakness of the model based on the evaluation
results, then generate a proper incremental dataset for the next training
iteration and enhance the model capability iteratively. Compared with previous
data collection methods which are separate from the benchmarking, the data
generated by MLLM-DataEngine shows better targeting, quality, and correctness.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts
the ratio of different types of data within each incremental dataset based on
the benchmarking results. For quality, we resort to GPT-4 to generate
high-quality data with each given data type. For correctness, prompt design is
critical for the data generation results. Rather than previous hand-crafted
prompt, we propose an Interactive Prompt Optimization strategy, which optimizes
the prompt with the multi-round interaction between human and GPT, and improve
the correctness of generated data greatly. Through extensive experiments, we
find our MLLM-DataEngine could boost the MLLM capability in a targeted and
automatic manner, with only a few human participation. We hope it could be a
general solution for the following MLLMs building. The MLLM-DataEngine has been
open-sourced and is now available at
https://github.com/opendatalab/MLLM-DataEngine.Comment: Code and models are available at
https://github.com/opendatalab/MLLM-DataEngin
OmniCity: Omnipotent City Understanding with Multi-level and Multi-view Images
This paper presents OmniCity, a new dataset for omnipotent city understanding
from multi-level and multi-view images. More precisely, the OmniCity contains
multi-view satellite images as well as street-level panorama and mono-view
images, constituting over 100K pixel-wise annotated images that are
well-aligned and collected from 25K geo-locations in New York City. To
alleviate the substantial pixel-wise annotation efforts, we propose an
efficient street-view image annotation pipeline that leverages the existing
label maps of satellite view and the transformation relations between different
views (satellite, panorama, and mono-view). With the new OmniCity dataset, we
provide benchmarks for a variety of tasks including building footprint
extraction, height estimation, and building plane/instance/fine-grained
segmentation. Compared with the existing multi-level and multi-view benchmarks,
OmniCity contains a larger number of images with richer annotation types and
more views, provides more benchmark results of state-of-the-art models, and
introduces a novel task for fine-grained building instance segmentation on
street-level panorama images. Moreover, OmniCity provides new problem settings
for existing tasks, such as cross-view image matching, synthesis, segmentation,
detection, etc., and facilitates the developing of new methods for large-scale
city understanding, reconstruction, and simulation. The OmniCity dataset as
well as the benchmarks will be available at
https://city-super.github.io/omnicity
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models
The rise in popularity of ChatGPT and GPT-4 has significantly accelerated the
development of large models, leading to the creation of numerous impressive
large language models(LLMs) and multimodal large language models (MLLMs). These
cutting-edge models owe their remarkable performance to high-quality data.
However, the details of the training data used in leading paradigms are often
kept confidential. This lack of transparency, coupled with the scarcity of
open-source data, impedes further developments within the community. As a
response, this paper presents "Wan Juan", a large-scale multimodal dataset
composed of both Chinese and English data, collected from a wide range of web
sources. The dataset incorporates text, image-text, and video modalities, with
a total volume exceeding 2TB. It was utilized in the training of InternLM, a
model that demonstrated significant advantages in multi-dimensional evaluations
when compared to models of a similar scale. All data can be accessed at
https://opendatalab.org.cn/WanJuan1.0.Comment: Technical Repor
VIGC: Visual Instruction Generation and Correction
The integration of visual encoders and large language models (LLMs) has
driven recent progress in multimodal large language models (MLLMs). However,
the scarcity of high-quality instruction-tuning data for vision-language tasks
remains a challenge. The current leading paradigm, such as LLaVA, relies on
language-only GPT-4 to generate data, which requires pre-annotated image
captions and detection bounding boxes, suffering from understanding image
details. A practical solution to this problem would be to utilize the available
multimodal large language models (MLLMs) to generate instruction data for
vision-language tasks. However, it's worth noting that the currently accessible
MLLMs are not as powerful as their LLM counterparts, as they tend to produce
inadequate responses and generate false information. As a solution for
addressing the current issue, this paper proposes the Visual Instruction
Generation and Correction (VIGC) framework that enables multimodal large
language models to generate instruction-tuning data and progressively enhance
its quality on-the-fly. Specifically, Visual Instruction Generation (VIG)
guides the vision-language model to generate diverse instruction-tuning data.
To ensure generation quality, Visual Instruction Correction (VIC) adopts an
iterative update mechanism to correct any inaccuracies in data produced by VIG,
effectively reducing the risk of hallucination. Leveraging the diverse,
high-quality data generated by VIGC, we finetune mainstream models and validate
data quality based on various evaluations. Experimental results demonstrate
that VIGC not only compensates for the shortcomings of language-only data
generation methods, but also effectively enhances the benchmark performance.
The models, datasets, and code will be made publicly available
- …