12 research outputs found
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation
Models (IFMs), which face challenges in transferring to the video domain.
Although VideoMAE has trained a robust ViT from limited data, its low-level
reconstruction poses convergence difficulties and conflicts with high-level
cross-modal alignment. This paper proposes a training-efficient method for
temporal-sensitive VFMs that integrates the benefits of existing methods. To
increase data efficiency, we mask out most of the low-semantics video tokens,
but selectively align the unmasked tokens with IFM, which serves as the
UnMasked Teacher (UMT). By providing semantic guidance, our method enables
faster convergence and multimodal friendliness. With a progressive pre-training
framework, our model can handle various tasks including scene-related,
temporal-related, and complex video-language understanding. Using only public
sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16
achieves state-of-the-art performances on various video tasks. The code and
models will be released at https://github.com/OpenGVLab/unmasked_teacher.Comment: 16 pages, 5 figures, 28 table
Harvest Video Foundation Models via Efficient Post-Pretraining
Building video-language foundation models is costly and difficult due to the
redundant nature of video data and the lack of high-quality video-language
datasets. In this paper, we propose an efficient framework to harvest video
foundation models from image ones. Our method is intuitively simple by randomly
dropping input video patches and masking out input text during the
post-pretraining procedure. The patch dropping boosts the training efficiency
significantly and text masking enforces the learning of cross-modal fusion. We
conduct extensive experiments to validate the effectiveness of our method on a
wide range of video-language downstream tasks including various zero-shot
tasks, video question answering, and video-text retrieval. Despite its
simplicity, our method achieves state-of-the-art performances, which are
comparable to some heavily pretrained video foundation models. Our method is
extremely efficient and can be trained in less than one day on 8 GPUs,
requiring only WebVid-10M as pretraining data. We hope our method can serve as
a simple yet strong counterpart for prevalent video foundation models, provide
useful insights when building them, and make large pretrained models more
accessible and sustainable. This is part of the InternVideo project
\url{https://github.com/OpenGVLab/InternVideo}
You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction
Challenging illumination conditions (low-light, under-exposure and
over-exposure) in the real world not only cast an unpleasant visual appearance
but also taint the computer vision tasks. After camera captures the raw-RGB
data, it renders standard sRGB images with image signal processor (ISP). By
decomposing ISP pipeline into local and global image components, we propose a
lightweight fast Illumination Adaptive Transformer (IAT) to restore the normal
lit sRGB image from either low-light or under/over-exposure conditions.
Specifically, IAT uses attention queries to represent and adjust the
ISP-related parameters such as colour correction, gamma correction. With only
~90k parameters and ~0.004s processing speed, our IAT consistently achieves
superior performance over SOTA on the current benchmark low-light enhancement
and exposure correction datasets. Competitive experimental performance also
demonstrates that our IAT significantly enhances object detection and semantic
segmentation tasks under various light conditions. Training code and pretrained
model is available at
https://github.com/cuiziteng/Illumination-Adaptive-Transformer.Comment: 23 page
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.Comment: Data and Code:
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVi
MVP: Robust Multi-View Practice for Driving Action Localization
Distracted driving causes thousands of deaths per year, and how to apply
deep-learning methods to prevent these tragedies has become a crucial problem.
In Track3 of the 6th AI City Challenge, researchers provide a high-quality
video dataset with densely action annotations. Due to the small data scale and
unclear action boundary, the dataset presents a unique challenge to precisely
localize all the different actions and classify their categories. In this
paper, we make good use of the multi-view synchronization among videos, and
conduct robust Multi-View Practice (MVP) for driving action localization. To
avoid overfitting, we fine-tune SlowFast with Kinetics-700 pre-training as the
feature extractor. Then the features of different views are passed to
ActionFormer to generate candidate action proposals. For precisely localizing
all the actions, we design elaborate post-processing, including model voting,
threshold filtering and duplication removal. The results show that our MVP is
robust for driving action localization, which achieves 28.49% F1-score in the
Track3 test set
Predicting mortality of patients with acute kidney injury in the ICU using XGBoost model.
PurposeThe goal of this study is to construct a mortality prediction model using the XGBoot (eXtreme Gradient Boosting) decision tree model for AKI (acute kidney injury) patients in the ICU (intensive care unit), and to compare its performance with that of three other machine learning models.MethodsWe used the eICU Collaborative Research Database (eICU-CRD) for model development and performance comparison. The prediction performance of the XGBoot model was compared with the other three machine learning models. These models included LR (logistic regression), SVM (support vector machines), and RF (random forest). In the model comparison, the AUROC (area under receiver operating curve), accuracy, precision, recall, and F1 score were used to evaluate the predictive performance of each model.ResultsA total of 7548 AKI patients were analyzed in this study. The overall in-hospital mortality of AKI patients was 16.35%. The best performing algorithm in this study was XGBoost with the highest AUROC (0.796, p ConclusionXGBoot model had obvious advantages of performance compared to the other machine learning models. This will be helpful for risk identification and early intervention for AKI patients at risk of death
VideoChat: Chat-Centric Video Understanding
In this study, we initiate an exploration into video understanding by
introducing VideoChat, an end-to-end chat-centric video understanding system.
It integrates video foundation models and large language models via a learnable
neural interface, excelling in spatiotemporal reasoning, event localization,
and causal relationship inference. To instructively tune this system, we
propose a video-centric instruction dataset, composed of thousands of videos
matched with detailed descriptions and conversations. This dataset emphasizes
spatiotemporal reasoning and causal relationships, providing a valuable asset
for training chat-centric video understanding systems. Preliminary qualitative
experiments reveal our system's potential across a broad spectrum of video
applications and set the standard for future research. Access our code and data
at https://github.com/OpenGVLab/Ask-AnythingComment: Technical repor
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language
We present an interactive visual framework named InternGPT, or iGPT for
short. The framework integrates chatbots that have planning and reasoning
capabilities, such as ChatGPT, with non-verbal instructions like pointing
movements that enable users to directly manipulate images or videos on the
screen. Pointing (including gestures, cursors, etc.) movements can provide more
flexibility and precision in performing vision-centric tasks that require
fine-grained control, editing, and generation of visual content. The name
InternGPT stands for \textbf{inter}action, \textbf{n}onverbal, and
\textbf{chat}bots. Different from existing interactive systems that rely on
pure language, by incorporating pointing instructions, the proposed iGPT
significantly improves the efficiency of communication between users and
chatbots, as well as the accuracy of chatbots in vision-centric tasks,
especially in complicated visual scenarios where the number of objects is
greater than 2. Additionally, in iGPT, an auxiliary control mechanism is used
to improve the control capability of LLM, and a large vision-language model
termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing
ChatGPT-3.5-turbo with 93.89\% GPT-4 Quality). We hope this work can spark new
ideas and directions for future interactive visual systems. Welcome to watch
the code at https://github.com/OpenGVLab/InternGPT.Comment: Technical Repor