Video-and-language understanding has a variety of applications in the
industry, such as video question answering, text-video retrieval and
multi-label classification. Existing video-and-language understanding methods
generally adopt heavy multi-modal encoders and feature fusion modules, which
consume large amounts of GPU memory. Especially, they have difficulty dealing
with dense video frames or long text that are prevalent in industrial
applications. In this paper, we propose MuLTI, a highly accurate and
memory-efficient video-and-language understanding model that achieves efficient
and effective feature fusion through feature sampling and attention modules.
Therefore, MuLTI can handle longer sequences with limited GPU memory. Then, we
introduce an attention-based adapter to the encoders, which finetunes the
shallow features to improve the model's performance with low GPU memory
consumption. Finally, to further improve the model's performance, we introduce
a new pretraining task named Multiple Choice Modeling to bridge the task gap
between pretraining and downstream tasks and enhance the model's ability to
align the video and the text. Benefiting from the efficient feature fusion
module, the attention-based adapter and the new pretraining task, MuLTI
achieves state-of-the-art performance on multiple datasets. Implementation and
pretrained models will be released