433 research outputs found
Visual-Semantic Learning
Visual-semantic learning is an attractive and challenging research direction aiming to understand complex semantics of heterogeneous data from two domains, i.e., visual signals (i.e., images and videos) and natural language (i.e., captions and questions). It requires memorizing the rich information in a single modality and a joint comprehension of multiple modalities. Artificial intelligence (AI) systems with human-level intelligence are claimed to learn like humans, such as efficiently leveraging brain memory for better comprehension, rationally incorporating common-sense knowledge into reasoning, quickly gaining in-depth understanding given a few samples, and analyzing relationships among abundant and informative events. However, these intelligence capacities are effortless for humans but challenging for machines. To bridge the discrepancy between human-level intelligence and present-day visual-semantic learning, we start from its basic understanding ability by studying the visual question answering (e.g., Image-QA and Video-QA) tasks from the perspectives of memory augmentation and common-sense knowledge incorporation. Furthermore, we stretch it to a more challenging situation with limited and partially unlabeled training data (i.e., Few-shot Visual-Semantic Learning) to imitate the fast learning ability of humans. Finally, to further enhance visual-semantic performance in natural videos with numerous spatio-temporal dynamics, we investigate exploiting event-correlated information for a comprehensive understanding of cross-modal semantics.
To study the essential visual-semantic understanding ability of the human brain with memory, we first propose a novel Memory Augmented Deep Recurrent Neural Network (i.e., MA-DRNN) model for Video-QA, which features a new method for encoding videos and questions, and memory augmentation using the emerging Differentiable Neural Computer (i.e., DNC). Specifically, we encode semantic (i.e., questions) information before visual (i.e., videos) information, which leads to better visual-semantic representations. Moreover, we leverage Differentiable Neural Computer (with external memory) to store and retrieve valuable information in questions and videos and model the long-term visual-semantic dependency.
In addition to basic understanding, to tackle visual-semantic reasoning that requires external knowledge beyond visible contents (e.g., KB-Image-QA), we propose a novel framework that endows the model with capabilities of answering more general questions and achieves better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (i.e., MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question-related relation phrases, each delivering two complementary clues to retrieve the supporting facts from an external knowledge base (i.e., KB). These facts are encoded into a continuous embedding space using a content-addressable memory. Afterward, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score.
Furthermore, to enable a fast, in-depth understanding given a small number of samples, especially with heterogeneity in the multi-modal scenarios such as image question answering (i.e., Image-QA) and image captioning (i.e., IC), we study the few-shot visual-semantic learning and present the Hierarchical Graph ATtention Network (i.e., HGAT). This two-stage network models the intra- and inter-modal relationships with limited image-text samples. The main contributions of HGAT can be summarized as follows: 1) it sheds light on tackling few-shot multi-modal learning problems, which focuses primarily, but not exclusively, on visual and semantic modalities, through better exploitation of the intra-relationship of each modality and an attention-based co-learning framework between modalities using a hierarchical graph-based architecture; 2) it achieves superior performance on both visual question answering and image captioning in the few-shot setting; 3) it can be easily extended to the semi-supervised setting where image-text samples are partially unlabeled.
Although various attention mechanisms have been utilized to manage contextualized representations by modeling intra- and inter-modal relationships of the two modalities, one limitation of the predominant visual-semantic methods is the lack of reasoning with event correlation, sensing, and analyzing relationships among abundant and informative events contained in the video. To this end, we introduce the dense caption modality as a new auxiliary and distill event-correlated information to infer the correct answer. We propose a novel end-to-end trainable model, Event-Correlated Graph Neural Networks (EC-GNNs), to perform cross-modal reasoning over information from the three modalities (i.e., caption, video, and question). Besides exploiting a new modality, we employ cross-modal reasoning modules to explicitly model inter-modal relationships and aggregate relevant information across different modalities. We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
To evaluate our proposed models, we conduct extensive experiments on VTW, MSVD-QA, and TGIF-QA datasets for Video-QA task, Toronto COCO-QA, Visual Genome-QA datasets for few-shot Image-QA task, COCO-FITB dataset for few-shot IC task, and FVQA, Visual7W + ConceptNet datasets for KB-Image-QA task. The experimental results justify these models’ effectiveness and superiority over baseline methods
eP-ALM: Efficient Perceptual Augmentation of Language Models
Large Language Models (LLMs) have so far impressed the world, with
unprecedented capabilities that emerge in models at large scales. On the vision
side, transformer models (i.e., ViT) are following the same trend, achieving
the best performance on challenging benchmarks. With the abundance of such
unimodal models, a natural question arises; do we need also to follow this
trend to tackle multimodal tasks? In this work, we propose to rather direct
effort to efficient adaptations of existing models, and propose to augment
Language Models with perception. Existing approaches for adapting pretrained
models for vision-language tasks still rely on several key components that
hinder their efficiency. In particular, they still train a large number of
parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP)
trained on huge image-text datasets, and add significant inference overhead. In
addition, most of these approaches have focused on Zero-Shot and In Context
Learning, with little to no effort on direct finetuning. We investigate the
minimal computational effort needed to adapt unimodal models for multimodal
tasks and propose a new challenging setup, alongside different approaches, that
efficiently adapts unimodal pretrained models. We show that by freezing more
than 99\% of total parameters, training only one linear projection layer, and
prepending only one trainable token, our approach (dubbed eP-ALM) significantly
outperforms other baselines on VQA and Captioning across Image, Video, and
Audio modalities, following the proposed setup. The code will be available
here: https://github.com/mshukor/eP-ALM.Comment: New experiments and comparison with SoTA. Code:
https://github.com/mshukor/eP-AL
Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
This work explores an efficient approach to establish a foundational
video-text model for tasks including open-vocabulary video classification,
text-to-video retrieval, video captioning and video question-answering. We
present VideoCoCa that reuses a pretrained image-text contrastive captioner
(CoCa) model and adapt it to video-text tasks with minimal extra training.
While previous works adapt image-text models with various cross-frame fusion
modules (for example, cross-frame attention layer or perceiver resampler) and
finetune the modified architecture on video-text data, we surprisingly find
that the generative attentional pooling and contrastive attentional pooling
layers in the image-text CoCa design are instantly adaptable to ``flattened
frame embeddings'', yielding a strong zero-shot transfer baseline for many
video-text tasks. Specifically, the frozen image encoder of a pretrained
image-text CoCa takes each video frame as inputs and generates token
embeddings per frame for totally video frames. We flatten
token embeddings as a long sequence of frozen video representation and apply
CoCa's generative attentional pooling and contrastive attentional pooling on
top. All model weights including pooling layers are directly loaded from an
image-text CoCa pretrained model. Without any video or video-text data,
VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art
results on zero-shot video classification on Kinetics 400/600/700, UCF101,
HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT
and ActivityNet Captions. We also explore lightweight finetuning on top of
VideoCoCa, and achieve strong results on video question-answering (iVQA,
MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our
approach establishes a simple and effective video-text baseline for future
research.Comment: Technical repor
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Effective scaling and a flexible task interface enable large language models
to excel at many tasks. PaLI (Pathways Language and Image model) extends this
approach to the joint modeling of language and vision. PaLI generates text
based on visual and textual inputs, and with this interface performs many
vision, language, and multimodal tasks, in many languages. To train PaLI, we
make use of large pretrained encoder-decoder language models and Vision
Transformers (ViTs). This allows us to capitalize on their existing
capabilities and leverage the substantial cost of training them. We find that
joint scaling of the vision and language components is important. Since
existing Transformers for language are much larger than their vision
counterparts, we train the largest ViT to date (ViT-e) to quantify the benefits
from even larger-capacity vision models. To train PaLI, we create a large
multilingual mix of pretraining tasks, based on a new image-text training set
containing 10B images and texts in over 100 languages. PaLI achieves
state-of-the-art in multiple vision and language tasks (such as captioning,
visual question-answering, scene-text understanding), while retaining a simple,
modular, and scalable design
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Various adaptation methods, such as LoRA, prompts, and adapters, have been
proposed to enhance the performance of pre-trained vision-language models in
specific domains. The robustness of these adaptation methods against
distribution shifts have not been studied. In this study, we assess the
robustness of 11 widely-used adaptation methods across 4 vision-language
datasets under multimodal corruptions. Concretely, we introduce 7 benchmark
datasets, including 96 visual and 87 textual corruptions, to investigate the
robustness of different adaptation methods, the impact of available adaptation
examples, and the influence of trainable parameter size during adaptation. Our
analysis reveals that: 1) Adaptation methods are more sensitive to text
corruptions than visual corruptions. 2) Full fine-tuning does not consistently
provide the highest robustness; instead, adapters can achieve better robustness
with comparable clean performance. 3) Contrary to expectations, our findings
indicate that increasing the number of adaptation data and parameters does not
guarantee enhanced robustness; instead it results in even lower robustness. We
hope this study could benefit future research in the development of robust
multimodal adaptation methods. The benchmark, code, and dataset used in this
study can be accessed at \url{https://adarobustness.github.io}
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Video question answering (VideoQA) is a complex task that requires diverse
multi-modal data for training. Manual annotation of question and answers for
videos, however, is tedious and prohibits scalability. To tackle this problem,
recent methods consider zero-shot settings with no manual annotation of visual
question-answer. In particular, a promising approach adapts frozen
autoregressive language models pretrained on Web-scale text-only data to
multi-modal inputs. In contrast, we here build on frozen bidirectional language
models (BiLM) and show that such an approach provides a stronger and cheaper
alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs
with the frozen BiLM using light trainable modules, (ii) we train such modules
using Web-scraped multi-modal data, and finally (iii) we perform zero-shot
VideoQA inference through masked language modeling, where the masked text is
the answer to a given question. Our proposed approach, FrozenBiLM, outperforms
the state of the art in zero-shot VideoQA by a significant margin on a variety
of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA,
TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in
the few-shot and fully-supervised setting. Our code and models are publicly
available at https://github.com/antoyang/FrozenBiLM.Comment: NeurIPS 2022 Camera-Ready; Project Webpage:
https://antoyang.github.io/frozenbilm.html; 25 pages; 5 figure
Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation
Multi-modal 3D scene understanding has gained considerable attention due to
its wide applications in many areas, such as autonomous driving and
human-computer interaction. Compared to conventional single-modal 3D
understanding, introducing an additional modality not only elevates the
richness and precision of scene interpretation but also ensures a more robust
and resilient understanding. This becomes especially crucial in varied and
challenging environments where solely relying on 3D data might be inadequate.
While there has been a surge in the development of multi-modal 3D methods over
past three years, especially those integrating multi-camera images (3D+2D) and
textual descriptions (3D+language), a comprehensive and in-depth review is
notably absent. In this article, we present a systematic survey of recent
progress to bridge this gap. We begin by briefly introducing a background that
formally defines various 3D multi-modal tasks and summarizes their inherent
challenges. After that, we present a novel taxonomy that delivers a thorough
categorization of existing methods according to modalities and tasks, exploring
their respective strengths and limitations. Furthermore, comparative results of
recent approaches on several benchmark datasets, together with insightful
analysis, are offered. Finally, we discuss the unresolved issues and provide
several potential avenues for future research
- …