112,368 research outputs found
Multi-task Self-Supervised Visual Learning
We investigate methods for combining multiple self-supervised tasks--i.e.,
supervised tasks where data can be collected without manual labeling--in order
to train a single visual representation. First, we provide an apples-to-apples
comparison of four different self-supervised tasks using the very deep
ResNet-101 architecture. We then combine tasks to jointly train a network. We
also explore lasso regularization to encourage the network to factorize the
information in its representation, and methods for "harmonizing" network inputs
in order to learn a more unified representation. We evaluate all methods on
ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our
results show that deeper networks work better, and that combining tasks--even
via a naive multi-head architecture--always improves performance. Our best
joint network nearly matches the PASCAL performance of a model pre-trained on
ImageNet classification, and matches the ImageNet network on NYU depth
prediction.Comment: Published at ICCV 201
Audio-Visual Speech Enhancement and Separation by Leveraging Multi-Modal Self-Supervised Embeddings
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be
effective for categorical problems such as automatic speech recognition and
lip-reading. This suggests that useful audio-visual speech representations can
be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless,
it is unclear if such representations can be generalized to solve real-world
multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE)
and audio-visual speech separation (AVSS). In this study, we leveraged the
pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS.
Comparative experimental results demonstrate that our proposed model performs
better than the state-of-the-art AVSE and traditional audio-only SE models. In
summary, our results confirm the effectiveness of our proposed model for the
AVSS task with proper fine-tuning strategies, demonstrating that multi-modal
self-supervised embeddings obtained from AV-HuBERT can be generalized to
audio-visual regression tasks.Comment: ICASSP AMHAT 202
Self-Supervised Learning Across Domains
Human adaptability relies crucially on learning and merging knowledge from both supervised and unsupervised tasks: the parents point out few important concepts, but then the children fill in the gaps on their own. This is particularly effective, because supervised learning can never be exhaustive and thus learning autonomously allows to discover invariances and regularities that help to generalize. In this paper we propose to apply a similar approach to the problem of object recognition across domains: our model learns the semantic labels in a supervised fashion, and broadens its understanding of the data by learning from self-supervised signals on the same images. This secondary task helps the network to focus on object shapes, learning concepts like spatial orientation and part correlation, while acting as a regularizer for the classification task over multiple visual domains. Extensive experiments confirm our intuition and show that our multi-task method combining supervised and self-supervised knowledge shows competitive results with respect to more complex domain generalization and adaptation solutions. It also proves its potential in the novel and challenging predictive and partial domain adaptation scenarios
Contrastive Audio-Visual Masked Autoencoder
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model
from a single modality to audio-visual multi-modalities. Subsequently, we
propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining
contrastive learning and masked data modeling, two major self-supervised
learning frameworks, to learn a joint and coordinated audio-visual
representation. Our experiments show that the contrastive audio-visual
correspondence learning objective not only enables the model to perform
audio-visual retrieval tasks, but also helps the model learn a better joint
representation. As a result, our fully self-supervised pretrained CAV-MAE
achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the
previous best supervised pretrained model on AudioSet in the audio-visual event
classification task. Code and pretrained models are at
https://github.com/yuangongnd/cav-mae.Comment: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained
models are at https://github.com/yuangongnd/cav-ma
Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension
In this work, we introduce a novel algorithm for solving the textbook
question answering (TQA) task which describes more realistic QA problems
compared to other recent tasks. We mainly focus on two related issues with
analysis of the TQA dataset. First, solving the TQA problems requires to
comprehend multi-modal contexts in complicated input data. To tackle this issue
of extracting knowledge features from long text lessons and merging them with
visual features, we establish a context graph from texts and images, and
propose a new module f-GCN based on graph convolutional networks (GCN). Second,
scientific terms are not spread over the chapters and subjects are split in the
TQA dataset. To overcome this so called "out-of-domain" issue, before learning
QA problems, we introduce a novel self-supervised open-set learning process
without any annotations. The experimental results show that our model
significantly outperforms prior state-of-the-art methods. Moreover, ablation
studies validate that both methods of incorporating f-GCN for extracting
knowledge from multi-modal contexts and our newly proposed self-supervised
learning process are effective for TQA problems.Comment: ACL2019 Camera-read
Perfect match: Improved cross-modal embeddings for audio-visual synchronisation
This paper proposes a new strategy for learning powerful cross-modal
embeddings for audio-to-video synchronization. Here, we set up the problem as
one of cross-modal retrieval, where the objective is to find the most relevant
audio segment given a short video clip. The method builds on the recent
advances in learning representations from cross-modal self-supervision.
The main contributions of this paper are as follows: (1) we propose a new
learning strategy where the embeddings are learnt via a multi-way matching
problem, as opposed to a binary classification (matching or non-matching)
problem as proposed by recent papers; (2) we demonstrate that performance of
this method far exceeds the existing baselines on the synchronization task; (3)
we use the learnt embeddings for visual speech recognition in self-supervision,
and show that the performance matches the representations learnt end-to-end in
a fully-supervised manner.Comment: Preprint. Work in progres
- …