24,331 research outputs found
CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing
We address the problem of cross-modal information retrieval in the domain of
remote sensing. In particular, we are interested in two application scenarios:
i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery,
and ii) multi-label image retrieval between very high resolution (VHR) images
and speech based label annotations. Notice that these multi-modal retrieval
scenarios are more challenging than the traditional uni-modal retrieval
approaches given the inherent differences in distributions between the
modalities. However, with the growing availability of multi-source remote
sensing data and the scarcity of enough semantic annotations, the task of
multi-modal retrieval has recently become extremely important. In this regard,
we propose a novel deep neural network based architecture which is considered
to learn a discriminative shared feature space for all the input modalities,
suitable for semantically coherent information retrieval. Extensive experiments
are carried out on the benchmark large-scale PAN - multi-spectral DSRSID
dataset and the multi-label UC-Merced dataset. Together with the Merced
dataset, we generate a corpus of speech signals corresponding to the labels.
Superior performance with respect to the current state-of-the-art is observed
in all the cases
Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision
The goal of this work is to train discriminative cross-modal embeddings
without access to manually annotated data. Recent advances in self-supervised
learning have shown that effective representations can be learnt from natural
cross-modal synchrony. We build on earlier work to train embeddings that are
more discriminative for uni-modal downstream tasks. To this end, we propose a
novel training strategy that not only optimises metrics across modalities, but
also enforces intra-class feature separation within each of the modalities. The
effectiveness of the method is demonstrated on two downstream tasks: lip
reading using the features trained on audio-visual synchronisation, and speaker
recognition using the features trained for cross-modal biometric matching. The
proposed method outperforms state-of-the-art self-supervised baselines by a
signficant margin.Comment: Under submission as a conference pape
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
- …