18,975 research outputs found
CMIR-NET : A Deep Learning Based Model For Cross-Modal Retrieval In Remote Sensing
We address the problem of cross-modal information retrieval in the domain of
remote sensing. In particular, we are interested in two application scenarios:
i) cross-modal retrieval between panchromatic (PAN) and multi-spectral imagery,
and ii) multi-label image retrieval between very high resolution (VHR) images
and speech based label annotations. Notice that these multi-modal retrieval
scenarios are more challenging than the traditional uni-modal retrieval
approaches given the inherent differences in distributions between the
modalities. However, with the growing availability of multi-source remote
sensing data and the scarcity of enough semantic annotations, the task of
multi-modal retrieval has recently become extremely important. In this regard,
we propose a novel deep neural network based architecture which is considered
to learn a discriminative shared feature space for all the input modalities,
suitable for semantically coherent information retrieval. Extensive experiments
are carried out on the benchmark large-scale PAN - multi-spectral DSRSID
dataset and the multi-label UC-Merced dataset. Together with the Merced
dataset, we generate a corpus of speech signals corresponding to the labels.
Superior performance with respect to the current state-of-the-art is observed
in all the cases
Multi-label modality enhanced attention based self-supervised deep cross-modal hashing
The recent deep cross-modal hashing (DCMH) has achieved superior performance in effective and efficient cross-modal retrieval and thus has drawn increasing attention. Nevertheless, there are still two limitations for most existing DCMH methods: (1) single labels are usually leveraged to measure the semantic similarity of cross-modal pairwise instances while neglecting that many cross-modal datasets contain abundant semantic information among multi-labels. (2) several DCMH methods utilized the multi-labels to supervise the learning of hash functions. Nevertheless, the feature space of multilabels suffers the weakness of sparse, resulting in sub-optimization for the hash functions learning. Thus, this paper proposed a multi-label modality enhanced attention-based self-supervised deep cross-modal hashing (MMACH) framework. Specifically, a multi-label modality enhanced attention module is designed to integrate the significant features from cross-modal data into multi-labels feature representations, aiming to improve its completion. Moreover, a multi-label cross-modal triplet loss is defined based on the criterion that the feature representations of cross-modal pairwise instances with more common categories should preserve higher semantic similarity than other instances. To the best of our knowledge, the multi-label cross-modal triplet loss is the first time designed for cross-modal retrieval. Extensive experiments on four multi-label cross-modal datasets demonstrate the effectiveness and efficiency of our proposed MMACH. Moreover, the MMACH also achieved superior performance and outperformed several state-of-the-art methods on the task of cross-modal retrieval. The source code of MMACH is available at https://github.com/SWU-CS-MediaLab/MMACH. (c) 2021 Elsevier B.V. All rights reserved.Computer Systems, Imagery and Medi
Ranking-based Deep Cross-modal Hashing
Cross-modal hashing has been receiving increasing interests for its low
storage cost and fast query speed in multi-modal data retrievals. However, most
existing hashing methods are based on hand-crafted or raw level features of
objects, which may not be optimally compatible with the coding process.
Besides, these hashing methods are mainly designed to handle simple pairwise
similarity. The complex multilevel ranking semantic structure of instances
associated with multiple labels has not been well explored yet. In this paper,
we propose a ranking-based deep cross-modal hashing approach (RDCMH). RDCMH
firstly uses the feature and label information of data to derive a
semi-supervised semantic ranking list. Next, to expand the semantic
representation power of hand-crafted features, RDCMH integrates the semantic
ranking information into deep cross-modal hashing and jointly optimizes the
compatible parameters of deep feature representations and of hashing functions.
Experiments on real multi-modal datasets show that RDCMH outperforms other
competitive baselines and achieves the state-of-the-art performance in
cross-modal retrieval applications
Improving Music Genre Classification from multi-modal properties of music and genre correlations Perspective
Music genre classification has been widely studied in past few years for its
various applications in music information retrieval. Previous works tend to
perform unsatisfactorily, since those methods only use audio content or jointly
use audio content and lyrics content inefficiently. In addition, as genres
normally co-occur in a music track, it is desirable to capture and model the
genre correlations to improve the performance of multi-label music genre
classification. To solve these issues, we present a novel multi-modal method
leveraging audio-lyrics contrastive loss and two symmetric cross-modal
attention, to align and fuse features from audio and lyrics. Furthermore, based
on the nature of the multi-label classification, a genre correlations
extraction module is presented to capture and model potential genre
correlations. Extensive experiments demonstrate that our proposed method
significantly surpasses other multi-label music genre classification methods
and achieves state-of-the-art result on Music4All dataset.Comment: Accepted by ICASSP 202
Learning Visual Actions Using Multiple Verb-Only Labels
This work introduces verb-only representations for both recognition and
retrieval of visual actions, in video. Current methods neglect legitimate
semantic ambiguities between verbs, instead choosing unambiguous subsets of
verbs along with objects to disambiguate the actions. We instead propose
multiple verb-only labels, which we learn through hard or soft assignment as a
regression. This enables learning a much larger vocabulary of verbs, including
contextual overlaps of these verbs. We collect multi-verb annotations for three
action video datasets and evaluate the verb-only labelling representations for
action recognition and cross-modal retrieval (video-to-text and text-to-video).
We demonstrate that multi-label verb-only representations outperform
conventional single verb labels. We also explore other benefits of a multi-verb
representation including cross-dataset retrieval and verb type manner and
result verb types) retrieval.Comment: Accepted at BMVC 2019. More information can be found at
https://mwray.github.io/MVOL/. Annotations can be found at
https://github.com/mwray/Multi-Verb-Label
Deep Lifelong Cross-modal Hashing
Hashing methods have made significant progress in cross-modal retrieval tasks
with fast query speed and low storage cost. Among them, deep learning-based
hashing achieves better performance on large-scale data due to its excellent
extraction and representation ability for nonlinear heterogeneous features.
However, there are still two main challenges in catastrophic forgetting when
data with new categories arrive continuously, and time-consuming for
non-continuous hashing retrieval to retrain for updating. To this end, we, in
this paper, propose a novel deep lifelong cross-modal hashing to achieve
lifelong hashing retrieval instead of re-training hash function repeatedly when
new data arrive. Specifically, we design lifelong learning strategy to update
hash functions by directly training the incremental data instead of retraining
new hash functions using all the accumulated data, which significantly reduce
training time. Then, we propose lifelong hashing loss to enable original hash
codes participate in lifelong learning but remain invariant, and further
preserve the similarity and dis-similarity among original and incremental hash
codes to maintain performance. Additionally, considering distribution
heterogeneity when new data arriving continuously, we introduce multi-label
semantic similarity to supervise hash learning, and it has been proven that the
similarity improves performance with detailed analysis. Experimental results on
benchmark datasets show that the proposed methods achieves comparative
performance comparing with recent state-of-the-art cross-modal hashing methods,
and it yields substantial average increments over 20\% in retrieval accuracy
and almost reduces over 80\% training time when new data arrives continuously
Exhaustive and Efficient Constraint Propagation: A Semi-Supervised Learning Perspective and Its Applications
This paper presents a novel pairwise constraint propagation approach by
decomposing the challenging constraint propagation problem into a set of
independent semi-supervised learning subproblems which can be solved in
quadratic time using label propagation based on k-nearest neighbor graphs.
Considering that this time cost is proportional to the number of all possible
pairwise constraints, our approach actually provides an efficient solution for
exhaustively propagating pairwise constraints throughout the entire dataset.
The resulting exhaustive set of propagated pairwise constraints are further
used to adjust the similarity matrix for constrained spectral clustering. Other
than the traditional constraint propagation on single-source data, our approach
is also extended to more challenging constraint propagation on multi-source
data where each pairwise constraint is defined over a pair of data points from
different sources. This multi-source constraint propagation has an important
application to cross-modal multimedia retrieval. Extensive results have shown
the superior performance of our approach.Comment: The short version of this paper appears as oral paper in ECCV 201
- …