866 research outputs found
Deep Sketch Hashing: Fast Free-hand Sketch-Based Image Retrieval
Free-hand sketch-based image retrieval (SBIR) is a specific cross-view
retrieval task, in which queries are abstract and ambiguous sketches while the
retrieval database is formed with natural images. Work in this area mainly
focuses on extracting representative and shared features for sketches and
natural images. However, these can neither cope well with the geometric
distortion between sketches and images nor be feasible for large-scale SBIR due
to the heavy continuous-valued distance computation. In this paper, we speed up
SBIR by introducing a novel binary coding method, named \textbf{Deep Sketch
Hashing} (DSH), where a semi-heterogeneous deep architecture is proposed and
incorporated into an end-to-end binary coding framework. Specifically, three
convolutional neural networks are utilized to encode free-hand sketches,
natural images and, especially, the auxiliary sketch-tokens which are adopted
as bridges to mitigate the sketch-image geometric distortion. The learned DSH
codes can effectively capture the cross-view similarities as well as the
intrinsic semantic correlations between different categories. To the best of
our knowledge, DSH is the first hashing work specifically designed for
category-level SBIR with an end-to-end deep architecture. The proposed DSH is
comprehensively evaluated on two large-scale datasets of TU-Berlin Extension
and Sketchy, and the experiments consistently show DSH's superior SBIR
accuracies over several state-of-the-art methods, while achieving significantly
reduced retrieval time and memory footprint.Comment: This paper will appear as a spotlight paper in CVPR201
Efficient Discrete Supervised Hashing for Large-scale Cross-modal Retrieval
Supervised cross-modal hashing has gained increasing research interest on
large-scale retrieval task owning to its satisfactory performance and
efficiency. However, it still has some challenging issues to be further
studied: 1) most of them fail to well preserve the semantic correlations in
hash codes because of the large heterogenous gap; 2) most of them relax the
discrete constraint on hash codes, leading to large quantization error and
consequent low performance; 3) most of them suffer from relatively high memory
cost and computational complexity during training procedure, which makes them
unscalable. In this paper, to address above issues, we propose a supervised
cross-modal hashing method based on matrix factorization dubbed Efficient
Discrete Supervised Hashing (EDSH). Specifically, collective matrix
factorization on heterogenous features and semantic embedding with class labels
are seamlessly integrated to learn hash codes. Therefore, the feature based
similarities and semantic correlations can be both preserved in hash codes,
which makes the learned hash codes more discriminative. Then an efficient
discrete optimal algorithm is proposed to handle the scalable issue. Instead of
learning hash codes bit-by-bit, hash codes matrix can be obtained directly
which is more efficient. Extensive experimental results on three public
real-world datasets demonstrate that EDSH produces a superior performance in
both accuracy and scalability over some existing cross-modal hashing methods
Semi-supervised Multimodal Hashing
Retrieving nearest neighbors across correlated data in multiple modalities,
such as image-text pairs on Facebook and video-tag pairs on YouTube, has become
a challenging task due to the huge amount of data. Multimodal hashing methods
that embed data into binary codes can boost the retrieving speed and reduce
storage requirement. As unsupervised multimodal hashing methods are usually
inferior to supervised ones, while the supervised ones requires too much
manually labeled data, the proposed method in this paper utilizes a part of
labels to design a semi-supervised multimodal hashing method. It first computes
the transformation matrices for data matrices and label matrix. Then, with
these transformation matrices, fuzzy logic is introduced to estimate a label
matrix for unlabeled data. Finally, it uses the estimated label matrix to learn
hashing functions for data in each modality to generate a unified binary code
matrix. Experiments show that the proposed semi-supervised method with 50%
labels can get a medium performance among the compared supervised ones and
achieve an approximate performance to the best supervised method with 90%
labels. With only 10% labels, the proposed method can still compete with the
worst compared supervised one
Audio Content based Geotagging in Multimedia
In this paper we propose methods to extract geographically relevant
information in a multimedia recording using its audio. Our method primarily is
based on the fact that urban acoustic environment consists of a variety of
sounds. Hence, location information can be inferred from the composition of
sound events/classes present in the audio. More specifically, we adopt matrix
factorization techniques to obtain semantic content of recording in terms of
different sound classes. These semantic information are then combined to
identify the location of recording.Comment: 5 page
Supervised cross-modal factor analysis for multiple modal data classification
In this paper we study the problem of learning from multiple modal data for
purpose of document classification. In this problem, each document is composed
two different modals of data, i.e., an image and a text. Cross-modal factor
analysis (CFA) has been proposed to project the two different modals of data to
a shared data space, so that the classification of a image or a text can be
performed directly in this space. A disadvantage of CFA is that it has ignored
the supervision information. In this paper, we improve CFA by incorporating the
supervision information to represent and classify both image and text modals of
documents. We project both image and text data to a shared data space by factor
analysis, and then train a class label predictor in the shared space to use the
class label information. The factor analysis parameter and the predictor
parameter are learned jointly by solving one single objective function. With
this objective function, we minimize the distance between the projections of
image and text of the same document, and the classification error of the
projection measured by hinge loss function. The objective function is optimized
by an alternate optimization strategy in an iterative algorithm. Experiments in
two different multiple modal document data sets show the advantage of the
proposed algorithm over other CFA methods
Unsupervised Multi-modal Hashing for Cross-modal retrieval
With the advantage of low storage cost and high efficiency, hashing learning
has received much attention in the domain of Big Data. In this paper, we
propose a novel unsupervised hashing learning method to cope with this open
problem to directly preserve the manifold structure by hashing. To address this
problem, both the semantic correlation in textual space and the locally
geometric structure in the visual space are explored simultaneously in our
framework. Besides, the `2;1-norm constraint is imposed on the projection
matrices to learn the discriminative hash function for each modality. Extensive
experiments are performed to evaluate the proposed method on the three publicly
available datasets and the experimental results show that our method can
achieve superior performance over the state-of-the-art methods.Comment: 4 pages, 4 figure
Multimodal music information processing and retrieval: survey and future challenges
Towards improving the performance in various music information processing
tasks, recent studies exploit different modalities able to capture diverse
aspects of music. Such modalities include audio recordings, symbolic music
scores, mid-level representations, motion, and gestural data, video recordings,
editorial or cultural tags, lyrics and album cover arts. This paper critically
reviews the various approaches adopted in Music Information Processing and
Retrieval and highlights how multimodal algorithms can help Music Computing
applications. First, we categorize the related literature based on the
application they address. Subsequently, we analyze existing information fusion
approaches, and we conclude with the set of challenges that Music Information
Retrieval and Sound and Music Computing research communities should focus in
the next years
Dense Multimodal Fusion for Hierarchically Joint Representation
Multiple modalities can provide more valuable information than single one by
describing the same contents in various ways. Hence, it is highly expected to
learn effective joint representation by fusing the features of different
modalities. However, previous methods mainly focus on fusing the shallow
features or high-level representations generated by unimodal deep networks,
which only capture part of the hierarchical correlations across modalities. In
this paper, we propose to densely integrate the representations by greedily
stacking multiple shared layers between different modality-specific networks,
which is named as Dense Multimodal Fusion (DMF). The joint representations in
different shared layers can capture the correlations in different levels, and
the connection between shared layers also provides an efficient way to learn
the dependence among hierarchical correlations. These two properties jointly
contribute to the multiple learning paths in DMF, which results in faster
convergence, lower training loss, and better performance. We evaluate our model
on three typical multimodal learning tasks, including audiovisual speech
recognition, cross-modal retrieval, and multimodal classification. The
noticeable performance in the experiments demonstrates that our model can learn
more effective joint representation.Comment: 10 pages, 4 figure
Attribute-Guided Network for Cross-Modal Zero-Shot Hashing
Zero-Shot Hashing aims at learning a hashing model that is trained only by
instances from seen categories but can generate well to those of unseen
categories. Typically, it is achieved by utilizing a semantic embedding space
to transfer knowledge from seen domain to unseen domain. Existing efforts
mainly focus on single-modal retrieval task, especially Image-Based Image
Retrieval (IBIR). However, as a highlighted research topic in the field of
hashing, cross-modal retrieval is more common in real world applications. To
address the Cross-Modal Zero-Shot Hashing (CMZSH) retrieval task, we propose a
novel Attribute-Guided Network (AgNet), which can perform not only IBIR, but
also Text-Based Image Retrieval (TBIR). In particular, AgNet aligns different
modal data into a semantically rich attribute space, which bridges the gap
caused by modality heterogeneity and zero-shot setting. We also design an
effective strategy that exploits the attribute to guide the generation of hash
codes for image and text within the same network. Extensive experimental
results on three benchmark datasets (AwA, SUN, and ImageNet) demonstrate the
superiority of AgNet on both cross-modal and single-modal zero-shot image
retrieval tasks.Comment: 9 pages, 8 figure
Multimodal diffusion geometry by joint diagonalization of Laplacians
We construct an extension of diffusion geometry to multiple modalities
through joint approximate diagonalization of Laplacian matrices. This naturally
extends classical data analysis tools based on spectral geometry, such as
diffusion maps and spectral clustering. We provide several synthetic and real
examples of manifold learning, retrieval, and clustering demonstrating that the
joint diffusion geometry frequently better captures the inherent structure of
multi-modal data. We also show that many previous attempts to construct
multimodal spectral clustering can be seen as particular cases of joint
approximate diagonalization of the Laplacians
- …