4,798 research outputs found
Hashing for Multimedia Similarity Modeling and Large-Scale Retrieval
In recent years, the amount of multimedia data such as images, texts, and videos have been growing rapidly on the Internet. Motivated by such trends, this thesis is dedicated to exploiting hashing-based solutions to reveal multimedia data correlations and support intra-media and inter-media similarity search among huge volumes of multimedia data. We start by investigating a hashing-based solution for audio-visual similarity modeling and apply it to the audio-visual sound source localization problem. We show that synchronized signals in audio and visual modalities demonstrate similar temporal changing patterns in certain feature spaces. We propose to use a permutation-based random hashing technique to capture the temporal order dynamics of audio and visual features by hashing them along the temporal axis into a common Hamming space. In this way, the audio-visual correlation problem is transformed into a similarity search problem in the Hamming space. Our hashing-based audio-visual similarity modeling has shown superior performances in the localization and segmentation of sounding objects in videos. The success of the permutation-based hashing method motivates us to generalize and formally define the supervised ranking-based hashing problem, and study its application to large-scale image retrieval. Specifically, we propose an effective supervised learning procedure to learn optimized ranking-based hash functions that can be used for large-scale similarity search. Compared with the randomized version, the optimized ranking-based hash codes are much more compact and discriminative. Moreover, it can be easily extended to kernel space to discover more complex ranking structures that cannot be revealed in linear subspaces. Experiments on large image datasets demonstrate the effectiveness of the proposed method for image retrieval. We further studied the ranking-based hashing method for the cross-media similarity search problem. Specifically, we propose two optimization methods to jointly learn two groups of linear subspaces, one for each media type, so that features\u27 ranking orders in different linear subspaces maximally preserve the cross-media similarities. Additionally, we develop this ranking-based hashing method in the cross-media context into a flexible hashing framework with a more general solution. We have demonstrated through extensive experiments on several real-world datasets that the proposed cross-media hashing method can achieve superior cross-media retrieval performances against several state-of-the-art algorithms. Lastly, to make better use of the supervisory label information, as well as to further improve the efficiency and accuracy of supervised hashing, we propose a novel multimedia discrete hashing framework that optimizes an instance-wise loss objective, as compared to the pairwise losses, using an efficient discrete optimization method. In addition, the proposed method decouples the binary codes learning and hash function learning into two separate stages, thus making the proposed method equally applicable for both single-media and cross-media search. Extensive experiments on both single-media and cross-media retrieval tasks demonstrate the effectiveness of the proposed method
Learning effective binary representation with deep hashing technique for large-scale multimedia similarity search
The explosive growth of multimedia data in modern times inspires the research of performing an efficient large-scale multimedia similarity search in the existing information retrieval systems. In the past decades, the hashing-based nearest neighbor search methods draw extensive attention in this research field. By representing the original data with compact hash code, it enables the efficient similarity retrieval by only conducting bitwise operation when computing the Hamming distance. Moreover, less memory space is required to process and store the massive amounts of features for the search engines owing to the nature of compact binary code. These advantages make hashing a competitive option in large-scale visual-related retrieval tasks. Motivated by the previous dedicated works, this thesis focuses on learning compact binary representation via hashing techniques for the large-scale multimedia similarity search tasks. Particularly, several novel frameworks are proposed for popular hashing-based applications like a local binary descriptor for patch-level matching (Chapter 3), video-to-video retrieval (Chapter 4) and cross-modality retrieval (Chapter 5). This thesis starts by addressing the problem of learning local binary descriptor for better patch/image matching performance. To this end, we propose a novel local descriptor termed Unsupervised Deep Binary Descriptor (UDBD) for the patch-level matching tasks, which learns the transformation invariant binary descriptor via embedding the original visual data and their transformed sets into a common Hamming space. By imposing a l2,1-norm regularizer on the objective function, the learned binary descriptor gains robustness against noises. Moreover, a weak bit scheme is applied to address the ambiguous matching in the local binary descriptor, where the best match is determined for each query by comparing a series of weak bits between the query instance and the candidates, thus improving the matching performance. Furthermore, Unsupervised Deep Video Hashing (UDVH) is proposed to facilitate large-scale video-to-video retrieval. To tackle the imbalanced distribution issue in the video feature, balanced rotation is developed to identify a proper projection matrix such that the information of each dimension can be balanced in the fixed-bit quantization, thus improving the retrieval performance dramatically with better code quality. To provide comprehensive insights on the proposed rotation, two different video feature learning structures: stacked LSTM units (UDVH-LSTM) and Temporal Segment Network (UDVH-TSN) are presented in Chapter 4. Lastly, we extend the research topic from single-modality to cross-modality retrieval, where Self-Supervised Deep Multimodal Hashing (SSDMH) based on matrix factorization is proposed to learn unified binary code for different modalities directly without the need for relaxation. By minimizing graph regularization loss, it is prone to produce discriminative hash code via preserving the original data structure. Moreover, Binary Gradient Descent (BGD) accelerates the discrete optimization against the bit-by-bit fashion. Besides, an unsupervised version termed Unsupervised Deep Cross-Modal Hashing (UDCMH) is proposed to tackle the large-scale cross-modality retrieval when prior knowledge is unavailable
Discrete Multi-modal Hashing with Canonical Views for Robust Mobile Landmark Search
Mobile landmark search (MLS) recently receives increasing attention for its
great practical values. However, it still remains unsolved due to two important
challenges. One is high bandwidth consumption of query transmission, and the
other is the huge visual variations of query images sent from mobile devices.
In this paper, we propose a novel hashing scheme, named as canonical view based
discrete multi-modal hashing (CV-DMH), to handle these problems via a novel
three-stage learning procedure. First, a submodular function is designed to
measure visual representativeness and redundancy of a view set. With it,
canonical views, which capture key visual appearances of landmark with limited
redundancy, are efficiently discovered with an iterative mining strategy.
Second, multi-modal sparse coding is applied to transform visual features from
multiple modalities into an intermediate representation. It can robustly and
adaptively characterize visual contents of varied landmark images with certain
canonical views. Finally, compact binary codes are learned on intermediate
representation within a tailored discrete binary embedding model which
preserves visual relations of images measured with canonical views and removes
the involved noises. In this part, we develop a new augmented Lagrangian
multiplier (ALM) based optimization method to directly solve the discrete
binary codes. We can not only explicitly deal with the discrete constraint, but
also consider the bit-uncorrelated constraint and balance constraint together.
Experiments on real world landmark datasets demonstrate the superior
performance of CV-DMH over several state-of-the-art methods
Scalable Image Retrieval by Sparse Product Quantization
Fast Approximate Nearest Neighbor (ANN) search technique for high-dimensional
feature indexing and retrieval is the crux of large-scale image retrieval. A
recent promising technique is Product Quantization, which attempts to index
high-dimensional image features by decomposing the feature space into a
Cartesian product of low dimensional subspaces and quantizing each of them
separately. Despite the promising results reported, their quantization approach
follows the typical hard assignment of traditional quantization methods, which
may result in large quantization errors and thus inferior search performance.
Unlike the existing approaches, in this paper, we propose a novel approach
called Sparse Product Quantization (SPQ) to encoding the high-dimensional
feature vectors into sparse representation. We optimize the sparse
representations of the feature vectors by minimizing their quantization errors,
making the resulting representation is essentially close to the original data
in practice. Experiments show that the proposed SPQ technique is not only able
to compress data, but also an effective encoding technique. We obtain
state-of-the-art results for ANN search on four public image datasets and the
promising results of content-based image retrieval further validate the
efficacy of our proposed method.Comment: 12 page
Zero-Shot Hashing via Transferring Supervised Knowledge
Hashing has shown its efficiency and effectiveness in facilitating
large-scale multimedia applications. Supervised knowledge e.g. semantic labels
or pair-wise relationship) associated to data is capable of significantly
improving the quality of hash codes and hash functions. However, confronted
with the rapid growth of newly-emerging concepts and multimedia data on the
Web, existing supervised hashing approaches may easily suffer from the scarcity
and validity of supervised information due to the expensive cost of manual
labelling. In this paper, we propose a novel hashing scheme, termed
\emph{zero-shot hashing} (ZSH), which compresses images of "unseen" categories
to binary codes with hash functions learned from limited training data of
"seen" categories. Specifically, we project independent data labels i.e.
0/1-form label vectors) into semantic embedding space, where semantic
relationships among all the labels can be precisely characterized and thus seen
supervised knowledge can be transferred to unseen classes. Moreover, in order
to cope with the semantic shift problem, we rotate the embedded space to more
suitably align the embedded semantics with the low-level visual feature space,
thereby alleviating the influence of semantic gap. In the meantime, to exert
positive effects on learning high-quality hash functions, we further propose to
preserve local structural property and discrete nature in binary codes.
Besides, we develop an efficient alternating algorithm to solve the ZSH model.
Extensive experiments conducted on various real-life datasets show the superior
zero-shot image retrieval performance of ZSH as compared to several
state-of-the-art hashing methods.Comment: 11 page
- …