13 research outputs found
Variational Deep Semantic Hashing for Text Documents
As the amount of textual data has been rapidly increasing over the past
decade, efficient similarity search methods have become a crucial component of
large-scale information retrieval systems. A popular strategy is to represent
original data samples by compact binary codes through hashing. A spectrum of
machine learning methods have been utilized, but they often lack expressiveness
and flexibility in modeling to learn effective representations. The recent
advances of deep learning in a wide range of applications has demonstrated its
capability to learn robust and powerful feature representations for complex
data. Especially, deep generative models naturally combine the expressiveness
of probabilistic generative models with the high capacity of deep neural
networks, which is very suitable for text modeling. However, little work has
leveraged the recent progress in deep learning for text hashing.
In this paper, we propose a series of novel deep document generative models
for text hashing. The first proposed model is unsupervised while the second one
is supervised by utilizing document labels/tags for hashing. The third model
further considers document-specific factors that affect the generation of
words. The probabilistic generative formulation of the proposed models provides
a principled framework for model extension, uncertainty estimation, simulation,
and interpretability. Based on variational inference and reparameterization,
the proposed models can be interpreted as encoder-decoder deep neural networks
and thus they are capable of learning complex nonlinear distributed
representations of the original documents. We conduct a comprehensive set of
experiments on four public testbeds. The experimental results have demonstrated
the effectiveness of the proposed supervised learning models for text hashing.Comment: 11 pages, 4 figure
Unsupervised Semantic Hashing with Pairwise Reconstruction
Semantic Hashing is a popular family of methods for efficient similarity
search in large-scale datasets. In Semantic Hashing, documents are encoded as
short binary vectors (i.e., hash codes), such that semantic similarity can be
efficiently computed using the Hamming distance. Recent state-of-the-art
approaches have utilized weak supervision to train better performing hashing
models. Inspired by this, we present Semantic Hashing with Pairwise
Reconstruction (PairRec), which is a discrete variational autoencoder based
hashing model. PairRec first encodes weakly supervised training pairs (a query
document and a semantically similar document) into two hash codes, and then
learns to reconstruct the same query document from both of these hash codes
(i.e., pairwise reconstruction). This pairwise reconstruction enables our model
to encode local neighbourhood structures within the hash code directly through
the decoder. We experimentally compare PairRec to traditional and
state-of-the-art approaches, and obtain significant performance improvements in
the task of document similarity search.Comment: Accepted at SIGIR'2
Content-aware Neural Hashing for Cold-start Recommendation
Content-aware recommendation approaches are essential for providing
meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items
in a recommender system. We present a content-aware neural hashing-based
collaborative filtering approach (NeuHash-CF), which generates binary hash
codes for users and items, such that the highly efficient Hamming distance can
be used for estimating user-item relevance. NeuHash-CF is modelled as an
autoencoder architecture, consisting of two joint hashing components for
generating user and item hash codes. Inspired from semantic hashing, the item
hashing component generates a hash code directly from an item's content
information (i.e., it generates cold-start and seen item hash codes in the same
manner). This contrasts existing state-of-the-art models, which treat the two
item cases separately. The user hash codes are generated directly based on user
id, through learning a user embedding matrix. We show experimentally that
NeuHash-CF significantly outperforms state-of-the-art baselines by up to 12\%
NDCG and 13\% MRR in cold-start recommendation settings, and up to 4\% in both
NDCG and MRR in standard settings where all items are present while training.
Our approach uses 2-4x shorter hash codes, while obtaining the same or better
performance compared to the state of the art, thus consequently also enabling a
notable storage reduction.Comment: Accepted to SIGIR 202
Deep Generative Models for Semantic Text Hashing
As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing.
Meanwhile, most state-of-the-art semantic hashing approaches require large amounts of hand-labeled training data which are often expensive and time consuming to collect. The cost of getting labeled data is the key bottleneck in deploying these hashing methods. Furthermore, Most existing text hashing approaches treat each document separately and only learn the hash codes from the content of the documents. However, in reality, documents are related to each other either explicitly through an observed linkage such as citations or implicitly through unobserved connections such as adjacency in the original space. The document relationships are pervasive in the real world while they are largely ignored in the prior semantic hashing work.
In this thesis, we propose a series of novel deep document generative models for text hashing to address the aforementioned challenges. Based on the deep generative modeling framework, our models employ deep neural networks to learn complex mappings from the original space to the hash space. We first introduce an unsupervised models for text hashing. Then we further introduce the supervised models that utilize document labels/tags as well as consider document-specific factors that affect the generation of words.
To address the lack of labeled data, we employ unsupervised ranking methods such as BM25 to extract weak signals from training data. We propose two deep generative semantic hashing models to leverage weak signals for text hashing. Finally, we propose node2hash, an unsupervised deep generative model for semantic text hashing by utilizing graph context. It is designed to incorporate both document content and connection information through a probabilistic formulation. Based on the deep generative modeling framework, node2hash employs deep neural networks to learn complex mappings from the original space to the hash space.
The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on various public testbeds. The experimental results have demonstrated the effectiveness of the proposed models over the competitive baselines