220 research outputs found
When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors
Finding similar user pairs is a fundamental task in social networks, with
numerous applications in ranking and personalization tasks such as link
prediction and tie strength detection. A common manifestation of user
similarity is based upon network structure: each user is represented by a
vector that represents the user's network connections, where pairwise cosine
similarity among these vectors defines user similarity. The predominant task
for user similarity applications is to discover all similar pairs that have a
pairwise cosine similarity value larger than a given threshold . In
contrast to previous work where is assumed to be quite close to 1, we
focus on recommendation applications where is small, but still
meaningful. The all pairs cosine similarity problem is computationally
challenging on networks with billions of edges, and especially so for settings
with small . To the best of our knowledge, there is no practical solution
for computing all user pairs with, say on large social networks,
even using the power of distributed algorithms.
Our work directly addresses this challenge by introducing a new algorithm ---
WHIMP --- that solves this problem efficiently in the MapReduce model. The key
insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for
approximate matrix multiplication with the SimHash random projection techniques
of Charikar. We provide a theoretical analysis of WHIMP, proving that it has
near optimal communication costs while maintaining computation cost comparable
with the state of the art. We also empirically demonstrate WHIMP's scalability
by computing all highly similar pairs on four massive data sets, and show that
it accurately finds high similarity pairs. In particular, we note that WHIMP
successfully processes the entire Twitter network, which has tens of billions
of edges
Clustering, Hamming Embedding, Generalized LSH and the Max Norm
We study the convex relaxation of clustering and hamming embedding, focusing
on the asymmetric case (co-clustering and asymmetric hamming embedding),
understanding their relationship to LSH as studied by (Charikar 2002) and to
the max-norm ball, and the differences between their symmetric and asymmetric
versions.Comment: 17 page
Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing
All pairs similarity search is a problem where a set of data objects is given
and the task is to find all pairs of objects that have similarity above a
certain threshold for a given similarity measure-of-interest. When the number
of points or dimensionality is high, standard solutions fail to scale
gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and
its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some
extent and provides substantial speedup over traditional index based
approaches. BayesLSH is used for pruning the candidate space and computation of
approximate similarity, whereas BayesLSHLite can only prune the candidates, but
similarity needs to be computed exactly on the original data. Thus where ever
the explicit data representation is available and exact similarity computation
is not too expensive, BayesLSHLite can be used to aggressively prune candidates
and provide substantial speedup without losing too much on quality. However,
the loss in quality is higher in the BayesLSH variant, where explicit data
representation is not available, rather only a hash sketch is available and
similarity has to be estimated approximately. In this work we revisit the LSH
problem from a Frequentist setting and formulate sequential tests for composite
hypothesis (similarity greater than or less than threshold) that can be
leveraged by such LSH algorithms for adaptively pruning candidates
aggressively. We propose a vanilla sequential probability ration test (SPRT)
approach based on this idea and two novel variants. We extend these variants to
the case where approximate similarity needs to be computed using fixed-width
sequential confidence interval generation technique
Where could we go? Recommendations for groups in location-based social networks
| openaire: EC/H2020/654024/EU//SoBigDataLocation-Based Social Networks (LBSNs) enable their users to share with their friends the places they go to and whom they go with. Additionally, they provide users with recommendations for Points of Interest (POI) they have not visited before. This functionality is of great importance for users of LBSNs, as it allows them to discover interesting places in populous cities that are not easy to explore. For this reason, previous research has focused on providing recommendations to LBSN users. Nevertheless, while most existing work focuses on recommendations for individual users, techniques to provide recommendations to groups of users are scarce. In this paper, we consider the problem of recommending a list of POIs to a group of users in the areas that the group frequents. Our data consist of activity on Swarm, a social networking app by Foursquare, and our results demonstrate that our proposed Geo-Group-Recommender (GGR), a class of hybrid recommender systems that combine the group geographical preferences using Kernel Density Estimation, category and location features and group check-ins outperform a large number of other recommender systems. Moreover, we find evidence that user preferences differ both in venue category and in location between individual and group activities. We also show that combining individual recommendations using group aggregation strategies is not as good as building a profile for a group. Our experiments show that (GGR) outperforms the baselines in terms of precision and recall at different cutoffs.Peer reviewe
Scalability and Total Recall with Fast CoveringLSH
Locality-sensitive hashing (LSH) has emerged as the dominant algorithmic
technique for similarity search with strong performance guarantees in
high-dimensional spaces. A drawback of traditional LSH schemes is that they may
have \emph{false negatives}, i.e., the recall is less than 100\%. This limits
the applicability of LSH in settings requiring precise performance guarantees.
Building on the recent theoretical "CoveringLSH" construction that eliminates
false negatives, we propose a fast and practical covering LSH scheme for
Hamming space called \emph{Fast CoveringLSH (fcLSH)}. Inheriting the design
benefits of CoveringLSH our method avoids false negatives and always reports
all near neighbors. Compared to CoveringLSH we achieve an asymptotic
improvement to the hash function computation time from to
, where is the dimensionality of data and is
the number of hash tables. Our experiments on synthetic and real-world data
sets demonstrate that \emph{fcLSH} is comparable (and often superior) to
traditional hashing-based approaches for search radius up to 20 in
high-dimensional Hamming space.Comment: Short version appears in Proceedings of CIKM 201
Zero-Shot Hashing via Transferring Supervised Knowledge
Hashing has shown its efficiency and effectiveness in facilitating
large-scale multimedia applications. Supervised knowledge e.g. semantic labels
or pair-wise relationship) associated to data is capable of significantly
improving the quality of hash codes and hash functions. However, confronted
with the rapid growth of newly-emerging concepts and multimedia data on the
Web, existing supervised hashing approaches may easily suffer from the scarcity
and validity of supervised information due to the expensive cost of manual
labelling. In this paper, we propose a novel hashing scheme, termed
\emph{zero-shot hashing} (ZSH), which compresses images of "unseen" categories
to binary codes with hash functions learned from limited training data of
"seen" categories. Specifically, we project independent data labels i.e.
0/1-form label vectors) into semantic embedding space, where semantic
relationships among all the labels can be precisely characterized and thus seen
supervised knowledge can be transferred to unseen classes. Moreover, in order
to cope with the semantic shift problem, we rotate the embedded space to more
suitably align the embedded semantics with the low-level visual feature space,
thereby alleviating the influence of semantic gap. In the meantime, to exert
positive effects on learning high-quality hash functions, we further propose to
preserve local structural property and discrete nature in binary codes.
Besides, we develop an efficient alternating algorithm to solve the ZSH model.
Extensive experiments conducted on various real-life datasets show the superior
zero-shot image retrieval performance of ZSH as compared to several
state-of-the-art hashing methods.Comment: 11 page
Variational Deep Semantic Hashing for Text Documents
As the amount of textual data has been rapidly increasing over the past
decade, efficient similarity search methods have become a crucial component of
large-scale information retrieval systems. A popular strategy is to represent
original data samples by compact binary codes through hashing. A spectrum of
machine learning methods have been utilized, but they often lack expressiveness
and flexibility in modeling to learn effective representations. The recent
advances of deep learning in a wide range of applications has demonstrated its
capability to learn robust and powerful feature representations for complex
data. Especially, deep generative models naturally combine the expressiveness
of probabilistic generative models with the high capacity of deep neural
networks, which is very suitable for text modeling. However, little work has
leveraged the recent progress in deep learning for text hashing.
In this paper, we propose a series of novel deep document generative models
for text hashing. The first proposed model is unsupervised while the second one
is supervised by utilizing document labels/tags for hashing. The third model
further considers document-specific factors that affect the generation of
words. The probabilistic generative formulation of the proposed models provides
a principled framework for model extension, uncertainty estimation, simulation,
and interpretability. Based on variational inference and reparameterization,
the proposed models can be interpreted as encoder-decoder deep neural networks
and thus they are capable of learning complex nonlinear distributed
representations of the original documents. We conduct a comprehensive set of
experiments on four public testbeds. The experimental results have demonstrated
the effectiveness of the proposed supervised learning models for text hashing.Comment: 11 pages, 4 figure
Parameterized Complexity of the k-anonymity Problem
The problem of publishing personal data without giving up privacy is becoming
increasingly important. An interesting formalization that has been recently
proposed is the -anonymity. This approach requires that the rows of a table
are partitioned in clusters of size at least and that all the rows in a
cluster become the same tuple, after the suppression of some entries. The
natural optimization problem, where the goal is to minimize the number of
suppressed entries, is known to be APX-hard even when the records values are
over a binary alphabet and , and when the records have length at most 8
and . In this paper we study how the complexity of the problem is
influenced by different parameters. In this paper we follow this direction of
research, first showing that the problem is W[1]-hard when parameterized by the
size of the solution (and the value ). Then we exhibit a fixed parameter
algorithm, when the problem is parameterized by the size of the alphabet and
the number of columns. Finally, we investigate the computational (and
approximation) complexity of the -anonymity problem, when restricting the
instance to records having length bounded by 3 and . We show that such a
restriction is APX-hard.Comment: 22 pages, 2 figure
- …