220 research outputs found

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold τ\tau. In contrast to previous work where τ\tau is assumed to be quite close to 1, we focus on recommendation applications where τ\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small τ\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say τ=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    Clustering, Hamming Embedding, Generalized LSH and the Max Norm

    Full text link
    We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions.Comment: 17 page

    Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing

    Full text link
    All pairs similarity search is a problem where a set of data objects is given and the task is to find all pairs of objects that have similarity above a certain threshold for a given similarity measure-of-interest. When the number of points or dimensionality is high, standard solutions fail to scale gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some extent and provides substantial speedup over traditional index based approaches. BayesLSH is used for pruning the candidate space and computation of approximate similarity, whereas BayesLSHLite can only prune the candidates, but similarity needs to be computed exactly on the original data. Thus where ever the explicit data representation is available and exact similarity computation is not too expensive, BayesLSHLite can be used to aggressively prune candidates and provide substantial speedup without losing too much on quality. However, the loss in quality is higher in the BayesLSH variant, where explicit data representation is not available, rather only a hash sketch is available and similarity has to be estimated approximately. In this work we revisit the LSH problem from a Frequentist setting and formulate sequential tests for composite hypothesis (similarity greater than or less than threshold) that can be leveraged by such LSH algorithms for adaptively pruning candidates aggressively. We propose a vanilla sequential probability ration test (SPRT) approach based on this idea and two novel variants. We extend these variants to the case where approximate similarity needs to be computed using fixed-width sequential confidence interval generation technique

    Where could we go? Recommendations for groups in location-based social networks

    Get PDF
    | openaire: EC/H2020/654024/EU//SoBigDataLocation-Based Social Networks (LBSNs) enable their users to share with their friends the places they go to and whom they go with. Additionally, they provide users with recommendations for Points of Interest (POI) they have not visited before. This functionality is of great importance for users of LBSNs, as it allows them to discover interesting places in populous cities that are not easy to explore. For this reason, previous research has focused on providing recommendations to LBSN users. Nevertheless, while most existing work focuses on recommendations for individual users, techniques to provide recommendations to groups of users are scarce. In this paper, we consider the problem of recommending a list of POIs to a group of users in the areas that the group frequents. Our data consist of activity on Swarm, a social networking app by Foursquare, and our results demonstrate that our proposed Geo-Group-Recommender (GGR), a class of hybrid recommender systems that combine the group geographical preferences using Kernel Density Estimation, category and location features and group check-ins outperform a large number of other recommender systems. Moreover, we find evidence that user preferences differ both in venue category and in location between individual and group activities. We also show that combining individual recommendations using group aggregation strategies is not as good as building a profile for a group. Our experiments show that (GGR) outperforms the baselines in terms of precision and recall at different cutoffs.Peer reviewe

    Scalability and Total Recall with Fast CoveringLSH

    Get PDF
    Locality-sensitive hashing (LSH) has emerged as the dominant algorithmic technique for similarity search with strong performance guarantees in high-dimensional spaces. A drawback of traditional LSH schemes is that they may have \emph{false negatives}, i.e., the recall is less than 100\%. This limits the applicability of LSH in settings requiring precise performance guarantees. Building on the recent theoretical "CoveringLSH" construction that eliminates false negatives, we propose a fast and practical covering LSH scheme for Hamming space called \emph{Fast CoveringLSH (fcLSH)}. Inheriting the design benefits of CoveringLSH our method avoids false negatives and always reports all near neighbors. Compared to CoveringLSH we achieve an asymptotic improvement to the hash function computation time from O(dL)\mathcal{O}(dL) to O(d+LlogL)\mathcal{O}(d + L\log{L}), where dd is the dimensionality of data and LL is the number of hash tables. Our experiments on synthetic and real-world data sets demonstrate that \emph{fcLSH} is comparable (and often superior) to traditional hashing-based approaches for search radius up to 20 in high-dimensional Hamming space.Comment: Short version appears in Proceedings of CIKM 201

    Zero-Shot Hashing via Transferring Supervised Knowledge

    Full text link
    Hashing has shown its efficiency and effectiveness in facilitating large-scale multimedia applications. Supervised knowledge e.g. semantic labels or pair-wise relationship) associated to data is capable of significantly improving the quality of hash codes and hash functions. However, confronted with the rapid growth of newly-emerging concepts and multimedia data on the Web, existing supervised hashing approaches may easily suffer from the scarcity and validity of supervised information due to the expensive cost of manual labelling. In this paper, we propose a novel hashing scheme, termed \emph{zero-shot hashing} (ZSH), which compresses images of "unseen" categories to binary codes with hash functions learned from limited training data of "seen" categories. Specifically, we project independent data labels i.e. 0/1-form label vectors) into semantic embedding space, where semantic relationships among all the labels can be precisely characterized and thus seen supervised knowledge can be transferred to unseen classes. Moreover, in order to cope with the semantic shift problem, we rotate the embedded space to more suitably align the embedded semantics with the low-level visual feature space, thereby alleviating the influence of semantic gap. In the meantime, to exert positive effects on learning high-quality hash functions, we further propose to preserve local structural property and discrete nature in binary codes. Besides, we develop an efficient alternating algorithm to solve the ZSH model. Extensive experiments conducted on various real-life datasets show the superior zero-shot image retrieval performance of ZSH as compared to several state-of-the-art hashing methods.Comment: 11 page

    Variational Deep Semantic Hashing for Text Documents

    Full text link
    As the amount of textual data has been rapidly increasing over the past decade, efficient similarity search methods have become a crucial component of large-scale information retrieval systems. A popular strategy is to represent original data samples by compact binary codes through hashing. A spectrum of machine learning methods have been utilized, but they often lack expressiveness and flexibility in modeling to learn effective representations. The recent advances of deep learning in a wide range of applications has demonstrated its capability to learn robust and powerful feature representations for complex data. Especially, deep generative models naturally combine the expressiveness of probabilistic generative models with the high capacity of deep neural networks, which is very suitable for text modeling. However, little work has leveraged the recent progress in deep learning for text hashing. In this paper, we propose a series of novel deep document generative models for text hashing. The first proposed model is unsupervised while the second one is supervised by utilizing document labels/tags for hashing. The third model further considers document-specific factors that affect the generation of words. The probabilistic generative formulation of the proposed models provides a principled framework for model extension, uncertainty estimation, simulation, and interpretability. Based on variational inference and reparameterization, the proposed models can be interpreted as encoder-decoder deep neural networks and thus they are capable of learning complex nonlinear distributed representations of the original documents. We conduct a comprehensive set of experiments on four public testbeds. The experimental results have demonstrated the effectiveness of the proposed supervised learning models for text hashing.Comment: 11 pages, 4 figure

    Parameterized Complexity of the k-anonymity Problem

    Full text link
    The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the kk-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least kk and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and k=3k=3, and when the records have length at most 8 and k=4k=4 . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value kk). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the kk-anonymity problem, when restricting the instance to records having length bounded by 3 and k=3k=3. We show that such a restriction is APX-hard.Comment: 22 pages, 2 figure