845 research outputs found

    When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

    Full text link
    Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold Ï„\tau. In contrast to previous work where Ï„\tau is assumed to be quite close to 1, we focus on recommendation applications where Ï„\tau is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small Ï„\tau. To the best of our knowledge, there is no practical solution for computing all user pairs with, say Ï„=0.2\tau = 0.2 on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

    Revisiting Wedge Sampling for Budgeted Maximum Inner Product Search

    Full text link
    Top-k maximum inner product search (MIPS) is a central task in many machine learning applications. This paper extends top-k MIPS with a budgeted setting, that asks for the best approximate top-k MIPS given a limit of B computational operations. We investigate recent advanced sampling algorithms, including wedge and diamond sampling to solve it. Though the design of these sampling schemes naturally supports budgeted top-k MIPS, they suffer from the linear cost from scanning all data points to retrieve top-k results and the performance degradation for handling negative inputs. This paper makes two main contributions. First, we show that diamond sampling is essentially a combination between wedge sampling and basic sampling for top-k MIPS. Our theoretical analysis and empirical evaluation show that wedge is competitive (often superior) to diamond on approximating top-k MIPS regarding both efficiency and accuracy. Second, we propose a series of algorithmic engineering techniques to deploy wedge sampling on budgeted top-k MIPS. Our novel deterministic wedge-based algorithm runs significantly faster than the state-of-the-art methods for budgeted and exact top-k MIPS while maintaining the top-5 precision at least 80% on standard recommender system data sets.Comment: ECML-PKDD 202

    A Bandit Approach to Maximum Inner Product Search

    Full text link
    There has been substantial research on sub-linear time approximate algorithms for Maximum Inner Product Search (MIPS). To achieve fast query time, state-of-the-art techniques require significant preprocessing, which can be a burden when the number of subsequent queries is not sufficiently large to amortize the cost. Furthermore, existing methods do not have the ability to directly control the suboptimality of their approximate results with theoretical guarantees. In this paper, we propose the first approximate algorithm for MIPS that does not require any preprocessing, and allows users to control and bound the suboptimality of the results. We cast MIPS as a Best Arm Identification problem, and introduce a new bandit setting that can fully exploit the special structure of MIPS. Our approach outperforms state-of-the-art methods on both synthetic and real-world datasets.Comment: AAAI 201

    Approximate Top-k Inner Product Join with a Proximity Graph

    Full text link
    This paper addresses the problem of top-k inner product join, which, given two sets of high-dimensional vectors and a result size k, outputs k pairs of vectors that have the largest inner product. This problem has important applications, such as recommendation, information extraction, and finding outlier correlation. Unfortunately, computing the exact answer incurs an expensive cost for large high-dimensional datasets. We therefore consider an approximate solution framework that efficiently retrieves k pairs of vectors with large inner products. To exploit this framework and obtain an accurate answer, we extend a state-of-the-art proximity graph for inner product search. We conduct experiments on real datasets, and the results show that our solution is faster and more accurate than baselines with state-of-the-art techniques.Nakama H., Amagata D., Hara T.. Approximate Top-k Inner Product Join with a Proximity Graph. Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 , 4468 (2021); https://doi.org/10.1109/BigData52589.2021.9671858

    Versatile Density Functionals for Computational Surface Science

    Get PDF

    Estimating the subjective perception of object size and position through brain imaging and psychophysics

    Get PDF
    Perception is subjective and context-dependent. Size and position perception are no exceptions. Studies have shown that apparent object size is represented by the retinotopic location of peak response in V1. Such representation is likely supported by a combination of V1 architecture and top-down driven retinotopic reorganisation. Are apparent object size and position encoded via a common mechanism? Using functional magnetic resonance imaging and a model-based reconstruction technique, the first part of this thesis sets out to test if retinotopic encoding of size percepts can be generalised to apparent position representation and whether neural signatures could be used to predict an individual’s perceptual experience. Here, I present evidence that static apparent position – induced by a dot-variant Muller-Lyer illusion – is represented retinotopically in V1. However, there is mixed evidence for retinotopic representation of motion-induced position shifts (e.g. curveball illusion) in early visual areas. My findings could be reconciled by assuming dual representation of veridical and percept-based information in early visual areas, which is consistent with the larger framework of predictive coding. The second part of the thesis sets out to compare different psychophysical methods for measuring size perception in the Ebbinghaus illusion. Consistent with the idea that psychophysical methods are not equally susceptible to cognitive factors, my experiments reveal a consistent discrepancy in illusion magnitude estimates between a traditional forced choice (2AFC) task and a novel perceptual matching (PM) task – a variant of a comparison-of-comparisons (CoC) task, a design widely seen as the gold standard in psychophysics. Further investigation reveals the difference was not driven by greater 2AFC susceptibility to cognitive factors, but a tendency for PM to skew illusion magnitude estimates towards the underlying stimulus distribution. I show that this dependency can be largely corrected using adaptive stimulus sampling
    • …
    corecore