Search CORE

845 research outputs found

When Hashes Met Wedges: A Distributed Algorithm for Finding High Similarity Vectors

Author: Andoni A.
Davis T.
Gionis A.
Goel A.
Shrivastava A.
Shrivastava A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/03/2017
Field of study

Finding similar user pairs is a fundamental task in social networks, with numerous applications in ranking and personalization tasks such as link prediction and tie strength detection. A common manifestation of user similarity is based upon network structure: each user is represented by a vector that represents the user's network connections, where pairwise cosine similarity among these vectors defines user similarity. The predominant task for user similarity applications is to discover all similar pairs that have a pairwise cosine similarity value larger than a given threshold

\tau

. In contrast to previous work where

\tau

is assumed to be quite close to 1, we focus on recommendation applications where

\tau

is small, but still meaningful. The all pairs cosine similarity problem is computationally challenging on networks with billions of edges, and especially so for settings with small

\tau

. To the best of our knowledge, there is no practical solution for computing all user pairs with, say

\tau = 0.2

on large social networks, even using the power of distributed algorithms. Our work directly addresses this challenge by introducing a new algorithm --- WHIMP --- that solves this problem efficiently in the MapReduce model. The key insight in WHIMP is to combine the "wedge-sampling" approach of Cohen-Lewis for approximate matrix multiplication with the SimHash random projection techniques of Charikar. We provide a theoretical analysis of WHIMP, proving that it has near optimal communication costs while maintaining computation cost comparable with the state of the art. We also empirically demonstrate WHIMP's scalability by computing all highly similar pairs on four massive data sets, and show that it accurately finds high similarity pairs. In particular, we note that WHIMP successfully processes the entire Twitter network, which has tens of billions of edges

arXiv.org e-Print Archive

Crossref

Revisiting Wedge Sampling for Budgeted Maximum Inner Product Search

Author: Lorenzen Stephan S.
Pham Ninh
Publication venue
Publication date: 12/09/2020
Field of study

Top-k maximum inner product search (MIPS) is a central task in many machine learning applications. This paper extends top-k MIPS with a budgeted setting, that asks for the best approximate top-k MIPS given a limit of B computational operations. We investigate recent advanced sampling algorithms, including wedge and diamond sampling to solve it. Though the design of these sampling schemes naturally supports budgeted top-k MIPS, they suffer from the linear cost from scanning all data points to retrieve top-k results and the performance degradation for handling negative inputs. This paper makes two main contributions. First, we show that diamond sampling is essentially a combination between wedge sampling and basic sampling for top-k MIPS. Our theoretical analysis and empirical evaluation show that wedge is competitive (often superior) to diamond on approximating top-k MIPS regarding both efficiency and accuracy. Second, we propose a series of algorithmic engineering techniques to deploy wedge sampling on budgeted top-k MIPS. Our novel deterministic wedge-based algorithm runs significantly faster than the state-of-the-art methods for budgeted and exact top-k MIPS while maintaining the top-5 precision at least 80% on standard recommender system data sets.Comment: ECML-PKDD 202

arXiv.org e-Print Archive

Copenhagen University Research Information System

A Bandit Approach to Maximum Inner Product Search

Author: Liu Rui
Mozafari Barzan
Wu Tianyi
Publication venue
Publication date: 15/12/2018
Field of study

There has been substantial research on sub-linear time approximate algorithms for Maximum Inner Product Search (MIPS). To achieve fast query time, state-of-the-art techniques require significant preprocessing, which can be a burden when the number of subsequent queries is not sufficiently large to amortize the cost. Furthermore, existing methods do not have the ability to directly control the suboptimality of their approximate results with theoretical guarantees. In this paper, we propose the first approximate algorithm for MIPS that does not require any preprocessing, and allows users to control and bound the suboptimality of the results. We cast MIPS as a Best Arm Identification problem, and introduce a new bandit setting that can fully exploit the special structure of MIPS. Our approach outperforms state-of-the-art methods on both synthetic and real-world datasets.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Approximate Top-k Inner Product Join with a Proximity Graph

Author: Amagata Daichi
Hara Takahiro
Nakama Hayato
Publication venue: Institute of Electrical and Electronics Engineers Inc.
Publication date
Field of study

This paper addresses the problem of top-k inner product join, which, given two sets of high-dimensional vectors and a result size k, outputs k pairs of vectors that have the largest inner product. This problem has important applications, such as recommendation, information extraction, and finding outlier correlation. Unfortunately, computing the exact answer incurs an expensive cost for large high-dimensional datasets. We therefore consider an approximate solution framework that efficiently retrieves k pairs of vectors with large inner products. To exploit this framework and obtain an accurate answer, we extend a state-of-the-art proximity graph for inner product search. We conduct experiments on real datasets, and the results show that our solution is faster and more accurate than baselines with state-of-the-art techniques.Nakama H., Amagata D., Hara T.. Approximate Top-k Inner Product Join with a Proximity Graph. Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021 , 4468 (2021); https://doi.org/10.1109/BigData52589.2021.9671858

Osaka University Knowledge Archive

Versatile Density Functionals for Computational Surface Science

Author: Wellendorff Jess
Publication venue: Technical University of Denmark
Publication date: 01/01/2012
Field of study

Online Research Database In Technology

Recommended from our members

Scalable algorithms for latent variable models in machine learning

Author: Yu Hsiang-Fu
Publication venue
Publication date: 13/10/2016
Field of study

Latent variable modeling (LVM) is a popular approach in many machine learning applications, such as recommender systems and topic modeling, due to its ability to succinctly represent data, even in the presence of several missing entries. Existing learning methods for LVMs, while attractive, are infeasible for the large-scale datasets required in modern big data applications. In addition, such applications often come with various types of side information such as the text description of items and the social network among users in a recommender system. In this thesis, we present scalable learning algorithms for a wide range of latent variable models such as low-rank matrix factorization and latent Dirichlet allocation. We also develop simple but effective techniques to extend existing LVMs to exploit various types of side information and make better predictions in many machine learning applications such as recommender systems, multi-label learning, and high-dimensional time-series prediction. In addition, we also propose a novel approach for the maximum inner product search problem to accelerate the prediction phase of many latent variable models.Computer Science

Texas ScholarWorks

Estimating the subjective perception of object size and position through brain imaging and psychophysics

Author: Ho Man-Ling
Publication venue: UCL (University College London)
Publication date: 28/07/2021
Field of study

Perception is subjective and context-dependent. Size and position perception are no exceptions. Studies have shown that apparent object size is represented by the retinotopic location of peak response in V1. Such representation is likely supported by a combination of V1 architecture and top-down driven retinotopic reorganisation. Are apparent object size and position encoded via a common mechanism? Using functional magnetic resonance imaging and a model-based reconstruction technique, the first part of this thesis sets out to test if retinotopic encoding of size percepts can be generalised to apparent position representation and whether neural signatures could be used to predict an individual’s perceptual experience. Here, I present evidence that static apparent position – induced by a dot-variant Muller-Lyer illusion – is represented retinotopically in V1. However, there is mixed evidence for retinotopic representation of motion-induced position shifts (e.g. curveball illusion) in early visual areas. My findings could be reconciled by assuming dual representation of veridical and percept-based information in early visual areas, which is consistent with the larger framework of predictive coding. The second part of the thesis sets out to compare different psychophysical methods for measuring size perception in the Ebbinghaus illusion. Consistent with the idea that psychophysical methods are not equally susceptible to cognitive factors, my experiments reveal a consistent discrepancy in illusion magnitude estimates between a traditional forced choice (2AFC) task and a novel perceptual matching (PM) task – a variant of a comparison-of-comparisons (CoC) task, a design widely seen as the gold standard in psychophysics. Further investigation reveals the difference was not driven by greater 2AFC susceptibility to cognitive factors, but a tendency for PM to skew illusion magnitude estimates towards the underlying stimulus distribution. I show that this dependency can be largely corrected using adaptive stimulus sampling

UCL Discovery