73 research outputs found

    Algorithm Engineering for High-Dimensional Similarity Search Problems (Invited Talk)

    Get PDF
    Similarity search problems in high-dimensional data arise in many areas of computer science such as data bases, image analysis, machine learning, and natural language processing. One of the most prominent problems is finding the k nearest neighbors of a data point q ? ?^d in a large set of data points S ? ?^d, under same distance measure such as Euclidean distance. In contrast to lower dimensional settings, we do not know of worst-case efficient data structures for such search problems in high-dimensional data, i.e., data structures that are faster than a linear scan through the data set. However, there is a rich body of (often heuristic) approaches that solve nearest neighbor search problems much faster than such a scan on many real-world data sets. As a necessity, the term solve means that these approaches give approximate results that are close to the true k-nearest neighbors. In this talk, we survey recent approaches to nearest neighbor search and related problems. The talk consists of three parts: (1) What makes nearest neighbor search difficult? (2) How do current state-of-the-art algorithms work? (3) What are recent advances regarding similarity search on GPUs, in distributed settings, or in external memory

    How Good Is Multi-Pivot Quicksort?

    Get PDF
    Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step kk pivots are used to split the input into k+1k + 1 segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-kk approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 to 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost. The study is theoretical in the sense that although the findings motivate design recommendations for multipivot quicksort algorithms that lead to running time improvements over known algorithms in an experimental setting, these improvements are small.Comment: Submitted to a journal, v2: Fixed statement of Gibb's inequality, v3: Revised version, especially improving on the experiments in Section

    Simple and Fast BlockQuicksort using Lomuto's Partitioning Scheme

    Get PDF
    This paper presents simple variants of the BlockQuicksort algorithm described by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to partition the input. To achieve a robust sorting algorithm that works well on many different input types, the paper introduces a novel two-pivot variant of Lomuto's partitioning scheme. A surprisingly simple twist to the generic two-pivot quicksort approach makes the algorithm robust. The paper provides an analysis of the theoretical properties of the proposed algorithms and compares them to their competitors. The analysis shows that Lomuto-based approaches incur a higher average sorting cost than the Hoare-based approach of BlockQuicksort. Moreover, the analysis is particularly useful to reason about pivot choices that suit the two-pivot approach. An extensive experimental study shows that, despite their worse theoretical behavior, the simpler variants perform as well as the original version of BlockQuicksort.Comment: Accepted at ALENEX 201

    Reproducibility Companion Paper: Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN

    Get PDF
    National Research Foundation (NRF) Singapore under NRF Fellowship Programm

    ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms

    Get PDF
    This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating kk-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm. Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, LaTeX\LaTeX plots, and websites with interactive plots. ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of kk-NN algorithms. In the short term, this overview allows users to choose the correct kk-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning. The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work. Interestingly, very different approaches to kk-NN search yield comparable quality-performance trade-offs. The system is available at http://ann-benchmarks.com .Comment: Full version of the SISAP 2017 conference paper. v2: Updated the abstract to avoid arXiv linking to the wrong UR

    DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search

    Full text link
    Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for KDE evaluation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.Comment: 24 pages, 1 figure. Submitted for revie
    corecore