4,731 research outputs found

    Fast kk-NNG construction with GPU-based quick multi-select

    Full text link
    In this paper we describe a new brute force algorithm for building the kk-Nearest Neighbor Graph (kk-NNG). The kk-NNG algorithm has many applications in areas such as machine learning, bio-informatics, and clustering analysis. While there are very efficient algorithms for data of low dimensions, for high dimensional data the brute force search is the best algorithm. There are two main parts to the algorithm: the first part is finding the distances between the input vectors which may be formulated as a matrix multiplication problem. The second is the selection of the kk-NNs for each of the query vectors. For the second part, we describe a novel graphics processing unit (GPU) -based multi-select algorithm based on quick sort. Our optimization makes clever use of warp voting functions available on the latest GPUs along with use-controlled cache. Benchmarks show significant improvement over state-of-the-art implementations of the kk-NN search on GPUs

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Computing a Minimum-Dilation Spanning Tree is NP-hard

    Get PDF
    In a geometric network G = (S, E), the graph distance between two vertices u, v in S is the length of the shortest path in G connecting u to v. The dilation of G is the maximum factor by which the graph distance of a pair of vertices differs from their Euclidean distance. We show that given a set S of n points with integer coordinates in the plane and a rational dilation delta > 1, it is NP-hard to determine whether a spanning tree of S with dilation at most delta exists

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online
    corecore