9 research outputs found

    Low-Quality Dimension Reduction and High-Dimensional Approximate Nearest Neighbor

    Get PDF
    The approximate nearest neighbor problem (epsilon-ANN) in Euclidean settings is a fundamental question, which has been addressed by two main approaches: Data-dependent space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. On the other hand, locality sensitive hashing has polynomial dependence in the dimension, sublinear query time with an exponent inversely proportional to (1+epsilon)^2, and subquadratic space requirement. We generalize the Johnson-Lindenstrauss Lemma to define "low-quality" mappings to a Euclidean space of significantly lower dimension, such that they satisfy a requirement weaker than approximately preserving all distances or even preserving the nearest neighbor. This mapping guarantees, with high probability, that an approximate nearest neighbor lies among the k approximate nearest neighbors in the projected space. These can be efficiently retrieved while using only linear storage by a data structure, such as BBD-trees. Our overall algorithm, given n points in dimension d, achieves space usage in O(dn), preprocessing time in O(dn log n), and query time in O(d n^{rho} log n), where rho is proportional to 1 - 1/loglog n, for fixed epsilon in (0, 1). The dimension reduction is larger if one assumes that point sets possess some structure, namely bounded expansion rate. We implement our method and present experimental results in up to 500 dimensions and 10^6 points, which show that the practical performance is better than predicted by the theoretical analysis. In addition, we compare our approach with E2LSH

    Low-quality dimension reduction and high-dimensional Approximate Nearest Neighbor

    Get PDF
    Το πρόβλημα του προσεγγιστικού κοντινότερου γείτονα στην Ευκλείδεια μετρική έχει επιλυθεί κυρίως με δύο βασικές μεθόδους: δομές δέντρων που διαμερίζουν τον χώρο, οι οποίες λειτουργούν εξαιρετικά όταν η διάσταση είναι σχετικά μικρή και το locality sensitive hashing το οποίο αποτελεί μια τυχαιοκρατική μέθοδο η οποία λειτουργεί αποδοτικά και όταν η διάσταση είναι υψηλή. Γενικεύουμε το Johnson-Lindenstrauss λήμμα και ορίζουμε εμβαπτίσεις χαμηλής ποιότητας σε έναν ευκλείδειο χώρο αρκετά χαμηλότερης διάστασης. Με τις εμβαπτίσεις αυτές ανάγουμε το αρχικό πρόβλημα στο πρόβλημα του υπολογισμού k προσεγγιστικών κοντινότερων γειτόνων σε πολύ χαμηλή διάσταση. Τελικά συνδυάζοντας τις εμβαπτίσεις αυτές με υπάρχουσα δομή δεδομένων για χαμηλές διαστάσεις, κατασκευάζουμε μια τυχαιοκρατική δομή δεδομένων η οποία λειτουργεί αποδοτικά όταν η διάσταση είναι υψηλή.The approximate nearest neighbor problem (ANN) in Euclidean settings is a fundamental question, which has been addressed by two main approaches: Data-dependent space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. On the other hand, locality sensitive hashing has polynomial dependence in the dimension and sublinear query time. We generalize the Johnson-Lindenstrauss lemma to define “low-quality” mappings to a Euclidean space of significantly lower dimension, such that they satisfy a requirement weaker than approximately preserving all distances or even preserving the nearest neighbor. This mapping guarantees, with constant probability, that an approximate nearest neighbor lies among the k approximate nearest neighbors in the projected space. This leads to an efficient randomized tree based data structure that avoids the curse of dimensionality

    High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests

    Get PDF
    We propose a new data-structure, the generalized randomized kd forest, or kgeraf, for approximate nearest neighbor searching in high dimensions. In particular, we introduce new randomization techniques to specify a set of independently constructed trees where search is performed simultaneously, hence increasing accuracy. We omit backtracking, and we optimize distance computations, thus accelerating queries. We release public domain software geraf and we compare it to existing implementations of state-of-the-art methods including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and product quantization. Experimental results indicate that our method would be the method of choice in dimensions around 1,000, and probably up to 10,000, and pointsets of cardinality up to a few hundred thousands or even one million; this range of inputs is encountered in many critical applications today. For instance, we handle a real dataset of 10610^6 images represented in 960 dimensions with a query time of less than 11sec on average and 90\% responses being true nearest neighbors

    Products of Euclidean Metrics and Applications to Proximity Questions among Curves

    Get PDF
    International audienceThe problem of Approximate Nearest Neighbor (ANN) search is fundamental in computer science and has benefited from significant progress in the past couple of decades. However, most work has been devoted to pointsets whereas complex shapes have not been sufficiently treated. Here, we focus on distance functions between discretized curves in Euclidean space: they appear in a wide range of applications, from road segments and molecular backbones to time-series in general dimension. For p-products of Euclidean metrics, for any p ≥ 1, we design simple and efficient data structures for ANN, based on randomized projections, which are of independent interest. They serve to solve proximity problems under a notion of distance between discretized curves, which generalizes both discrete Fréchet and Dynamic Time Warping distances. These are the most popular and practical approaches to comparing such curves. We offer the first data structures and query algorithms for ANN with arbitrarily good approximation factor, at the expense of increasing space usage and preprocessing time over existing methods. Query time complexity is comparable or significantly improved by our algorithms; our approach is especially efficient when the length of the curves is bounded. 2012 ACM Subject Classification Theory of computation → Data structures design and analysi

    Υλοποίηση Αλγορίθμου Για Προβλήματα Εγγύτητας Πολυγωνικών Καμπυλών Γενικής Διάστασης

    Get PDF
    Η ολοένα αυξανόμενη ποσότητα πληροφορίας και γνώσης που παράγει η ανθρωπότητα, και η συνεχής άνοδος της ζήτησης για την όσο το δυνατόν καλύτερη και γρηγορότερη ανάλυσή τους, οδηγεί στην ανάγκη ανάπτυξης προσεγγιστικών αλγορίθμων, για όλων των ειδών τα δεδομένα και διεργασίες. Στην παρούσα εργασία παρουσιάζονται οι πολυγωνικές καμπύλες, δηλαδή σύνολα διακεκριμένων σημείων σε συγκεκριμένη σειρά, και αυθαίρετου πλήθους, στη γενική διάσταση, ως αντικείμενα επεξεργασίας, και πιο συγκεκριμένα η διαδικασία συσταδοποίησης τους. Σκοπός, ήταν μία πρώτη υλοποίηση προσεγγιστικού αλγορίθμου με μεθόδους αναγωγής (reduction) και χρήση kd - δέντρων. Η βασική ιδέα ήταν η δημιουργία μίας δομής (preprocessing) η οποία θα δέχεται ερωτήματα, τα οποία θα απαντάει όσο το δυνατόν γρηγορότερα. Τα ερωτήματα αυτά, μπορούν να είναι είτε πολυγωνικές καμπύλες, που ζητούν τον πλησιέστερο γείτονά τους (ΑΝΝ Search), είτε θετικοί πραγματικοί αριθμοί που αναπαριστούν την ακτίνα R, βάσει της οποίας θα δημιουργείται κάθε φορά μία καινούργια συσταδοποίηση. Ο αλγόριθμος στην πράξη είναι αποτελεσματικός, αλλά εκθετικός σε χώρο και χρόνο ως προς το μέγιστο αριθμό σημείων σε οποιαδήποτε καμπύλη, όπως και αναμενόταν από τη θεωρία. Η γλώσσα υλοποίησης είναι η C++.The continuous increasing amount of information and knowledge that humanity keeps producing, and also the need for a better and faster analysis of them, leads to the need for developing approximation algorithms for all types of data and processes. In this project the objects that are studied and presented are ‘Polygonal Curves’, that is to say, sets of discrete points in specific order and arbitrary amplitude in general dimension space, as objects of processing and more specific the process of clustering. The main purpose was a first implementation of approximation algorithm with reduction methods and usage of kd - trees. The basic idea was the construction of a data structure (preprocessing) that can answer queries as fast as possible. These queries can be either Polygonal Curves that expect their Nearest Neighbor as an answer (ANN Search), or real positive numbers that represent the Radius, in which the clustering is based on, each time. The algorithm in practice gives good results, but also exponential in space and time, corresponded in the maximum number of points that any Curve can have, as expected from the theory. The used programming language is C++

    Randomized Embeddings with Slack and High-Dimensional Approximate Nearest Neighbor

    Get PDF
    International audienceThe approximate nearest neighbor problem (e-ANN) in high dimensional Euclidean space has been mainly addressed by Locality Sensitive Hashing (LSH), which has polynomial dependence in the dimension, sublinear query time, but subquadratic space requirement. In this paper, we introduce a new definition of "low-quality" embeddings for metric spaces. It requires that, for some query point q, there exists an approximate nearest neighbor among the pre-images of the k > 1 approximate nearest neighbors in the target space. Focusing on Euclidean spaces, we employ random projections in order to reduce the original problem to one in a space of dimension inversely proportional to k. The k approximate nearest neighbors can be efficiently retrieved by a data structure such as BBD-trees. The same approach is applied to the problem of computing an approximate near neighbor, where we obtain a data structure requiring linear space, and query time in O(dn^ρ), for ρ ≈ 1 − e^2 / log(1/e). This directly implies a solution for e-ANN, while achieving a better exponent in the query time than the method based on BBD-trees. Better bounds are obtained in the case of doubling subsets of Euclidean space, by combining our method with r-nets. We implement our method in C++, and present experimental results in dimension up to 500 and 1Mil. points, which show that performance is better than predicted by the analysis. In addition, we compare our ANN approach to E2LSH, which implements LSH, and we show that the theoretical advantages of each method are reflected on their actual performance

    Approximating Spectral Clustering via Sampling: a Review

    Get PDF
    International audienceSpectral clustering refers to a family of well-known unsupervised learning algorithms. Rather than attempting to cluster points in their native domain, one constructs a (usually sparse) similarity graph and computes the principal eigenvec-tors of its Laplacian. The eigenvectors are then interpreted as transformed points and fed into a k-means clustering algorithm. As a result of this non-linear transformation , it becomes possible to use a simple centroid-based algorithm in order to identify non-convex clusters, something that was otherwise impossible. Unfortunately , what makes spectral clustering so successful is also its Achilles heel: forming a graph and computing its dominant eigenvectors can be computationally prohibitive when dealing with more that a few tens of thousands of points. In this chapter, we review the principal research efforts aiming to reduce this computational cost. We focus on methods that come with a theoretical control on the clustering performance and incorporate some form of sampling in their operation. Such methods abound in the machine learning, numerical linear algebra, and graph signal processing literature and, amongst others, include Nyström-approximation, landmarks, coarsening, coresets, and compressive spectral clustering. We present the approximation guarantees available for each and discuss practical merits and limitations. Surprisingly, despite the breadth of the literature explored, we conclude that there is still a gap between theory and practice: the most scalable methods are only intuitively motivated or loosely controlled, whereas those that come with end-to-end guarantees rely on strong assumptions or enable a limited gain of computation time
    corecore