55,517 research outputs found

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results

    Exact Computation of a Manifold Metric, via Lipschitz Embeddings and Shortest Paths on a Graph

    Full text link
    Data-sensitive metrics adapt distances locally based the density of data points with the goal of aligning distances and some notion of similarity. In this paper, we give the first exact algorithm for computing a data-sensitive metric called the nearest neighbor metric. In fact, we prove the surprising result that a previously published 33-approximation is an exact algorithm. The nearest neighbor metric can be viewed as a special case of a density-based distance used in machine learning, or it can be seen as an example of a manifold metric. Previous computational research on such metrics despaired of computing exact distances on account of the apparent difficulty of minimizing over all continuous paths between a pair of points. We leverage the exact computation of the nearest neighbor metric to compute sparse spanners and persistent homology. We also explore the behavior of the metric built from point sets drawn from an underlying distribution and consider the more general case of inputs that are finite collections of path-connected compact sets. The main results connect several classical theories such as the conformal change of Riemannian metrics, the theory of positive definite functions of Schoenberg, and screw function theory of Schoenberg and Von Neumann. We develop novel proof techniques based on the combination of screw functions and Lipschitz extensions that may be of independent interest.Comment: 15 page

    Spectra of "Real-World" Graphs: Beyond the Semi-Circle Law

    Full text link
    Many natural and social systems develop complex networks, that are usually modelled as random graphs. The eigenvalue spectrum of these graphs provides information about their structural properties. While the semi-circle law is known to describe the spectral density of uncorrelated random graphs, much less is known about the eigenvalues of real-world graphs, describing such complex systems as the Internet, metabolic pathways, networks of power stations, scientific collaborations or movie actors, which are inherently correlated and usually very sparse. An important limitation in addressing the spectra of these systems is that the numerical determination of the spectra for systems with more than a few thousand nodes is prohibitively time and memory consuming. Making use of recent advances in algorithms for spectral characterization, here we develop new methods to determine the eigenvalues of networks comparable in size to real systems, obtaining several surprising results on the spectra of adjacency matrices corresponding to models of real-world graphs. We find that when the number of links grows as the number of nodes, the spectral density of uncorrelated random graphs does not converge to the semi-circle law. Furthermore, the spectral densities of real-world graphs have specific features depending on the details of the corresponding models. In particular, scale-free graphs develop a triangle-like spectral density with a power law tail, while small-world graphs have a complex spectral density function consisting of several sharp peaks. These and further results indicate that the spectra of correlated graphs represent a practical tool for graph classification and can provide useful insight into the relevant structural properties of real networks.Comment: 14 pages, 9 figures (corrected typos, added references) accepted for Phys. Rev.

    Properties of dense partially random graphs

    Full text link
    We study the properties of random graphs where for each vertex a {\it neighbourhood} has been previously defined. The probability of an edge joining two vertices depends on whether the vertices are neighbours or not, as happens in Small World Graphs (SWGs). But we consider the case where the average degree of each node is of order of the size of the graph (unlike SWGs, which are sparse). This allows us to calculate the mean distance and clustering, that are qualitatively similar (although not in such a dramatic scale range) to the case of SWGs. We also obtain analytically the distribution of eigenvalues of the corresponding adjacency matrices. This distribution is discrete for large eigenvalues and continuous for small eigenvalues. The continuous part of the distribution follows a semicircle law, whose width is proportional to the "disorder" of the graph, whereas the discrete part is simply a rescaling of the spectrum of the substrate. We apply our results to the calculation of the mixing rate and the synchronizability threshold.Comment: 14 pages. To be published in Physical Review

    Sparse geometric graphs with small dilation

    Get PDF
    Given a set S of n points in R^D, and an integer k such that 0 <= k < n, we show that a geometric graph with vertex set S, at most n - 1 + k edges, maximum degree five, and dilation O(n / (k+1)) can be computed in time O(n log n). For any k, we also construct planar n-point sets for which any geometric graph with n-1+k edges has dilation Omega(n/(k+1)); a slightly weaker statement holds if the points of S are required to be in convex position

    Graph Connectivity in Noisy Sparse Subspace Clustering

    Get PDF
    Subspace clustering is the problem of clustering data points into a union of low-dimensional linear/affine subspaces. It is the mathematical abstraction of many important problems in computer vision, image processing and machine learning. A line of recent work (4, 19, 24, 20) provided strong theoretical guarantee for sparse subspace clustering (4), the state-of-the-art algorithm for subspace clustering, on both noiseless and noisy data sets. It was shown that under mild conditions, with high probability no two points from different subspaces are clustered together. Such guarantee, however, is not sufficient for the clustering to be correct, due to the notorious "graph connectivity problem" (15). In this paper, we investigate the graph connectivity problem for noisy sparse subspace clustering and show that a simple post-processing procedure is capable of delivering consistent clustering under certain "general position" or "restricted eigenvalue" assumptions. We also show that our condition is almost tight with adversarial noise perturbation by constructing a counter-example. These results provide the first exact clustering guarantee of noisy SSC for subspaces of dimension greater then 3.Comment: 14 pages. To appear in The 19th International Conference on Artificial Intelligence and Statistics, held at Cadiz, Spain in 201
    corecore