55,516 research outputs found
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Exact Computation of a Manifold Metric, via Lipschitz Embeddings and Shortest Paths on a Graph
Data-sensitive metrics adapt distances locally based the density of data
points with the goal of aligning distances and some notion of similarity. In
this paper, we give the first exact algorithm for computing a data-sensitive
metric called the nearest neighbor metric. In fact, we prove the surprising
result that a previously published -approximation is an exact algorithm.
The nearest neighbor metric can be viewed as a special case of a
density-based distance used in machine learning, or it can be seen as an
example of a manifold metric. Previous computational research on such metrics
despaired of computing exact distances on account of the apparent difficulty of
minimizing over all continuous paths between a pair of points. We leverage the
exact computation of the nearest neighbor metric to compute sparse spanners and
persistent homology. We also explore the behavior of the metric built from
point sets drawn from an underlying distribution and consider the more general
case of inputs that are finite collections of path-connected compact sets.
The main results connect several classical theories such as the conformal
change of Riemannian metrics, the theory of positive definite functions of
Schoenberg, and screw function theory of Schoenberg and Von Neumann. We develop
novel proof techniques based on the combination of screw functions and
Lipschitz extensions that may be of independent interest.Comment: 15 page
Spectra of "Real-World" Graphs: Beyond the Semi-Circle Law
Many natural and social systems develop complex networks, that are usually
modelled as random graphs. The eigenvalue spectrum of these graphs provides
information about their structural properties. While the semi-circle law is
known to describe the spectral density of uncorrelated random graphs, much less
is known about the eigenvalues of real-world graphs, describing such complex
systems as the Internet, metabolic pathways, networks of power stations,
scientific collaborations or movie actors, which are inherently correlated and
usually very sparse. An important limitation in addressing the spectra of these
systems is that the numerical determination of the spectra for systems with
more than a few thousand nodes is prohibitively time and memory consuming.
Making use of recent advances in algorithms for spectral characterization, here
we develop new methods to determine the eigenvalues of networks comparable in
size to real systems, obtaining several surprising results on the spectra of
adjacency matrices corresponding to models of real-world graphs. We find that
when the number of links grows as the number of nodes, the spectral density of
uncorrelated random graphs does not converge to the semi-circle law.
Furthermore, the spectral densities of real-world graphs have specific features
depending on the details of the corresponding models. In particular, scale-free
graphs develop a triangle-like spectral density with a power law tail, while
small-world graphs have a complex spectral density function consisting of
several sharp peaks. These and further results indicate that the spectra of
correlated graphs represent a practical tool for graph classification and can
provide useful insight into the relevant structural properties of real
networks.Comment: 14 pages, 9 figures (corrected typos, added references) accepted for
Phys. Rev.
Properties of dense partially random graphs
We study the properties of random graphs where for each vertex a {\it
neighbourhood} has been previously defined. The probability of an edge joining
two vertices depends on whether the vertices are neighbours or not, as happens
in Small World Graphs (SWGs). But we consider the case where the average degree
of each node is of order of the size of the graph (unlike SWGs, which are
sparse). This allows us to calculate the mean distance and clustering, that are
qualitatively similar (although not in such a dramatic scale range) to the case
of SWGs. We also obtain analytically the distribution of eigenvalues of the
corresponding adjacency matrices. This distribution is discrete for large
eigenvalues and continuous for small eigenvalues. The continuous part of the
distribution follows a semicircle law, whose width is proportional to the
"disorder" of the graph, whereas the discrete part is simply a rescaling of the
spectrum of the substrate. We apply our results to the calculation of the
mixing rate and the synchronizability threshold.Comment: 14 pages. To be published in Physical Review
Sparse geometric graphs with small dilation
Given a set S of n points in R^D, and an integer k such that 0 <= k < n, we
show that a geometric graph with vertex set S, at most n - 1 + k edges, maximum
degree five, and dilation O(n / (k+1)) can be computed in time O(n log n). For
any k, we also construct planar n-point sets for which any geometric graph with
n-1+k edges has dilation Omega(n/(k+1)); a slightly weaker statement holds if
the points of S are required to be in convex position
Graph Connectivity in Noisy Sparse Subspace Clustering
Subspace clustering is the problem of clustering data points into a union of
low-dimensional linear/affine subspaces. It is the mathematical abstraction of
many important problems in computer vision, image processing and machine
learning. A line of recent work (4, 19, 24, 20) provided strong theoretical
guarantee for sparse subspace clustering (4), the state-of-the-art algorithm
for subspace clustering, on both noiseless and noisy data sets. It was shown
that under mild conditions, with high probability no two points from different
subspaces are clustered together. Such guarantee, however, is not sufficient
for the clustering to be correct, due to the notorious "graph connectivity
problem" (15). In this paper, we investigate the graph connectivity problem for
noisy sparse subspace clustering and show that a simple post-processing
procedure is capable of delivering consistent clustering under certain "general
position" or "restricted eigenvalue" assumptions. We also show that our
condition is almost tight with adversarial noise perturbation by constructing a
counter-example. These results provide the first exact clustering guarantee of
noisy SSC for subspaces of dimension greater then 3.Comment: 14 pages. To appear in The 19th International Conference on
Artificial Intelligence and Statistics, held at Cadiz, Spain in 201
- …