14,333 research outputs found
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Recovering the Optimal Solution by Dual Random Projection
Random projection has been widely used in data classification. It maps
high-dimensional data into a low-dimensional subspace in order to reduce the
computational cost in solving the related optimization problem. While previous
studies are focused on analyzing the classification performance of using random
projection, in this work, we consider the recovery problem, i.e., how to
accurately recover the optimal solution to the original optimization problem in
the high-dimensional space based on the solution learned from the subspace
spanned by random projections. We present a simple algorithm, termed Dual
Random Projection, that uses the dual solution of the low-dimensional
optimization problem to recover the optimal solution to the original problem.
Our theoretical analysis shows that with a high probability, the proposed
algorithm is able to accurately recover the optimal solution to the original
problem, provided that the data matrix is of low rank or can be well
approximated by a low rank matrix.Comment: The 26th Annual Conference on Learning Theory (COLT 2013
Sketch-based subspace clustering of hyperspectral images
Sparse subspace clustering (SSC) techniques provide the state-of-the-art in clustering of hyperspectral images (HSIs). However, their computational complexity hinders their applicability to large-scale HSIs. In this paper, we propose a large-scale SSC-based method, which can effectively process large HSIs while also achieving improved clustering accuracy compared to the current SSC methods. We build our approach based on an emerging concept of sketched subspace clustering, which was to our knowledge not explored at all in hyperspectral imaging yet. Moreover, there are only scarce results on any large-scale SSC approaches for HSI. We show that a direct application of sketched SSC does not provide a satisfactory performance on HSIs but it does provide an excellent basis for an effective and elegant method that we build by extending this approach with a spatial prior and deriving the corresponding solver. In particular, a random matrix constructed by the Johnson-Lindenstrauss transform is first used to sketch the self-representation dictionary as a compact dictionary, which significantly reduces the number of sparse coefficients to be solved, thereby reducing the overall complexity. In order to alleviate the effect of noise and within-class spectral variations of HSIs, we employ a total variation constraint on the coefficient matrix, which accounts for the spatial dependencies among the neighbouring pixels. We derive an efficient solver for the resulting optimization problem, and we theoretically prove its convergence property under mild conditions. The experimental results on real HSIs show a notable improvement in comparison with the traditional SSC-based methods and the state-of-the-art methods for clustering of large-scale images
- …