337 research outputs found
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
Distribution of sizes of erased loops for loop-erased random walks
We study the distribution of sizes of erased loops for loop-erased random
walks on regular and fractal lattices. We show that for arbitrary graphs the
probability of generating a loop of perimeter is expressible in
terms of the probability of forming a loop of perimeter when a
bond is added to a random spanning tree on the same graph by the simple
relation . On -dimensional hypercubical lattices,
varies as for large , where for , where
z is the fractal dimension of the loop-erased walks on the graph. On
recursively constructed fractals with this relation is modified
to , where is the hausdorff and
is the spectral dimension of the fractal.Comment: 4 pages, RevTex, 3 figure
Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing
All pairs similarity search is a problem where a set of data objects is given
and the task is to find all pairs of objects that have similarity above a
certain threshold for a given similarity measure-of-interest. When the number
of points or dimensionality is high, standard solutions fail to scale
gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and
its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some
extent and provides substantial speedup over traditional index based
approaches. BayesLSH is used for pruning the candidate space and computation of
approximate similarity, whereas BayesLSHLite can only prune the candidates, but
similarity needs to be computed exactly on the original data. Thus where ever
the explicit data representation is available and exact similarity computation
is not too expensive, BayesLSHLite can be used to aggressively prune candidates
and provide substantial speedup without losing too much on quality. However,
the loss in quality is higher in the BayesLSH variant, where explicit data
representation is not available, rather only a hash sketch is available and
similarity has to be estimated approximately. In this work we revisit the LSH
problem from a Frequentist setting and formulate sequential tests for composite
hypothesis (similarity greater than or less than threshold) that can be
leveraged by such LSH algorithms for adaptively pruning candidates
aggressively. We propose a vanilla sequential probability ration test (SPRT)
approach based on this idea and two novel variants. We extend these variants to
the case where approximate similarity needs to be computed using fixed-width
sequential confidence interval generation technique
You can't see what you can't see: Experimental evidence for how much relevant information may be missed due to Google's Web search personalisation
The influence of Web search personalisation on professional knowledge work is
an understudied area. Here we investigate how public sector officials
self-assess their dependency on the Google Web search engine, whether they are
aware of the potential impact of algorithmic biases on their ability to
retrieve all relevant information, and how much relevant information may
actually be missed due to Web search personalisation. We find that the majority
of participants in our experimental study are neither aware that there is a
potential problem nor do they have a strategy to mitigate the risk of missing
relevant information when performing online searches. Most significantly, we
provide empirical evidence that up to 20% of relevant information may be missed
due to Web search personalisation. This work has significant implications for
Web research by public sector professionals, who should be provided with
training about the potential algorithmic biases that may affect their judgments
and decision making, as well as clear guidelines how to minimise the risk of
missing relevant information.Comment: paper submitted to the 11th Intl. Conf. on Social Informatics;
revision corrects error in interpretation of parameter Psi/p in RBO resulting
from discrepancy between the documentation of the implementation in R
(https://rdrr.io/bioc/gespeR/man/rbo.html) and the original definition
(https://dl.acm.org/citation.cfm?id=1852106) as per 20/05/201
Set Similarity Search for Skewed Data
Set similarity join, as well as the corresponding indexing problem set
similarity search, are fundamental primitives for managing noisy or uncertain
data. For example, these primitives can be used in data cleaning to identify
different representations of the same object. In many cases one can represent
an object as a sparse 0-1 vector, or equivalently as the set of nonzero entries
in such a vector. A set similarity join can then be used to identify those
pairs that have an exceptionally large dot product (or intersection, when
viewed as sets). We choose to focus on identifying vectors with large Pearson
correlation, but results extend to other similarity measures. In particular, we
consider the indexing problem of identifying correlated vectors in a set S of
vectors sampled from {0,1}^d. Given a query vector y and a parameter alpha in
(0,1), we need to search for an alpha-correlated vector x in a data structure
representing the vectors of S. This kind of similarity search has been
intensely studied in worst-case (non-random data) settings.
Existing theoretically well-founded methods for set similarity search are
often inferior to heuristics that take advantage of skew in the data
distribution, i.e., widely differing frequencies of 1s across the d dimensions.
The main contribution of this paper is to analyze the set similarity problem
under a random data model that reflects the kind of skewed data distributions
seen in practice, allowing theoretical results much stronger than what is
possible in worst-case settings. Our indexing data structure is a recursive,
data-dependent partitioning of vectors inspired by recent advances in set
similarity search. Previous data-dependent methods do not seem to allow us to
exploit skew in item frequencies, so we believe that our work sheds further
light on the power of data dependence
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Minwise hashing has become a standard tool to calculate signatures which
allow direct estimation of Jaccard similarities. While very efficient
algorithms already exist for the unweighted case, the calculation of signatures
for weighted sets is still a time consuming task. BagMinHash is a new algorithm
that can be orders of magnitude faster than current state of the art without
any particular restrictions or assumptions on weights or data dimensionality.
Applied to the special case of unweighted sets, it represents the first
efficient algorithm producing independent signature components. A series of
tests finally verifies the new algorithm and also reveals limitations of other
approaches published in the recent past.Comment: 10 pages, KDD 201
Minimizing energy below the glass thresholds
Focusing on the optimization version of the random K-satisfiability problem,
the MAX-K-SAT problem, we study the performance of the finite energy version of
the Survey Propagation (SP) algorithm. We show that a simple (linear time)
backtrack decimation strategy is sufficient to reach configurations well below
the lower bound for the dynamic threshold energy and very close to the analytic
prediction for the optimal ground states. A comparative numerical study on one
of the most efficient local search procedures is also given.Comment: 12 pages, submitted to Phys. Rev. E, accepted for publicatio
Complexity transitions in global algorithms for sparse linear systems over finite fields
We study the computational complexity of a very basic problem, namely that of
finding solutions to a very large set of random linear equations in a finite
Galois Field modulo q. Using tools from statistical mechanics we are able to
identify phase transitions in the structure of the solution space and to
connect them to changes in performance of a global algorithm, namely Gaussian
elimination. Crossing phase boundaries produces a dramatic increase in memory
and CPU requirements necessary to the algorithms. In turn, this causes the
saturation of the upper bounds for the running time. We illustrate the results
on the specific problem of integer factorization, which is of central interest
for deciphering messages encrypted with the RSA cryptosystem.Comment: 23 pages, 8 figure
Minimum spanning trees on random networks
We show that the geometry of minimum spanning trees (MST) on random graphs is
universal. Due to this geometric universality, we are able to characterise the
energy of MST using a scaling distribution () found using uniform
disorder. We show that the MST energy for other disorder distributions is
simply related to . We discuss the relationship to invasion
percolation (IP), to the directed polymer in a random media (DPRM) and the
implications for the broader issue of universality in disordered systems.Comment: 4 pages, 3 figure
- …