28,317 research outputs found
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the
position of a record within a sorted array, a Hash-Index as a model to map a
key to a position of a record within an unsorted array, and a BitMap-Index as a
model to indicate if a data record exists or not. In this exploratory research
paper, we start from this premise and posit that all existing index structures
can be replaced with other types of models, including deep-learning models,
which we term learned indexes. The key idea is that a model can learn the sort
order or structure of lookup keys and use this signal to effectively predict
the position or existence of records. We theoretically analyze under which
conditions learned indexes outperform traditional index structures and describe
the main challenges in designing learned index structures. Our initial results
show, that by using neural nets we are able to outperform cache-optimized
B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over
several real-world data sets. More importantly though, we believe that the idea
of replacing core components of a data management system through learned models
has far reaching implications for future systems designs and that this work
just provides a glimpse of what might be possible
Distributed top-k aggregation queries at large
Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network
Multivariate discrimination and the Higgs + W/Z search
A systematic method for optimizing multivariate discriminants is developed
and applied to the important example of a light Higgs boson search at the
Tevatron and the LHC. The Significance Improvement Characteristic (SIC),
defined as the signal efficiency of a cut or multivariate discriminant divided
by the square root of the background efficiency, is shown to be an extremely
powerful visualization tool. SIC curves demonstrate numerical instabilities in
the multivariate discriminants, show convergence as the number of variables is
increased, and display the sensitivity to the optimal cut values. For our
application, we concentrate on Higgs boson production in association with a W
or Z boson with H -> bb and compare to the irreducible standard model
background, Z/W + bb. We explore thousands of experimentally motivated,
physically motivated, and unmotivated single variable discriminants. Along with
the standard kinematic variables, a number of new ones, such as twist, are
described which should have applicability to many processes. We find that some
single variables, such as the pull angle, are weak discriminants, but when
combined with others they provide important marginal improvement. We also find
that multiple Higgs boson-candidate mass measures, such as from mild and
aggressively trimmed jets, when combined may provide additional discriminating
power. Comparing the significance improvement from our variables to those used
in recent CDF and DZero searches, we find that a 10-20% improvement in
significance against Z/W + bb is possible. Our analysis also suggests that the
H + W/Z channel with H -> bb is also viable at the LHC, without requiring a
hard cut on the W/Z transverse momentum.Comment: 41 pages, 5 tables, 29 figure
- …