161 research outputs found
Evaluating holistic aggregators efficiently for very large datasets
In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the “top n,” median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into sub-buckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy
Optimal Gossip Algorithms for Exact and Approximate Quantile Computations
This paper gives drastically faster gossip algorithms to compute exact and
approximate quantiles.
Gossip algorithms, which allow each node to contact a uniformly random other
node in each round, have been intensely studied and been adopted in many
applications due to their fast convergence and their robustness to failures.
Kempe et al. [FOCS'03] gave gossip algorithms to compute important aggregate
statistics if every node is given a value. In particular, they gave a beautiful
round algorithm to -approximate
the sum of all values and an round algorithm to compute the exact
-quantile, i.e., the the smallest value.
We give an quadratically faster and in fact optimal gossip algorithm for the
exact -quantile problem which runs in rounds. We furthermore
show that one can achieve an exponential speedup if one allows for an
-approximation. We give an
round gossip algorithm which computes a value of rank between and
at every node.% for any and . Our algorithms are extremely simple and very robust - they can
be operated with the same running times even if every transmission fails with
a, potentially different, constant probability. We also give a matching
lower bound which shows that
our algorithm is optimal for all values of
Streams Going Notts: The tidal debris finder comparison project
While various codes exist to systematically and robustly find haloes and
subhaloes in cosmological simulations (Knebe et al., 2011, Onions et al.,
2012), this is the first work to introduce and rigorously test codes that find
tidal debris (streams and other unbound substructure) in fully cosmological
simulations of structure formation. We use one tracking and three non-tracking
codes to identify substructure (bound and unbound) in a Milky Way type
simulation from the Aquarius suite (Springel et al., 2008) and post-process
their output with a common pipeline to determine the properties of these
substructures in a uniform way. By using output from a fully cosmological
simulation, we also take a step beyond previous studies of tidal debris that
have used simple toy models. We find that both tracking and non-tracking codes
agree well on the identification of subhaloes and more importantly, the {\em
unbound tidal features} associated with them. The distributions of basic
properties of the total substructure distribution (mass, velocity dispersion,
position) are recovered with a scatter of . Using the tracking code as
our reference, we show that the non-tracking codes identify complex tidal
debris with purities of . Analysing the results of the substructure
finders, we find that the general distribution of {\em substructures} differ
significantly from the distribution of bound {\em subhaloes}. Most importantly,
both bound and unbound {\em substructures} together constitute of the
host halo mass, which is a factor of higher than the fraction in
self-bound {\em subhaloes}. However, this result is restricted by the remaining
challenge to cleanly define when an unbound structure has become part of the
host halo. Nevertheless, the more general substructure distribution provides a
more complete picture of a halo's accretion history.Comment: 19 pages, 12 figures, accepted for publication in MNRA
NASA Tech Briefs, October 1988
Topics include: New Product Ideas; NASA TU Services; Electronic Components and Circuits; Electronic Systems; Physical Sciences Materials; Computer Programs; Mechanics; Machinery; Fabrication Technology; Mathematics and Information Sciences; Life Sciences
Bridging the gap between algorithmic and learned index structures
Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion.
In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine.
Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream.
Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces
- …