Search CORE

161 research outputs found

Evaluating holistic aggregators efficiently for very large datasets

Author: Fu Lixin
NC DOCKS at The University of North Carolina at Greensboro
Publication venue
Publication date: 01/01/2004
Field of study

In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the “top n,” median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into sub-buckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy

The University of North Carolina at Greensboro

Optimal Gossip Algorithms for Exact and Approximate Quantile Computations

Author: Haeupler Bernhard
Mohapatra Jeet
Su Hsin-Hao
Publication venue
Publication date: 25/11/2017
Field of study

This paper gives drastically faster gossip algorithms to compute exact and approximate quantiles. Gossip algorithms, which allow each node to contact a uniformly random other node in each round, have been intensely studied and been adopted in many applications due to their fast convergence and their robustness to failures. Kempe et al. [FOCS'03] gave gossip algorithms to compute important aggregate statistics if every node is given a value. In particular, they gave a beautiful

O(\log n + \log \frac{1}{\epsilon})

round algorithm to

\epsilon

-approximate the sum of all values and an

O(\log^2 n)

round algorithm to compute the exact

\phi

-quantile, i.e., the the

\lceil \phi n \rceil

smallest value. We give an quadratically faster and in fact optimal gossip algorithm for the exact

\phi

-quantile problem which runs in

O(\log n)

rounds. We furthermore show that one can achieve an exponential speedup if one allows for an

\epsilon

-approximation. We give an

O(\log \log n + \log \frac{1}{\epsilon})

round gossip algorithm which computes a value of rank between

\phi n

and

(\phi+\epsilon)n

at every node.% for any

0 \leq \phi \leq 1

and

0 < \epsilon < 1

. Our algorithms are extremely simple and very robust - they can be operated with the same running times even if every transmission fails with a, potentially different, constant probability. We also give a matching

\Omega(\log \log n + \log \frac{1}{\epsilon})

lower bound which shows that our algorithm is optimal for all values of

\epsilon

arXiv.org e-Print Archive

Crossref

Streams Going Notts: The tidal debris finder comparison project

Author: Ascasibar Yago
Behroozi Peter
Elahi Pascal J.
Han Jiaxin
Knebe Alexander
Lux Hanni
Muldrew Stuart I.
Onions Julian
Pearce Frazer
Publication venue: 'Oxford University Press (OUP)'
Publication date: 10/05/2013
Field of study

While various codes exist to systematically and robustly find haloes and subhaloes in cosmological simulations (Knebe et al., 2011, Onions et al., 2012), this is the first work to introduce and rigorously test codes that find tidal debris (streams and other unbound substructure) in fully cosmological simulations of structure formation. We use one tracking and three non-tracking codes to identify substructure (bound and unbound) in a Milky Way type simulation from the Aquarius suite (Springel et al., 2008) and post-process their output with a common pipeline to determine the properties of these substructures in a uniform way. By using output from a fully cosmological simulation, we also take a step beyond previous studies of tidal debris that have used simple toy models. We find that both tracking and non-tracking codes agree well on the identification of subhaloes and more importantly, the {\em unbound tidal features} associated with them. The distributions of basic properties of the total substructure distribution (mass, velocity dispersion, position) are recovered with a scatter of

\sim20%

. Using the tracking code as our reference, we show that the non-tracking codes identify complex tidal debris with purities of

\sim40%

. Analysing the results of the substructure finders, we find that the general distribution of {\em substructures} differ significantly from the distribution of bound {\em subhaloes}. Most importantly, both bound and unbound {\em substructures} together constitute

\sim18%

of the host halo mass, which is a factor of

\sim2

higher than the fraction in self-bound {\em subhaloes}. However, this result is restricted by the remaining challenge to cleanly define when an unbound structure has become part of the host halo. Nevertheless, the more general substructure distribution provides a more complete picture of a halo's accretion history.Comment: 19 pages, 12 figures, accepted for publication in MNRA

arXiv.org e-Print Archive

Shanghai Astronomical Observatory,Chinese Academy of Sciences

NASA Tech Briefs, October 1988

Author
Publication venue
Publication date
Field of study

Topics include: New Product Ideas; NASA TU Services; Electronic Components and Circuits; Electronic Systems; Physical Sciences Materials; Computer Programs; Mechanics; Machinery; Fabrication Technology; Mathematics and Information Sciences; Life Sciences

NASA Technical Reports Server

Bridging the gap between algorithmic and learned index structures

Author: Hadian Ali
Publication venue: Computing, Imperial College London
Publication date: 01/07/2022
Field of study

Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

Spiral - Imperial College Digital Repository