161 research outputs found

    Evaluating holistic aggregators efficiently for very large datasets

    Get PDF
    In data warehousing applications, numerous OLAP queries involve the processing of holistic aggregators such as computing the “top n,” median, quantiles, etc. In this paper, we present a novel approach called dynamic bucketing to efficiently evaluate these aggregators. We partition data into equiwidth buckets and further partition dense buckets into sub-buckets as needed by allocating and reclaiming memory space. The bucketing process dynamically adapts to the input order and distribution of input datasets. The histograms of the buckets and subbuckets are stored in our new data structure called structure trees. A recent selection algorithm based on regular sampling is generalized and its analysis extended. We have also compared our new algorithms with this generalized algorithm and several other recent algorithms. Experimental results show that our new algorithms significantly outperform prior ones not only in the runtime but also in accuracy

    Optimal Gossip Algorithms for Exact and Approximate Quantile Computations

    Full text link
    This paper gives drastically faster gossip algorithms to compute exact and approximate quantiles. Gossip algorithms, which allow each node to contact a uniformly random other node in each round, have been intensely studied and been adopted in many applications due to their fast convergence and their robustness to failures. Kempe et al. [FOCS'03] gave gossip algorithms to compute important aggregate statistics if every node is given a value. In particular, they gave a beautiful O(logn+log1ϵ)O(\log n + \log \frac{1}{\epsilon}) round algorithm to ϵ\epsilon-approximate the sum of all values and an O(log2n)O(\log^2 n) round algorithm to compute the exact ϕ\phi-quantile, i.e., the the ϕn\lceil \phi n \rceil smallest value. We give an quadratically faster and in fact optimal gossip algorithm for the exact ϕ\phi-quantile problem which runs in O(logn)O(\log n) rounds. We furthermore show that one can achieve an exponential speedup if one allows for an ϵ\epsilon-approximation. We give an O(loglogn+log1ϵ)O(\log \log n + \log \frac{1}{\epsilon}) round gossip algorithm which computes a value of rank between ϕn\phi n and (ϕ+ϵ)n(\phi+\epsilon)n at every node.% for any 0ϕ10 \leq \phi \leq 1 and 0<ϵ<10 < \epsilon < 1. Our algorithms are extremely simple and very robust - they can be operated with the same running times even if every transmission fails with a, potentially different, constant probability. We also give a matching Ω(loglogn+log1ϵ)\Omega(\log \log n + \log \frac{1}{\epsilon}) lower bound which shows that our algorithm is optimal for all values of ϵ\epsilon

    Streams Going Notts: The tidal debris finder comparison project

    Full text link
    While various codes exist to systematically and robustly find haloes and subhaloes in cosmological simulations (Knebe et al., 2011, Onions et al., 2012), this is the first work to introduce and rigorously test codes that find tidal debris (streams and other unbound substructure) in fully cosmological simulations of structure formation. We use one tracking and three non-tracking codes to identify substructure (bound and unbound) in a Milky Way type simulation from the Aquarius suite (Springel et al., 2008) and post-process their output with a common pipeline to determine the properties of these substructures in a uniform way. By using output from a fully cosmological simulation, we also take a step beyond previous studies of tidal debris that have used simple toy models. We find that both tracking and non-tracking codes agree well on the identification of subhaloes and more importantly, the {\em unbound tidal features} associated with them. The distributions of basic properties of the total substructure distribution (mass, velocity dispersion, position) are recovered with a scatter of 20\sim20%. Using the tracking code as our reference, we show that the non-tracking codes identify complex tidal debris with purities of 40\sim40%. Analysing the results of the substructure finders, we find that the general distribution of {\em substructures} differ significantly from the distribution of bound {\em subhaloes}. Most importantly, both bound and unbound {\em substructures} together constitute 18\sim18% of the host halo mass, which is a factor of 2\sim2 higher than the fraction in self-bound {\em subhaloes}. However, this result is restricted by the remaining challenge to cleanly define when an unbound structure has become part of the host halo. Nevertheless, the more general substructure distribution provides a more complete picture of a halo's accretion history.Comment: 19 pages, 12 figures, accepted for publication in MNRA

    NASA Tech Briefs, October 1988

    Get PDF
    Topics include: New Product Ideas; NASA TU Services; Electronic Components and Circuits; Electronic Systems; Physical Sciences Materials; Computer Programs; Mechanics; Machinery; Fabrication Technology; Mathematics and Information Sciences; Life Sciences

    Bridging the gap between algorithmic and learned index structures

    Get PDF
    Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces
    corecore