6,442 research outputs found

    Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms

    Get PDF
    Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one di-mensional data distributions by histograms. A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld [ILR12] in PODS 2012: Given samples from a distribution p over {1,..., n}, compute a k-histogram that minimizes the `2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information– theoretically minimal sample size of m = O(1/ε2), runs in sample–linear time O(m), and outputs an O(k) – his-togram whose `2-distance from p is at most O(optk)+, where optk is the minimum `2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size n. We generalize our approach to obtain fast algo-rithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude improvement in terms of empirical running times over previous state-of-the-art algorithms. 1

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network
    corecore