90,214 research outputs found

    Evaluation of optimization techniques for aggregation

    Get PDF
    Aggregations are almost always done at the top of operator tree after all selections and joins in a SQL query. But actually they can be done before joins and make later joins much cheaper when used properly. Although some enumeration algorithms considering eager aggregation are proposed, no sufficient evaluations are available to guide the adoption of this technique in practice. And no evaluations are done for real data sets and real queries with estimated cardinalities. That means it is not known how eager aggregation performs in the real world. In this thesis, a new estimation method for group by and join combining traditional estimation method and index-based join sampling is proposed and evaluated. Two enumeration algorithms considering eager aggregation are implemented and compared in the context of estimated cardinality. We find that the new estimation method works well with little overhead and that under certain conditions, eager aggregation can dramatically accelerate queries

    Moment-based parameter estimation in binomial random intersection graph models

    Full text link
    Binomial random intersection graphs can be used as parsimonious statistical models of large and sparse networks, with one parameter for the average degree and another for transitivity, the tendency of neighbours of a node to be connected. This paper discusses the estimation of these parameters from a single observed instance of the graph, using moment estimators based on observed degrees and frequencies of 2-stars and triangles. The observed data set is assumed to be a subgraph induced by a set of n0n_0 nodes sampled from the full set of nn nodes. We prove the consistency of the proposed estimators by showing that the relative estimation error is small with high probability for n0n2/31n_0 \gg n^{2/3} \gg 1. As a byproduct, our analysis confirms that the empirical transitivity coefficient of the graph is with high probability close to the theoretical clustering coefficient of the model.Comment: 15 pages, 6 figure

    Statistical structures for internet-scale data management

    Get PDF
    Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability

    Fully decentralized computation of aggregates over data streams

    Get PDF
    In several emerging applications, data is collected in massive streams at several distributed points of observation. A basic and challenging task is to allow every node to monitor a neighbourhood of interest by issuing continuous aggregate queries on the streams observed in its vicinity. This class of algorithms is fully decentralized and diffusive in nature: collecting all data at few central nodes of the network is unfeasible in networks of low capability devices or in the presence of massive data sets. The main difficulty in designing diffusive algorithms is to cope with duplicate detections. These arise both from the observation of the same event at several nodes of the network and/or receipt of the same aggregated information along multiple paths of diffusion. In this paper, we consider fully decentralized algorithms that answer locally continuous aggregate queries on the number of distinct events, total number of events and the second frequency moment in the scenario outlined above. The proposed algorithms use in the worst case or on realistic distributions sublinear space at every node. We also propose strategies that minimize the communication needed to update the aggregates when new events are observed. We experimentally evaluate for the efficiency and accuracy of our algorithms on realistic simulated scenarios
    corecore