4 research outputs found

    Random Sampling for Group-By Queries

    Get PDF
    Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group).We present CVOPT, a query- and data-driven sampling framework for a set of group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm (e.g. ℓ2 or ℓ∞) of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are scheduled to run periodically.Our experimental results show that CVOPT outperforms current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are 5x smaller than competing approaches, under the same space budget

    Approximate query processing in a data warehouse using random sampling

    Get PDF
    Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency. We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input. We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions. Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times
    corecore