20,323 research outputs found

    Approximate query processing in a data warehouse using random sampling

    Get PDF
    Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency. We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input. We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions. Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times

    Database Learning: Toward a Database that Becomes Smarter Every Time

    Full text link
    In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM SIGMOD conference 201

    Rapid Sampling for Visualizations with Ordering Guarantees

    Get PDF
    Visualizations are frequently used as a means to understand trends and gather insights from datasets, but often take a long time to generate. In this paper, we focus on the problem of rapidly generating approximate visualizations while preserving crucial visual proper- ties of interest to analysts. Our primary focus will be on sampling algorithms that preserve the visual property of ordering; our techniques will also apply to some other visual properties. For instance, our algorithms can be used to generate an approximate visualization of a bar chart very rapidly, where the comparisons between any two bars are correct. We formally show that our sampling algorithms are generally applicable and provably optimal in theory, in that they do not take more samples than necessary to generate the visualizations with ordering guarantees. They also work well in practice, correctly ordering output groups while taking orders of magnitude fewer samples and much less time than conventional sampling schemes.Comment: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No.

    DROP: Dimensionality Reduction Optimization for Time Series

    Full text link
    Dimensionality reduction is a critical step in scaling machine learning pipelines. Principal component analysis (PCA) is a standard tool for dimensionality reduction, but performing PCA over a full dataset can be prohibitively expensive. As a result, theoretical work has studied the effectiveness of iterative, stochastic PCA methods that operate over data samples. However, termination conditions for stochastic PCA either execute for a predetermined number of iterations, or until convergence of the solution, frequently sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after operating over small (e.g., 1%) subsamples of input data, reducing whole workload runtime. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5x over Singular-Value-Decomposition-based PCA techniques, and exceeds conventional approaches like FFT and PAA by up to 16x in end-to-end workloads

    Unbiased Comparative Evaluation of Ranking Functions

    Full text link
    Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing kk systems against a baseline, and ranking kk systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time

    A Bandit Approach to Maximum Inner Product Search

    Full text link
    There has been substantial research on sub-linear time approximate algorithms for Maximum Inner Product Search (MIPS). To achieve fast query time, state-of-the-art techniques require significant preprocessing, which can be a burden when the number of subsequent queries is not sufficiently large to amortize the cost. Furthermore, existing methods do not have the ability to directly control the suboptimality of their approximate results with theoretical guarantees. In this paper, we propose the first approximate algorithm for MIPS that does not require any preprocessing, and allows users to control and bound the suboptimality of the results. We cast MIPS as a Best Arm Identification problem, and introduce a new bandit setting that can fully exploit the special structure of MIPS. Our approach outperforms state-of-the-art methods on both synthetic and real-world datasets.Comment: AAAI 201
    corecore