2,015 research outputs found

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    Knowing when you're wrong: Building fast and reliable approximate query processing systems

    Get PDF
    Modern data analytics applications typically process massive amounts of data on clusters of tens, hundreds, or thousands of machines to support near-real-time decisions.The quantity of data and limitations of disk and memory bandwidth often make it infeasible to deliver answers at interactive speeds. However, it has been widely observed that many applications can tolerate some degree of inaccuracy. This is especially true for exploratory queries on data, where users are satisfied with "close-enough" answers if they can come quickly. A popular technique for speeding up queries at the cost of accuracy is to execute each query on a sample of data, rather than the whole dataset. To ensure that the returned result is not too inaccurate, past work on approximate query processing has used statistical techniques to estimate "error bars" on returned results. However, existing work in the sampling-based approximate query processing (S-AQP) community has not validated whether these techniques actually generate accurate error bars for real query workloads. In fact, we find that error bar estimation often fails on real world production workloads. Fortunately, it is possible to quickly and accurately diagnose the failure of error estimation for a query. In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds.National Science Foundation (U.S.) (CISE Expeditions Award CCF-1139158)Lawrence Berkeley National Laboratory (Award 7076018)United States. Defense Advanced Research Projects Agency (XData Award FA8750-12-2-0331)Amazon.com (Firm)Google (Firm)SAP CorporationThomas and Stacey Siebel FoundationApple Computer, Inc.Cisco Systems, Inc.Cloudera, Inc.EMC CorporationEricsson, Inc.Facebook (Firm

    Early Accurate Results for Advanced Analytics on MapReduce

    Full text link
    Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for `big data'. Therefore, we proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation. These estimates are based on a technique called bootstrapping that has been widely employed in statistics and can be applied to arbitrary functions and data distributions. In this paper, we describe our Early Accurate Result Library (EARL) for Hadoop that was designed to minimize the changes required to the MapReduce framework. Various tests of EARL of Hadoop are presented to characterize the frequent situations where EARL can provide major speed-ups over the current version of Hadoop.Comment: VLDB201

    Database Learning: Toward a Database that Becomes Smarter Every Time

    Full text link
    In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM SIGMOD conference 201

    The substantive and practical significance of citation impact differences between institutions: Guidelines for the analysis of percentiles using effect sizes and confidence intervals

    Full text link
    In our chapter we address the statistical analysis of percentiles: How should the citation impact of institutions be compared? In educational and psychological testing, percentiles are already used widely as a standard to evaluate an individual's test scores - intelligence tests for example - by comparing them with the percentiles of a calibrated sample. Percentiles, or percentile rank classes, are also a very suitable method for bibliometrics to normalize citations of publications in terms of the subject category and the publication year and, unlike the mean-based indicators (the relative citation rates), percentiles are scarcely affected by skewed distributions of citations. The percentile of a certain publication provides information about the citation impact this publication has achieved in comparison to other similar publications in the same subject category and publication year. Analyses of percentiles, however, have not always been presented in the most effective and meaningful way. New APA guidelines (American Psychological Association, 2010) suggest a lesser emphasis on significance tests and a greater emphasis on the substantive and practical significance of findings. Drawing on work by Cumming (2012) we show how examinations of effect sizes (e.g. Cohen's d statistic) and confidence intervals can lead to a clear understanding of citation impact differences

    The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

    Full text link
    In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing and by clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data

    A Reliability Case Study on Estimating Extremely Small Percentiles of Strength Data for the Continuous Improvement of Medium Density Fiberboard Product Quality

    Get PDF
    The objective of this thesis is to better estimate extremely small percentiles of strength distributions for measuring failure process in continuous improvement initiatives. These percentiles are of great interest for companies, oversight organizations, and consumers concerned with product safety and reliability. The thesis investigates the lower percentiles for the quality of medium density fiberboard (MDF). The international industrial standard for measuring quality for MDF is internal bond (IB, a tensile strength test). The results of the thesis indicated that the smaller percentiles are crucial, especially the first percentile and lower ones. The thesis starts by introducing the background, study objectives, and previous work done in the area of MDF reliability. The thesis also reviews key components of total quality management (TQM) principles, strategies for reliability data analysis and modeling, information and data quality philosophy, and data preparation steps that were used in the research study. Like many real world cases, the internal bond data in material failure analysis do not follow perfectly the normal distribution. There was evidence from the study to suggest that MDF has potentially different failure modes for early failures. Forcing of the normality assumption may lead to inaccurate predictions and poor product quality. We introduce a novel, forced censoring technique that closer fits the lower tails of strength distributions, where these smaller percentiles are impacted most. In this thesis, such a forced censoring technique is implemented as a software module, using JMP® Scripting Language (JSL) to expedite data processing which is key for real-time manufacturing settings. Results show that the Weibull distribution models the data best and provides percentile estimates that are neither too conservative nor risky. Further analyses are performed to build an accelerated common-shaped Weibull model for these two product types using the JMP® Survival and Reliability platform. The use of the JMP® Scripting Language helps to automate the task of fitting an accelerated Weibull model and test model homogeneity in the shape parameter. At the end of modeling stage, a package script is written to readily provide the field engineers customized reporting for model visualization, parameter estimation, and percentile forecasting. Furthermore, using the powerful tools of Splida and S Plus, bootstrap estimates of the small percentiles demonstrate improved intervals by our forced censoring approach and the fitted model, including the common shape assumption. Additionally, relatively more advanced Bayesian methods are employed to predict the low percentiles of this particular product type, which has a rather limited number of observations. Model interpretability, cross-validation strategy, result comparisons, and habitual assessment of practical significance are particularly stressed and exercised throughout the thesis. Overall, the approach in the thesis is parsimonious and suitable for real time manufacturing settings. The approach follows a consistent strategy in statistical analysis which leads to more accuracy for product conformance evaluation. Such an approach may also potentially reduce the cost of destructive testing and data management due to reduced frequency of testing. If adopted, the approach may prevent field failures and improve product safety. The philosophy and analytical methods presented in the thesis also apply to other strength distributions and lifetime data
    corecore