595 research outputs found

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    A data warehouse environment for storing and analyzing simulation output data

    Get PDF
    Discrete event simulation modelling has been extensively used in modelling complex systems. Although it offers great conceptual-modelling flexibility, it is both computationally expensive and data intensive. There are several examples of simulation models that generate millions of observations to achieve satisfactory point and confidence interval estimations for the model variables. In these cases, it is exceptionally cumbersome to conduct the required output and sensitivity analysis in a spreadsheet or statistical package. In this paper, we highlight the advantages of employing data warehousing techniques for storing and analyzing simulation output data. The proposed data warehouse environment is capable of providing the means for automating the necessary algorithms and procedures for estimating different parameters of the simulation. These include initial transient in steady-state simulations and point and confidence interval estimations. Previously developed models for evaluating patient flow through hospital epartments are used to demonstrate the problem and the proposed solutions

    Novel Selectivity Estimation Strategy for Modern DBMS

    Full text link
    Selectivity estimation is important in query optimization, however accurate estimation is difficult when predicates are complex. Instead of existing database synopses and statistics not helpful for such cases, we introduce a new approach to compute the exact selectivity by running an aggregate query during the optimization phase. Exact selectivity can be achieved without significant overhead for in-memory and GPU-accelerated databases by adding extra query execution calls. We implement a selection push-down extension based on the novel selectivity estimation strategy in the MapD database system. Our approach records constant and less than 30 millisecond overheads in any circumstances while running on GPU. The novel strategy successfully generates better query execution plans which result in performance improvement up to 4.8 times from TPC-H benchmark SF-50 queries and 7.3 times from star schema benchmark SF-80 queries
    • …
    corecore