595 research outputs found
VerdictDB: Universalizing Approximate Query Processing
Despite 25 years of research in academia, approximate query processing (AQP)
has had little industrial adoption. One of the major causes of this slow
adoption is the reluctance of traditional vendors to make radical changes to
their legacy codebases, and the preoccupation of newer vendors (e.g.,
SQL-on-Hadoop products) with implementing standard features. Additionally, the
few AQP engines that are available are each tied to a specific platform and
require users to completely abandon their existing databases---an unrealistic
expectation given the infancy of the AQP technology. Therefore, we argue that a
universal solution is needed: a database-agnostic approximation engine that
will widen the reach of this emerging technology across various platforms.
Our proposal, called VerdictDB, uses a middleware architecture that requires
no changes to the backend database, and thus, can work with all off-the-shelf
engines. Operating at the driver-level, VerdictDB intercepts analytical queries
issued to the database and rewrites them into another query that, if executed
by any standard relational engine, will yield sufficient information for
computing an approximate answer. VerdictDB uses the returned result set to
compute an approximate answer and error estimates, which are then passed on to
the user or application. However, lack of access to the query execution layer
introduces significant challenges in terms of generality, correctness, and
efficiency. This paper shows how VerdictDB overcomes these challenges and
delivers up to 171 speedup (18.45 on average) for a variety of
existing engines, such as Impala, Spark SQL, and Amazon Redshift, while
incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache
License.Comment: Extended technical report of the paper that appeared in Proceedings
of the 2018 International Conference on Management of Data, pp. 1461-1476.
ACM, 201
A data warehouse environment for storing and analyzing simulation output data
Discrete event simulation modelling has been extensively
used in modelling complex systems. Although it offers
great conceptual-modelling flexibility, it is both computationally expensive and data intensive. There are several examples of simulation models that generate millions of observations to achieve satisfactory point and confidence interval estimations for the model variables. In these cases, it is exceptionally cumbersome to conduct the required output and sensitivity analysis in a spreadsheet or statistical package. In this paper, we highlight the advantages of employing data warehousing techniques for storing and analyzing simulation output data. The proposed data warehouse environment is capable of providing the means for automating the necessary algorithms and procedures for estimating different parameters of the simulation. These include initial transient in steady-state simulations and point and confidence interval estimations. Previously developed models for evaluating patient flow through hospital epartments are used to demonstrate the problem and the proposed solutions
Novel Selectivity Estimation Strategy for Modern DBMS
Selectivity estimation is important in query optimization, however accurate
estimation is difficult when predicates are complex. Instead of existing
database synopses and statistics not helpful for such cases, we introduce a new
approach to compute the exact selectivity by running an aggregate query during
the optimization phase. Exact selectivity can be achieved without significant
overhead for in-memory and GPU-accelerated databases by adding extra query
execution calls. We implement a selection push-down extension based on the
novel selectivity estimation strategy in the MapD database system. Our approach
records constant and less than 30 millisecond overheads in any circumstances
while running on GPU. The novel strategy successfully generates better query
execution plans which result in performance improvement up to 4.8 times from
TPC-H benchmark SF-50 queries and 7.3 times from star schema benchmark SF-80
queries
- …