4 research outputs found
VerdictDB: Universalizing Approximate Query Processing
Despite 25 years of research in academia, approximate query processing (AQP)
has had little industrial adoption. One of the major causes of this slow
adoption is the reluctance of traditional vendors to make radical changes to
their legacy codebases, and the preoccupation of newer vendors (e.g.,
SQL-on-Hadoop products) with implementing standard features. Additionally, the
few AQP engines that are available are each tied to a specific platform and
require users to completely abandon their existing databases---an unrealistic
expectation given the infancy of the AQP technology. Therefore, we argue that a
universal solution is needed: a database-agnostic approximation engine that
will widen the reach of this emerging technology across various platforms.
Our proposal, called VerdictDB, uses a middleware architecture that requires
no changes to the backend database, and thus, can work with all off-the-shelf
engines. Operating at the driver-level, VerdictDB intercepts analytical queries
issued to the database and rewrites them into another query that, if executed
by any standard relational engine, will yield sufficient information for
computing an approximate answer. VerdictDB uses the returned result set to
compute an approximate answer and error estimates, which are then passed on to
the user or application. However, lack of access to the query execution layer
introduces significant challenges in terms of generality, correctness, and
efficiency. This paper shows how VerdictDB overcomes these challenges and
delivers up to 171 speedup (18.45 on average) for a variety of
existing engines, such as Impala, Spark SQL, and Amazon Redshift, while
incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache
License.Comment: Extended technical report of the paper that appeared in Proceedings
of the 2018 International Conference on Management of Data, pp. 1461-1476.
ACM, 201
Hillview:A trillion-cell spreadsheet for big data
Hillview is a distributed spreadsheet for browsing very large datasets that
cannot be handled by a single machine. As a spreadsheet, Hillview provides a
high degree of interactivity that permits data analysts to explore information
quickly along many dimensions while switching visualizations on a whim. To
provide the required responsiveness, Hillview introduces visualization
sketches, or vizketches, as a simple idea to produce compact data
visualizations. Vizketches combine algorithmic techniques for data
summarization with computer graphics principles for efficient rendering. While
simple, vizketches are effective at scaling the spreadsheet by parallelizing
computation, reducing communication, providing progressive visualizations, and
offering precise accuracy guarantees. Using Hillview running on eight servers,
we can navigate and visualize datasets of tens of billions of rows and
trillions of cells, much beyond the published capabilities of competing
systems