20,073 research outputs found
Subset Queries in Relational Databases
In this paper, we motivated the need for relational database systems to
support subset query processing. We defined new operators in relational
algebra, and new constructs in SQL for expressing subset queries. We also
illustrated the applicability of subset queries through different examples
expressed using extended SQL statements and relational algebra expressions. Our
aim is to show the utility of subset queries for next generation applications.Comment: 15 page
Data Mining-based Materialized View and Index Selection in Data Warehouses
Materialized views and indexes are physical structures for accelerating data
access that are casually used in data warehouses. However, these data
structures generate some maintenance overhead. They also share the same storage
space. Most existing studies about materialized view and index selection
consider these structures separately. In this paper, we adopt the opposite
stance and couple materialized view and index selection to take view-index
interactions into account and achieve efficient storage space sharing.
Candidate materialized views and indexes are selected through a data mining
process. We also exploit cost models that evaluate the respective benefit of
indexing and view materialization, and help select a relevant configuration of
indexes and materialized views among the candidates. Experimental results show
that our strategy performs better than an independent selection of materialized
views and indexes
Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views
Materialized views (MVs), stored pre-computed results, are widely used to
facilitate fast queries on large datasets. When new records arrive at a high
rate, it is infeasible to continuously update (maintain) MVs and a common
solution is to defer maintenance by batching updates together. Between batches
the MVs become increasingly stale with incorrect, missing, and superfluous rows
leading to increasingly inaccurate query results. We propose Stale View
Cleaning (SVC) which addresses this problem from a data cleaning perspective.
In SVC, we efficiently clean a sample of rows from a stale MV, and use the
clean sample to estimate aggregate query results. While approximate, the
estimated query results reflect the most recent data. As sampling can be
sensitive to long-tailed distributions, we further explore an outlier indexing
technique to give increased accuracy when the data distributions are skewed.
SVC complements existing deferred maintenance approaches by giving accurate and
bounded query answers between maintenance. We evaluate our method on a
generated dataset from the TPC-D benchmark and a real video distribution
application. Experiments confirm our theoretical results: (1) cleaning an MV
sample is more efficient than full view maintenance, (2) the estimated results
are more accurate than using the stale MV, and (3) SVC is applicable for a wide
variety of MVs
LevelHeaded: Making Worst-Case Optimal Joins Work in the Common Case
Pipelines combining SQL-style business intelligence (BI) queries and linear
algebra (LA) are becoming increasingly common in industry. As a result, there
is a growing need to unify these workloads in a single framework.
Unfortunately, existing solutions either sacrifice the inherent benefits of
exclusively using a relational database (e.g. logical and physical
independence) or incur orders of magnitude performance gaps compared to
specialized engines (or both). In this work we study applying a new type of
query processing architecture to standard BI and LA benchmarks. To do this we
present a new in-memory query processing engine called LevelHeaded. LevelHeaded
uses worst-case optimal joins as its core execution mechanism for both BI and
LA queries. With LevelHeaded, we show how crucial optimizations for BI and LA
queries can be captured in a worst-case optimal query architecture. Using these
optimizations, LevelHeaded outperforms other relational database engines
(LogicBlox, MonetDB, and HyPer) by orders of magnitude on standard LA
benchmarks, while performing on average within 31% of the best-of-breed BI
(HyPer) and LA (Intel MKL) solutions on their own benchmarks. Our results show
that such a single query processing architecture is capable of delivering
competitive performance on both BI and LA queries
Scaling-Up Reasoning and Advanced Analytics on BigData
BigDatalog is an extension of Datalog that achieves performance and
scalability on both Apache Spark and multicore systems to the point that its
graph analytics outperform those written in GraphX. Looking back, we see how
this realizes the ambitious goal pursued by deductive database researchers
beginning forty years ago: this is the goal of combining the rigor and power of
logic in expressing queries and reasoning with the performance and scalability
by which relational databases managed Big Data. This goal led to Datalog which
is based on Horn Clauses like Prolog but employs implementation techniques,
such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up
computation model of relational systems, and thus obtain the performance and
scalability that relational systems had achieved, as far back as the 80s, using
data-parallelization on shared-nothing architectures. But this goal proved
difficult to achieve because of major issues at (i) the language level and (ii)
at the system level. The paper describes how (i) was addressed by simple rules
under which the fixpoint semantics extends to programs using count, sum and
extrema in recursion, and (ii) was tamed by parallel compilation techniques
that achieve scalability on multicore systems and Apache Spark. This paper is
under consideration for acceptance in Theory and Practice of Logic Programming
(TPLP).Comment: Under consideration in Theory and Practice of Logic Programming
(TPLP
Optimization of Imperative Programs in a Relational Database
For decades, RDBMSs have supported declarative SQL as well as imperative
functions and procedures as ways for users to express data processing tasks.
While the evaluation of declarative SQL has received a lot of attention
resulting in highly sophisticated techniques, the evaluation of imperative
programs has remained naive and highly inefficient. Imperative programs offer
several benefits over SQL and hence are often preferred and widely used. But
unfortunately, their abysmal performance discourages, and even prohibits their
use in many situations. We address this important problem that has hitherto
received little attention.
We present Froid, an extensible framework for optimizing imperative programs
in relational databases. Froid's novel approach automatically transforms entire
User Defined Functions (UDFs) into relational algebraic expressions, and embeds
them into the calling SQL query. This form is now amenable to cost-based
optimization and results in efficient, set-oriented, parallel plans as opposed
to inefficient, iterative, serial execution of UDFs. Froid's approach
additionally brings the benefits of many compiler optimizations to UDFs with no
additional implementation effort. We describe the design of Froid and present
our experimental evaluation that demonstrates performance improvements of up to
multiple orders of magnitude on real workloads.Comment: Extended version of the paper titled "FROID: Optimization of
Imperative Programs in a Relational Database" in PVLDB 11(4), 2017. DOI:
10.1145/3164135.316414
Cache-based Multi-query Optimization for Data-intensive Scalable Computing Frameworks
In modern large-scale distributed systems, analytics jobs submitted by
various users often share similar work, for example scanning and processing the
same subset of data. Instead of optimizing jobs independently, which may result
in redundant and wasteful processing, multi-query optimization techniques can
be employed to save a considerable amount of cluster resources. In this work,
we introduce a novel method combining in-memory cache primitives and
multi-query optimization, to improve the efficiency of data-intensive, scalable
computing frameworks. By careful selection and exploitation of common
(sub)expressions, while satisfying memory constraints, our method transforms a
batch of queries into a new, more efficient one which avoids unnecessary
recomputations. To find feasible and efficient execution plans, our method uses
a cost-based optimization formulation akin to the multiple-choice knapsack
problem. Extensive experiments on a prototype implementation of our system show
significant benefits of worksharing for both TPC-DS workloads and detailed
micro-benchmarks.Comment: 12 pages + references, extended versio
Sequences, yet Functions: The Dual Nature of Data-Stream Processing
Data-stream processing has continuously risen in importance as the amount of
available data has been steadily increas- ing over the last decade. Besides
traditional domains such as data-center monitoring and click analytics, there
is an increasing number of network-enabled production machines that generate
continuous streams of data. Due to their continuous nature, queries on
data-streams can be more complex, and distinctly harder to understand then
database queries. As users have to consider operational details, maintenance
and debugging become challenging. Current approaches model data-streams as
sequences, be- cause this is the way they are physically received. These models
result in an implementation-focused perspective. We explore an alternate way of
modeling data-streams by focusing on time-slicing semantics. This focus results
in a model based on functions, which is better suited for reasoning about query
semantics. By adapting the definitions of relevant concepts in stream
processing to our model, we illustrate the practical useful- ness of our
approach. Thereby, we link data-streams and query primitives to concepts in
functional programming and mathematics. Most noteworthy, we prove that
data-streams are monads, and show how to derive monad definitions for current
data-stream models. We provide an abstract, yet practical perspective on data-
stream related subjects based on a sound, consistent query model. Our work can
serve as solid foundation for future data-stream query-languages
Explaining Wrong Queries Using Small Examples
For testing the correctness of SQL queries, e.g., evaluating student
submissions in a database course, a standard practice is to execute the query
in question on some test database instance and compare its result with that of
the correct query. Given two queries and , we say that a database
instance is a counterexample (for and ) if differs from
; such a counterexample can serve as an explanation of why and
are not equivalent. While the test database instance may serve as a
counterexample, it may be too large or complex to read and understand where the
inequivalence comes from. Therefore, in this paper, given a known
counterexample for and , we aim to find the smallest
counterexample where . The problem in
general is NP-hard. We give a suite of algorithms for finding the smallest
counterexample for different classes of queries, some more tractable than
others. We also present an efficient provenance-based algorithm for SPJUD
queries that uses a constraint solver, and extend it to more complex queries
with aggregation, group-by, and nested queries. We perform extensive
experiments indicating the effectiveness and scalability of our solution on
student queries from an undergraduate database course and on queries from the
TPC-H benchmark. We also report a user study from the course where we deployed
our tool to help students with an assignment on relational algebra
DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views
Applications ranging from algorithmic trading to scientific data analysis
require realtime analytics based on views over databases that change at very
high rates. Such views have to be kept fresh at low maintenance cost and
latencies. At the same time, these views have to support classical SQL, rather
than window semantics, to enable applications that combine current with aged or
historical data. In this paper, we present viewlet transforms, a recursive
finite differencing technique applied to queries. The viewlet transform
materializes a query and a set of its higher-order deltas as views. These views
support each other's incremental maintenance, leading to a reduced overall view
maintenance cost. The viewlet transform of a query admits efficient evaluation,
the elimination of certain expensive query operations, and aggressive
parallelization. We develop viewlet transforms into a workable query execution
technique, present a heuristic and cost-based optimization framework, and
report on experiments with a prototype dynamic data management system that
combines viewlet transforms with an optimizing compilation technique. The
system supports tens of thousands of complete view refreshes a second for a
wide range of queries.Comment: VLDB201
- …