666 research outputs found
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
Multidimensional modeling and analysis of large and complex watercourse data: an OLAP-based solution
International audienceThis paper presents the application of Data Warehouse (DW) and On-Line Analytical Processing (OLAP) technologies to the field of water quality assessment. The European Water Framework Directive (DCE, 2000) underlined the necessity of having operational tools to help in the interpretation of the complex and abundant information regarding running waters and their functioning. Several studies have exemplified the interest in DWs for integrating large volumes of data and in OLAP tools for data exploration and analysis. Based on free software tools, we propose an extensible relational OLAP system for the analysis of physicochemical and hydrobiological watercourse data. This system includes: (i) two data cubes; (ii) an Extract, Transform and Load (ETL) tool for data integration; and (iii) tools for OLAP exploration. Many examples of OLAP analysis (thematic, temporal, spatiotemporal, and multiscale) are provided. We have extended an existing framework with complex aggregate functions that are used to define complex analysis indicators. Additional analysis dimensions are also introduced to allow their calculation and also for purposes of rendering information. Finally, we propose two strategies to address the problem of summarizing heterogeneous measurement units by: (i) transforming source data at the ETL tier, and (ii) introducing an additional analysis dimension at the OLAP server tier
REX: Recursive, Delta-Based Data-Centric Computation
In today's Web and social network environments, query workloads include ad
hoc and OLAP queries, as well as iterative algorithms that analyze data
relationships (e.g., link analysis, clustering, learning). Modern DBMSs support
ad hoc and OLAP queries, but most are not robust enough to scale to large
clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch
tasks across clusters in a fault tolerant way, but have too much overhead to
support ad hoc queries.
Moreover, both classes of platform incur significant overhead in executing
iterative data analysis algorithms. Most such iterative algorithms repeatedly
refine portions of their answers, until some convergence criterion is reached.
However, general cloud platforms typically must reprocess all data in each
step. DBMSs that support recursive SQL are more efficient in that they
propagate only the changes in each step -- but they still accumulate each
iteration's state, even if it is no longer useful. User-defined functions are
also typically harder to write for DBMSs than for cloud platforms.
We seek to unify the strengths of both styles of platforms, with a focus on
supporting iterative computations in which changes, in the form of deltas, are
propagated from iteration to iteration, and state is efficiently updated in an
extensible way. We present a programming model oriented around deltas, describe
how we execute and optimize such programs in our REX runtime system, and
validate that our platform also handles failures gracefully. We experimentally
validate our techniques, and show speedups over the competing methods ranging
from 2.5 to nearly 100 times.Comment: VLDB201
Kaskade: Graph Views for Efficient Graph Analytics
Graphs are an increasingly popular way to model real-world entities and
relationships between them, ranging from social networks to data lineage graphs
and biological datasets. Queries over these large graphs often involve
expensive subgraph traversals and complex analytical computations. These
real-world graphs are often substantially more structured than a generic
vertex-and-edge model would suggest, but this insight has remained mostly
unexplored by existing graph engines for graph query optimization purposes.
Therefore, in this work, we focus on leveraging structural properties of graphs
and queries to automatically derive materialized graph views that can
dramatically speed up query evaluation. We present KASKADE, the first graph
query optimization framework to exploit materialized graph views for query
optimization purposes. KASKADE employs a novel constraint-based view
enumeration technique that mines constraints from query workloads and graph
schemas, and injects them during view enumeration to significantly reduce the
search space of views to be considered. Moreover, it introduces a graph view
size estimator to pick the most beneficial views to materialize given a query
set and to select the best query evaluation plan given a set of materialized
views. We evaluate its performance over real-world graphs, including the
provenance graph that we maintain at Microsoft to enable auditing, service
analytics, and advanced system optimizations. Our results show that KASKADE
substantially reduces the effective graph size and yields significant
performance speedups (up to 50X), in some cases making otherwise intractable
queries possible
- …