43 research outputs found
Self-organizing strategies for a column-store database
Column-store database systems open new vistas for improved maintenance through self-organization. Individual columns are the focal point, which simplify balancing conflicting requirements. This work presents two workload-driven self-organizing techniques in a column-store, i.e. adaptive segmentation and adaptive replication. Adaptive segmentation splits a column into non-overlapping segments based on the actual query load. Likewise, adaptive replication creates segment replicas. The strategies can support different application requirements by trading off the reorganization overhead for storage cost. Both techniques can significantly improve system performance as demonstrated in an evaluation of different scenarios
Data Vaults: A Symbiosis between Database Technology and Scientific File Repositories
In this short paper we outline the Data Vault, a database-attached external file repository.
It provides a true symbiosis between a DBMS and existing file-based repositories.
Data is kept in its original format while scalable processing functionality is provided through the DBMS facilities.
In particular, it provides transparent access to all data kept in the repository through an (array-based)
query language using the file-type specific scientific libraries.
The design space for data vaults is characterized by requirements coming from various fields.
We present a reference architecture for their realization in (commercial) DBMSs and a concrete
implementation in MonetDB for remote sensing data geared at content-based image retrieval
Just-in-time Data Distribution for Analytical Query Processing
Distributed processing commonly requires data spread across machines using a
priori static or hash-based data allocation. In this paper, we explore
an alternative approach that starts from a master node in control of the
complete database, and a variable number of worker nodes for delegated
query processing. Data is shipped just-in-time to the worker nodes using
a need to know policy, and is being reused, if possible, in subsequent
queries. A bidding mechanism among the workers yields a scheduling with
the most efficient reuse of previously shipped data, minimizing the data
transfer costs.
Just-in-time data shipment allows our system to benefit from locally
available idle resources to boost overall performance. The system is
maintenance-free and allocation is fully transparent to users. Our
experiments show that the proposed adaptive distributed architecture is a
viable and flexible alternative for small scale MapReduce-type of
settings
SciQL, Bridging the Gap between Science and Relational DBMS
Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as ‘cycle’), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. This paper describes the main language features of SciQL and illustrates
it using time-series concepts
SciQL, A query language for science applications
Scientific applications are still poorly served by contemporary
relational database systems.
At best, the system provides a bridge towards an external library using
user-defined functions, explicit import/export facilities or linked-in
Java/C# interpreters.
Time has come to rectify this with SciQL, a SQL-query language for
science applications with arrays as first class citizens.
It provides a seamless symbiosis of array-, set-, and sequence-
interpretation using a clear separation of the mathematical object from
its underlying storage representation.
The language extends value-based grouping in SQL with structural
grouping, i.e., fixed-sized and unbounded groups based on explicit
relationships between its index attributes.
It leads to a generalization of window-based query processing.
The SciQL architecture benefits from a column store system with an
adaptive storage scheme, including keeping multiple representations
around for reduced impedance mismatch.
This paper is focused on the language features, its architectural
consequences and extensive examples of its intended use
Data Vaults: Database Technology for Scientific File Repositories
Current data-management systems and analysis tools fail to meet scientists’ data-intensive needs. A "data vault" approach lets researchers effectively and efficiently explore and analyze information
An architecture for recycling intermediates in a column-store
Automatically recycling (intermediate) results is a grand challenge for state-of-the-art databases to improve both query response time and throughput. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning.
In contrast, the operator-at-a-time execution paradigm produces fully materialized results in each step of the query plan. To avoid resource contention, these intermediates are evicted as soon as possible.
In this paper we study an architecture that harvests the by-products of the operator-at-a-time paradigm in a column store system using a lightweight mechanism, the recycler. The key challenge then becomes selection of the policies to admit intermediates to the resource pool, their retention period, and the eviction strategy when facing resource limitations.
The proposed recycling architecture has been implemented in an open-source system. An experimental analysis against the TPC-H ad-hoc decision support benchmark and a complex, real-world application (SkyServer) demonstrates its effectiveness in terms of self-organizing behavior and its significant performance gains. The results indicate the potentials of recycling intermediates and charters a route for further development of database kernels
An architecture for recycling intermediates in a column-store
Automatic recycling intermediate results to improve both query response time and throughput is a grand c
Lazy ETL in Action: ETL Technology Dates Scientific Data
Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data.
We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly.
The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading