Search CORE

9 research outputs found

Instant-on scientific data warehouses: Lazy ETL for data-intensive research

Author: Ivanova M.G. (Milena)
Kargin Y. (Yagiz)
Kersten M.L. (Martin)
Manegold S. (Stefan)
Pirk H. (Holger)
Publication venue
Publication date: 01/08/2012
Field of study

In the dawning era of data intensive research, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data

CWI's Institutional Repository

Data Vaults: a Database Welcome to Scientiﬁc File Repositories

Author: Datcu M. (Mihai)
Espinoza Molina D.
Ivanova M.G. (Milena)
Kargin Y. (Yagiz)
Kersten M.L. (Martin)
Manegold S. (Stefan)
Zhang Y. (Ying)
Publication venue
Publication date: 01/01/2013
Field of study

Efficient management and exploration of high-volume scientific file repositories have become pivotal for advancement in science. We propose to demonstrate the Data Vault, an extension of the database system architecture that transparently opens scientific file repositories for efficient in-database processing and exploration. The Data Vault facilitates science data analysis using high-level declarative languages, such as the traditional SQL and the novel array-oriented SciQL. Data of interest are loaded from the attached repository in a just-in-time manner without need for up-front data ingestion. The demo is built around concrete implementations of the Data Vault for two scientific use cases: seismic time series and Earth observation images. The seismic Data Vault uses the queries submitted by the audience to illustrate the internals of Data Vault functioning by revealing the mechanisms of dynamic query plan generation and on-demand external data ingestion. The image Data Vault shows an application view from the perspective of data mining researchers

Crossref

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

Lazy ETL in Action: ETL Technology Dates Scientific Data

Author: Ivanova M.G. (Milena)
Kargin Y. (Yagiz)
Kersten M.L. (Martin)
Manegold S. (Stefan)
Zhang Y. (Ying)
Publication venue
Publication date: 01/08/2013
Field of study

Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data. We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly. The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading

CWI's Institutional Repository

An automated ETL for online datasets

Author: McCarren Andrew
McCarthy Suzanne
Roantree Mark
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2019
Field of study

While using online datasets for machine learning is commonplace today, the quality of these datasets impacts on the performance of prediction algorithms. One method for improving the semantics of new data sources is to map these sources to a common data model or ontology. While semantic and structural heterogeneities must still be resolved, this provides a well established approach to providing clean datasets, suitable for machine learning and analysis. However, when there is a requirement for a close to real time usage of online data, a method for dynamic Extract-Transform-Load of new sources data must be developed. In this work, we present a framework for integrating online and enterprise data sources, in close to real time, to provide datasets for machine learning and predictive algorithms. An exhaustive evaluation compares a human built data transformation process with our system’s machine generated ETL process, with very favourable results, illustrating the value and impact of an automated approach

Crossref

Irish Universities

DCU Online Research Access Service

An automated ETL for online datasets

Author: McCarren Andrew
McCarthy Suzanne
Roantree Mark
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2019
Field of study

DCU Online Research Access Service

An automated ETL for online datasets

Author: McCarren Andrew
McCarthy Suzanne
Roantree Mark
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2019
Field of study

Crossref

Irish Universities

DCU Online Research Access Service

A method for automated transformation and validation of online datasets

Author: McCarren Andrew
McCarthy Suzanne
Roantree Mark
Publication venue
Publication date: 19/10/2019
Field of study

Crossref

Irish Universities

DCU Online Research Access Service

Instant-on scientific data warehouses lazy ETL for data-intensive research

Author: Ivanova Milena
Kargin Yagiz
Kersten Martin
Manegold Stefan
Pirk Holger
Publication venue
Publication date: 01/01/2013
Field of study

In the dawn of the data intensive research era, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data

Instant-on scientific data warehouses lazy ETL for data-intensive research

Author: Ivanova M.G. (Milena)
Kargin Y. (Yagiz)
Kersten M.L. (Martin)
Manegold S. (Stefan)
Pirk H. (Holger)
Publication venue
Publication date: 01/01/2013
Field of study

CWI's Institutional Repository