Search CORE

9 research outputs found

Efficient snapshot differential algorithms for data warehousing

Author: Wilburt Juan Labio
Publication venue
Publication date
Field of study

Detecting and extracting modifications from information sources is an integral part of data warehousing. For unsophisticated sources, it is often necessary to infer mod-ifications by periodically comparing snap shots of data from the source. Although this snapshot differential problena is closely related to traditional joins, there are signif-icant differences, which lead to simple new algorithms. In particular, we present algo-rithms that perform compression of records. We also present a window algorithm that works. very well if the snapshots are not “very different. ” The algorithms are studied via analysis and an implementation of two of them;‘the results illustrate the potential gains achievable with the new ’ algorithms.

CiteSeerX

Expiring Data from the Warehouse

Author: Hector Garcia-Molina
Wilburt Juan Labio
Publication venue
Publication date
Field of study

Data warehouses are used to collect and analyze data from remote sources. The data collected often originate from transactional information and can become very large. This paper presents a framework for incrementally removing warehouse data (without a need to fully recompute views), offering two choices. One is to expunge data, in which case the result is as if the data had never existed. The second is to expire data, in which case views defined over the data are not necessarily affected. Within the framework, a user or administrator can specify what data to expire or expunge, what auxiliary data is to be kept for facilitating incremental view maintenance, what type of updates are expected from external sources, and how the system should compensate when data is expired or other parameters changed. We present algorithms for the various expiration and compensation actions, and we show how our framework can be implemented on top of a conventional RDBMS

CiteSeerX

Expiring data in a warehouse

Author: Hector Garcia-molina
Jun Yang
Wilburt Juan Labio
Publication venue
Publication date: 01/01/1998
Field of study

Data warehouses collect data into materi-alized views for analysis. After some time, some of the data may no longer be needed or may not be of interest. In this pa-per, we handle this by expiring or remov-ing unneeded materialized view tuples. A framework supporting such expiration is presented. Within it, a user or adminis-trator can declaratively request expirations and can specify what type of modifications are expected from external sources. The lat-ter can significantly increase the amount of data that can be expired. We present effi-cient algorithms for determining what data can be expired (data not needed for main-tenance of other views), taking into account the types of updates that may occur.

CiteSeerX

Shrinking the Warehouse Update Window

Author: Hector Garcia-molina
Ramana Yerneni
Wilburt Juan Labio
Publication venue
Publication date
Field of study

Warehouse views need to be updated when source data changes. Due to the constantly increasing size of warehouses and the rapid rates of change, there is increasing pressure to reduce the time taken for updating the warehouse views. In this paper we focus on reducing this "update window" by minimizing the work required to compute and install a batch of updates. Various strategies have been proposed in the literature for updating a single warehouse view. These algorithms typically cannot be extended to come up with good strategies for updating an entire set of views. We develop an efficient algorithm that selects an optimal update strategy for any single warehouse view. Based on this algorithm, we develop an algorithm for selecting strategies to update a set of views. The performance of these algorithms is studied with experiments involving warehouse views based on TPC-D queries. 1 Introduction Data warehouses derive data from remote information sources in support of on-line analytical ..

CiteSeerX

Abstract Shrinking the Warehouse Update Window

Author: Hector Garcia-molina
Ramana Yerneni
Wilburt Juan Labio
Publication venue
Publication date
Field of study

Warehouse views need to be updated when source data changes. Due to the constantly increasing size of warehouses and the rapid rates of change, there is increasing pressure to reduce the time taken for updating the warehouse views. In this paper we focus on reducing this \update window &quot; by minimizing the work required to compute and install a batch of updates. Various strategies have been proposed in the literature for updating a single warehouse view. These algorithms typically cannot be extended to come up with good strategies for updating an entire set of views. We develop an e cient algorithm that selects an optimal update strategy for any single warehouse view. Based on this algorithm, we develop an algorithm for selecting strategies to update a set of views. The performance of these algorithms is studied with experiments involving warehouse views based on TPC-D queries.

CiteSeerX

Efficient Resumption of Interrupted Warehouse Loads

Author: Hector Garcia-Molina
Janet L. Wiener
Vlad Gorelik
Wilburt Juan Labio
Publication venue
Publication date: 01/01/2000
Field of study

Data warehouses collect large quantities of data from distributed sources into a single repository. A typical load to create or maintain a warehouse processes GBs of data, takes hours or even days to execute, and involves many complex and user-defined transformations of the data (e.g., find duplicates, resolve data inconsistencies, and add unique keys). If the load fails, a possible approach is to "redo" the entire load. A better approach is to resume the incomplete load from where it was interrupted. Unfortunately, traditional algorithms for resuming the load either impose unacceptable overhead during normal operation, or rely on the specifics of simple transformations. We develop a resumption algorithm called DR that imposes no overhead and relies only on the basic properties of the transformations. We show that DR can lead to almost a ten-fold reduction in resumption time by performing experiments using commercial software to load TPC-D tables and materialized views

CiteSeerX

Crossref

Performance Issues in Incremental Warehouse Maintenance

Author: Hector Garcia-molina
Jennifer Widom
Jun Yang
Wilburt Juan Labio
Yingwei Cui
Publication venue
Publication date
Field of study

A well-known challenge in data warehousing is the efficient incremental maintenance of warehouse data in the presence of source data updates. In this paper, we identify several critical data representation and algorithmic choices that must be made when developing the machinery of an incrementally maintained data warehouse. For each decision area, we identify various alternatives and evaluate them through extensive experiments. We show that picking the right alternative can lead to dramatic performance gains, and we propose guidelines for making the right decisions under different scenarios. All of the issues addressed in this paper arose in our development of WHIPS, a prototype data warehousing system supporting incremental maintenance. 1 Introduction Data warehousing systems integrate and store data from remote sources as materialized views in the warehouse [LW95, CD97]. When source data changes, warehouse views need to be maintained so that they remain consistent with the so..

CiteSeerX

Shrinking the warehouse update Window

Author: Agrawal D.
Baralis E.
Hector Garcia-Molina
Huyn P.
Labio W. J.
Quass D.
Ramana Yerneni
Wilburt Juan Labio
Yang J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref