3,018 research outputs found

    Efficient incremental view maintenance in data warehouses

    Get PDF

    Lazy ETL in Action: ETL Technology Dates Scientific Data

    Get PDF
    Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data. We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly. The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading

    On Deriving Net Change Information From Change Logs - The DELTALAYER-Algorithm

    Get PDF
    The management of change logs is crucial in different areas of information systems like data replication, data warehousing, and process management. One barrier that hampers the (intelligent) use of respective change logs is the possibly large amount of unnecessary and redundant data provided by them. In particular, change logs often contain information about changes which actually have had no effect on the original data source (e.g., due to subsequently applied, overriding change operations). Typically, such inflated logs lead to difficulties with respect to system performance, data quality or change comparability. In order to deal with this we introduce the DeltaLayer algorithm. It takes arbitrary change log information as input and produces a cleaned output which only contains the net change effects; i.e., the produced log only contains information about those changes which actually have had an effect on the original source. We formally prove the minimality of our algorithm, and we show how it can be applied in different domains; e.g., the post-processing of differential snapshots in data warehouses or the analysis of conflicting changes in process management systems. Altogether the ability to purge change logs from unnecessary information provides the basis for a more intelligent handling of these logs

    Container-Managed ETL Applications for Integrating Data in Near Real-Time

    Get PDF
    As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced “et-lets”) for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches

    HYBRIDJOIN for near-real-time Data Warehousing

    Get PDF
    An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a diskbased relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings

    Instant-on scientific data warehouses: Lazy ETL for data-intensive research

    Get PDF
    In the dawning era of data intensive research, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data

    An ETL Metadata Model for Data Warehousing

    Get PDF
    Metadata is essential for understanding information stored in data warehouses. It helps increase levels of adoption and usage of data warehouse data by knowledge workers and decision makers. A metadata model is important to the implementation of a data warehouse; the lack of a metadata model can lead to quality concerns about the data warehouse. A highly successful data warehouse implementation depends on consistent metadata. This article proposes adoption of an ETL (extracttransform-load) metadata model for the data warehouse that makes subject area refreshes metadata-driven, loads observation timestamps and other useful parameters, and minimizes consumption of database systems resources. The ETL metadata model provides developers with a set of ETL development tools and delivers a user-friendly batch cycle refresh monitoring tool for the production support team
    corecore