1,058 research outputs found

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    AcDWH - A patented method for active data warehousing

    Get PDF
    The traditional needs of data warehousing from monthly, weekly or nightly batch processing have evolved to near real-time refreshment cycles of the data, called active data warehousing. While the traditional data warehousing methods have been used to batch load large sets of data in the past, the business need for extremely fresh data in the data warehouse has increased. Previous studies have reviewed different aspects of the process along with the different methods to process data in data warehouses in near real-time fashion. To date, there has been little research of using partitioned staging tables within relational databases, combined with a crafted metadata driven system and parallelized loading processes for active data warehousing. This study provides a throughout description and suitability assessment of the patented AcDWH method for active data warehousing. In addition, this study provides a review and a summary of existing research on the data warehousing area from the era of start of data warehousing in the 1990’s to the year 2020. The review focuses on different parts of the data warehousing process and highlights the differences compared to the AcDWH method. Related to the AcDWH, the usage of partitioned staging tables within a relational database in combination of meta data structures used to manage the system is discussed in detail. In addition, two real-life applications are disclosed and discussed on high level. Potential future extensions to the methodology are discussed, and briefly summarized. The results indicate that the utilization of AcDWH method using parallelized loading pipelines and partitioned staging tables can provide enhanced throughput in the data warehouse loading processes. This is a clear improvement on the study’s field. Previous studies have not been considering using partitioned staging tables in conjunction with the loading processes and pipeline parallelization. Review of existing literature against the AcDWH method together with trial and error -approach show that the results and conclusions of this study are genuine. The results of this study confirm the fact that also technical level inventions within the data warehousing processes have significant contribution to the advance of methodologies. Compared to the previous studies in the field, this study suggests a simple yet novel method to achieve near real-time capabilities in active data warehousing.AcDWH – Patentoitu menetelmä aktiiviseen tietovarastointiin Perinteiset tarpeet tietovarastoinnille kuukausittaisen, viikoittaisen tai yöllisen käsittelyn osalta ovat kehittyneet lähes reaaliaikaista päivitystä vaativaksi aktiiviseksi tietovarastoinniksi. Vaikka perinteisiä menetelmiä on käytetty suurten tietomäärien lataukseen menneisyydessä, liiketoiminnan tarve erittäin ajantasaiselle tiedolle tietovarastoissa on kasvanut. Aikaisemmat tutkimukset ovat tarkastelleet erilaisia prosessin osa-alueita sekä erilaisia menetelmiä tietojen käsittelyyn lähes reaaliaikaisissa tietovarastoissa. Tutkimus partitioitujen relaatiotietokantojen väliaikaistaulujen käytöstä aktiivisessa tietovarastoinnissa yhdessä räätälöidyn metatieto-ohjatun järjestelmän ja rinnakkaislatauksen kanssa on ollut kuitenkin vähäistä. Tämä tutkielma tarjoaa kattavan kuvauksen sekä arvioinnin patentoidun AcDWH-menetelmän käytöstä aktiivisessa tietovarastoinnissa. Työ sisältää katsauksen ja yhteenvedon olemassa olevaan tutkimukseen tietovarastoinnin alueella 1990-luvun alusta vuoteen 2020. Kirjallisuuskatsaus keskittyy eri osa-alueisiin tietovarastointiprosessissa ja havainnollistaa eroja verrattuna AcDWH-menetelmään. AcDWH-menetelmän osalta käsitellään partitioitujen väliaikaistaulujen käyttöä relaatiotietokannassa, yhdessä järjestelmän hallitsemiseen käytettyjen metatietorakenteiden kanssa. Lisäksi kahden reaalielämän järjestelmän sovellukset kuvataan korkealla tasolla. Tutkimuksessa käsitellään myös menetelmän mahdollisia tulevia laajennuksia menetelmään tiivistetysti. Tulokset osoittavat, että AcDWH-menetelmän käyttö rinnakkaisilla latausputkilla ja partitioitujen välitaulujen käytöllä tarjoaa tehokkaan tietovaraston latausprosessin. Tämä on selvä parannus aikaisempaan tutkimukseen verrattuna. Aikaisemmassa tutkimuksessa ei ole käsitelty partitioitujen väliaikaistaulujen käyttöä ja soveltamista latausprosessin rinnakkaistamisessa. Tämän tutkimuksen tulokset vahvistavat, että myös tekniset keksinnöt tietovarastointiprosesseissa ovat merkittävässä roolissa menetelmien kehittymisessä. Aikaisempaan alan tutkimukseen verrattuna tämä tutkimus ehdottaa yksinkertaista mutta uutta menetelmää lähes reaaliaikaisten ominaisuuksien saavuttamiseksi aktiivisessa tietovarastoinnissa

    Lazy ETL in Action: ETL Technology Dates Scientific Data

    Get PDF
    Both scientific data and business data have analytical needs. Analysis takes place after a scientific data warehouse is eagerly filled with all data from external data sources (repositories). This is similar to the initial loading stage of Extract, Transform, and Load (ETL) processes that drive business intelligence. ETL can also help scientific data analysis. However, the initial loading is a time and resource consuming operation. It might not be entirely necessary, e.g. if the user is interested in only a subset of the data. We propose to demonstrate Lazy ETL, a technique to lower costs for initial loading. With it, ETL is integrated into the query processing of the scientific data warehouse. For a query, only the required data items are extracted, transformed, and loaded transparently on-the-fly. The demo is built around concrete implementations of Lazy ETL for seismic data analysis. The seismic data warehouse is ready for query processing, without waiting for long initial loading. The audience fires analytical queries to observe the internal mechanisms and modifications that realize each of the steps; lazy extraction, transformation, and loading

    A framework for detecting unnecessary industrial data in ETL processes

    Get PDF
    Extract transform and load (ETL) is a critical process used by industrial organisations to shift data from one database to another, such as from an operational system to a data warehouse. With the increasing amount of data stored by industrial organisations, some ETL processes can take in excess of 12 hours to complete; this can leave decision makers stranded while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements can change, and much of the data that goes through the ETL process may not ever be used or needed. This paper therefore proposes a framework for dynamically detecting and predicting unnecessary data and preventing it from slowing down ETL processes - either by removing it entirely or deprioritizing it. Other advantages of the framework include being able to prioritise data cleansing tasks and determining what data should be processed first and placed into fast access memory. We show existing example algorithms that can be used for each component of the framework, and present some initial testing results as part of our research to determine whether the framework can help to reduce ETL time.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/INDIN.2014.694555

    An ETL Metadata Model for Data Warehousing

    Get PDF
    Metadata is essential for understanding information stored in data warehouses. It helps increase levels of adoption and usage of data warehouse data by knowledge workers and decision makers. A metadata model is important to the implementation of a data warehouse; the lack of a metadata model can lead to quality concerns about the data warehouse. A highly successful data warehouse implementation depends on consistent metadata. This article proposes adoption of an ETL (extracttransform-load) metadata model for the data warehouse that makes subject area refreshes metadata-driven, loads observation timestamps and other useful parameters, and minimizes consumption of database systems resources. The ETL metadata model provides developers with a set of ETL development tools and delivers a user-friendly batch cycle refresh monitoring tool for the production support team

    Općeniti postupak za i ntegracijsko testiranje ETL procedura

    Get PDF
    In order to attain a certain degree of confidence in the quality of the data in the data warehouse it is necessary to perform a series of tests. There are many components (and aspects) of the data warehouse that can be tested, and in this paper we focus on the ETL procedures. Due to the complexity of ETL process, ETL procedure tests are usually custom written, having a very low level of reusability. In this paper we address this issue and work towards establishing a generic procedure for integration testing of certain aspects of ETL procedures. In this approach, ETL procedures are treated as a black box and are tested by comparing their inputs and outputs – datasets. Datasets from three locations are compared: datasets from the relational source(s), datasets from the staging area and datasets from the data warehouse. Proposed procedure is generic and can be implemented on any data warehouse employing dimensional model and having relational database(s) as a source. Our work pertains only to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse system’s testing capabilities. We comment on proposed mechanisms both in terms of full reload and incremental loading.Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladišta podataka koji se mogu testirati. U ovom radu smo se usredotočili na testiranje ETL procedura. S obzirom na složenost sustava skladišta podataka, testovi ETL procedura se pišu posebno za svako skladište podataka i rijetko se mogu ponovo upotrebljavati. Ovdje se obrađuje taj problem i predlaže općenita procedura za integracijsko testiranje određ enih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crnu kutiju, te se procedure testiraju tako što se uspoređuju ulazni i izlazni skupovi podataka. Uspoređuju se skupovi podataka s tri lokacije: podaci iz izvorišta podataka, podaci iz konsolidiranog pripremnog područja te podaci iz skladišta podataka. Predložena procedura je općenita i može se primijeniti na bilo koje skladište podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Predložene provjere se odnose samo na određene aspekte problema kvalitete podataka koji se mogu pojaviti u sustavu skladišta podataka, te služe za uspostavljanje osnovnog skupa provjera ili uvećanje mogućnosti provjere postojećih sustava. Predloženi postupak se komentira u kontekstu potpunog i inkrementalnog učitavanja podataka u skladište podataka

    An event-based near real-time data integration architecture

    Get PDF
    Extract-Transform-Load (ETL) tools feed data from operational databases into data warehouses. Traditionally, these ETL tools use batch processing and operate offline at regular time intervals, for example on a nightly or weekly basis. Naturally, users prefer to have up-to-date data to make their decisions, therefore there is a demand for real-time ETL tools. In this paper we investigate an event-based near real-time ETL layer for transferring and transforming data from the operational database to the data warehouse. One of our main concerns in this paper is master data management in the ETL layer. We present the architecture of a novel, general purpose, event-driven, and near real-time ETL layer that uses a Database Queue (DBQ), works on a push technology principle and directly supports content enrichment. We also observe that the system architecture is consistent with the information architecture of a classical Online Transaction Processing (OLTP) application, allowing us to distinguish between different kinds of data to increase the clarity of the design. Keywords: event-based architecture, content enrichment, master data, extract-transform-load, enterprise service bus

    Data warehousing technologies for large-scale and right-time data

    Get PDF

    Container-Managed ETL Applications for Integrating Data in Near Real-Time

    Get PDF
    As the analytical capabilities and applications of e-business systems expand, providing real-time access to critical business performance indicators to improve the speed and effectiveness of business operations has become crucial. The monitoring of business activities requires focused, yet incremental enterprise application integration (EAI) efforts and balancing information requirements in real-time with historical perspectives. The decision-making process in traditional data warehouse environments is often delayed because data cannot be propagated from the source system to the data warehouse in a timely manner. In this paper, we present an architecture for a container-based ETL (extraction, transformation, loading) environment, which supports a continual near real-time data integration with the aim of decreasing the time it takes to make business decisions and to attain minimized latency between the cause and effect of a business decision. Instead of using vendor proprietary ETL solutions, we use an ETL container for managing ETLets (pronounced “et-lets”) for the ETL processing tasks. The architecture takes full advantage of existing J2EE (Java 2 Platform, Enterprise Edition) technology and enables the implementation of a distributed, scalable, near real-time ETL environment. We have fully implemented the proposed architecture. Furthermore, we compare the ETL container to alternative continuous data integration approaches
    corecore