912 research outputs found

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    Extracting, Transforming and Archiving Scientific Data

    Get PDF
    It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

    Using Ontologies for the Design of Data Warehouses

    Get PDF
    Obtaining an implementation of a data warehouse is a complex task that forces designers to acquire wide knowledge of the domain, thus requiring a high level of expertise and becoming it a prone-to-fail task. Based on our experience, we have detected a set of situations we have faced up with in real-world projects in which we believe that the use of ontologies will improve several aspects of the design of data warehouses. The aim of this article is to describe several shortcomings of current data warehouse design approaches and discuss the benefit of using ontologies to overcome them. This work is a starting point for discussing the convenience of using ontologies in data warehouse design.Comment: 15 pages, 2 figure

    Quarry : digging up the gems of your data treasury

    Get PDF
    The design lifecycle of a data warehousing (DW) system is primarily led by requirements of its end-users and the complexity of underlying data sources. The process of designing a multidimensional (MD) schema and back-end extracttransform-load (ETL) processes, is a long-term and mostly manual task. As enterprises shift to more real-time and ’on-the-fly’ decision making, business intelligence (BI) systems require automated means for efficiently adapting a physical DW design to frequent changes of business needs. To address this problem, we present Quarry, an end-to-end system for assisting users of various technical skills in managing the incremental design and deployment of MD schemata and ETL processes. Quarry automates the physical design of a DW system from high-level information requirements. Moreover, Quarry provides tools for efficiently accommodating MD schema and ETL process designs to new or changed information needs of its end-users. Finally, Quarry facilitates the deployment of the generated DW design over an extensible list of execution engines. On-site, we will use a variety of examples to show how Quarry facilitates the complexity of the DW design lifecycle.Peer ReviewedPostprint (published version

    A Domain Specific Model for Generating ETL Workflows from Business Intents

    Get PDF
    Extract-Transform-Load (ETL) tools have provided organizations with the ability to build and maintain workflows (consisting of graphs of data transformation tasks) that can process the flood of digital data. Currently, however, the specification of ETL workflows is largely manual, human time intensive, and error prone. As these workflows become increasingly complex, the users that build and maintain them must retain an increasing amount of knowledge specific to how to produce solutions to business objectives using their domain\u27s ETL workflow system. A program that can reduce the human time and expertise required to define such workflows, producing accurate ETL solutions with fewer errors would therefore be valuable. This dissertation presents a means to automate the specification of ETL workflows using a domain-specific modeling language. To provide such a solution, the knowledge relevant to the construction of ETL workflows for the operations and objectives of a given domain is identified and captured. The approach provides a rich model of ETL workflow capable of representing such knowledge. This knowledge representation is leveraged by a domain-specific modeling language which maps declarative statements into workflow requirements. Users are then provided with the ability to assertionally express the intents that describe a desired ETL solution at a high-level of abstraction, from which procedural workflows satisfying the intent specification are automatically generated using a planner

    Data generator for evaluating ETL process quality

    Get PDF
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

    A Goal and Ontology Based Approach for Generating ETL Process Specifications

    Get PDF
    Data warehouse (DW) systems development involves several tasks such as defining requirements, designing DW schemas, and specifying data transformation operations. Indeed, the success of DW systems is very much dependent on the proper design of the extracting, transforming, and loading (ETL) processes. However, the common design-related problems in the ETL processes such as defining user requirements and data transformation specifications are far from being resolved. These problems are due to data heterogeneity in data sources, ambiguity of user requirements, and the complexity of data transformation activities. Current approaches have limitations on the reconciliation of DW requirement semantics towards designing the ETL processes. As a result, this has prolonged the process of the ETL processes specifications generation. The semantic framework of DW systems established from this study is used to develop the requirement analysis method for designing the ETL processes (RAMEPs) from the different perspectives of organization, decision-maker, and developer by using goal and ontology approaches. The correctness of RAMEPs approach was validated by using modified and newly developed compliant tools. The RAMEPs was evaluated in three real case studies, i.e., Student Affairs System, Gas Utility System, and Graduate Entrepreneur System. These case studies were used to illustrate how the RAMEPs approach can be implemented for designing and generating the ETL processes specifications. Moreover, the RAMEPs approach was reviewed by the DW experts for assessing the strengths and weaknesses of this method, and the new approach is accepted. The RAMEPs method proves that the ETL processes specifications can be derived from the early phases of DW systems development by using the goal-ontology approach

    Modelling Data Pipelines

    Get PDF
    Data is the new currency and key to success. However, collecting high-quality data from multiple distributed sources requires much effort. In addition, there are several other challenges involved while transporting data from its source to the destination. Data pipelines are implemented in order to increase the overall efficiency of data-flow from the source to the destination since it is automated and reduces the human involvement which is required otherwise. Despite existing research on ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform) pipelines, the research on this topic is limited. ETL/ELT pipelines are abstract representations of the end-to-end data pipelines. To utilize the full potential of the data pipeline, we should understand the activities in it and how they are connected in an end-to-end data pipeline. This study gives an overview of how to design a conceptual model of data pipeline which can be further used as a language of communication between different data teams. Furthermore, it can be used for automation of monitoring, fault detection, mitigation and alarming at different steps of data pipeline

    Continuous Delivery in Data Warehousing

    Get PDF
    Tämän väitöskirjan motivaatio kumpuaa käytännön ongelmasta: kuinka lyhentää aikaa ideasta analysoida jotain siihen, että analyysi on käyttäjien saatavilla. Tietovarastointia on perinteisesti pidetty monimutkaisena ja siten herkkänä virheille. Tietovarastoinnissa erilliset vaiheet tapahtuvat peräkkäin ennalta määritellyssä järjestyksessä. Perinteinen tapa tietovarastoinnissa on ottaa koko ratkaisu kerralla tuotantokäyttöön, jossa kaikki tietovaraston palaset ovat paikoillaan ennen tuotantokäyttöä. Mikäli kehitys seuraa lyhyitä iteraatioita, miksi käyttöönotot tuotantoon eivät seuraa näitä iteraatioita? Tämä väitöskirja esittelee kuinka raportointi- ja tietovarastointitiimit voivat rakentaa yhtäaikaa raportointiratkaisuja (business intelligence) vaiheittain. Yhteistyö tehostaa kehittäjien välistä kommunikaatiota ja lyhentää palautesykliä loppukäyttäjältä kehittäjille mikä tekee palautteesta suorempaa. Jatkuvan käyttöönoton käytännöt tukevat julkaisemista usein tuotantoympäristöön. Kaksikerroksinen tietovarastoarkkitehtuuri erottaa analyyttisen ja tapahtumapohjaisen käsittelyn. Erilaisten käsittelyjen erottaminen mahdollistaa paremman testauksen ja siten jatkuvan käyttöönoton. Käytettäessä jatkuvaa käyttöönottoa, voidaan kehitysaikaa lyhentää myös automatisoimalla tietomuunnosten toteutustyötä. Tämä väitöskirja esittelee tietomallin tietomuunnosten automatisoinnin toteuttamista varten, niin tiedon saattamiseksi tietovarastoon kuin tiedon hyödyntämiseen tietovarastosta. Tutkimuksen arvioinnissa noudatettiin suunnittelutieteen suuntaviivoja. Tutkimus tehtiin yhteistyössä teollisuuden ja yliopistojen kanssa. Näitä ideoita on testattu todellisissa projekteissa lupaavin tuloksin ja siten ne on todistettu toimiviksi.Continuous delivery is an activity in the field of continuous software engineering. Data warehousing, on the other hand, lie within information systems research. This dissertation combines these two traditionally separate concerns of continuous delivery and data warehousing. This dissertation’s motivation stems from a practical problem: how to shorten the time from a reporting idea until it is available for users. Data warehousing has traditionally been considered tedious and delicate. In data warehousing, distinct steps take place one after another in a predefined unalterable sequence. Another traditional aspect of data warehousing is bringing everything at once to a production environment, where all the pieces of a data warehouse are in place before production use. If development follows agile iterations, why are the releases in production not following the same iterations? This dissertation introduces how reporting and data warehouse teams can synchronously build business intelligence solutions in increments. Joint working enhances communication between developers and shortens the feedback cycle from an end-user to developers, and makes the feedback more direct. Continuous delivery practices support releasing frequently to a production environment. A two-layer data warehouse architecture separates analytical and transactional processing. Separating different processing targets enables better testing and, thus, continuous delivery. When frequently deploying with continuous delivery practices, automating transformation creation in data warehousing reduces the development time. This dissertation introduces an information model for automating the implementation of transformations, getting data into a data warehouse and getting data out of it. The research evaluation followed the design science guidelines. Research for this dissertation collaborated with the industry. These ideas have been tested on real projects with promising results, and thus they have been proven to work