1,866 research outputs found

    Data generator for evaluating ETL process quality

    Get PDF
    Obtaining the right set of data for evaluating the fulfillment of different quality factors in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while manually providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. More importantly, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over input data, and automatically generates testing datasets. Bijoux is highly modular and configurable to enable end-users to generate datasets for a variety of interesting test scenarios (e.g., evaluating specific parts of an input ETL process design, with different input dataset sizes, different distributions of data, and different operation selectivities). We have developed a running prototype that implements the functionality of our data generation framework and here we report our experimental findings showing the effectiveness and scalability of our approach.Peer ReviewedPostprint (author's final draft

    POIESIS: A tool for quality-aware ETL process redesign

    Get PDF
    We present a tool, called POIESIS, for automatic ETL process enhancement. ETL processes are essential data-centric activities in modern business intelligence environments and they need to be examined through a viewpoint that concerns their quality characteristics (e.g., data quality, performance, manageability) in the era of Big Data. POIESIS responds to this need by providing a user-centered environment for quality-aware analysis and redesign of ETL flows. It generates thousands of alternative flows by adding flow patterns to the initial flow, in varying positions and combinations, thus creating alternative design options in a multidimensional space of different quality attributes. Through the demonstration of POIESIS we introduce the tool's capabilities and highlight its efficiency, usability and modifiability, thanks to its polymorphic design. © 2015, Copyright is with the authors.Peer ReviewedPostprint (published version

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

    A Domain Specific Model for Generating ETL Workflows from Business Intents

    Get PDF
    Extract-Transform-Load (ETL) tools have provided organizations with the ability to build and maintain workflows (consisting of graphs of data transformation tasks) that can process the flood of digital data. Currently, however, the specification of ETL workflows is largely manual, human time intensive, and error prone. As these workflows become increasingly complex, the users that build and maintain them must retain an increasing amount of knowledge specific to how to produce solutions to business objectives using their domain\u27s ETL workflow system. A program that can reduce the human time and expertise required to define such workflows, producing accurate ETL solutions with fewer errors would therefore be valuable. This dissertation presents a means to automate the specification of ETL workflows using a domain-specific modeling language. To provide such a solution, the knowledge relevant to the construction of ETL workflows for the operations and objectives of a given domain is identified and captured. The approach provides a rich model of ETL workflow capable of representing such knowledge. This knowledge representation is leveraged by a domain-specific modeling language which maps declarative statements into workflow requirements. Users are then provided with the ability to assertionally express the intents that describe a desired ETL solution at a high-level of abstraction, from which procedural workflows satisfying the intent specification are automatically generated using a planner

    Approach for testing the extract-transform-load process in data warehouse systems, An

    Get PDF
    2018 Spring.Includes bibliographical references.Enterprises use data warehouses to accumulate data from multiple sources for data analysis and research. Since organizational decisions are often made based on the data stored in a data warehouse, all its components must be rigorously tested. In this thesis, we first present a comprehensive survey of data warehouse testing approaches, and then develop and evaluate an automated testing approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in data warehousing. In the survey we present a classification framework that categorizes the testing and evaluation activities applied to the different components of data warehouses. These approaches include both dynamic analysis as well as static evaluation and manual inspections. The classification framework uses information related to what is tested in terms of the data warehouse component that is validated, and how it is tested in terms of various types of testing and evaluation approaches. We discuss the specific challenges and open problems for each component and propose research directions. The ETL process involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex one-to-one, many-to-one, and many-to-many transformations involving sources and targets that use different schemas, databases, and technologies. Since faulty implementations in any of the ETL steps can result in incorrect information in the target data warehouse, ETL processes must be thoroughly validated. In this thesis, we propose automated balancing tests that check for discrepancies between the data in the source databases and that in the target warehouse. Balancing tests ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process. First, we categorize and define a set of properties to be checked in balancing tests. We identify various types of discrepancies that may exist between the source and the target data, and formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Next, we automatically identify source-to-target mappings from ETL transformation rules provided in the specifications. We identify one-to-one, many-to-one, and many-to-many mappings for tables, records, and attributes involved in the ETL transformations. We automatically generate test assertions to verify the properties for balancing tests. We use the source-to-target mappings to automatically generate assertions corresponding to each property. The assertions compare the data in the target data warehouse with the corresponding data in the sources to verify the properties. We evaluate our approach on a health data warehouse that uses data sources with different data models running on different platforms. We demonstrate that our approach can find previously undetected real faults in the ETL implementation. We also provide an automatic mutation testing approach to evaluate the fault finding ability of our balancing tests. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse when faulty ETL scripts execute on mock source data

    Engineering model transformations with transML

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007%2Fs10270-011-0211-2Model transformation is one of the pillars of model-driven engineering (MDE). The increasing complexity of systems and modelling languages has dramatically raised the complexity and size of model transformations as well. Even though many transformation languages and tools have been proposed in the last few years, most of them are directed to the implementation phase of transformation development. In this way, even though transformations should be built using sound engineering principles—just like any other kind of software—there is currently a lack of cohesive support for the other phases of the transformation development, like requirements, analysis, design and testing. In this paper, we propose a unified family of languages to cover the life cycle of transformation development enabling the engineering of transformations. Moreover, following an MDE approach, we provide tools to partially automate the progressive refinement of models between the different phases and the generation of code for several transformation implementation languages.This work has been sponsored by the Spanish Ministry of Science and Innovation with project METEORIC (TIN2008-02081), and by the R&D program of the Community of Madrid with projects “e-Madrid" (S2009/TIC-1650). Parts of this work were done during the research stays of Esther and Juan at the University of York, with financial support from the Spanish Ministry of Science and Innovation (grant refs. JC2009-00015, PR2009-0019 and PR2008-0185)

    SoFIA: Shop Floor Intelligence Awareness

    Get PDF
    Trabalho de Projeto submetido como requisito parcial para obtenção do grau de Mestre em Engenharia de SoftwareNos dias que correm, os dados são um dos ativos mais importantes de uma empresa. O conceito de Business Intelligence surge precisamente como um conjunto de técnicas que permite extrair valor desses dados, transformando-os em informação. É neste contexto que se insere a execução deste projeto. Fazendo uso de um conjunto de dados, que já eram recolhidos pela empresa, e trabalhando-os, de forma a gerar informação que permita gerar impacto positivo quer nos operadores da fábrica (motivando uma competição saudável entre os mesmos, para que tenham o mínimo de paragens possíveis nas linhas de produção pelas quais são responsáveis), quer na equipa de gestão e coordenação (permitindo aferir informação disponibilizada através de indicadores de desempenho). A informação extraída a partir do processamento destes dados é disponibilizada através de dashboards para várias áreas da empresa, refletindo ocorrências verificadas nas linhas de produção, que de acordo com um conjunto de regras definidas pela equipa de gestão, devem ser evidenciadas.Nowadays, data is one of the most important assets of a company. The Business Intelligence concept emerges precisely as a set of techniques that aim to extract value from this data, transforming it in information. It is in this context that this project execution belongs to. By making use of a set of data, that is already gathered by the company, and working it, in a way to create information that allows to generate a positive impact either on factory operators (by promoting a healthy competition among them, in order to prevent break downs as much as possible, in production lines they are responsible for) and management or coordination teams (by providing information available through performance indicators). The information extracted from this data processing is made available through dashboards, to several company areas, reflecting incidents verified in production lines, that according to a set of rules, created by the management team, should be highlighted
    corecore