251 research outputs found

    intelligent data qality management

    Get PDF
    The primary objective of this work is to develop a concept for a holistic data quality management which is based on formal data quality metrics and a well–defined process model. The extensive use of metadata provides a flexible adaptation to various application domains and a maximum degree of automation

    ETL queues for active data warehousing

    Full text link

    A relational algebra approach to ETL modeling

    Get PDF
    The MAP-i Doctoral Programme in Informatics, of the Universities of Minho, Aveiro and PortoInformation Technology has been one of drivers of the revolution that currently is happening in today’s management decisions in most organizations. The amount of data gathered and processed through the use of computing devices has been growing every day, providing a valuable source of information for decision makers that are managing every type of organization, public or private. Gathering the right amount of data in a centralized and unified repository like a data warehouse is similar to build the foundations for a system that will act has a base to support decision making processes requiring factual information. Nevertheless, the complexity of building such a repository is very challenging, as well as developing all the components of a data warehousing system. One of the most critical components of a data warehousing system is the Extract-Transform-Load component, ETL for short, which is responsible for gathering data from information sources, clean, transform and conform it in order to store it in a data warehouse. Several designing methodologies for the ETL components have been presented in the last few years with very little impact in ETL commercial tools. Basically, this was due to an existing gap between the conceptual design of an ETL system and its correspondent physical implementation. The methodologies proposed ranged from new approaches, with novel notation and diagrams, to the adoption and expansion of current standard modeling notations, like UML or BPMN. However, all these proposals do not contain enough detail to be translated automatically into a specific execution platform. The use of a standard well-known notation like Relational Algebra might bridge the gap between the conceptual design and the physical design of an ETL component, mainly due to its formal approach that is based on a limited set of operators and also due to its functional characteristics like being a procedural language operating over data stored in relational format. The abstraction that Relational Algebra provides over the technological infrastructure might also be an advantage for uncommon execution platforms, like computing grids that provide an exceptional amount of processing power that is very critical for ETL systems. Additionally, partitioning data and task distribution over computing nodes works quite well with a Relational Algebra approach. An extensive research over the use of Relational Algebra in the ETL context was conducted to validate its usage. To complement this, a set of Relational Algebra patterns were also developed to support the most common ETL tasks, like changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions and surrogate key pipelining. All these patterns provide a formal approach to the referred ETL tasks by specifying all the operations needed to accomplish them in a series of Relational Algebra operations. To evaluate the feasibility of the work done in this thesis, we used a real ETL application scenario for the extraction of data in two different social networks operational systems, storing hashtag usage information in a specific data mart. The ability to analyze trends in social network usage is a hot topic in today’s media and information coverage. A complete design of the ETL component using the patterns developed previously is also provided, as well as a critical evaluation of its usage.As Tecnologias da Informação têm sido um dos principais catalisadores na revolução que se assiste nas tomadas de decisão na maioria das organizações. A quantidade de dados que são angariados e processados através do uso de dispositivos computacionais tem crescido diariamente, tornando-se uma fonte de informação valiosa para os decisores que gerem todo o tipo de organizações, públicas ou privadas. Concentrar o conjunto ideal de dados num repositório centralizado e unificado, como um data warehouse, é essencial para a construção de um sistema que servirá de suporte aos processos de tomada de decisão que necessitam de factos. No entanto, a complexidade associada à construção deste repositório e de todos os componentes que caracterizam um sistema de data warehousing é extremamente desafiante. Um dos componentes mais críticos de um sistema de data warehousing é a componente de Extração-Transformação- Alimentação (ETL) que lida com a extração de dados das fontes, que limpa, transforma e concilia os dados com vista à sua integração no data warehouse. Nos últimos anos têm sido apresentadas várias metodologias de desenho da componente de ETL, no entanto estas não têm sido adotadas pelas ferramentas comerciais de ETL principalmente devido ao diferencial existente entre o desenho concetual e as plataformas físicas de execução. As metodologias de desenho propostas variam desde propostas que assentam em novas notações e diagramas até às propostas que usam notações standard como a notação UML e BPMN que depois são complementadas com conceitos de ETL. Contudo, estas propostas de modelação concetual não contêm informações detalhadas que permitam uma tradução automática para plataformas de execução. A utilização de uma linguagem standard e reconhecida como a linguagem de Álgebra Relacional pode servir como complemento e colmatar o diferencial existente entre o desenho concetual e o desenho físico da componente de ETL, principalmente devido ao facto de esta linguagem assentar numa abordagem procedimental com um conjunto limitado de operadores que atuam sobre dados armazenados num formato relacional. A abstração providenciada pela Álgebra Relacional relativamente às plataformas de execução pode eventualmente ser uma vantagem tendo em vista a utilização de plataformas menos comuns, como por exemplo grids computacionais. Este tipo de arquiteturas disponibiliza por norma um grande poder computacional o que é essencial para um sistema de ETL. O particionamento e distribuição dos dados e tarefas pelos nodos computacionais conjugam relativamente bem com a abordagem da Álgebra Relacional. No decorrer deste trabalho foi efetuado um estudo extensivo às propriedades da AR num contexto de ETL com vista à avaliação da sua usabilidade. Como complemento, foram desenhados um conjunto de padrões de AR que suportam as atividades mais comuns de ETL como por exemplo changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions e surrogate key pipelining. Estes padrões formalizam este conjunto de atividades ETL, especificando numa série de operações de Álgebra Relacional quais os passos necessários à sua execução. Com vista à avaliação da sustentabilidade da proposta presente neste trabalho, foi utilizado um cenário real de ETL em que os dados fontes pertencem a duas redes sociais e os dados armazenados no data mart identificam a utilização de hashtags por parte dos seus utilizadores. De salientar que a deteção de tendências e de assuntos que estão na ordem do dia nas redes sociais é de vital importância para as empresas noticiosas e para as próprias redes sociais. Por fim, é apresentado o desenho completo do sistema de ETL para o cenário escolhido, utilizando os padrões desenvolvidos neste trabalho, avaliando e criticando a sua utilização

    Applying the UML and the Unified Process to the Design of Data Warehouses

    Get PDF
    The design, development and deployment of a data warehouse (DW) is a complex, time consuming and prone to fail task. This is mainly due to the different aspects taking part in a DW architecture such as data sources, processes responsible for Extracting, Transforming and Loading (ETL) data into the DW, the modeling of the DW itself, specifying data marts from the data warehouse or designing end user tools. In the last years, different models, methods and techniques have been proposed to provide partial solutions to cover the different aspects of a data warehouse. Nevertheless, none of these proposals addresses the whole development process of a data warehouse in an integrated and coherent manner providing the same notation for the modeling of the different parts of a DW. In this paper, we propose a data warehouse development method, based on the Unified Modeling Language (UML) and the Unified Process (UP), which addresses the design and development of both the data warehouse back-stage and front-end. We use the extension mechanisms (stereotypes, tagged values and constraints) provided by the UML and we properly extend it in order to accurately model the different parts of a data warehouse (such as the modeling of the data sources, ETL processes or the modeling of the DW itself) by using the same notation. To the best of our knowledge, our proposal provides a seamless method for developing data warehouses. Finally, we apply our approach to a case study to show its benefit.This work has been partially supported by the METASIGN project (TIN2004-OO779) from the Spanish Ministry of Education and Science, by the DADASMECA project (GV05/220) from the Valencia Government, and by the DADS (PBC-05-QI 2-2) project from the Regional Science arid Technology Ministry of CastiIla-La Mancha (Spain)

    Data Warehouse Design: An Investigation of Star Schema

    Get PDF

    A decision support system for IST academic information

    Get PDF
    This article describes the Decision Support System (DSS) for Academic Information being developed at Instituto Superior Técnico, the Engineering School of the Technical University of Lisbon. In Portuguese, this project has been given the acronym SADIA (Sistema de Apoio à Decisão da Informação Académica). This paper focuses on the early phases of the DSS development process, i.e., the business requirements definition and the dimensional modelling. First, we show how the business requirements of the School drive the definition of the DSS dimensional model. Second, we detail the logical dimensional model for a selected business process, the IST Student Admission process. Third, the corresponding physical design decisions are reported. The results obtained from the three phases were successfully validated by business users

    A abordagem POESIA para a integração de dados e serviços na Web semantica

    Get PDF
    Orientador: Claudia Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: POESIA (Processes for Open-Ended Systems for lnformation Analysis), a abordagem proposta neste trabalho, visa a construção de processos complexos envolvendo integração e análise de dados de diversas fontes, particularmente em aplicações científicas. A abordagem é centrada em dois tipos de mecanismos da Web semântica: workflows científicos, para especificar e compor serviços Web; e ontologias de domínio, para viabilizar a interoperabilidade e o gerenciamento semânticos dos dados e processos. As principais contribuições desta tese são: (i) um arcabouço teórico para a descrição, localização e composição de dados e serviços na Web, com regras para verificar a consistência semântica de composições desses recursos; (ii) métodos baseados em ontologias de domínio para auxiliar a integração de dados e estimar a proveniência de dados em processos cooperativos na Web; (iii) implementação e validação parcial das propostas, em urna aplicação real no domínio de planejamento agrícola, analisando os benefícios e as limitações de eficiência e escalabilidade da tecnologia atual da Web semântica, face a grandes volumes de dadosAbstract: POESIA (Processes for Open-Ended Systems for Information Analysis), the approach proposed in this work, supports the construction of complex processes that involve the integration and analysis of data from several sources, particularly in scientific applications. This approach is centered in two types of semantic Web mechanisms: scientific workflows, to specify and compose Web services; and domain ontologies, to enable semantic interoperability and management of data and processes. The main contributions of this thesis are: (i) a theoretical framework to describe, discover and compose data and services on the Web, inc1uding mIes to check the semantic consistency of resource compositions; (ii) ontology-based methods to help data integration and estimate data provenance in cooperative processes on the Web; (iii) partial implementation and validation of the proposal, in a real application for the domain of agricultural planning, analyzing the benefits and scalability problems of the current semantic Web technology, when faced with large volumes of dataDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Testing Data Vault-Based Data Warehouse

    Get PDF
    Data warehouse (DW) projects are undertakings that require integration of disparate sources of data, a well-defined mapping of the source data to the reconciled data, and effective Extract, Transform, and Load (ETL) processes. Owing to the complexity of data warehouse projects, great emphasis must be placed on an agile-based approach with properly developed and executed test plans throughout the various stages of designing, developing, and implementing the data warehouse to mitigate against budget overruns, missed deadlines, low customer satisfaction, and outright project failures. Yet, there are often attempts to test the data warehouse exactly like traditional back-end databases and legacy applications, or to downplay the role of quality assurance (QA) and testing, which only serve to fuel the frustration and mistrust of data warehouse and business intelligence (BI) systems. In spite of this, there are a number of steps that can be taken to ensure DW/BI solutions are successful, highly trusted, and stable. In particular, adopting a Data Vault (DV)-based Enterprise Data Warehouse (EDW) can simplify and enhance various aspects of testing, and curtail delays common in non-DV based DW projects. A major area of focus in this research is raw DV loads from source systems, keeping transformations to a minimum in the ETL process which loads the DV from the source. Certain load errors, classified as permissible errors and enforced by business rules, are kept in the Data Vault until correct values are supplied. Major transformation activities are pushed further downstream to the next ETL process which loads and refreshes the Data Mart (DM) from the Data Vault
    corecore