443 research outputs found

    A framework for data cleaning in data warehouses

    Get PDF
    It is a persistent challenge to achieve a high quality of data in data warehouses. Data cleaning is a crucial task for such a challenge. To deal with this challenge, a set of methods and tools has been developed. However, there are still at least two questions needed to be answered: How to improve the efficiency while performing data cleaning? How to improve the degree of automation when performing data cleaning? This paper challenges these two questions by presenting a novel framework, which provides an approach to managing data cleaning in data warehouses by focusing on the use of data quality dimensions, and decoupling a cleaning process into several sub-processes. Initial test run of the processes in the framework demonstrates that the approach presented is efficient and scalable for data cleaning in data warehouses

    A framework for data cleaning in data warehouses

    Get PDF
    It is a persistent challenge to achieve a high quality of data in data warehouses. Data cleaning is a crucial task for such a challenge. To deal with this challenge, a set of methods and tools has been developed. However, there are still at least two questions needed to be answered: How to improve the efficiency while performing data cleaning? How to improve the degree of automation when performing data cleaning? This paper challenges these two questions by presenting a novel framework, which provides an approach to managing data cleaning in data warehouses by focusing on the use of data quality dimensions, and decoupling a cleaning process into several sub-processes. Initial test run of the processes in the framework demonstrates that the approach presented is efficient and scalable for data cleaning in data warehouses

    Modeling, Annotating, and Querying Geo-Semantic Data Warehouses

    Get PDF

    Quarry: A user-centered big data integration platform

    Get PDF
    Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.Peer ReviewedPostprint (author's final draft

    Data quality problems in TPC-DI based data integration processes

    Get PDF
    Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark

    A relational algebra approach to ETL modeling

    Get PDF
    The MAP-i Doctoral Programme in Informatics, of the Universities of Minho, Aveiro and PortoInformation Technology has been one of drivers of the revolution that currently is happening in today’s management decisions in most organizations. The amount of data gathered and processed through the use of computing devices has been growing every day, providing a valuable source of information for decision makers that are managing every type of organization, public or private. Gathering the right amount of data in a centralized and unified repository like a data warehouse is similar to build the foundations for a system that will act has a base to support decision making processes requiring factual information. Nevertheless, the complexity of building such a repository is very challenging, as well as developing all the components of a data warehousing system. One of the most critical components of a data warehousing system is the Extract-Transform-Load component, ETL for short, which is responsible for gathering data from information sources, clean, transform and conform it in order to store it in a data warehouse. Several designing methodologies for the ETL components have been presented in the last few years with very little impact in ETL commercial tools. Basically, this was due to an existing gap between the conceptual design of an ETL system and its correspondent physical implementation. The methodologies proposed ranged from new approaches, with novel notation and diagrams, to the adoption and expansion of current standard modeling notations, like UML or BPMN. However, all these proposals do not contain enough detail to be translated automatically into a specific execution platform. The use of a standard well-known notation like Relational Algebra might bridge the gap between the conceptual design and the physical design of an ETL component, mainly due to its formal approach that is based on a limited set of operators and also due to its functional characteristics like being a procedural language operating over data stored in relational format. The abstraction that Relational Algebra provides over the technological infrastructure might also be an advantage for uncommon execution platforms, like computing grids that provide an exceptional amount of processing power that is very critical for ETL systems. Additionally, partitioning data and task distribution over computing nodes works quite well with a Relational Algebra approach. An extensive research over the use of Relational Algebra in the ETL context was conducted to validate its usage. To complement this, a set of Relational Algebra patterns were also developed to support the most common ETL tasks, like changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions and surrogate key pipelining. All these patterns provide a formal approach to the referred ETL tasks by specifying all the operations needed to accomplish them in a series of Relational Algebra operations. To evaluate the feasibility of the work done in this thesis, we used a real ETL application scenario for the extraction of data in two different social networks operational systems, storing hashtag usage information in a specific data mart. The ability to analyze trends in social network usage is a hot topic in today’s media and information coverage. A complete design of the ETL component using the patterns developed previously is also provided, as well as a critical evaluation of its usage.As Tecnologias da Informação têm sido um dos principais catalisadores na revolução que se assiste nas tomadas de decisão na maioria das organizações. A quantidade de dados que são angariados e processados através do uso de dispositivos computacionais tem crescido diariamente, tornando-se uma fonte de informação valiosa para os decisores que gerem todo o tipo de organizações, públicas ou privadas. Concentrar o conjunto ideal de dados num repositório centralizado e unificado, como um data warehouse, é essencial para a construção de um sistema que servirá de suporte aos processos de tomada de decisão que necessitam de factos. No entanto, a complexidade associada à construção deste repositório e de todos os componentes que caracterizam um sistema de data warehousing é extremamente desafiante. Um dos componentes mais críticos de um sistema de data warehousing é a componente de Extração-Transformação- Alimentação (ETL) que lida com a extração de dados das fontes, que limpa, transforma e concilia os dados com vista à sua integração no data warehouse. Nos últimos anos têm sido apresentadas várias metodologias de desenho da componente de ETL, no entanto estas não têm sido adotadas pelas ferramentas comerciais de ETL principalmente devido ao diferencial existente entre o desenho concetual e as plataformas físicas de execução. As metodologias de desenho propostas variam desde propostas que assentam em novas notações e diagramas até às propostas que usam notações standard como a notação UML e BPMN que depois são complementadas com conceitos de ETL. Contudo, estas propostas de modelação concetual não contêm informações detalhadas que permitam uma tradução automática para plataformas de execução. A utilização de uma linguagem standard e reconhecida como a linguagem de Álgebra Relacional pode servir como complemento e colmatar o diferencial existente entre o desenho concetual e o desenho físico da componente de ETL, principalmente devido ao facto de esta linguagem assentar numa abordagem procedimental com um conjunto limitado de operadores que atuam sobre dados armazenados num formato relacional. A abstração providenciada pela Álgebra Relacional relativamente às plataformas de execução pode eventualmente ser uma vantagem tendo em vista a utilização de plataformas menos comuns, como por exemplo grids computacionais. Este tipo de arquiteturas disponibiliza por norma um grande poder computacional o que é essencial para um sistema de ETL. O particionamento e distribuição dos dados e tarefas pelos nodos computacionais conjugam relativamente bem com a abordagem da Álgebra Relacional. No decorrer deste trabalho foi efetuado um estudo extensivo às propriedades da AR num contexto de ETL com vista à avaliação da sua usabilidade. Como complemento, foram desenhados um conjunto de padrões de AR que suportam as atividades mais comuns de ETL como por exemplo changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions e surrogate key pipelining. Estes padrões formalizam este conjunto de atividades ETL, especificando numa série de operações de Álgebra Relacional quais os passos necessários à sua execução. Com vista à avaliação da sustentabilidade da proposta presente neste trabalho, foi utilizado um cenário real de ETL em que os dados fontes pertencem a duas redes sociais e os dados armazenados no data mart identificam a utilização de hashtags por parte dos seus utilizadores. De salientar que a deteção de tendências e de assuntos que estão na ordem do dia nas redes sociais é de vital importância para as empresas noticiosas e para as próprias redes sociais. Por fim, é apresentado o desenho completo do sistema de ETL para o cenário escolhido, utilizando os padrões desenvolvidos neste trabalho, avaliando e criticando a sua utilização

    Data quality evaluation through data quality rules and data provenance.

    Get PDF
    The application and exploitation of large amounts of data play an ever-increasing role in today’s research, government, and economy. Data understanding and decision making heavily rely on high quality data; therefore, in many different contexts, it is important to assess the quality of a dataset in order to determine if it is suitable to be used for a specific purpose. Moreover, as the access to and the exchange of datasets have become easier and more frequent, and as scientists increasingly use the World Wide Web to share scientific data, there is a growing need to know the provenance of a dataset (i.e., information about the processes and data sources that lead to its creation) in order to evaluate its trustworthiness. In this work, data quality rules and data provenance are used to evaluate the quality of datasets. Concerning the first topic, the applied solution consists in the identification of types of data constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. We selected some of the data constraints and dependencies already considered in the data quality field, but we also used order dependencies and existence constraints as quality rules. In addition, we developed some algorithms to discover the types of dependencies used in the tool. To deal with the provenance of data, the Open Provenance Model (OPM) was adopted, an experimental query language for querying OPM graphs stored in a relational database was implemented, and an approach to design OPM graphs was proposed
    corecore