53,769 research outputs found

    Data Cleaning

    Get PDF
    Course material for the webinar “Data Cleaning”, part of a webinar series on research data management (RDM) organized by UiT The Arctic University of Norway. For more information, please visit: site.uit.no/rdmtrainin

    Data Cleaning Methods for Client and Proxy Logs

    Get PDF
    In this paper we present our experiences with the cleaning of Web client and proxy usage logs, based on a long-term browsing study with 25 participants. A detailed clickstream log, recorded using a Web intermediary, was combined with a second log of user interface actions, which was captured by a modified Firefox browser for a subset of the participants. The consolidated data from both records revealed many page requests that were not directly related to user actions. For participants who had no ad-filtering system installed, these artifacts made up one third of all transferred Web pages. Three major reasons could be identified: HTML Frames and iFrames, advertisements, and automatic page reloads. The experiences made during the data cleaning process might help other researchers to choose adequate filtering methods for their data

    A revival of integrity constraints for data cleaning

    Get PDF
    Integrity constraints, a.k.a . data dependencies, are being widely used for improving the quality of schema . Recently constraints have enjoyed a revival for improving the quality of data . The tutorial aims to provide an overview of recent advances in constraint-based data cleaning. </jats:p

    Humanized data cleaning

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaData science has started to become one of the most important skills someone can have in the modern world, due to data taking an increasingly meaningful role in our lives. The accessibility of data science is however limited, requiring complicated software or programming knowledge. Both can be challenging and hard to master, even for the simpler tasks. Currently, in order to clean data you need a data scientist. The process of data cleaning, consisting of removing or correcting entries of a data set, usually requires programming knowledge as it is mostly performed using programming languages such as Python and R (kag). However, data cleaning could be performed by people that may possess better knowledge of the data domain, but lack the programming background, if this barrier is removed. We have studied current solutions that are available on the market, the type of interface each one uses to interact with the end users, such as a control flow interface, a tabular based interface or block-based languages. With this in mind, we have approached this issue by providing a new data science tool, termed Data Cleaning for All (DCA), that attempts to reduce the necessary knowledge to perform data science tasks, in particular for data cleaning and curation. By combining Human-Computer Interaction (HCI) concepts, this tool is: simple to use through direct manipulation and showing transformation previews; allows users to save time by eliminate repetitive tasks and automatically calculating many of the common analyses data scientists must perform; and suggests data transformations based on the contents of the data, allowing for a smarter environment.A ciência de dados tornou-se uma das capacidades mais importantes que alguém pode possuir no mundo moderno, devido aos dados serem cada vez mais importantes na nossa sociedade. A acessibilidade da ciência de dados é, no entanto, limitada, requer software complicado ou conhecimentos de programação. Ambos podem ser desafiantes e difíceis de aprender bem, mesmo para tarefas simples. Atualmente, para efetuar a limpeza de dados e necessário um Data Scientist. O processo de limpeza de dados, que consiste em remover ou corrigir entradas de um dataset, é normalmente efetuado utilizando linguagens de programação como Python e R (kag). No entanto, a limpeza de dados poderia ser efetuada por profissionais que possuam melhor conhecimento sobre o domínio dos dados a tratar, mas que não possuam uma formação em ciências da computação. Estudamos soluções que estão presentes no mercado e o tipo de interface que cada uma usa para interagir com o utilizador, seja através de diagramas de fluxo de controlo, interfaces tabulares ou recorrendo a linguagens de programação baseadas em blocos. Com isto em mente, abordamos o problema através do desenvolvimento de uma nova plataforma onde podemos efetuar tarefas de ciências de dados com o nome Data Cleaning for All (DCA). Com esta ferramenta esperamos reduzir os conhecimentos necessários para efetuar tarefas nesta área, especialmente na área da limpeza de dados. Através da combinação de conceitos de HCI, a plataforma é: simples de usar através da manipulação direta dos dados e da demonstração de pré-visualizações das transformações; permite aos utilizadores poupar tempo através da eliminação de tarefas repetitivas ao calcular muitas das métricas que Data Scientist tem de calcular; e sugere transformações dos dados baseadas nos conteúdos dos mesmos, permitindo um ambiente mais inteligente

    Online Data Cleaning

    Get PDF
    Data-centric applications have never been more ubiquitous in our lives, e.g., search engines, route navigation and social media. This has brought along a new age where digital data is at the core of many decisions we make as individuals, e.g., looking for the most scenic route to plan a road trip, or as professionals, e.g., analysing customers’ transactions to predict the best time to restock different products. However, the surge in data generation has also led to creating massive amounts of dirty data, i.e., inaccurate or redundant data. Using dirty data to inform business decisions comes with dire consequences, for instance, an IBM report estimates that dirty data costs the U.S. $3.1 trillion a year. Dirty data is the product of many factors which include data entry errors and integration of several data sources. Data integration of multiple sources is especially prone to producing dirty data. For instance, while individual sources may not have redundant data, they often carry redundant data across each other. Furthermore, different data sources may obey different business rules (sometimes not even known) which makes it challenging to reconcile the integrated data. Even if the data is clean at the time of the integration, data updates would compromise its quality over time. There is a wide spectrum of errors that can be found in the data, e,g, duplicate records, missing values, obsolete data, etc. To address these problems, several data cleaning efforts have been proposed, e.g., record linkage to identify duplicate records, data fusion to fuse duplicate data items into a single representation and enforcing integrity constraints on the data. However, most existing efforts make two key assumptions: (1) Data cleaning is done in one shot; and (2) The data is available in its entirety. Those two assumptions do not hold in our age where data is highly volatile and integrated from several sources. This calls for a paradigm shift in approaching data cleaning: it has to be made iterative where data comes in chunks and not all at once. Consequently, cleaning the data should not be repeated from scratch whenever the data changes, but instead, should be done only for data items affected by the updates. Moreover, the repair should be computed effciently to support applications where cleaning is performed online (e.g. query time data cleaning). In this dissertation, we present several proposals to realize this paradigm for two major types of data errors: duplicates and integrity constraint violations. We frst present a framework that supports online record linkage and fusion over Web databases. Our system processes queries posted to Web databases. Query results are deduplicated, fused and then stored in a cache for future reference. The cache is updated iteratively with new query results. This effort makes it possible to perform record linkage and fusion effciently, but also effectively, i.e., the cache contains data items seen in previous queries which are jointly cleaned with incoming query results. To address integrity constraints violations, we propose a novel way to approach Functional Dependency repairs, develop a new class of repairs and then demonstrate it is superior to existing efforts, in runtime and accuracy. We then show how our framework can be easily tuned to work iteratively to support online applications. We implement a proof-ofconcept query answering system to demonstrate the iterative capability of our system
    corecore