347 research outputs found

    Profiling relational data: a survey

    Get PDF
    Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases

    How to Juggle Columns: An Entropy-Based Approach for Table Compression

    Get PDF
    Many relational databases exhibit complex dependencies between data attributes, caused either by the nature of the underlying data or by explicitly denormalized schemas. In data warehouse scenarios, calculated key figures may be materialized or hierarchy levels may be held within a single dimension table. Such column correlations and the resulting data redundancy may result in additional storage requirements. They may also result in bad query performance if inappropriate independence assumptions are made during query compilation. In this paper, we tackle the specific problem of detecting functional dependencies between columns to improve the compression rate for column-based database systems, which both reduces main memory consumption and improves query performance. Although a huge variety of algorithms have been proposed for detecting column dependencies in databases, we maintain that increased data volumes and recent developments in hardware architectures demand novel algorithms with much lower runtime overhead and smaller memory footprint. Our novel approach is based on entropy estimations and exploits a combination of sampling and multiple heuristics to render it applicable for a wide range of use cases. We demonstrate the quality of our approach by means of an implementation within the SAP NetWeaver Business Warehouse Accelerator. Our experiments indicate that our approach scales well with the number of columns and produces reliable dependence structure information. This both reduces memory consumption and improves performance for nontrivial queries

    Scalability aspects of data cleaning

    Get PDF
    Data cleaning has become one of the important pre-processing steps for many data science, data analytics, and machine learning applications. According to a survey by Gartner, more than 25% of the critical data in the world's top companies is flawed, which can result in economic losses amounting to trillions of dollars a year. Over the past few decades, several algorithms and tools have been developed to clean data. However, many of these solutions find it difficult to scale, as the amount of data has increased over time. For example, these solutions often involve a quadratic amount of tuple-pair comparisons or generation of all possible column combinations. Both these tasks can take days to finish if the dataset has millions of tuples or a few hundreds of columns, which is usually the case for real-world applications. The data cleaning tasks often have a trade-off between the scalability and the quality of the solution. One can achieve scalability by performing fewer computations, but at the cost of a lower quality solution. Therefore, existing approaches exploit this trade-off when they need to scale to larger datasets, settling for a lower quality solution. Some approaches have considered re-thinking solutions from scratch to achieve scalability and high quality. However, re-designing these solutions from scratch is a daunting task as it would involve systematically analyzing the space of possible optimizations and then tuning the physical implementations for a specific computing framework, data size, and resources. Another component in these solutions that becomes critical with the increasing data size is how this data is stored and fetched. As for smaller datasets, most of it can fit in-memory, so accessing it from a data store is not a bottleneck. However, for large datasets, these solutions need to constantly fetch and write the data to a data store. As observed in this dissertation, data cleaning tasks have a lifecycle-driven data access pattern, which are not suitable for traditional data stores, making these data stores a bottleneck when cleaning large datasets. In this dissertation, we consider scalability as a first-class citizen for data cleaning tasks and propose that the scalable and high-quality solutions can be achieved by adopting the following three principles: 1) by having a new primitive-base re-writing of the existing algorithms that allows for efficient implementations for multiple computing frameworks, 2) by efficiently involving domain expert’s knowledge to reduce computation and improve quality, and 3) by using an adaptive data store that can transform the data layout based on the access pattern. We make contributions towards each of these principles. First, we present a set of primitive operations for discovering constraints from the data. These primitives facilitate re-writing efficient distributed implementations of the existing discovery algorithms. Next, we present a framework involving domain experts, for faster clustering selection for data de-duplication. This framework asks a bounded number of queries to a domain-expert and uses their response to select the best clustering with a high accuracy. Finally, we present an adaptive data store that can change the layout of the data based on the workload's access pattern, hence speeding-up the data cleaning tasks

    Ahorro sostenible de espacio de almacenamiento de bases de datos usando atributos de proxy

    Get PDF
    Rapid data growth and inefficient data storage are two concerning issues that are becoming more and more important in green computing. The decision on the eco-friendly technology to use often relies on the amount of carbon footprint produced. Thus, it would be valuable to avoid inefficient electric power utilization by minimizing physical data storages to store large data volumes. This paper reported the implementation of proxy attributes to reduce space by optimizing the available database space through attributes substitution. We examine a set of proxies retrieved from the public databases regarding their space-saving and accuracy properties. The results indicated that useful proxies that can offer spacesaving while maintaining accuracy are available. The findings contribute in understanding the practicality of proxies and their potential in database space-saving.El rápido crecimiento de la cantidad de datos usados en diferentes aplicaciones, y el almacenamiento ineficiente de los datos en dichas aplicaciones están siendo dos factores cada vez más preocupantes para el Green Computing. Las decisiones sobre las tecnologías sostenibles a usarse se deberían tomar en la cantidad de la huella de carbono generada durante la ejecución de las aplicaciones. Por esta razón, es deseable optimizar minimizar el consumo energético de las operaciones de almacenamiento de datos, sobre todo en casos de grandes volúmenes de datos. En este artículo se presenta un informe del desempeño de atributos proxy para reducir el espacio necesario de almacenamiento de datos mediante la sustitución de atributos. Examinamos un conjunto de proxies recuperados de las bases de datos públicas con respecto a sus propiedades de precisión y ahorro de espacio. Los resultados indicaron que hay disponibles proxies útiles que pueden ahorrar espacio y mantener la precisión. Los hallazgos contribuyen a comprender la practicidad de los proxies y su potencial para ahorrar espacio en la base de datos.Facultad de Informátic

    Eco-friendly database space saving using proxy attributes

    Get PDF
    Rapid data growth and inefficient data storage are two concerning issues that are becoming more and more important in green computing. The decision on the eco-friendly technology to use often relies on the amount of carbon footprint produced. Thus, it would be valuable to avoid inefficient electric power utilization by minimizing physical data storages to store large data volumes. This paper reported the implementation of proxy attributes to reduce space by optimizing the available database space through attributes substitution. We examine a set of proxies retrieved from the public databases regarding their space-saving and accuracy properties. The results indicated that useful proxies that can offer space-saving while maintaining accuracy are available. The findings contribute in understanding the practicality of proxies and their potential in database space-saving
    • …
    corecore