4 research outputs found

    Contributions Ă  l’Optimisation de RequĂȘtes Multidimensionnelles

    Get PDF
    Analyser les donnĂ©es consiste Ă  choisir un sous-ensemble des dimensions qui les dĂ©criventafin d'en extraire des informations utiles. Or, il est rare que l'on connaisse a priori les dimensions"intĂ©ressantes". L'analyse se transforme alors en une activitĂ© exploratoire oĂč chaque passe traduit par une requĂȘte. Ainsi, il devient primordiale de proposer des solutions d'optimisationde requĂȘtes qui ont une vision globale du processus plutĂŽt que de chercher Ă  optimiser chaque requĂȘteindĂ©pendamment les unes des autres. Nous prĂ©sentons nos contributions dans le cadre de cette approcheexploratoire en nous focalisant sur trois types de requĂȘtes: (i) le calcul de bordures,(ii) les requĂȘtes dites OLAP (On Line Analytical Processing) dans les cubes de donnĂ©es et (iii) les requĂȘtesde prĂ©fĂ©rence type skyline

    Scalability aspects of data cleaning

    Get PDF
    Data cleaning has become one of the important pre-processing steps for many data science, data analytics, and machine learning applications. According to a survey by Gartner, more than 25% of the critical data in the world's top companies is flawed, which can result in economic losses amounting to trillions of dollars a year. Over the past few decades, several algorithms and tools have been developed to clean data. However, many of these solutions find it difficult to scale, as the amount of data has increased over time. For example, these solutions often involve a quadratic amount of tuple-pair comparisons or generation of all possible column combinations. Both these tasks can take days to finish if the dataset has millions of tuples or a few hundreds of columns, which is usually the case for real-world applications. The data cleaning tasks often have a trade-off between the scalability and the quality of the solution. One can achieve scalability by performing fewer computations, but at the cost of a lower quality solution. Therefore, existing approaches exploit this trade-off when they need to scale to larger datasets, settling for a lower quality solution. Some approaches have considered re-thinking solutions from scratch to achieve scalability and high quality. However, re-designing these solutions from scratch is a daunting task as it would involve systematically analyzing the space of possible optimizations and then tuning the physical implementations for a specific computing framework, data size, and resources. Another component in these solutions that becomes critical with the increasing data size is how this data is stored and fetched. As for smaller datasets, most of it can fit in-memory, so accessing it from a data store is not a bottleneck. However, for large datasets, these solutions need to constantly fetch and write the data to a data store. As observed in this dissertation, data cleaning tasks have a lifecycle-driven data access pattern, which are not suitable for traditional data stores, making these data stores a bottleneck when cleaning large datasets. In this dissertation, we consider scalability as a first-class citizen for data cleaning tasks and propose that the scalable and high-quality solutions can be achieved by adopting the following three principles: 1) by having a new primitive-base re-writing of the existing algorithms that allows for efficient implementations for multiple computing frameworks, 2) by efficiently involving domain expert’s knowledge to reduce computation and improve quality, and 3) by using an adaptive data store that can transform the data layout based on the access pattern. We make contributions towards each of these principles. First, we present a set of primitive operations for discovering constraints from the data. These primitives facilitate re-writing efficient distributed implementations of the existing discovery algorithms. Next, we present a framework involving domain experts, for faster clustering selection for data de-duplication. This framework asks a bounded number of queries to a domain-expert and uses their response to select the best clustering with a high accuracy. Finally, we present an adaptive data store that can change the layout of the data based on the workload's access pattern, hence speeding-up the data cleaning tasks

    Parallel Mining of Dependencies

    No full text
    International audienceThe problem of extracting functional dependencies (FDs) from databases has a long story dating back to the 90's. Still, efficient solutions taking into account both material evolution, namely the advent of multicore machines, and the amount of data that are to be mined, are still needed. In this paper we propose a parallel algorithm which, upon small modifications, extracts (i) the minimal keys, (ii) the minimal exact FDs, (iii) the minimal approximate FDs and (iv) the Conditional functional dependencies (CFDs) holding in a table. Under some natural conditions, we prove a theoretical speed up of our solution with respect to a baseline algorithm which follows a depth first search strategy. Since mining most of these dependencies require a procedure for computing the {\em number of distinct values} (NDV) which is a space consuming operation, we show how sketching techniques for estimating the exact value of NDV can be used for reducing both memory consumption as well as communications overhead when considering distributed data while guaranteeing a certain quality of the result. Our solution is implemented and some experimental results are reported here showing the efficiency and scalability of our proposal. Most notably, the theoretical speed ups are confirmed by the experiments
    corecore