27 research outputs found

    PRESISTANT: Learning based assistant for data pre-processing

    Get PDF
    Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

    Measuring discord among multidimensional data sources

    Get PDF
    Data integration is a classical problem in databases, typically decomposed into schema matching, entity matching and record merging. To solve the latter, it is mostly assumed that ground truth can be determined, either as master data or from user feedback. However, in many cases, this is not the case because firstly the merging processes cannot be accurate enough, and also the data gathering processes in the different sources are simply imperfect and cannot provide high quality data. Instead of enforcing consistency, we propose to evaluate how concordant or discordant sources are as a measure of trustworthiness (the more discordant are the sources, the less we can trust their data). Thus, we define the discord measurement problem in which given a set of uncertain raw observations or aggregate results (such as case/hospitalization/death data relevant to COVID-19) and information on the alignment of different data (for example, cases and deaths), we wish to assess whether the different sources are concordant, or if not, measure how discordant they are.The work of Alberto Abelló has been done under project PID2020- 117191RB-I00 funded by MCIN/ AEI /10.13039/501100011033. The work of James Cheney was supported by ERC Consolidator Grant Skye (grant number 682315).Peer ReviewedPostprint (published version

    Improving Data Quality by Leveraging Statistical Relational Learning

    Get PDF
    Digitally collected data su ↵ ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order logic directly translate into the predictive model in our SRL framework

    Improving Data Quality by Leveraging Statistical Relational\ud Learning

    Get PDF
    Digitally collected data su\ud ↵\ud ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common\ud approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and\ud missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints\ud within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational\ud learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach\ud allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it\ud obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order\ud logic directly translate into the predictive model in our SRL framework