399 research outputs found

    A Machine Learning Trainable Model to Assess the Accuracy of Probabilistic Record Linkage

    Get PDF
    Record linkage (RL) is the process of identifying and linking data that relates to the same physical entity across multiple heterogeneous data sources. Deterministic linkage methods rely on the presence of common uniquely identifying attributes across all sources while probabilistic approaches use non-unique attributes and calculates similarity indexes for pair wise comparisons. A key component of record linkage is accuracy assessment — the process of manually verifying and validating matched pairs to further refine linkage parameters and increase its overall effectiveness. This process however is time-consuming and impractical when applied to large administrative data sources where millions of records must be linked. Additionally, it is potentially biased as the gold standard used is often the reviewer’s intuition. In this paper, we present an approach for assessing and refining the accuracy of probabilistic linkage based on different supervised machine learning methods (decision trees, naïve Bayes, logistic regression, random forest, linear support vector machines and gradient boosted trees). We used data sets extracted from huge Brazilian socioeconomic and public health care data sources. These models were evaluated using receiver operating characteristic plots, sensitivity, specificity and positive predictive values collected from a 10-fold cross-validation method. Results show that logistic regression outperforms other classifiers and enables the creation of a generalized, very accurate model to validate linkage results

    Quality and complexity measures for data linkage and deduplication

    Get PDF
    Summary. Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity. Key words: data or record linkage, data integration and matching, deduplication, data mining pre-processing, quality and complexity measures

    Automatic detection of relationships between banking operations using machine learning

    Get PDF
    In their daily business, bank branches should register their operations with several systems in order to share information with other branches and to have a central repository of records. In this way, information can be analysed and processed according to different requisites: fraud detection, accounting or legal requirements. Within this context, there is increasing use of big data and artificial intelligence techniques to improve customer experience. Our research focuses on detecting matches between bank operation records by means of applied intelligence techniques in a big data environment and business intelligence analytics. The business analytics function allows relationships to be established and comparisons to be made between variables from the bank's daily business. Finally, the results obtained show that the framework is able to detect relationships between banking operation records, starting from not homogeneous information and taking into account the large volume of data involved in the process. (C) 2019 Elsevier Inc. All rights reserved.This work was supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R)

    Automatic learning framework for pharmaceutical record matching

    Get PDF
    Pharmaceutical manufacturers need to analyse a vast number of products in their daily activities. Many times, the same product can be registered several times by different systems using different attributes, and these companies require accurate and quality information regarding their products since these products are drugs. The central hypothesis of this research work is that machine learning can be applied to this domain to efficiently merge different data sources and match the records related to the same product. No human is able to do this in a reasonable way because the number of records to be matched is extremely high. This article presents a framework for pharmaceutical record matching based on machine learning techniques in a big data environment. The proposed framework aims to explode the well-known rules for the matching of records from different databases for training machine learning models. Then the trained models are evaluated by predicting matches with records that do not follow these known rules. Finally, the production environment is simulated by generating a huge amount of combinations of records and predicting the matches. The obtained results show that, despite the good results obtained with the training datasets, in the production environment, the average accuracy of the best model is around 85%. That shows that matches which do not follow the known rules can be predicted and, considering that there is not a human way to process this amount of data, the results are promising.This work was supported by the Research Program of the Ministry of Economy and competitiveness, Government of Spain, through the DeepEMR Project, under Grant TIN2017-87548-C2-1-

    Machine Learning Methods for Brain Image Analysis

    Get PDF
    Understanding how the brain functions and quantifying compound interactions between complex synaptic networks inside the brain remain some of the most challenging problems in neuroscience. Lack or abundance of data, shortage of manpower along with heterogeneity of data following from various species all served as an added complexity to the already perplexing problem. The ability to process vast amount of brain data need to be performed automatically, yet with an accuracy close to manual human-level performance. These automated methods essentially need to generalize well to be able to accommodate data from different species. Also, novel approaches and techniques are becoming a necessity to reveal the correlations between different data modalities in the brain at the global level. In this dissertation, I mainly focus on two problems: automatic segmentation of brain electron microscopy (EM) images and stacks, and integrative analysis of the gene expression and synaptic connectivity in the brain. I propose to use deep learning algorithms for the 2D segmentation of EM images. I designed an automated pipeline with novel insights that was able to achieve state-of-the-art performance on the segmentation of the \textit{Drosophila} brain. I also propose a novel technique for 3D segmentation of EM image stacks that can be trained end-to-end with no prior knowledge of the data. This technique was evaluated in an ongoing online challenge for 3D segmentation of neurites where it achieved accuracy close to a second human observer. Later, I employed ensemble learning methods to perform the first systematic integrative analysis of the genome and connectome in the mouse brain at both the regional- and voxel-level. I show that the connectivity signals can be predicted from the gene expression signatures with an extremely high accuracy. Furthermore, I show that only a certain fraction of genes are responsible for this predictive aspect. Rich functional and cellular analysis of these genes are detailed to validate these findings

    Detection of non-technical losses in smart meter data based on load curve profiling and time series analysis

    Get PDF
    The advent and progressive deployment of the so-called Smart Grid has unleashed a profitable portfolio of new possibilities for an efficient management of the low-voltage distribution network supported by the introduction of information and communication technologies to exploit its digitalization. Among all such possibilities this work focuses on the detection of anomalous energy consumption traces: disregarding whether they are due to malfunctioning metering equipment or fraudulent purposes, strong efforts are invested by utilities to detect such outlying events and address them to optimize the power distribution and avoid significant income costs. In this context this manuscript introduce a novel algorithmic approach for the identification of consumption outliers in Smart Grids that relies on concepts from probabilistic data mining and time series analysis. A key ingredient of the proposed technique is its ability to accommodate time irregularities – shifts and warps – in the consumption habits of the user by concentrating on the shape of the consumption rather than on its temporal properties. Simulation results over real data from a Spanish utility are presented and discussed, from where it is concluded that the proposed approach excels at detecting different outlier cases emulated on the aforementioned consumption traces.Ministerio de Energía y Competitividad under the RETOS program (OSIRIS project, grant ref. RTC-2014-1556-3)

    Advancing duplicate question detection with deep learning

    Get PDF

    Cross-Silo Federated Learning Across Divergent Domains with Iterative Parameter Alignment

    Full text link
    Learning from the collective knowledge of data dispersed across private sources can provide neural networks with enhanced generalization capabilities. Federated learning, a method for collaboratively training a machine learning model across remote clients, achieves this by combining client models via the orchestration of a central server. However, current approaches face two critical limitations: i) they struggle to converge when client domains are sufficiently different, and ii) current aggregation techniques produce an identical global model for each client. In this work, we address these issues by reformulating the typical federated learning setup: rather than learning a single global model, we learn N models each optimized for a common objective. To achieve this, we apply a weighted distance minimization to model parameters shared in a peer-to-peer topology. The resulting framework, Iterative Parameter Alignment, applies naturally to the cross-silo setting, and has the following properties: (i) a unique solution for each participant, with the option to globally converge each model in the federation, and (ii) an optional early-stopping mechanism to elicit fairness among peers in collaborative learning settings. These characteristics jointly provide a flexible new framework for iteratively learning from peer models trained on disparate datasets. We find that the technique achieves competitive results on a variety of data partitions compared to state-of-the-art approaches. Further, we show that the method is robust to divergent domains (i.e. disjoint classes across peers) where existing approaches struggle.Comment: Published at IEEE Big Data 202
    • …
    corecore