469 research outputs found

    A scaling approach to record linkage

    Get PDF
    With increasing availability of large datasets derived from administrative and other sources, there is an increasing demand for the successful linking of these to provide rich sources of data for further analysis. Variation in the quality of identifiers used to carry out linkage means that existing approaches are often based upon 'probabilistic' models, which are based on a number of assumptions, and can make heavy computational demands. In this paper, we suggest a new approach to classifying record pairs in linkage, based upon weights (scores) derived using a scaling algorithm. The proposed method does not rely on training data, is computationally fast, requires only moderate amounts of storage and has intuitive appeal. Copyright © 2017 John Wiley & Sons, Ltd

    Big Data - Supply Chain Management Framework for Forecasting: Data Preprocessing and Machine Learning Techniques

    Full text link
    This article intends to systematically identify and comparatively analyze state-of-the-art supply chain (SC) forecasting strategies and technologies. A novel framework has been proposed incorporating Big Data Analytics in SC Management (problem identification, data sources, exploratory data analysis, machine-learning model training, hyperparameter tuning, performance evaluation, and optimization), forecasting effects on human-workforce, inventory, and overall SC. Initially, the need to collect data according to SC strategy and how to collect them has been discussed. The article discusses the need for different types of forecasting according to the period or SC objective. The SC KPIs and the error-measurement systems have been recommended to optimize the top-performing model. The adverse effects of phantom inventory on forecasting and the dependence of managerial decisions on the SC KPIs for determining model performance parameters and improving operations management, transparency, and planning efficiency have been illustrated. The cyclic connection within the framework introduces preprocessing optimization based on the post-process KPIs, optimizing the overall control process (inventory management, workforce determination, cost, production and capacity planning). The contribution of this research lies in the standard SC process framework proposal, recommended forecasting data analysis, forecasting effects on SC performance, machine learning algorithms optimization followed, and in shedding light on future research

    A hierarchical Bayesian approach to record linkage and population size problems

    Full text link
    We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture--recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0--1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS447 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach

    Get PDF
    The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR). The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon. Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area. This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results

    Efficient estimation algorithms for large and complex data sets

    Get PDF
    The recent world-wide surge in available data allows the investigation of many new and sophisticated questions that were inconceivable just a few years ago. However, two types of data sets often complicate the subsequent analysis: Data that is simple in structure but large in size, and data that is small in size but complex in structure. These two kinds of problems also apply to biological data. For example, data sets acquired from family studies, where the data can be visualized as pedigrees, are small in size but, because of the dependencies within families, they are complex in structure. By comparison, next-generation sequencing data, such as data from chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq), is simple in structure but large in size. Even though the available computational power is increasing steadily, it often cannot keep up with the massive amounts of new data that are being acquired. In these situations, ordinary methods are no longer applicable or scale badly with increasing sample size. The challenge in today’s environment is then to adapt common algorithms for modern data sets. This dissertation considers the challenge of performing inference on modern data sets, and approaches the problem in two parts: first using a problem in the field of genetics, and then using one from molecular biology. In the first part, we focus on data of a complex nature. Specifically, we analyze data from a family study on colorectal cancer (CRC). To model familial clusters of increased cancer risk, we assume inheritable but latent variables for a risk factor that increases the hazard rate for the occurrence of CRC. During parameter estimation, the inheritability of this latent variable necessitates a marginalization of the likelihood that is costly in time for large families. We first approached this problem by implementing computational accelerations that reduced the time for an optimization by the Nelder-Mead method to about 10% of a naive implementation. In a next step, we developed an expectation-maximization (EM) algorithm that works on data obtained from pedigrees. To achieve this, we used factor graphs to factorize the likelihood into a product of “local” functions, which enabled us to apply the sum-product algorithm in the E-step, reducing the computational complexity from exponential to linear. Our algorithm thus enables parameter estimation for family studies in a feasible amount of time. In the second part, we turn to ChIP-Seq data. Previously, practitioners were required to assemble a set of tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies between experimental conditions. In order to remove these restrictions and create a unified framework for ChIP-Seq analysis, we developed GenoGAM (Genome-wide Generalized Additive Model), which extends generalized additive models to efficiently work on data spread over a long x axis by reducing the scaling from cubic to linear and by employing a data parallelism strategy. Our software makes the well-established and flexible GAM framework available for a number of genomic applications. Furthermore, the statistical framework allows for significance testing for differential occupancy. In conclusion, I show how developing algorithms of lower complexity can open the door for analyses that were previously intractable. On this basis, it is recommended to focus subsequent research efforts on lowering the complexity of existing algorithms and design new, lower-complexity algorithms

    An Introduction to Data Linkage

    Get PDF
    This guide is designed to give readers a practical introduction to data linkage and is aimed at researchers who would like to gain an understanding of data linkage techniques, either for the creation or analysis of linked data. It covers data preparation, deterministic and probabilistic linkage methods, and analysis of linked data, with examples relevant to health and other administrative data sources. This guide is relevant for academic researchers in the social and health sciences or those who work for government, survey agencies, official statistics, charities or the private sector

    Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

    Get PDF
    The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning
    corecore