612 research outputs found

    A qualitative assessment of machine learning support for detecting data completeness and accuracy issues to improve data analytics in big data for the healthcare industry

    Get PDF
    Tackling Data Quality issues as part of Big Data can be challenging. For data cleansing activities, manual methods are not efficient due to the potentially very large amount of data. This paper aims to qualitatively assess the possibilities for using machine learning in the process of detecting data incompleteness and inaccuracy, since these two data quality dimensions were found to be the most significant by a previous research study conducted by the authors. A review of existing literature concludes that there is no unique machine learning algorithm most suitable to deal with both incompleteness and inaccuracy of data. Various algorithms are selected from existing studies and applied against a representative big (healthcare) dataset. Following experiments, it was also discovered that the implementation of machine learning algorithms in this context encounters several challenges for Big Data quality activities. These challenges are related to the amount of data particular machine learning algorithms can scale to and also to certain data type restrictions imposed by some machine learning algorithms. The study concludes that 1) data imputation works better with linear regression models, 2) clustering models are more efficient to detect outliers but fully automated systems may not be realistic in this context. Therefore, a certain level of human judgement is still needed

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Multiple imputation in an international database of social science surveys

    Full text link
    'In diesem Beitrag wird das Verfahren der multiplen Imputation anhand von Datensätzen aus dem International Social Survey Programme diskutiert. Da es in den meisten Variablen fehlende Werte gibt, die in vielen unterschiedlichen Kombinationen vorkommen, werden die Imputations in mehreren Schritten durchgeführt. Als erstes werden die Angaben bei den sozio-demografischen Merkmalen ersetzt, da es hier in der Regel nur relativ wenige fehlende Werte gibt. Bei Blöcken von Items werden nur deren Summenwerte geschätzt, was die Aufgabe der Imputation vereinfacht, ohne daß dabei auf wichtige Informationen verzichtet wird. Ein weiterer Vorteil dieser Vorgehensweise ist, daß die Anzahl der einzusetzenden Werte reduziert wird.' (Autorenreferat)'This paper describes an implementation of the method of multiple imputation in the database of surveys in the International Social Science Programme. Since missing values occur for most variables, with a wide range of patterns, the imputations are carried out in stages, starting with background variables which, in general, have fewer missing values. For blocks of questionnaire items only their total scores are imputed, making the imputation task manageable without substantial loss of utility of the database, and reducing the size of the data files added to the database by the imputation procedure.' (author's abstract)

    Potential adjustment methodology for missing data and reporting delay in the HIV Surveillance System, European Union/European Economic Area, 2015

    Get PDF
    HIV remains one of the most important public health concerns in the European Union and European Economic Area (EU/EEA). Accurate data are therefore crucial to appropriately direct and evaluate public health response. The European Centre for Disease Prevention and Control (ECDC) and the World Health Organization Regional Office for Europe (WHO/Europe) have jointly coordinated enhanced HIV/AIDS surveillance in the European Region since 2008. The general objectives of the surveillance system in EU/EEA countries include monitoring of trends over time and across countries. Specific HIV-related objectives include the monitoring of testing patterns, late HIV diagnoses, defined by low CD4+ counts (<350 cells/mm3), and mortality, as well estimating HIV incidence and prevalence stratified by key populations, e.g. transmission category and migrant status [1]. To meet these objectives, the long-term strategy states that improving the quality of surveillance data is needed [2]. Achieving this in practice poses challenges, especially given the heterogeneous national surveillance systems in the EU/EEA and that the routinely collected data are known to suffer from important quality limitations. The limitations originating from national data collection systems may include under-reporting or duplication of cases, delays in reporting, incompleteness of data and misclassification. Accounting for some of these limitations (e.g. assessment of under-reporting) requires additional data such as cohort studies or registries, while other issues, such as incompleteness and reporting delay, may be addressed directly within the surveillance datasets. Missing data are a well-recognised problem within surveillance systems. When values for some variables are missing and cases with missing values are excluded from analysis, it may lead to biased and potentially less precise estimates [3,4]. In principle, whenever there are missing data or reporting delays, the accuracy of epidemiological distributions and trends should be interpreted with caution. Reporting delay, the time from case diagnosis to notification, can lead to problems when analysing the most recent years, given that the information on some cases or variables may not have been collected yet because of national reporting process characteristics. This phenomenon is common in disease surveillance and also applies to HIV [5-8]. Rough adjustments for reporting delay were already implemented in the past in Europe [8,9], but further refinement of the existing applied methodology is needed to address this issue across more countries’ data. The main purpose of this paper is to explore the issues of missing data and reporting delay in EU/EEA HIV surveillance data. We aim to quantify the extent to which these problems are present and to identify specific data characteristics that are relevant for data adjustments. Taking these characteristics into account, we also propose methods to adjust for missing data and reporting delay based on literature and existing national practices in EU/EEA countries.Peer Reviewe

    Improving deep learning performance with missing values via deletion and compensation

    Get PDF
    Proceedings of: International Work conference on the Interplay between Natural and Artificial Computation (IWINAC 2015)Missing values in a dataset is one of the most common difficulties in real applications. Many different techniques based on machine learning have been proposed in the literature to face this problem. In this work, the great representation capability of the stacked denoising auto-encoders is used to obtain a new method of imputating missing values based on two ideas: deletion and compensation. This method improves imputation performance by artificially deleting values in the input features and using them as targets in the training process. Nevertheless, although the deletion of samples is demonstrated to be really efficient, it may cause an imbalance between the distributions of the training and the test sets. In order to solve this issue, a compensation mechanism is proposed based on a slight modification of the error function to be optimized. Experiments over several datasets show that the deletion and compensation not only involve improvements in imputation but also in classification in comparison with other classical techniques.The work of A. R. Figueiras-Vidal has been partly supported by Grant Macro-ADOBE (TEC 2015-67719-P, MINECO/FEDER&FSE). The work of J.L. Sancho-Gómez has been partly supported by Grant AES 2017 (PI17/00771, MINECO/FEDER)

    Bayesian multilevel latent class models for the multiple imputation of nested categorical data

    Get PDF
    With this article, we propose using a Bayesian multilevel latent class (BMLC; or mixture) model for the multiple imputation of nested categorical data. Unlike recently developed methods that can only pick up associations between pairs of variables, the multilevel mixture model we propose is flexible enough to automatically deal with complex interactions in the joint distribution of the variables to be estimated. After formally introducing the model and showing how it can be implemented, we carry out a simulation study and a real-data study in order to assess its performance and compare it with the commonly used listwise deletion and an available R-routine. Results indicate that the BMLC model is able to recover unbiased parameter estimates of the analysis models considered in our studies, as well as to correctly reflect the uncertainty due to missing data, outperforming the competing methods

    Predicting Pilot Success Using Machine Learning

    Get PDF
    The United States Air Force has a pilot shortage. Unfortunately, training an Air Force pilot requires significant time and resources. Thus, diligence and expediency are critical in selecting those pilot candidates with a strong possibility of success. This research applies multivariate and statistical machine learning techniques to pilot candidates pre-qualification test data and undergraduate pilot training results to determine whether there are selected pre-qualification tests or specific training evaluations that do a \best job of screening for successful pilot training candidates and distinguished graduates. Flight experience, both during training and otherwise, indicates pilot training completion and performance

    HYBRID MULTIPLE IMPUTATION IN A LARGE SCALE COMPLEX SURVEY

    Get PDF
    corecore