34 research outputs found

    A Review of Missing Data Handling Techniques for Machine Learning

    Get PDF
    Real-world data are commonly known to contain missing values, and consequently affect the performance of most machine learning algorithms adversely when employed on such datasets. Precisely, missing values are among the various challenges occurring in real-world data. Since the accuracy and efficiency of machine learning models depend on the quality of the data used, there is a need for data analysts and researchers working with data, to seek out some relevant techniques that can be used to handle these inescapable missing values. This paper reviews some state-of-art practices obtained in the literature for handling missing data problems for machine learning. It lists some evaluation metrics used in measuring the performance of these techniques. This study tries to put these techniques and evaluation metrics in clear terms, followed by some mathematical equations. Furthermore, some recommendations to consider when dealing with missing data handling techniques were provided

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

    Get PDF
    The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning

    Novel Computationally Intelligent Machine Learning Algorithms for Data Mining and Knowledge Discovery

    Get PDF
    This thesis addresses three major issues in data mining regarding feature subset selection in large dimensionality domains, plausible reconstruction of incomplete data in cross-sectional applications, and forecasting univariate time series. For the automated selection of an optimal subset of features in real time, we present an improved hybrid algorithm: SAGA. SAGA combines the ability to avoid being trapped in local minima of Simulated Annealing with the very high convergence rate of the crossover operator of Genetic Algorithms, the strong local search ability of greedy algorithms and the high computational efficiency of generalized regression neural networks (GRNN). For imputing missing values and forecasting univariate time series, we propose a homogeneous neural network ensemble. The proposed ensemble consists of a committee of Generalized Regression Neural Networks (GRNNs) trained on different subsets of features generated by SAGA and the predictions of base classifiers are combined by a fusion rule. This approach makes it possible to discover all important interrelations between the values of the target variable and the input features. The proposed ensemble scheme has two innovative features which make it stand out amongst ensemble learning algorithms: (1) the ensemble makeup is optimized automatically by SAGA; and (2) GRNN is used for both base classifiers and the top level combiner classifier. Because of GRNN, the proposed ensemble is a dynamic weighting scheme. This is in contrast to the existing ensemble approaches which belong to the simple voting and static weighting strategy. The basic idea of the dynamic weighting procedure is to give a higher reliability weight to those scenarios that are similar to the new ones. The simulation results demonstrate the validity of the proposed ensemble model

    A step towards Advancing Digital Phenotyping In Mental Healthcare

    Get PDF
    Smartphones and wrist-wearable devices have infiltrated our lives in recent years. According to published statistics, nearly 84% of the world’s population owns a smartphone, and almost 10% own a wearable device today (2022). These devices continuously generate various data sources from multiple sensors and apps, creating our digital phenotypes. This opens new research opportunities, particularly in mental health care, which has previously relied almost exclusively on self-reports of mental health symptoms. Unobtrusive monitoring using patients’ devices may result in clinically valuable markers that can improve diagnostic processes, tailor treatment choices, provide continuous insights into their condition for actionable outcomes, such as early signs of relapse, and develop new intervention models. However, these data sources must be translated into meaningful, actionable features related to mental health to achieve their full potential. In the mental health field, there is a great need and much to be gained from defining a way to continuously assess the evolution of patients’ mental states, ideally in their everyday environment, to support the monitoring and treatments by health care providers. A smartphone-based approach may be valuable in gathering long-term objective data, aside from the usually used self-ratings, to predict clinical state changes and investigate causal inferences about state changes in patients (e.g., those with affective disorders). Being objective does not imply that passive data collection is also perfect. It has several challenges: some sensors generate vast volumes of data, and others cause significant battery drain. Furthermore, the analysis of raw passive data is complicated, and collecting certain types of data may interfere with the phenotype of interest. Nonetheless, machine learning is predisposed to address these matters and advance psychiatry’s era of personalised medicine. This work aimed to advance the research efforts on mobile and wearable sensors for mental health monitoring. We applied supervised and unsupervised machine learning methods to model and understand mental disease evolution based on the digital phenotype of patients and clinician assessments at the follow-up visits, which provide ground truths. We needed to cope with regularly and irregularly sampled, high-dimensional, and heterogeneous time series data susceptible to distortion and missingness. Hence, the developed methods must be robust to these limitations and handle missing data properly. Throughout the various projects presented here, we used probabilistic latent variable models for data imputation and feature extraction, namely, mixture models (MM) and hidden Markov models (HMM). These unsupervised models can learn even in the presence of missing data by marginalising the missing values in the function of the present observations. Once the generative models are trained on the data set with missing values, they can be used to generate samples for imputation. First, the most probable component/state has to be found for each sample. Then, sampling from the most probable distribution yields valid and robust parameter estimates and explicit imputed values for variables that can be analysed as outcomes or predictors. The imputation process can be repeated several times, creating multiple datasets, thereby accounting for the uncertainty in the imputed values and implicitly augmenting the data. Moreover, they are robust to moderate deviations of the observed data from the assumed underlying distribution and provide accurate estimates even when missingness is high. Depending on the properties of the data at hand, we employed feature extraction methods combined with classical machine learning algorithms or deep learning-based techniques for temporal modelling to predict various mental health outcomes - emotional state, World Health Organisation Disability Assessment Schedule (WHODAS 2.0) functionality scores and Generalised Anxiety Disorder-7 (GAD-7) scores, of psychiatric outpatients. We mainly focused on one-size-fits-all models, as the labelled sample size per patient was limited; however, in the mood prediction case, it was possible to apply personalised models. Integrating machines and algorithms into the clinical workflow require interpretability to increase acceptance. Therefore, we also analysed feature importance by computing Shapley additive explanations (SHAP) values. SHAP values provide an overview of essential features in the machine learning models by designating the weight of predictability of each feature positively or negatively to the target variable. The provided solutions, as such, are proof of concept, which require further clinical validation to be deployable in the clinical workflow. Still, the results are promising and lay some foundations for future research and collaboration among clinicians, patients, and computer scientists. They set the paths to advance future research prospects in technology-based mental healthcare.En los últimos años, los smartphones y los dispositivos y pulseras inteligentes, comúnmente conocidos como wearables, se han infiltrado en nuestras vidas. Según las estadísticas publicadas a día de hoy (2022), cerca del 84% de la población tiene un smartphone y aproximadamente un 10% también posee un wearable. Estos dispositivos generan datos de forma continua en base a distintos sensores y aplicaciones, creando así nuestro fenotipo digital. Estos datos abren nuevas vías de investigación, particularmente en el área de salud mental, dónde las fuentes de datos han sido casi exclusivamente autoevaluaciones de síntomas de salud mental. Monitorizar de forma no intrusiva a los pacientes mediante sus dispositivos puede dar lugar a marcadores valiosos en aplicación clínica. Esto permite mejorar los procesos de diagnóstico, adaptar tratamientos, e incluso proporcionar información continua sobre el estado de los pacientes, como signos tempranos de recaída, y hasta desarrollar nuevos modelos de intervención. Aun así, estos datos en crudo han de ser traducidos a datos interpretables relacionados con la salud mental para conseguir un máximo rendimiento de los mismos. En salud mental existe una gran necesidad, y además hay mucho que ganar, de definir cómo evaluar de forma continuada la evolución del estado mental de los pacientes en su entorno cotidiano para ayudar en el tratamiento y seguimiento de los mismos por parte de los profesionales sanitarios. En este ámbito, un enfoque basado en datos recopilados desde sus smartphones puede ser valioso para recoger datos objetivos a largo plazo al mismo tiempo que se acompaña de las autoevaluaciones utilizadas habitualmente. La combinación de ambos tipos de datos puede ayudar a predecir los cambios en el estado clínico de estos pacientes e investigar las relaciones causales sobre estos cambios (por ejemplo, en aquellos que padecen trastornos afectivos). Aunque la recogida de datos de forma pasiva tiene la ventaja de ser objetiva, también implica varios retos. Por un lado, ciertos sensores generan grandes volúmenes de datos, provocando un importante consumo de batería. Además, el análisis de los datos pasivos en crudo es complicado, y la recogida de ciertos tipos de datos puede interferir con el fenotipo que se quiera analizar. No obstante, el machine learning o aprendizaje automático, está predispuesto a resolver estas cuestiones y aportar avances en la medicina personalizada aplicada a psiquiatría. Esta tesis tiene como objetivo avanzar en la investigación de los datos recogidos por sensores de smartphones y wearables para la monitorización en salud mental. Para ello, aplicamos métodos de aprendizaje automático supervisado y no supervisado para modelar y comprender la evolución de las enfermedades mentales basándonos en el fenotipo digital de los pacientes. Estos resultados se comparan con las evaluaciones de los médicos en las visitas de seguimiento, que proporcionan las etiquetas reales. Para aplicar estos métodos hemos lidiado con datos provenientes de series temporales con alta dimensionalidad, muestreados de forma regular e irregular, heterogéneos y, además, susceptibles a presentar patrones de datos perdidos y/o distorsionados. Por lo tanto, los métodos desarrollados deben ser resistentes a estas limitaciones y manejar adecuadamente los datos perdidos. A lo largo de los distintos proyectos presentados en este trabajo, hemos utilizado modelos probabilísticos de variables latentes para la imputación de datos y la extracción de características, como por ejemplo, Mixture Models (MM) y hidden Markov Models (HMM). Estos modelos no supervisados pueden aprender incluso en presencia de datos perdidos, marginalizando estos valores en función de las datos que sí han sido observados. Una vez entrenados los modelos generativos en el conjunto de datos con valores perdidos, pueden utilizarse para imputar dichos valores generando muestras. En primer lugar, hay que encontrar el componente/estado más probable para cada muestra. Luego, se muestrea de la distirbución más probable resultando en estimaciones de parámetros robustos y válidos. Además, genera imputaciones explícitas que pueden ser tratadas como resultados. Este proceso de imputación puede repetirse varias veces, creando múltiples conjuntos de datos, con lo que se tiene en cuenta la incertidumbre de los valores imputados y aumentándose así, implícitamente, los datos. Además, estas imputaciones son resistentes a desviaciones que puedan existir en los datos observados con respecto a la distribución subyacente asumida y proporcionan estimaciones precisas incluso cuando la falta de datos es elevada. Dependiendo de las propiedades de los datos en cuestión, hemos usado métodos de extracción de características combinados con algoritmos clásicos de aprendizaje automático o técnicas basadas en deep learning o aprendizaje profundo para el modelado temporal. La finalidad de ambas opciones es ser capaces de predecir varios resultados de salud mental/estado emocional, como la puntuación sobre el World Health Organisation Disability Assessment Schedule (WHODAS 2.0), o las puntuaciones del generalised anxiety disorder-7 (GAD-7) de pacientes psiquiátricos ambulatorios. Nos centramos principalmente en modelos generalizados, es decir, no personalizados para cada paciente sino explicativos para la mayoría, ya que el tamaño de muestras etiquetada por paciente es limitado; sin embargo, en el caso de la predicción del estado de ánimo, puidmos aplicar modelos personalizados. Para que la integración de las máquinas y algoritmos dentro del flujo de trabajo clínico sea aceptada, se requiere que los resultados sean interpretables. Por lo tanto, en este trabajo también analizamos la importancia de las características sacadas por cada algoritmo en base a los valores de las explicaciones aditivas de Shapley (SHAP). Estos valores proporcionan una visión general de las características esenciales en los modelos de aprendizaje automático designando el peso, positivo o negativo, de cada característica en su predictibilidad sobre la variable objetivo. Las soluciones aportadas en esta tesis, como tales, son pruebas de concepto, que requieren una mayor validación clínica para poder ser desplegadas en el flujo de trabajo clínico. Aun así, los resultados son prometedores y sientan base para futuras investigaciones y colaboraciones entre clínicos, pacientes y científicos de datos. Éstas establecen las guías para avanzar en las perspectivas de investigación futuras en la atención sanitaria mental basada en la tecnología.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: David Ramírez García.- Secretario: Alfredo Nazábal Rentería.- Vocal: María Luisa Barrigón Estéve

    Personalized Medicine Support System for Chronic Myeloid Leukemia Patients

    Get PDF
    Personalized medicine offers the most effective treatment protocols to the individual Chronic Myeloid Leukemia (CML) patients. Understanding the molecular biology that causes CML assists in providing efficient treatment. After the identification of an activated tyrosine kinase BCR-ABL1 as the causative lesion in CML, the first-generation Tyrosine Kinase inhibitors (TKI) imatinib (Glivec®), were developed to inhibit BCR-ABL1 activity and approved as a treatment for CML. Despite the remarkable increase in the survival rate of CML patients treated with imatinib, some patients discontinued imatinib therapy due to intolerance, resistance or progression. These patients may benefit from the use of secondgeneration TKIs, such as nilotinib (Tasigna®) and dasatinib (Sprycel®). All three of these TKIs are currently approved for use as frontline treatments. Prognostic scores and molecularbased predictive assays are used to personalize the care of CML patients by allocating risk groups and predicting responses to therapy. Although prognostic scores remain in use today, they are often inadequate for three main reasons. Firstly, since each prognostic score may generate conflicting prognoses for the risk index and it can be difficult to know how to treat patients with conflicting prognoses. Secondly, since prognostic score systems are developed over time, patients can benefit from newly developed systems and information. Finally, the earlier scores use mostly clinically oriented factors instead of those directly related to genetic or molecular indicators. As the current CML treatment guidelines recommend the use of TKI therapy, a new tool that combines the well-known, molecular-based predictive assays to predict molecular response to TKI has not been considered in previous research. Therefore, the main goal of this research is to improve the ability to manage CML disease in individual CML patients and support CML physicians in TKI therapy treatment selection by correctly allocating patients to risk groups and predicting their molecular response to the selected treatment. To achieve this objective, the research detailed here focuses on developing a prognostic model and a predictive model for use as a personalized medicine support system. The system will be considered a knowledge-based clinical decision support system that includes two models embedded in a decision tree. The main idea is to classify patients into risk groups using the prognostic model, while the patients identified as part of the high-risk group should be considered for more aggressive imatinib therapy or switched to secondgeneration TKI with close monitoring. For patients assigned to the low-risk group to imatinib should be predicted using the predictive model. The outcomes should be evaluated by comparing the results of these models with the actual responses to imatinib in patients from a previous medical trial and from patients admitted to hospitals. Validating such a predictive system could greatly assist clinicians in clinical decision-making geared toward individualized medicine. Our findings suggest that the system provides treatment recommendations that could help improve overall healthcare for CML patients. Study limitations included the impact of diversity on human expertise, changing predictive factors, population and prediction endpoints, the impact of time and patient personal issues. Further intensive research activities based on the development of a new predictive model and the method for selecting predictive factors and validation can be expanded to other health organizations and the development of models to predict responses to other TKIs.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    Statistical Modelling

    Get PDF
    The book collects the proceedings of the 19th International Workshop on Statistical Modelling held in Florence on July 2004. Statistical modelling is an important cornerstone in many scientific disciplines, and the workshop has provided a rich environment for cross-fertilization of ideas from different disciplines. It consists in four invited lectures, 48 contributed papers and 47 posters. The contributions are arranged in sessions: Statistical Modelling; Statistical Modelling in Genomics; Semi-parametric Regression Models; Generalized Linear Mixed Models; Correlated Data Modelling; Missing Data, Measurement of Error and Survival Analysis; Spatial Data Modelling and Time Series and Econometrics

    STATISTICAL METHODS FOR THE ANALYSIS OF LARGE-SCALE GENOMIC AND PROTEOMIC DATA

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality
    corecore