23 research outputs found
Recovering Loss to Followup Information Using Denoising Autoencoders
Loss to followup is a significant issue in healthcare and has serious
consequences for a study's validity and cost. Methods available at present for
recovering loss to followup information are restricted by their expressive
capabilities and struggle to model highly non-linear relations and complex
interactions. In this paper we propose a model based on overcomplete denoising
autoencoders to recover loss to followup information. Designed to work with
high volume data, results on various simulated and real life datasets show our
model is appropriate under varying dataset and loss to followup conditions and
outperforms the state-of-the-art methods by a wide margin ( in some
scenarios) while preserving the dataset utility for final analysis.Comment: Copyright IEEE 2017, IEEE International Conference on Big Data (Big
Data
Deep Learning-Based Approach for Missing Data Imputation
The missing values in the datasets are a problem that will decrease the machine learning performance. New methods arerecommended every day to overcome this problem. The methods of statistical, machine learning, evolutionary and deeplearning are among these methods. Although deep learning methods is one of the popular subjects of today, there are limitedstudies in the missing data imputation. Several deep learning techniques have been used to handling missing data, one of themis the autoencoder and its denoising and stacked variants. In this study, the missing value in three different real-world datasetswas estimated by using denoising autoencoder (DAE), k-nearest neighbor (kNN) and multivariate imputation by chainedequations (MICE) methods. The estimation success of the methods was compared according to the root mean square error(RMSE) criterion. It was observed that the DAE method was more successful than other statistical methods in estimating themissing values for large datasets
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Adversarial Learning on Incomplete and Imbalanced Medical Data for Robust Survival Prediction of Liver Transplant Patients
The scarcity of liver transplants necessitates prioritizing patients based on their health condition to minimize deaths on the waiting list. Recently, machine learning methods have gained popularity for automatizing liver transplant allocation systems, which enables prompt and suitable selection of recipients. Nevertheless, raw medical data often contain complexities such as missing values and class imbalance that reduce the reliability of the constructed model. This paper aims at eliminating the respective challenges to ensure the reliability of the decision-making process. To this aim, we first propose a novel deep learning method to simultaneously eliminate these challenges and predict the patients\u27 survival chance. Secondly, a hybrid framework is designed that contains three main modules for missing data imputation, class imbalance learning, and classification, each of which employing multiple advanced techniques for the given task. Furthermore, these two approaches are compared and evaluated using a real clinical case study. The experimental results indicate the robust and superior performance of the proposed deep learning method in terms of F-measure and area under the receiver operating characteristic curve (AUC)
Adversarial Learning on Incomplete and Imbalanced Medical Data for Robust Survival Prediction of Liver Transplant Patients
The scarcity of liver transplants necessitates prioritizing patients based on their health condition to minimize deaths on the waiting list. Recently, machine learning methods have gained popularity for automatizing liver transplant allocation systems, which enables prompt and suitable selection of recipients. Nevertheless, raw medical data often contain complexities such as missing values and class imbalance that reduce the reliability of the constructed model. This paper aims at eliminating the respective challenges to ensure the reliability of the decision-making process. To this aim, we first propose a novel deep learning method to simultaneously eliminate these challenges and predict the patients\u27 survival chance. Secondly, a hybrid framework is designed that contains three main modules for missing data imputation, class imbalance learning, and classification, each of which employing multiple advanced techniques for the given task. Furthermore, these two approaches are compared and evaluated using a real clinical case study. The experimental results indicate the robust and superior performance of the proposed deep learning method in terms of F-measure and area under the receiver operating characteristic curve (AUC)
Robust One-Shot Singing Voice Conversion
Recent progress in deep generative models has improved the quality of voice
conversion in the speech domain. However, high-quality singing voice conversion
(SVC) of unseen singers remains challenging due to the wider variety of musical
expressions in pitch, loudness, and pronunciation. Moreover, singing voices are
often recorded with reverb and accompaniment music, which make SVC even more
challenging. In this work, we present a robust one-shot SVC (ROSVC) that
performs any-to-any SVC robustly even on such distorted singing voices. To this
end, we first propose a one-shot SVC model based on generative adversarial
networks that generalizes to unseen singers via partial domain conditioning and
learns to accurately recover the target pitch via pitch distribution matching
and AdaIN-skip conditioning. We then propose a two-stage training method called
Robustify that train the one-shot SVC model in the first stage on clean data to
ensure high-quality conversion, and introduces enhancement modules to the
encoders of the model in the second stage to enhance the feature extraction
from distorted singing voices. To further improve the voice quality and pitch
reconstruction accuracy, we finally propose a hierarchical diffusion model for
singing voice neural vocoders. Experimental results show that the proposed
method outperforms state-of-the-art one-shot SVC baselines for both seen and
unseen singers and significantly improves the robustness against distortions
Applications of machine learning to gravitational waves
Gravitational waves, predicted by Albert Einstein in 1916 and first directly observed in 2015, are a powerful window into the universe, and its past. Currently, multiple detectors around the globe are in operation. While the technology has matured to a point where detections are common, there are still unsolved problems. Traditional search algorithms are only optimal under assumptions which do not hold in contemporary detectors. In addition, high data rates and latency requirements can be challenging. In this thesis, we use new methods based on recent advancements in machine learning to tackle these issues. We develop search algorithms competitive with conventional methods in a realistic setting. In doing so, we cover a mock data challenge which we have organized, and which served as a framework to obtain some of these results. Finally, we demonstrate the power of our search algorithms by applying them to data from the second half of LIGO's third observing run. We find that the events targeted by our searches are identified reliably
Recommended from our members
Electronic Health Record-Derived Phenotyping Models to Improve Genomic Research in Stroke
Stroke is a highly heterogeneous and complex disease that is a leading cause of death in the United States. The landscape of risk factors for stroke is vast, and its large genetic burden has yet to be fully discovered. We hypothesize that the small number of stroke variants recovered so far is due to 1) the vast phenotypic heterogeneity of stroke and 2) binary labeling of stroke genome-wide association study (GWAS) participants as cases or controls. Specifically, genome-wide association studies accumulate hundreds of thousands to millions of participants to acquire adequate signal for variant discovery. This requires time-consuming manual curation of cases and controls often involving large-scale collaborations. Genetic biobanks connected to electronic health records (EHR) can facilitate these studies by using data routinely captured during clinical care like billing diagnosis codes. These data, however, do not define adjudicated cases and controls, with many patients falling somewhere in between. There is an opportunity to use machine learning to add nuance to these definitions. We hypothesize that an expanded definition of disease by incorporating correlated diseases and risk factors from EHR data will improve GWAS power. We also hypothesize that granularly subtyping stroke using unsupervised learning methods can provide insight into stroke etiology and heterogeneity. In Chapter 1, we described the motivation for building upon current phenotyping methods for subtyping and genome-wide association studies to improve GWAS power. In Chapter 2, using patients from Columbia-New York Presbyterian (NYP) Hospital, we built and evaluated machine learning models to identify patients with acute ischemic stroke based on 75 different case-control and classifier combinations. In chapter 3, we compared two data-driven and unsupervised methods, non-negative matrix factorization (NMF) and Hierarchical Poisson Factorization, to subtype stroke patients and determined whether any of the subtypes correlate to stroke severity. In chapter 4, we estimated the heritability of acute ischemic stroke by treating the patient probabilities assigned by the machine learning phenotyping models for acute ischemic stroke in chapter 2 as a quantitative trait and mapping the probabilities to Columbia-NYP EHR-generated pedigrees. We also applied our machine learning phenotyping algorithm method, which we call QTPhenProxy, to venous thromboembolism on Columbia eMERGE Consortium patients and ran a genome-wide association study using the model probabilities as a quantitative trait. Finally, we applied QTPhenProxy to subjects in the UK Biobank for stroke and 14 other diseases and ran genome-wide association studies for each disease. We found that our machine-learned models performed well in identifying acute ischemic stroke patients in the Columbia-NYP EHR and in the UK Biobank. We also found some NMF-derived subtypes that were significantly correlated with stroke severity. We were underpowered in the eMERGE venous thromboembolism cohort GWAS and did not recover any known or new variants. Finally, we found that QTPhenProxy improved the power of GWAS of stroke and several subtypes in the UK Biobank, recovered known variants, and discovered a new variant that replicates in a previous stroke GWAS. Our results for QTPhenProxy demonstrate the promise of incorporating large but messy sets of data, such as the electronic health record, to improve signal in genome-wide association studies