506 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    INTEGRATION OF MULTI-PLATFORM HIGH-DIMENSIONAL OMIC DATA

    Get PDF
    The development of high-throughput biotechnologies have made data accessible from different platforms, including RNA sequencing, copy number variation, DNA methylation, protein lysate arrays, etc. The high-dimensional omic data derived from different technological platforms have been extensively used to facilitate comprehensive understanding of disease mechanisms and to determine personalized health treatments. Although vital to the progress of clinical research, the high dimensional multi-platform data impose new challenges for data analysis. Numerous studies have been proposed to integrate multi-platform omic data; however, few have efficiently and simultaneously addressed the problems that arise from high dimensionality and complex correlations. In my dissertation, I propose a statistical framework of shared informative factor model (SIFORM) that can jointly analyze multi-platform omic data and explore their associations with a disease phenotype. The common disease- associated sample characteristics across different data types can be captured through the shared structure space, while the corresponding weights of genetic variables directly index the strengths of their association with the phenotype. I compare the performance of the proposed method with several popular regularized regression methods and canonical correlation analysis (CCA)-based methods through extensive simulation studies and two lung adenocarcinoma applications. The two lung adenocarcinoma applications jointly explore the associations of mRNA expression and protein expression with smoking status and survival using The Cancer Genome Atlas (TCGA) datasets. The simulation studies demonstrate the superior performance of SIFORM in terms of biomarker detection accuracy. In lung cancer applications, SIFORM identifies many biomarkers that belong to key pathways for lung tumorigenesis. It also discovers potential prognostic biomarkers for lung cancer patients survival and some biomarkers that reveal different tumorigenesis mechanisms between light smokers and heavy smokers. To improve the prediction accuracy and interpretability of the proposed model, I extend it to PSIFORM by incorporating existing biological pathway information to current statistical framework. I adopt a network-based regularization to ensure that the neighboring genes in the same pathway tend to be selected (or eliminated) simultaneously. Through simulation studies and a TCGA kidney cancer application, I show that PSIFORM outperforms its competitors in both variable selection and prediction. The statistical framework of PSIFORM also has a great potential in incorporating the hierarchical order across the multi-platform omic measurements

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Time-Series Embedded Feature Selection Using Deep Learning: Data Mining Electronic Health Records for Novel Biomarkers

    Get PDF
    As health information technologies continue to advance, routine collection and digitisation of patient health records in the form of electronic health records present as an ideal opportunity for data-mining and exploratory analysis of biomarkers and risk factors indicative of a potentially diverse domain of patient outcomes. Patient records have continually become more widely available through various initiatives enabling open access whilst maintaining critical patient privacy. In spite of such progress, health records remain not widely adopted within the current clinical statistical analysis domain due to challenging issues derived from such “big data”.Deep learning based temporal modelling approaches present an ideal solution to health record challenges through automated self-optimisation of representation learning, able to man-ageably compose the high-dimensional domain of patient records into data representations able to model complex data associations. Such representations can serve to condense and reduce dimensionality to emphasise feature sparsity and importance through novel embedded feature selection approaches. Accordingly, application towards patient records enable complex mod-elling and analysis of the full domain of clinical features to select biomarkers of predictive relevance.Firstly, we propose a novel entropy regularised neural network ensemble able to highlight risk factors associated with hospitalisation risk of individuals with dementia. The application of which, was able to reduce a large domain of unique medical events to a small set of relevant risk factors able to maintain hospitalisation discrimination.Following on, we continue our work on ensemble architecture approaches with a novel cas-cading LSTM ensembles to predict severe sepsis onset within critical patients in an ICU critical care centre. We demonstrate state-of-the-art performance capabilities able to outperform that of current related literature.Finally, we propose a novel embedded feature selection application dubbed 1D convolu-tion feature selection using sparsity regularisation. Said methodology was evaluated on both domains of dementia and sepsis prediction objectives to highlight model capability and generalisability. We further report a selection of potential biomarkers for the aforementioned case study objectives highlighting clinical relevance and potential novelty value for future clinical analysis.Accordingly, we demonstrate the effective capability of embedded feature selection ap-proaches through the application of temporal based deep learning architectures in the discovery of effective biomarkers across a variety of challenging clinical applications

    Machine learning in healthcare : an investigation into model stability

    Full text link
    Current machine learning algorithms, when directly applied to medical data, often fail to provide a good understanding of prognosis. This study provides three pathways to make predictive models stable and usable for healthcare. When tested on heart failure and diabetes patients from a local hospital, this study demonstrated 20% improvement over existing methods.<br /

    A small number of abnormal brain connections predicts adult autism spectrum disorder

    Get PDF
    abstract: Although autism spectrum disorder (ASD) is a serious lifelong condition, its underlying neural mechanism remains unclear. Recently, neuroimaging-based classifiers for ASD and typically developed (TD) individuals were developed to identify the abnormality of functional connections (FCs). Due to over-fitting and interferential effects of varying measurement conditions and demographic distributions, no classifiers have been strictly validated for independent cohorts. Here we overcome these difficulties by developing a novel machine-learning algorithm that identifies a small number of FCs that separates ASD versus TD. The classifier achieves high accuracy for a Japanese discovery cohort and demonstrates a remarkable degree of generalization for two independent validation cohorts in the USA and Japan. The developed ASD classifier does not distinguish individuals with major depressive disorder and attention-deficit hyperactivity disorder from their controls but moderately distinguishes patients with schizophrenia from their controls. The results leave open the viable possibility of exploring neuroimaging-based dimensions quantifying the multiple-disorder spectrum.The final version of this article, as published in Nature Communications, can be viewed online at: https://www.nature.com/articles/ncomms1125

    Supervised Sparse Components Analysis with Application to Brain Imaging Data

    Get PDF
    We propose a dimension-reduction method using supervised (multi-block) sparse (principal) component analysis. The method is first implemented through basis expansion of spatial brain images, and the scores are then reduced through regularized matrix decomposition to produce simultaneous data-driven selections of related brain regions, supervised by univariate composite scores representing linear combinations of covariates. Two advantages of the proposed method are that it identifies the associations between brain regions at the voxel level and that supervision is helpful for interpretation. The proposed method was applied to a study on Alzheimer’s disease (AD) that involved using multimodal whole-brain magnetic resonance imaging (MRI) and positron emission tomography (PET). For illustrative purposes, we demonstrate cases of both single- and multimodal brain imaging and longitudinal measurements

    Genomic biomarker discovery in disease progression and therapy response in bladder cancer utilizing machine learning

    Get PDF
    Cancer in all its forms of expression is a major cause of death. To identify the genomic reason behind cancer, discovery of biomarkers is needed. In this paper, genomic data of bladder cancer are examined for the purpose of biomarker discovery. Genomic biomarkers are indicators stemming from the study of the genome, either at a very low level based on the genome sequence itself, or more abstractly such as measuring the level of gene expression for different disease groups. The latter method is pivotal for this work, since the available datasets consist of RNA sequencing data, transformed to gene expression levels, as well as data on a multitude of clinical indicators. Based on this, various methods are utilized such as statistical modeling via logistic regression and regularization techniques (elastic-net), clustering, survival analysis through Kaplan–Meier curves, and heatmaps for the experiments leading to biomarker discovery. The experiments have led to the discovery of two gene signatures capable of predicting therapy response and disease progression with considerable accuracy for bladder cancer patients which correlates well with clinical indicators such as Therapy Response and T-Stage at surgery with Disease Progression in a time-to-event manner
    corecore