852 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    One-class models for validation of miRNAs and ERBB2 gene interactions based on sequence features for breast cancer scenarios

    Get PDF
    One challenge in miRNA–genes–diseases interaction studies is that it is challenging to find labeled data that indicate a positive or negative relationship between miRNA and genes. The use of one-class classification methods shows a promising path for validating them. We have applied two one-class classification methods, Isolation Forest and One-class SVM, to validate miRNAs interactions with the ERBB2 gene present in breast cancer scenarios using features extracted via sequence-binding. We found that the One-class SVM outperforms the Isolation Forest model, with values of sensitivity of 80.49% and a specificity of 86.49% showing results that are comparable to previous studies. Additionally, we have demonstrated that the use of features extracted from a sequence-based approach (considering miRNA and gene sequence binding characteristics) and one-class models have proven to be a feasible method for validating these genetic molecule interactions

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    Algebraic shortcuts for leave-one-out cross-validation in supervised network inference

    Get PDF
    Supervised machine learning techniques have traditionally been very successful at reconstructing biological networks, such as protein-ligand interaction, protein-protein interaction and gene regulatory networks. Many supervised techniques for network prediction use linear models on a possibly nonlinear pairwise feature representation of edges. Recently, much emphasis has been placed on the correct evaluation of such supervised models. It is vital to distinguish between using a model to either predict new interactions in a given network or to predict interactions for a new vertex not present in the original network. This distinction matters because (i) the performance might dramatically differ between the prediction settings and (ii) tuning the model hyperparameters to obtain the best possible model depends on the setting of interest. Specific cross-validation schemes need to be used to assess the performance in such different prediction settings. In this work we discuss a state-of-the-art kernel-based network inference technique called two-step kernel ridge regression. We show that this regression model can be trained efficiently, with a time complexity scaling with the number of vertices rather than the number of edges. Furthermore, this framework leads to a series of cross-validation shortcuts that allow one to rapidly estimate the model performance for any relevant network prediction setting. This allows computational biologists to fully assess the capabilities of their models

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Network-Based Biomarker Discovery : Development of Prognostic Biomarkers for Personalized Medicine by Integrating Data and Prior Knowledge

    Get PDF
    Advances in genome science and technology offer a deeper understanding of biology while at the same time improving the practice of medicine. The expression profiling of some diseases, such as cancer, allows for identifying marker genes, which could be able to diagnose a disease or predict future disease outcomes. Marker genes (biomarkers) are selected by scoring how well their expression levels can discriminate between different classes of disease or between groups of patients with different clinical outcome (e.g. therapy response, survival time, etc.). A current challenge is to identify new markers that are directly related to the underlying disease mechanism

    Medical Image Analytics (Radiomics) with Machine/Deeping Learning for Outcome Modeling in Radiation Oncology

    Full text link
    Image-based quantitative analysis (radiomics) has gained great attention recently. Radiomics possesses promising potentials to be applied in the clinical practice of radiotherapy and to provide personalized healthcare for cancer patients. However, there are several challenges along the way that this thesis will attempt to address. Specifically, this thesis focuses on the investigation of repeatability and reproducibility of radiomics features, the development of new machine/deep learning models, and combining these for robust outcomes modeling and their applications in radiotherapy. Radiomics features suffer from robustness issues when applied to outcome modeling problems, especially in head and neck computed tomography (CT) images. These images tend to contain streak artifacts due to patients’ dental implants. To investigate the influence of artifacts for radiomics modeling performance, we firstly developed an automatic artifact detection algorithm using gradient-based hand-crafted features. Then, comparing the radiomics models trained on ‘clean’ and ‘contaminated’ datasets. The second project focused on using hand-crafted radiomics features and conventional machine learning methods for the prediction of overall response and progression-free survival for Y90 treated liver cancer patients. By identifying robust features and embedding prior knowledge in the engineered radiomics features and using bootstrapped LASSO to select robust features, we trained imaging and dose based models for the desired clinical endpoints, highlighting the complementary nature of this information in Y90 outcomes prediction. Combining hand-crafted and machine learnt features can take advantage of both expert domain knowledge and advanced data-driven approaches (e.g., deep learning). Thus, we proposed a new variational autoencoder network framework that modeled radiomics features, clinical factors, and raw CT images for the prediction of intrahepatic recurrence-free and overall survival for hepatocellular carcinoma (HCC) patients in this third project. The proposed approach was compared with widely used Cox proportional hazard model for survival analysis. Our proposed methods achieved significant improvement in terms of the prediction using the c-index metric highlighting the value of advanced modeling techniques in learning from limited and heterogeneous information in actuarial prediction of outcomes. Advances in stereotactic radiation therapy (SBRT) has led to excellent local tumor control with limited toxicities for HCC patients, but intrahepatic recurrence still remains prevalent. As an extension of the third project, we not only hope to predict the time to intrahepatic recurrence, but also the location where the tumor might recur. This will be clinically beneficial for better intervention and optimizing decision making during the process of radiotherapy treatment planning. To address this challenging task, firstly, we proposed an unsupervised registration neural network to register atlas CT to patient simulation CT and obtain the liver’s Couinaud segments for the entire patient cohort. Secondly, a new attention convolutional neural network has been applied to utilize multimodality images (CT, MR and 3D dose distribution) for the prediction of high-risk segments. The results showed much improved efficiency for obtaining segments compared with conventional registration methods and the prediction performance showed promising accuracy for anticipating the recurrence location as well. Overall, this thesis contributed new methods and techniques to improve the utilization of radiomics for personalized radiotherapy. These contributions included new algorithm for detecting artifacts, a joint model of dose with image heterogeneity, combining hand-crafted features with machine learnt features for actuarial radiomics modeling, and a novel approach for predicting location of treatment failure.PHDApplied PhysicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163092/1/liswei_1.pd
    • 

    corecore