1,162 research outputs found

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    PACS: Prediction and analysis of cancer subtypes from multi-omics data based on a multi-head attention mechanism model

    Full text link
    Due to the high heterogeneity and clinical characteristics of cancer, there are significant differences in multi-omic data and clinical characteristics among different cancer subtypes. Therefore, accurate classification of cancer subtypes can help doctors choose the most appropriate treatment options, improve treatment outcomes, and provide more accurate patient survival predictions. In this study, we propose a supervised multi-head attention mechanism model (SMA) to classify cancer subtypes successfully. The attention mechanism and feature sharing module of the SMA model can successfully learn the global and local feature information of multi-omics data. Second, it enriches the parameters of the model by deeply fusing multi-head attention encoders from Siamese through the fusion module. Validated by extensive experiments, the SMA model achieves the highest accuracy, F1 macroscopic, F1 weighted, and accurate classification of cancer subtypes in simulated, single-cell, and cancer multiomics datasets compared to AE, CNN, and GNN-based models. Therefore, we contribute to future research on multiomics data using our attention-based approach.Comment: Submitted to BIBM202

    Machine Learning Models for Deciphering Regulatory Mechanisms and Morphological Variations in Cancer

    Get PDF
    The exponential growth of multi-omics biological datasets is resulting in an emerging paradigm shift in fundamental biological research. In recent years, imaging and transcriptomics datasets are increasingly incorporated into biological studies, pushing biology further into the domain of data-intensive-sciences. New approaches and tools from statistics, computer science, and data engineering are profoundly influencing biological research. Harnessing this ever-growing deluge of multi-omics biological data requires the development of novel and creative computational approaches. In parallel, fundamental research in data sciences and Artificial Intelligence (AI) has advanced tremendously, allowing the scientific community to generate a massive amount of knowledge from data. Advances in Deep Learning (DL), in particular, are transforming many branches of engineering, science, and technology. Several of these methodologies have already been adapted for harnessing biological datasets; however, there is still a need to further adapt and tailor these techniques to new and emerging technologies. In this dissertation, we present computational algorithms and tools that we have developed to study gene-regulation and cellular morphology in cancer. The models and platforms that we have developed are general and widely applicable to several problems relating to dysregulation of gene expression in diseases. Our pipelines and software packages are disseminated in public repositories for larger scientific community use. This dissertation is organized in three main projects. In the first project, we present Causal Inference Engine (CIE), an integrated platform for the identification and interpretation of active regulators of transcriptional response. The platform offers visualization tools and pathway enrichment analysis to map predicted regulators to Reactome pathways. We provide a parallelized R-package for fast and flexible directional enrichment analysis to run the inference on custom regulatory networks. Next, we designed and developed MODEX, a fully automated text-mining system to extract and annotate causal regulatory interaction between Transcription Factors (TFs) and genes from the biomedical literature. MODEX uses putative TF-gene interactions derived from high-throughput ChIP-Seq or other experiments and seeks to collect evidence and meta-data in the biomedical literature to validate and annotate the interactions. MODEX is a complementary platform to CIE that provides auxiliary information on CIE inferred interactions by mining the literature. In the second project, we present a Convolutional Neural Network (CNN) classifier to perform a pan-cancer analysis of tumor morphology, and predict mutations in key genes. The main challenges were to determine morphological features underlying a genetic status and assess whether these features were common in other cancer types. We trained an Inception-v3 based model to predict TP53 mutation in five cancer types with the highest rate of TP53 mutations. We also performed a cross-classification analysis to assess shared morphological features across multiple cancer types. Further, we applied a similar methodology to classify HER2 status in breast cancer and predict response to treatment in HER2 positive samples. For this study, our training slides were manually annotated by expert pathologists to highlight Regions of Interest (ROIs) associated with HER2+/- tumor microenvironment. Our results indicated that there are strong morphological features associated with each tumor type. Moreover, our predictions highly agree with manual annotations in the test set, indicating the feasibility of our approach in devising an image-based diagnostic tool for HER2 status and treatment response prediction. We have validated our model using samples from an independent cohort, which demonstrates the generalizability of our approach. Finally, in the third project, we present an approach to use spatial transcriptomics data to predict spatially-resolved active gene regulatory mechanisms in tissues. Using spatial transcriptomics, we identified tissue regions with differentially expressed genes and applied our CIE methodology to predict active TFs that can potentially regulate the marker genes in the region. This project bridged the gap between inference of active regulators using molecular data and morphological studies using images. The results demonstrate a significant local pattern in TF activity across the tissue, indicating differential spatial-regulation in tissues. The results suggest that the integrative analysis of spatial transcriptomics data with CIE can capture discriminant features and identify localized TF-target links in the tissue

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Prostate cancer screening research can benefit from network medicine: an emerging awareness

    Get PDF
    Up to date, screening for prostate cancer (PCa) remains one of the most appealing but also a very controversial topics in the urological community. PCa is the second most common cancer in men worldwide and it is universally acknowledged as a complex disease, with a multi-factorial etiology. The pathway of PCa diagnosis has changed dramatically in the last few years, with the multiparametric magnetic resonance (mpMRI) playing a starring role with the introduction of the “MRI Pathway”. In this scenario the basic tenet of network medicine (NM) that sees the disease as perturbation of a network of interconnected molecules and pathways, seems to fit perfectly with the challenges that PCa early detection must face to advance towards a more reliable technique. Integration of tests on body fluids, tissue samples, grading/staging classification, physiological parameters, MR multiparametric imaging and molecular profiling technologies must be integrated in a broader vision of “disease” and its complexity with a focus on early signs. PCa screening research can greatly benefit from NM vision since it provides a sound interpretation of data and a common language, facilitating exchange of ideas between clinicians and data analysts for exploring new research pathways in a rational, highly reliable, and reproducible way

    Specialized Named Entity Recognition For Breast Cancer Subtyping

    Get PDF
    The amount of data and analysis being published and archived in the biomedical research community is more than can feasibly be sifted through manually, which limits the information an individual or small group can synthesize and integrate into their own research. This presents an opportunity for using automated methods, including Natural Language Processing (NLP), to extract important information from text on various topics. Named Entity Recognition (NER), is one way to automate knowledge extraction of raw text. NER is defined as the task of identifying named entities from text using labels such as people, dates, locations, diseases, and proteins. There are several NLP tools that are designed for entity recognition, but rely on large established corpus for training data. Biomedical research has the potential to guide diagnostic and therapeutic decisions, yet the overwhelming density of publications acts as a barrier to getting these results into a clinical setting. An exceptional example of this is the field of breast cancer biology where over 2 million people are diagnosed worldwide every year and billions of dollars are spent on research. Breast cancer biology literature and research relies on a highly specific domain with unique language and vocabulary, and therefore requires specialized NLP tools which can generate biologically meaningful results. This thesis presents a novel annotation tool, that is optimized for quickly creating training data for spaCy pipelines as well as exploring the viability of said data for analyzing papers with automated processing. Custom pipelines trained on these annotations are shown to be able to recognize custom entities at levels comparable to large corpus based recognition

    Combining Molecular, Imaging, and Clinical Data Analysis for Predicting Cancer Prognosis

    Get PDF
    Cancer is one of the most detrimental diseases globally. Accordingly, the prognosis prediction of cancer patients has become a field of interest. In this review, we have gathered 43 stateof- the-art scientific papers published in the last 6 years that built cancer prognosis predictive models using multimodal data. We have defined the multimodality of data as four main types: clinical, anatomopathological, molecular, and medical imaging; and we have expanded on the information that each modality provides. The 43 studies were divided into three categories based on the modelling approach taken, and their characteristics were further discussed together with current issues and future trends. Research in this area has evolved from survival analysis through statistical modelling using mainly clinical and anatomopathological data to the prediction of cancer prognosis through a multi-faceted data-driven approach by the integration of complex, multimodal, and high-dimensional data containing multi-omics and medical imaging information and by applying Machine Learning and, more recently, Deep Learning techniques. This review concludes that cancer prognosis predictive multimodal models are capable of better stratifying patients, which can improve clinical management and contribute to the implementation of personalised medicine as well as provide new and valuable knowledge on cancer biology and its progression

    Survival-Related Clustering of Cancer Patients by Integrating Clinical and Biological Datasets

    Get PDF
    Subtype-based treatments and drug therapies are essential aspects to be considered in cancer patients\u27 clinical trials to provide appropriate personalized therapies. With the advancement of the next-generation sequencing technology, several computational models, integrating genomic and transcriptomic datasets (i.e., multi-omics) in the prediction of subtype-based classification in cancer patients, were emerged. However, integration of the prognostic features from the clinical data, related to survival risks with the multi-omics datasets in the prediction of different subtypes, is limited and an important research area to be explored. In this study, we proposed a data integration pipeline with the prognostic features from the clinical data and multi-omics datasets to predict the survival-risk-based subtypes in Kidney Renal Clear Cell Carcinoma (KIRC) patients from The Cancer Genome Atlas (TCGA) database. Firstly, we applied an unsupervised clustering algorithm on KIRC patients and clustered them into two survival-risk-based subgroups, i.e., subtypes. Then, using the clustering-based subtype labels as class labels for cancer patients, we trained a supervised classification model to determine the class label of un-labeled patients.In our clustering step, we applied multivariate Cox Proportional Hazard (Cox-PH) model to select the survival-related prognostically significant features (p-value \u3c 0.05) from the patients’ multivariate clinical data. Then, we used the Silhouette Coefficient to determine the optimal number (k) of the clusters. In our classification step, we integrated high dimensional multi-omics datasets with three different data modalities (such as gene expression, microRNA expression, and DNA methylation). We utilized a dimension-reduction approach, followed by a univariate Cox-PH for each reduced data modality with patients’ survival status. Then, we selected the survival-related reduced-omics-features in our classification model. In this step, we applied a supervised classification method with 10-fold cross-validation to check our survival-based subtype prediction accuracy. We tested multiple machine learning and deep learning algorithms in different steps of the pipeline for clustering (K-means, K-modes and, Gaussian mixture model), dimension-reduction (Denoising Autoencoder and Principal Component Analysis) and classification (Support Vector Machine and Random Forest) purposes. We proposed an optimized model with the highest survival-specific-subtype classification accuracy as the final model
    corecore