47 research outputs found

    Large-Scale and Pan-Cancer Multi-omic Analyses with Machine Learning

    Get PDF
    Multi-omic data analysis has been foundational in many fields of molecular biology, including cancer research. Investigation of the relationship between different omic data types reveals patterns that cannot otherwise be found in a single data type alone. With recent technological advancements in mass spectrometry (MS), MS-based proteomics has enabled the quantification of thousands of proteins in hundreds of cell lines and human tissue samples. This thesis presents several machine learning-based methods that facilitate the integrative analysis of multi-omic data. First, we reviewed five existing multi-omic data integration methods and performed a benchmarking analysis, using a large-scale multi-omic cancer cell line dataset. We evaluated the performance of these machine learning methods for drug response prediction and cancer type classification. Our result provides recommendations to researchers regarding optimal machine learning method selection for their applications. Second, we generated a pan-cancer proteomic map of 949 cancer cell lines across 40 cancer types and developed a machine learning method DeeProM to analyse the multi-omic information of these lines. This pan-cancer proteomic map (ProCan-DepMapSanger) is now publicly available and represents a major resource for the scientific community, for biomarker discovery and for the study of fundamental aspects of protein regulation. Third, we focused on publicly available multi-omic datasets of both cancer cell lines and human tissue samples and developed a Transformer-based deep learning method, DeePathNet, which integrates human knowledge with machine intelligence. We applied DeePathNet on three evaluation tasks, namely drug response prediction, cancer type classification and breast cancer subtype classification. Taken together, our analyses and methods allowed more accurate cancer diagnosis and prognosis

    Deep learning models for modeling cellular transcription systems

    Get PDF
    Cellular signal transduction system (CSTS) plays a fundamental role in maintaining homeostasis of a cell by detecting changes in its environment and orchestrates response. Perturbations of CSTS lead to diseases such as cancers. Almost all CSTSs are involved in regulating the expression of certain genes and leading to signature changes in gene expression. Therefore, the gene expression profile of a cell is the readout of the state of its CSTS and could be used to infer CSTS. However, a gene expression profile is a convoluted mixture of the responses to all active signaling pathways in cells. Therefore it is difficult to find the genes associated with an individual pathway. An efficient way of de-convoluting signals embedded in the gene expression profile is needed. At the beginning of the thesis, we applied Pearson correlation coefficient analysis to study cellular signals transduced from ceramide species (lipids) to genes. We found significant correlations between specific ceramide species or ceramide groups and gene expression. We showed that various dihydroceramide families regulated distinct subsets of target genes predicted to participate in distinct biologic processes. However, it’s well known that the signaling pathway structure is hierarchical. Useful information may not be fully detected if only linear models are used to study CSTS. More complex non-linear models are needed to represent the hierarchical structure of CSTS. This motivated us to investigate contemporary deep learning models (DLMs). Later, we applied various deep hierarchical models to learn a distributed representation of statistical structures embedded in transcriptomic data. The models learn and represent the hierarchical organization of transcriptomic machinery. Besides, they provide an abstract representation of the statistical structure of transcriptomic data with flexibility and different degrees of granularity. We showed that deep hierarchical models were capable of learning biologically sensible representations of the data (e.g., the hidden units in the first hidden layer could represent transcription factors) and revealing novel insights regarding the machinery regulating gene expression. We also showed that the model outperformed state-of-the-art methods such as Elastic-Net Linear Regression, Support Vector Machine and Non-Negative Matrix Factorization

    Network-based methods for biological data integration in precision medicine

    Full text link
    [eng] The vast and continuously increasing volume of available biomedical data produced during the last decades opens new opportunities for large-scale modeling of disease biology, facilitating a more comprehensive and integrative understanding of its processes. Nevertheless, this type of modelling requires highly efficient computational systems capable of dealing with such levels of data volumes. Computational approximations commonly used in machine learning and data analysis, namely dimensionality reduction and network-based approaches, have been developed with the goal of effectively integrating biomedical data. Among these methods, network-based machine learning stands out due to its major advantage in terms of biomedical interpretability. These methodologies provide a highly intuitive framework for the integration and modelling of biological processes. This PhD thesis aims to explore the potential of integration of complementary available biomedical knowledge with patient-specific data to provide novel computational approaches to solve biomedical scenarios characterized by data scarcity. The primary focus is on studying how high-order graph analysis (i.e., community detection in multiplex and multilayer networks) may help elucidate the interplay of different types of data in contexts where statistical power is heavily impacted by small sample sizes, such as rare diseases and precision oncology. The central focus of this thesis is to illustrate how network biology, among the several data integration approaches with the potential to achieve this task, can play a pivotal role in addressing this challenge provided its advantages in molecular interpretability. Through its insights and methodologies, it introduces how network biology, and in particular, models based on multilayer networks, facilitates bringing the vision of precision medicine to these complex scenarios, providing a natural approach for the discovery of new biomedical relationships that overcomes the difficulties for the study of cohorts presenting limited sample sizes (data-scarce scenarios). Delving into the potential of current artificial intelligence (AI) and network biology applications to address data granularity issues in the precision medicine field, this PhD thesis presents pivotal research works, based on multilayer networks, for the analysis of two rare disease scenarios with specific data granularities, effectively overcoming the classical constraints hindering rare disease and precision oncology research. The first research article presents a personalized medicine study of the molecular determinants of severity in congenital myasthenic syndromes (CMS), a group of rare disorders of the neuromuscular junction (NMJ). The analysis of severity in rare diseases, despite its importance, is typically neglected due to data availability. In this study, modelling of biomedical knowledge via multilayer networks allowed understanding the functional implications of individual mutations in the cohort under study, as well as their relationships with the causal mutations of the disease and the different levels of severity observed. Moreover, the study presents experimental evidence of the role of a previously unsuspected gene in NMJ activity, validating the hypothetical role predicted using the newly introduced methodologies. The second research article focuses on the applicability of multilayer networks for gene priorization. Enhancing concepts for the analysis of different data granularities firstly introduced in the previous article, the presented research provides a methodology based on the persistency of network community structures in a range of modularity resolution, effectively providing a new framework for gene priorization for patient stratification. In summary, this PhD thesis presents major advances on the use of multilayer network-based approaches for the application of precision medicine to data-scarce scenarios, exploring the potential of integrating extensive available biomedical knowledge with patient-specific data

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    Recent Developments in Cancer Systems Biology

    Get PDF
    This ebook includes original research articles and reviews to update readers on the state of the art systems approach to not only discover novel diagnostic and prognostic biomarkers for several cancer types, but also evaluate methodologies to map out important genomic signatures. In addition, therapeutic targets and drug repurposing have been emphasized for a variety of cancer types. In particular, new and established researchers who desire to learn about cancer systems biology and why it is possibly the leading front to a personalized medicine approach will enjoy reading this book

    IN SILICO METHODS FOR DRUG DESIGN AND DISCOVERY

    Get PDF
    Computer-aided drug design (CADD) methodologies are playing an ever-increasing role in drug discovery that are critical in the cost-effective identification of promising drug candidates. These computational methods are relevant in limiting the use of animal models in pharmacological research, for aiding the rational design of novel and safe drug candidates, and for repositioning marketed drugs, supporting medicinal chemists and pharmacologists during the drug discovery trajectory.Within this field of research, we launched a Research Topic in Frontiers in Chemistry in March 2019 entitled “In silico Methods for Drug Design and Discovery,” which involved two sections of the journal: Medicinal and Pharmaceutical Chemistry and Theoretical and Computational Chemistry. For the reasons mentioned, this Research Topic attracted the attention of scientists and received a large number of submitted manuscripts. Among them 27 Original Research articles, five Review articles, and two Perspective articles have been published within the Research Topic. The Original Research articles cover most of the topics in CADD, reporting advanced in silico methods in drug discovery, while the Review articles offer a point of view of some computer-driven techniques applied to drug research. Finally, the Perspective articles provide a vision of specific computational approaches with an outlook in the modern era of CADD
    corecore