15 research outputs found

    Network-based stratification of tumor mutations.

    Get PDF
    Many forms of cancer have multiple subtypes with different causes and clinical outcomes. Somatic tumor genome sequences provide a rich new source of data for uncovering these subtypes but have proven difficult to compare, as two tumors rarely share the same mutations. Here we introduce network-based stratification (NBS), a method to integrate somatic tumor genomes with gene networks. This approach allows for stratification of cancer into informative subtypes by clustering together patients with mutations in similar network regions. We demonstrate NBS in ovarian, uterine and lung cancer cohorts from The Cancer Genome Atlas. For each tissue, NBS identifies subtypes that are predictive of clinical outcomes such as patient survival, response to therapy or tumor histology. We identify network regions characteristic of each subtype and show how mutation-derived subtypes can be used to train an mRNA expression signature, which provides similar information in the absence of DNA sequence

    Algorithms for complex systems in the life sciences: AI for gene fusion prioritization and multi-omics data integration

    Get PDF
    Due to the continuous increase in the number and complexity of the genomics and biological data, new computer science techniques are needed to analyse these data and provide valuable insights into the main features. The thesis research topic consists of designing and developing bioinformatics methods for complex systems in life sciences to provide informative models about biological processes. The thesis is divided into two main sub-topics. The first sub-topic concerns machine and deep learning techniques applied to the analysis of aberrant genetic sequences like, for instance, gene fusions. The second one is the development of statistics and deep learning techniques for heterogeneous biological and clinical data integration. Referring to the first sub-topic, a gene fusion is a biological event in which two distinct regions in the DNA create a new fused gene. Gene fusions are a relevant issue in medicine because many gene fusions are involved in cancer, and some of them can even be used as cancer predictors. However, not all of them are necessarily oncogenic. The first part of this thesis is devoted to the automated recognition of oncogenic gene fusions, a very open and challenging problem in cancer development analysis. In this context, an automated model for the recognition of oncogenic gene fusions relying exclusively on the amino acid sequence of the resulting proteins has been developed. The main contributions consist of: 1. creation of a proper database used to train and test the model; 2. development of the methodology through the design and the implementation of a predictive model based on a Convolutional Neural Network (CNN) followed by a bidirectional Long Short Term Memory (LSTM) network; 3. extensive comparative analysis with other reference tools in the literature; 4. engineering of the developed method through the implementation and release of an automated tool for gene fusions prioritization downstream of gene fusion detection tools. Since the previous approach does not consider post-transcriptional regulation effects, new biological features have been considered (e.g., micro RNA data, gene ontologies, and transcription factors) to improve the overall performance, and a new integrated approach based on MLP has explicitly been designed. In the end, extensive comparisons with other methods present in the literature have been made. These contributions led to an improved model that outperforms the previous ones, and it competes with state-of-the-art tools. The rationale behind the second sub-topic of this thesis is the following: due to the widespread of Next Generation Sequencing (NGS) technologies, a large amount of heterogeneous complex data related to several diseases and healthy individuals is now available (e.g., RNA-seq, gene expression data, miRNAs expression data, methylation sequencing data, and many others). Each one of these data is also called omic, and their integrative study is called multi-omics. In this context, the aim is to integrate multi-omics data involving thousands of features (genes, microRNA) and identifying which of them are relevant for a specific biological process. From a computational point of view, finding the best strategies for multi-omics analysis and relevant features identification is a very open challenge. The first chapter dedicated to this second sub-topic focuses on the integrative analysis of gene expression and connectivity data of mouse brains exploiting machine learning techniques. The rational behind this study is the exploration of the capability to evaluate the grade of physical connection between brain regions starting from their gene expression data. Many studies have been performed considering the functional connection of two or more brain areas (which areas are activated in response to a specific stimulus). While, analyzing physical connections (i.e., axon bundles) starting from gene expression data is still an open problem. Despite this study is scientifically very relevant to deepen human brain functioning, ethical reasons strongly limit the availability of samples. For this reason, several studies have been carried out on the mouse brain, anatomically similar to the human one. The neuronal connection data (obtained by viral tracers) of mouse brains were processed to identify brain regions physically connected and then evaluated with these areas’ gene expression data. A multi-layer perceptron was applied to perform the classification task between connected and unconnected regions providing gene expression data as input. Furthermore, a second model was created to infer the degree of connection between distinct brain regions. The implemented models successfully executed the binary classification task (connected regions against unconnected regions) and distinguished the intensity of the connection in low, medium, and high. A second chapter describes a statistical method to reveal pathology-determining microRNA targets in multi-omic datasets. In this work, two multi-omics datasets are used: breast cancer and medulloblastoma datasets. Both the datasets are composed of miRNA, mRNA, and proteomics data related to the same patients. The main computational contribution to the field consists of designing and implementing an algorithm based on the statistical conditional probability to infer the impact of miRNA post-transcriptional regulation on target genes exploiting the protein expression values. The developed methodology allowed a more in-depth understanding and identification of target genes. Also, it proved to be significantly enriched in three well-known databases (miRDB, TargetScan, and miRTarBase), leading to relevant biological insights. Another chapter deals with the classification of multi-omics samples. The literature’s main approaches integrate all the features available for each sample upstream of the classifier (early integration approach) or create separate classifiers for each omic and subsequently define a consensus set rules (late integration approach). In this context, the main contribution consists of introducing the probability concept by creating a model based on Bayesian and MLP networks to achieve a consensus guided by the class label and its probability. This approach has shown how a probabilistic late integration classification is more specific than an early integration approach and can identify samples out of the training domain. To provide new molecular profiles and patients’ categorization, class labels could be helpful. However, they are not always available. Therefore, the need to cluster samples based on their intrinsic characteristics is revealed and dealt with in a specific chapter. Multi-omic clustering in literature is mainly addressed by creating graphs or methods based on multidimensional data reduction. This field’s main contribution is creating a model based on deep learning techniques by implementing an MLP with a specifically designed loss function. The loss represents the input samples in a reduced dimensional space by calculating the intra-cluster and inter-cluster distance at each epoch. This approach reported performances comparable to those of most referred methods in the literature, avoiding pre-processing steps for either feature selection or dimensionality reduction. Moreover, it has no limitations on the number of omics to integrate

    마이크로 RNA 와 mRNA 표현형 데이터를 위한 시각적 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 8. 서진욱.MicroRNAs (miRNA) are short nucleotides that down-regulate its target genes. Various miRNA target prediction algorithms have used sequence complementarity between miRNA and its targets. Recently, other algorithms tried to improve sequence based miRNA target prediction by exploiting miRNA-mRNA expression profile data. Some web-based tools are also introduced to help researchers predict miRNAs targets from miRNA-mRNA expression profile dataHowever, there is still a demand for a miRNA-mRNA visual analysis tool that include quality miRNA prediction algorithms and more interactive visualizations. We presented two techniques for miRNA-mRNA interaction visualizations, Bipartite Treemap and enhanced node-link diagram. Bipartite Treemap is a new visualization technique for miRNA-mRNA interaction network that resolves occlusion problem. Enhanced node-link diagram provides interaction techniques that help users to explore miRNA-mRNA interaction network easily. We designed and implemented miRTarVis, which is an interactive visual analysis tool that predicts miRNA targets by integrating sequence based and miRNA-mRNA expression profile based miRNA target prediction algorithms, and visualizes the resulting miRNA-mRNA interaction network. miRTarVis has intuitive interface design in accordance with the analysis procedure of load, filter, predict, and visualize. It predicts miRNA targets by adopting Bayesian inference and MINE analyses, as well as conventional correlation and mutual information analyses. It visualizes a resulting miRNA-mRNA network in an interactive Bipartite Treemap as well as ehanced node-link diagram. Using miRTarVis, we analyzed miRNA-mRNA expression profile data from an experiment over asthmatic and non-asthmatic fibroblasts exposed to obese visceral exosomes. In addition, we applied miRTarVis to miRNA-mRNA expression profile data from breast cancer cell lines data to show its efficacy. miRTarVis verified its efficacy by helping its users execute miRNA target prediction easily and gain insights from miRNA-mRNA expression profile data by its interactive visualization.Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Main Contribution 9 1.3 Organization of the Dissertation 14 Chapter 2 miRNA target Prediction 16 2.1 MicroRNA Target Prediction Algorithms 17 2.1.1 Sequence based target prediction algorithms 22 2.1.2 MiRNA-mRNA expression profile based target prediction algorithms 29 2.2 Analysis Tools for Integrated Analysis of miRNA and mRNA 39 Chapter 3 Bipartite Treemap and Enhanced Node-Link Diagram for miRNA-mRNA Interaction Network 46 3.1 Visual representation of Bipartite Treemap 49 3.2 Node-link Diagram with Enhanced Interaction and Various Graph Layouts 54 3.3 Interfaces and Interaction Design for Bipartite Treemap and Enhanced Node-Link Diagram 58 3.4 Comparison with Other Visualization Techniques for MiRNA-mRNA Interaction Network 70 Chapter 4 miRTarVis 83 4.1 Design goals and Rationale 84 4.2 Input Data 88 4.3 MiRNA Target Prediction and Analysis Procedure 91 4.4 Visualizations in miRTarVis 98 4.5 Implementation 100 Chapter 5 Case Study 102 5.1 Analysis of miRNA-mRNA Expression Profile Data from Asthmatic and Non-asthmatic Cells by miRTarVis 102 5.2 Analysis of miRNA-mRNA Expression Profile Data using TCGA Breast Cancer Dataset 109 Chapter 6 Discussion 120 Chapter 7 Conclusion 125 Bibliography 129 요약 149Docto

    Learning by Fusing Heterogeneous Data

    Get PDF
    It has become increasingly common in science and technology to gather data about systems at different levels of granularity or from different perspectives. This often gives rise to data that are represented in totally different input spaces. A basic premise behind the study of learning from heterogeneous data is that in many such cases, there exists some correspondence among certain input dimensions of different input spaces. In our work we found that a key bottleneck that prevents us from better understanding and truly fusing heterogeneous data at large scales is identifying the kind of knowledge that can be transferred between related data views, entities and tasks. We develop interesting and accurate data fusion methods for predictive modeling, which reduce or entirely eliminate some of the basic feature engineering steps that were needed in the past when inferring prediction models from disparate data. In addition, our work has a wide range of applications of which we focus on those from molecular and systems biology: it can help us predict gene functions, forecast pharmacological actions of small chemicals, prioritize genes for further studies, mine disease associations, detect drug toxicity and regress cancer patient survival data. Another important aspect of our research is the study of latent factor models. We aim to design latent models with factorized parameters that simultaneously tackle multiple types of data heterogeneity, where data diversity spans across heterogeneous input spaces, multiple types of features, and a variety of related prediction tasks. Our algorithms are capable of retaining the relational structure of a data system during model inference, which turns out to be vital for good performance of data fusion in certain applications. Our recent work included the study of network inference from many potentially nonidentical data distributions and its application to cancer genomic data. We also model the epistasis, an important concept from genetics, and propose algorithms to efficiently find the ordering of genes in cellular pathways. A central topic of our Thesis is also the analysis of large data compendia as predictions about certain phenomena, such as associations between diseases and involvement of genes in a certain phenotype, are only possible when dealing with lots of data. Among others, we analyze 30 heterogeneous data sets to assess drug toxicity and over 40 human gene association data collections, the largest number of data sets considered by a collective latent factor model up to date. We also make interesting observations about deciding which data should be considered for fusion and develop a generic approach that can estimate the sensitivities between different data sets

    COMPUTATIONAL MODELING OF GENE REGULATION, GAMETE FORMATION, AND EMBRYO IMPLANTATION

    Get PDF
    DNA located in genes is transcribed into RNA which is translated into protein. The regulation of transcription and translation is carried out by several factors including a gene’s primary sequence, cis-regulatory elements (CREs) in non-coding DNA regions, epigenetic marks on the histones which compact DNA, and trans-binding factors (or proteins). The differential expression of a gene is crucial for establishing lineage-specific cell identity and phenotypic variability. Mutation or dysregulation may lead to natural variation within a population or aberrant gene expression and disease; trait-associated variation is known to be enriched in putative CREs, supporting their role in the origins of disease. Understanding the mechanisms by which CREs interact with one another and their cellular environment to regulate transcription may inform knowledge of biological pathways and provide a crucial foundation for developing new treatments. Further, because all DNA is passed to an offspring from their parents, it is important to understand not just the outcomes on expression due to coding and non-coding variation, but also how genetic material is passed to future generations. These dissertation chapters apply modeling approaches to large amounts of genetic and gene expression data in order to 1) better understand how the sequence and epigenetic makeup of CREs impact gene expression within hematopoiesis; 2) scan for selfish genetic elements which are preferentially passed to offspring within human sperm samples; and 3) predict implantation success for euploid embryos given gene expression profiles. Our models within Chapters 2-4 describe the impact of CREs within the blood cell lineage, connecting CREs to putative target genes, and establishing that the hematopoietic CREs were enriched for blood-trait associated genetic variation. Within Chapter 5, we find no compelling evidence of selfish genetic elements within a large sample of human sperm. Finally, within Chapter 6, we identify some genes which seem to impact the success of IVF embryo implantation by acting through regulation of translation

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Correction to: RNA Bioinformatics.

    Get PDF
    n/
    corecore