131 research outputs found

    Network-based identification of driver pathways in clonal systems

    Get PDF
    Highly ethanol-tolerant bacteria for the production of biofuels, bacterial pathogenes which are resistant to antibiotics and cancer cells are examples of phenotypes that are of importance to society and are currently being studied. In order to better understand these phenotypes and their underlying genotype-phenotype relationships it is now commonplace to investigate DNA and expression profiles using next generation sequencing (NGS) and microarray techniques. These techniques generate large amounts of omics data which result in lists of genes that have mutations or expression profiles which potentially contribute to the phenotype. These lists often include a multitude of genes and are troublesome to verify manually as performing literature studies and wet-lab experiments for a large number of genes is very time and resources consuming. Therefore, (computational) methods are required which can narrow these gene lists down by removing generally abundant false positives from these lists and can ideally provide additional information on the relationships between the selected genes. Other high-throughput techniques such as yeast two-hybrid (Y2H), ChIP-Seq and Chip-Chip but also a myriad of small-scale experiments and predictive computational methods have generated a treasure of interactomics data over the last decade, most of which is now publicly available. By combining this data into a biological interaction network, which contains all molecular pathways that an organisms can utilize and thus is the equivalent of the blueprint of an organisms, it is possible to integrate the omics data obtained from experiments with these biological interaction networks. Biological interaction networks are key to the computational methods presented in this thesis as they enables methods to account for important relations between genes (and gene products). Doing so it is possible to not only identify interesting genes but also to uncover molecular processes important to the phenotype. As the best way to analyze omics data from an interesting phenotype varies widely based on the experimental setup and the available data, multiple methods were developed and applied in the context of this thesis: In a first approach, an existing method (PheNetic) was applied to a consortium of three bacterial species that together are able to efficiently degrade a herbicide but none of the species are able to efficiently degrade the herbicide on their own. For each of the species expression data (RNA-seq) was generated for the consortium and the species in isolation. PheNetic identified molecular pathways which were differentially expressed and likely contribute to a cross-feeding mechanism between the species in the consortium. Having obtained proof-of-concept, PheNetic was adapted to cope with experimental evolution datasets in which, in addition to expression data, genomics data was also available. Two publicly available datasets were analyzed: Amikacin resistance in E. coli and coexisting ecotypes in E.coli. The results allowed to elicit well-known and newly found molecular pathways involved in these phenotypes. Experimental evolution sometimes generates datasets consisting of mutator phenotypes which have high mutation rates. These datasets are hard to analyze due to the large amount of noise (most mutations have no effect on the phenotype). To this end IAMBEE was developed. IAMBEE is able to analyze genomic datasets from evolution experiments even if they contain mutator phenotypes. IAMBEE was tested using an E. coli evolution experiment in which cells were exposed to increasing concentrations of ethanol. The results were validated in the wet-lab. In addition to methods for analysis of causal mutations and mechanisms in bacteria, a method for the identification of causal molecular pathways in cancer was developed. As bacteria and cancerous cells are both clonal, they can be treated similar in this context. The big differences are the amount of data available (many more samples are available in cancer) and the fact that cancer is a complex and heterogenic phenotype. Therefore we developed SSA-ME, which makes use of the concept that a causal molecular pathway has at most one mutation in a cancerous cell (mutual exclusivity). However, enforcing this criterion is computationally hard. SSA-ME is designed to cope with this problem and search for mutual exclusive patterns in relatively large datasets. SSA-ME was tested on cancer data from the TCGA PAN-cancer dataset. From the results we could, in addition to already known molecular pathways and mutated genes, predict the involvement of few rarely mutated genes.nrpages: 246status: publishe

    Exploiting natural selection to study adaptive behavior

    Get PDF
    The research presented in this dissertation explores different computational and modeling techniques that combined with predictions from evolution by natural selection leads to the analysis of the adaptive behavior of populations under selective pressure. For this thesis three computational methods were developed: EXPLoRA, EVORhA and SSA-ME. EXPLoRA finds genomic regions associated with a trait of interests (QTL) by explicitly modeling the expected linkage disequilibrium of a population of sergeants under selection. Data from BSA experiments was analyzed to find genomic loci associated with ethanol tolerance. EVORhA explores the interplay between driving and hitchhiking mutations during evolution to reconstruct the subpopulation structure of clonal bacterial populations based on deep sequencing data. Data from mixed infections and evolution experiments of E. Coli was used and their population structure reconstructed. SSA-ME uses mutual exclusivity in cancer to prioritize cancer driver genes. TCGA data of breast cancer tumor samples were analyzed.status: publishe

    De-novo pathway discovery for multi-omics data

    Get PDF
    In der vorliegenden Arbeit werden Methoden und Software vorgestellt, die es erlauben, aus Hochdurchsatz Omics-Daten und biomolekularen Interaktionsnetzwerken biologisch relevante Muster zu extrahieren. Es wird ein Algorithmus entwickelt, der es ermöglicht, aus großen gerichteten molekularen Interaktionsnetzwerken sog. deregulierte Teilnetzwerke zu extrahieren. Deregulierung wird hierbei über auf die Knoten des Netzwerkes abgebildete Omics-Daten definiert. Es wird eine statistische Grundlage für den vorgestellten Algorithmus diskutiert und eine Evaluierung hinsichtlich methodisch verwandter Verfahren vorgenommen. Der Algorithmus und seine Implementierung, DeRegNet, beruhen auf fraktionaler ganzzahliger Optimierung und erlauben zahlreiche Anwendungsszenarien. Exemplarisch wird die Anwendung auf öffentlich zugängliche Daten des TCGA-Projekts vorgestellt (TCGA: The Cancer Genome Atlas), hier genauer an Hand der Daten zum hepatozellulären Karzinom (Leberkrebs). Weiterhin werden Anwendungen auf eine Studie des Folate One-Carbon Metabolismus im Leberkrebs, als auch auf die phosphoproteomische Regulierung des Saccharomyces cerevisiae (Backhefe) Zellzyklus beschrieben. Abschließend wird auf die allgemeine Architektur und einige Implementationsdetails einer web-basierten API (Application Programming Interface) zur Bereitstellung von DeRegNet eingegangen.This thesis presents algorithms and software which allow the extraction of biologically meaningful patterns from high-throughput multi-omics data and biomolecular networks. It describes the concept and implementation of an algorithm which allows the extraction of deregulated subnetworks from large directed molecular interaction networks based on node scores derived from omics data. Statistical underpinnings of the algorithms are derived and the algorithm is benchmarked against its closest methodological relative. Relying on fractional integer programming, the algorithm and its implementation, DeRegNet, allow many flexible modes of application. I demonstrate the application of the algorithm in the context of the public TCGA (The Cancer Genome Atlas) liver cancer dataset, a study investigating the role of folate one-carbon metabolism in liver cancer and a study about the phosphoproteomic regulation of the Saccharomyces cerevisiae (baker's yeast) cell cycle. Finally, the general architecture and some implementation details of a web-based API for DeRegNet are presented

    Statistical approaches to harness high throughput sequencing data in diverse biological systems

    Get PDF
    The development of novel statistical approaches to questions specific to biological systems of interest is becoming more valuable as we tackle increasingly complex problems. This thesis explores three distinct biological systems in which high throughput sequencing data is utilised, varying in research area, organism, number of sequencing platforms and datasets integrated, and structure such as matched samples; showcasing the variety of study designs and thus the need for tailored statistical approaches. First, we characterise allelic imbalance from RNA-Seq data including stringent filtering criteria and a count based likelihood ratio test. This work identified genes of particular importance in livestock genomics such as those related to energy use. Second, we outline a novel methodology to identify highly expressed genes and cells for single cell RNA-Seq data. We derive a gamma-normal mixture model to identify lowly and highly expressed components, and use this to identify novel markers for olfactory sensory neuron (OSN) maturity across publicly available mouse neuron datasets. In addition we estimate single cell networks and find that mature OSN single cell networks are more centralised than immature OSN single cell networks. Third, we develop two novel frameworks for relating information from Whole Exome DNA-Seq and RNA-Seq data when i) samples are matched and when ii) samples are not necessary matched between platforms. In the latter case, we relate functional somatic mutation driver gene scores to transcriptional network correlation disturbance using a permutation testing framework, identifying potential candidate genes for targeted therapies. In the former case, we estimate directed mutation-expression networks for each cancer using linear models, providing a useful exploratory tool for identifying novel relationships among genes. This thesis demonstrates the importance of tailored statistical approaches to further understanding across many biological systems

    The role of network science in glioblastoma

    Get PDF
    Network science has long been recognized as a well-established discipline across many biological domains. In the particular case of cancer genomics, network discovery is challenged by the multitude of available high-dimensional heterogeneous views of data. Glioblastoma (GBM) is an example of such a complex and heterogeneous disease that can be tackled by network science. Identifying the architecture of molecular GBM networks is essential to understanding the information flow and better informing drug development and pre-clinical studies. Here, we review network-based strategies that have been used in the study of GBM, along with the available software implementations for reproducibility and further testing on newly coming datasets. Promising results have been obtained from both bulk and single-cell GBM data, placing network discovery at the forefront of developing a molecularly-informed-based personalized medicine.This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with references CEECINST/00102/2018, CEECIND/00072/2018 and PD/BDE/143154/2019, UIDB/04516/2020, UIDB/00297/2020, UIDB/50021/2020, UIDB/50022/2020, UIDB/50026/2020, UIDP/50026/2020, NORTE-01-0145-FEDER-000013, and NORTE-01-0145-FEDER000023 and projects PTDC/CCI-BIO/4180/2020 and DSAIPA/DS/0026/2019. This project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 951970 (OLISSIPO project)

    Bioinformatics approaches for cancer research

    Get PDF
    Cancer is the consequence of genetic alterations that influence the behavior of affected cells. While the phenotypic effects of cancer like infinite proliferation are common hallmarks of this complex class of diseases, the connections between the genetic alterations and these effects are not always evident. The growth of information generated by experimental high-throughput techniques makes it possible to combine heterogeneous data from different sources to gain new insights into these complex molecular processes. The demand on computational biology to develop tools and methods to facilitate the evaluation of such data has increased accordingly. To this end, we developed new approaches and bioinformatics tools for the analysis of high-throughput data. Additionally, we integrated these new approaches into our comprehensive C++ framework GeneTrail. GeneTrail presents a powerful package that combines information retrieval, statistical evaluation of gene sets, result presentation, and data exchange. To make GeneTrail';s capabilities available to the research community, we implemented a graphical user interface in PHP and set up a webserver that is world-wide accessible. In this thesis, we discuss newly integrated algorithms and extensions of GeneTrail, as well as some comprehensive studies that have been performed with GeneTrail in the context of cancer research. We applied GeneTrail to analyze properties of tumor-associated antigens to elucidate the mechanisms of antigen candidate selection. Furthermore, we performed an extensive analysis of miRNAs and their putative target pathways and networks in cancer. In the field of differential network analysis, we employed a combination of expression values and topological data to identify patterns of deregulated subnetworks and putative key players for the deregulation. Signatures of deregulated subnetworks may help to predict the sensitivity of tumor subtypes to therapeutic agents and, hence, may be used in the future to guide the selection of optimal agents. Furthermore, the identified putative key players may represent oncogenes, tumor suppressor genes, or other genes that contribute to crucial changes of regulatory and signaling processes in cancer cells and may serve as potential targets for an individualized tumor therapy. With these applications, we demonstrate the usefulness of our GeneTrail package and hope that our work will contribute to a better understanding of cancer.Krebs ist eine Folge von tiefgreifenden genetischen Veränderungen, die das Verhalten der betroffenen Zellen beeinflussen. Während phänotypische Effekte wie unaufhörliches Wachstum augenscheinliche Merkmale dieser komplexen Klasse von Krankheiten sind, sind die Zusammenhänge zwischen genetischen Veränderungen und diesen Effekten oftmals weit weniger offensichtlich. Mit der stetigen Zunahme an Daten, die aus Hochdurchsatz-Verfahren stammen, ist es möglich geworden, heterogene Daten aus verschiedenen Quellen zu kombinieren und neue Erkenntnisse über diese Zusammenhänge zu gewinnen. Dementsprechend sind auch die Anforderungen an die Bioinformatik gewachsen, geeignete Applikationen und Verfahren zu entwickeln, um die Auswertung solcher Daten zu vereinfachen. Zu diesem Zweck haben wir neue Ansätze und bioinformatische Werkzeuge für die Analyse von entsprechenden Daten für die Krebsforschung entwickelt, welche wir in unser umfangreiches C++ System GeneTrail integriert haben. GeneTrail stellt ein mächtiges Softwarepaket dar, das Informationsgewinnung, statistische Auswertung von Gen Mengen, visuelle Darstellung der Resultate und Datenaustausch kombiniert. Um GeneTrail';s Fähigkeiten der Forschungsgemeinschaft zugänglich zu machen, haben wir eine graphische Benutzerschnittstelle in PHP implementiert und einen Webserver aufgesetzt, auf den weltweit zugegriffen werden kann. In der vorliegenden Arbeit diskutieren wir neu integrierte Algorithmen und Erweiterungen von GeneTrail, sowie umfangreiche Untersuchungen im Bereich Krebsforschung, die mit GeneTrail durchgeführt wurden. Wir haben GeneTrail angewendet, um Eigenschaften von Tumorantigenen zu untersuchen, um aufzuklären, welche dieser Eigenschaften zur Selektion dieser Proteine als Antigene beitragen. Des Weiteren haben wir eine umfangreiche Analyse von miRNAs und deren potentiellen Zielpfaden und -netzen in verschiedenen Krebsarten durchgeführt. Im Bereich differentieller Netzwerkanalyse kombinierten wir Expressionswerte und topologische Netzwerkdaten, um Muster deregulierter Teilnetzwerke und mögliche Schlüsselgene für die Deregulation zu identifizieren. Signaturen deregulierter Teilnetzwerke können helfen die Sensitivität verschiedener Tumorarten gegenüber Therapeutika vorherzusagen und damit zukünftig eine optimal angepasste Therapie zu ermöglichen. Außerdem können die identifizierten potentiellen Schlüsselgene Oncogene, Tumorsuppressorgene, oder andere Gene darstellen, die zu wichtigen Änderungen von regulatorischen Prozessen in Krebszellen beitragen, und damit auch als potentielle Ziele für eine individuelle Tumortherapie in Frage kommen. Mit diesen Anwendungen untermauern wir den Nutzen von GeneTrail und hoffen, dass unsere Arbeit in Zukunft zu einem besseren Verständnis von Krebs beiträgt

    Global analysis of SNPs, proteins and protein-protein interactions: approaches for the prioritisation of candidate disease genes.

    Get PDF
    PhDUnderstanding the etiology of complex disease remains a challenge in biology. In recent years there has been an explosion in biological data, this study investigates machine learning and network analysis methods as tools to aid candidate disease gene prioritisation, specifically relating to hypertension and cardiovascular disease. This thesis comprises four sets of analyses: Firstly, non synonymous single nucleotide polymorphisms (nsSNPs) were analysed in terms of sequence and structure based properties using a classifier to provide a model for predicting deleterious nsSNPs. The degree of sequence conservation at the nsSNP position was found to be the single best attribute but other sequence and structural attributes in combination were also useful. Predictions for nsSNPs within Ensembl have been made publicly available. Secondly, predicting protein function for proteins with an absence of experimental data or lack of clear similarity to a sequence of known function was addressed. Protein domain attributes based on physicochemical and predicted structural characteristics of the sequence were used as input to classifiers for predicting membership of large and diverse protein superfamiles from the SCOP database. An enrichment method was investigated that involved adding domains to the training dataset that are currently absent from SCOP. This analysis resulted in improved classifier accuracy, optimised classifiers achieved 66.3% for single domain proteins and 55.6% when including domains from multi domain proteins. The domains from superfamilies with low sequence similarity, share global sequence properties enabling applications to be developed which compliment profile methods for detecting distant sequence relationships. Thirdly, a topological analysis of the human protein interactome was performed. The results were combined with functional annotation and sequence based properties to build models for predicting hypertension associated proteins. The study found that predicted hypertension related proteins are not generally associated with network hubs and do not exhibit high clustering coefficients. Despite this, they tend to be closer and better connected to other hypertension proteins on the interaction network than would be expected by chance. Classifiers that combined PPI network, amino acid sequence and functional properties produced a range of precision and recall scores according to the applied 3 weights. Finally, interactome properties of proteins implicated in cardiovascular disease and cancer were studied. The analysis quantified the influential (central) nature of each protein and defined characteristics of functional modules and pathways in which the disease proteins reside. Such proteins were found to be enriched 2 fold within proteins that are influential (p<0.05) in the interactome. Additionally, they cluster in large, complex, highly connected communities, acting as interfaces between multiple processes more often than expected. An approach to prioritising disease candidates based on this analysis was proposed. Each analyses can provide some new insights into the effort to identify novel disease related proteins for cardiovascular disease

    Mathematical Physics Techniques for Omics Data Integration

    Get PDF
    Nowadays different types of high-throughput technologies allow us to collect information on the molecular components of biological systems. Each of such technologies is designed to simultaneously collect large sets of molecular data of a specific omic-kind. In order to draw a more comprehensive view of biological processes, experimental data made on different layers have to be integrated and analyzed. The complexity of biological systems, the technological limits, the large number of biological variables and the relatively low number of biological samples make integrative analyses a challenge. Hence, the development of methods for omics integration is one of the most relevant problems computational scientists are addressing nowadays. The most representative and promising techniques for the analysis of omics data are presented and broadly divided into categories. In the literature we notice a growing interest around approaches that use graphs for modeling the relationships among omic variables. In particular we found that algorithms propagating molecular information on networks are being proposed in several applications and are often related to actual physical models. We considered the chemical master equation (CME) framework to model the exchange of information in biological networks as a stochastic process on the network. In this context we defined new algorithms and pipelines for the analysis of omics. In particular we propose two network-based methods with applications to both synthetic and prostate ardenocarcinoma data. In both the applications the molecular alterations are mapped on the protein-protein interaction network. In the first application we defined a novel methodology for extracting modules of connected genes that present the most significant differential molecular information between two classes of samples. In the second application we measure to which degree a distribution of deleterious molecular information on a given network deviates the normal trajectories of information flow using a perturbative approach to the CME

    Relating topology and dynamics in cell signaling networks

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biological Engineering, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 153-163).Cells are constantly bombarded with stimuli that they must sense, process, and interpret to make decisions. This capability is provided by interconnected signaling pathways. Many of the components and interactions within pathways have been identified, and it is becoming clear that the precise dynamics they generate are necessary for proper system function. However, our understanding of how pathways are interconnected to drive decisions is limited. We must overcoming this limitation to develop interventions that can fine tune a cell decision by modulating specific features of its constituent pathway's dynamics. How can we quantatively map a whole cell decision process? Answering this question requires addressing challenges at three scales: the detailed biochemistry of protein-protein interactions, the complex, interlocked feedback loops of transcriptionally regulated signaling pathways, and the multiple mechanisms of connection that link distinct pathways together into a full cell decision process. In this thesis, we address challenges at each level. We develop new computational approaches for identifying the interactions driving dynamics in protein-protein networks. Applied to the cyanobacterial clock, these approaches identify two coupled motifs that together provide independent control over oscillation phase and period. Using the p53 pathway as a model transcriptional network, we experimentally isolate and characterize dynamics from a core feedback loop in individual cells. A quantitative model of this signaling network predicts and rationalizes the distinct effects on dynamics of additional feedback loops and small molecule inhibitors. Finally, we demonstrated the feasibility of combining individual pathway models to map a whole cell decision: cell cycle arrest elicited by the mammalian DNA damage response. By coupling modeling and experiments, we used this combined perspective to uncover some new biology. We found that multiple arrest mechanisms must work together in a proper cell cycle arrest, and identified a new role for p21 in preventing G2 arrest, paradoxically through its action on G1 cyclins. This thesis demonstrates that we can quantitatively map the logic of cellular decisions, affording new insight and revealing points of control.by Jared E. Toettcher.Ph.D
    • …
    corecore