8 research outputs found

    Durmaz et al. FSM

    No full text
    <div>Poster Describing our recent paper: Frequent Subgraph Mining of Personalized Signaling Pathway Networks Groups Patients with Frequently Dysregulated Disease Pathways and Predicts Prognosis </div><div><br></div><div>https://www.ncbi.nlm.nih.gov/pubmed/27896993<br></div><div><br></div

    From Correlation to Causality: Does Network Information improve Cancer Outcome Prediction?

    Get PDF
    Motivation: Disease progression in cancer can vary substantially between patients. Yet, patients often receive the same treatment. Recently, there has been much work on predicting disease progression and patient outcome variables from gene expression in order to personalize treatment options. A widely used approach is high-throughput experiments that aim to explore predictive signature genes which would provide identification of clinical outcome of diseases. Microarray data analysis helps to reveal underlying biological mechanisms of tumor progression, metastasis, and drug-resistance in cancer studies. Despite first diagnostic kits in the market, there are open problems such as the choice of random gene signatures or noisy expression data. The experimental or computational noise in data and limited tissue samples collected from patients might furthermore reduce the predictive power and biological interpretability of such signature genes. Nevertheless, signature genes predicted by different studies generally represent poor similarity; even for the same type of cancer. Integration of network information with gene expression data could provide more efficient signatures for outcome prediction in cancer studies. One approach to deal with these problems employs gene-gene relationships and ranks genes using the random surfer model of Google's PageRank algorithm. Unfortunately, the majority of published network-based approaches solely tested their methods on a small amount of datasets, questioning the general applicability of network-based methods for outcome prediction. Methods: In this thesis, I provide a comprehensive and systematically evaluation of a network-based outcome prediction approach -- NetRank - a PageRank derivative -- applied on several types of gene expression cancer data and four different types of networks. The algorithm identifies a signature gene set for a specific cancer type by incorporating gene network information with given expression data. To assess the performance of NetRank, I created a benchmark dataset collection comprising 25 cancer outcome prediction datasets from literature and one in-house dataset. Results: NetRank performs significantly better than classical methods such as foldchange or t-test as it improves the prediction performance in average for 7%. Besides, we are approaching the accuracy level of the authors' signatures by applying a relatively unbiased but fully automated process for biomarker discovery. Despite an order of magnitude difference in network size, a regulatory, a protein-protein interaction and two predicted networks perform equally well. Signatures as published by the authors and the signatures generated with classical methods do not overlap -- not even for the same cancer type -- whereas the network-based signatures strongly overlap. I analyze and discuss these overlapping genes in terms of the Hallmarks of cancer and in particular single out six transcription factors and seven proteins and discuss their specific role in cancer progression. Furthermore several tests are conducted for the identification of a Universal Cancer Signature. No Universal Cancer Signature could be identified so far, but a cancer-specific combination of general master regulators with specific cancer genes could be discovered that achieves the best results for all cancer types. As NetRank offers a great value for cancer outcome prediction, first steps for a secure usage of NetRank in a public cloud are described. Conclusion: Experimental evaluation of network-based methods on a gene expression benchmark dataset suggests that these methods are especially suited for outcome prediction as they overcome the problems of random gene signatures and noisy expression data. Through the combination of network information with gene expression data, network-based methods identify highly similar signatures over all cancer types, in contrast to classical methods that fail to identify highly common gene sets across the same cancer types. In general allows the integration of additional information in gene expression analysis the identification of more reliable, accurate and reproducible biomarkers and provides a deeper understanding of processes occurring in cancer development and progression.:1 Definition of Open Problems 2 Introduction 2.1 Problems in cancer outcome prediction 2.2 Network-based cancer outcome prediction 2.3 Universal Cancer Signature 3 Methods 3.1 NetRank algorithm 3.2 Preprocessing and filtering of the microarray data 3.3 Accuracy 3.4 Signature similarity 3.5 Classical approaches 3.6 Random signatures 3.7 Networks 3.8 Direct neighbor method 3.9 Dataset extraction 4 Performance of NetRank 4.1 Benchmark dataset for evaluation 4.2 The influence of NetRank parameters 4.3 Evaluation of NetRank 4.4 General findings 4.5 Computational complexity of NetRank 4.6 Discussion 5 Universal Cancer Signature 5.1 Signature overlap – a sign for Universal Cancer Signature 5.2 NetRank genes are highly connected and confirmed in literature 5.3 Hallmarks of Cancer 5.4 Testing possible Universal Cancer Signatures 5.5 Conclusion 6 Cloud-based Biomarker Discovery 6.1 Introduction to secure Cloud computing 6.2 Cancer outcome prediction 6.3 Security analysis 6.4 Conclusion 7 Contributions and Conclusion

    Bioinformatics approaches for cancer research

    Get PDF
    Cancer is the consequence of genetic alterations that influence the behavior of affected cells. While the phenotypic effects of cancer like infinite proliferation are common hallmarks of this complex class of diseases, the connections between the genetic alterations and these effects are not always evident. The growth of information generated by experimental high-throughput techniques makes it possible to combine heterogeneous data from different sources to gain new insights into these complex molecular processes. The demand on computational biology to develop tools and methods to facilitate the evaluation of such data has increased accordingly. To this end, we developed new approaches and bioinformatics tools for the analysis of high-throughput data. Additionally, we integrated these new approaches into our comprehensive C++ framework GeneTrail. GeneTrail presents a powerful package that combines information retrieval, statistical evaluation of gene sets, result presentation, and data exchange. To make GeneTrail';s capabilities available to the research community, we implemented a graphical user interface in PHP and set up a webserver that is world-wide accessible. In this thesis, we discuss newly integrated algorithms and extensions of GeneTrail, as well as some comprehensive studies that have been performed with GeneTrail in the context of cancer research. We applied GeneTrail to analyze properties of tumor-associated antigens to elucidate the mechanisms of antigen candidate selection. Furthermore, we performed an extensive analysis of miRNAs and their putative target pathways and networks in cancer. In the field of differential network analysis, we employed a combination of expression values and topological data to identify patterns of deregulated subnetworks and putative key players for the deregulation. Signatures of deregulated subnetworks may help to predict the sensitivity of tumor subtypes to therapeutic agents and, hence, may be used in the future to guide the selection of optimal agents. Furthermore, the identified putative key players may represent oncogenes, tumor suppressor genes, or other genes that contribute to crucial changes of regulatory and signaling processes in cancer cells and may serve as potential targets for an individualized tumor therapy. With these applications, we demonstrate the usefulness of our GeneTrail package and hope that our work will contribute to a better understanding of cancer.Krebs ist eine Folge von tiefgreifenden genetischen VerĂ€nderungen, die das Verhalten der betroffenen Zellen beeinflussen. WĂ€hrend phĂ€notypische Effekte wie unaufhörliches Wachstum augenscheinliche Merkmale dieser komplexen Klasse von Krankheiten sind, sind die ZusammenhĂ€nge zwischen genetischen VerĂ€nderungen und diesen Effekten oftmals weit weniger offensichtlich. Mit der stetigen Zunahme an Daten, die aus Hochdurchsatz-Verfahren stammen, ist es möglich geworden, heterogene Daten aus verschiedenen Quellen zu kombinieren und neue Erkenntnisse ĂŒber diese ZusammenhĂ€nge zu gewinnen. Dementsprechend sind auch die Anforderungen an die Bioinformatik gewachsen, geeignete Applikationen und Verfahren zu entwickeln, um die Auswertung solcher Daten zu vereinfachen. Zu diesem Zweck haben wir neue AnsĂ€tze und bioinformatische Werkzeuge fĂŒr die Analyse von entsprechenden Daten fĂŒr die Krebsforschung entwickelt, welche wir in unser umfangreiches C++ System GeneTrail integriert haben. GeneTrail stellt ein mĂ€chtiges Softwarepaket dar, das Informationsgewinnung, statistische Auswertung von Gen Mengen, visuelle Darstellung der Resultate und Datenaustausch kombiniert. Um GeneTrail';s FĂ€higkeiten der Forschungsgemeinschaft zugĂ€nglich zu machen, haben wir eine graphische Benutzerschnittstelle in PHP implementiert und einen Webserver aufgesetzt, auf den weltweit zugegriffen werden kann. In der vorliegenden Arbeit diskutieren wir neu integrierte Algorithmen und Erweiterungen von GeneTrail, sowie umfangreiche Untersuchungen im Bereich Krebsforschung, die mit GeneTrail durchgefĂŒhrt wurden. Wir haben GeneTrail angewendet, um Eigenschaften von Tumorantigenen zu untersuchen, um aufzuklĂ€ren, welche dieser Eigenschaften zur Selektion dieser Proteine als Antigene beitragen. Des Weiteren haben wir eine umfangreiche Analyse von miRNAs und deren potentiellen Zielpfaden und -netzen in verschiedenen Krebsarten durchgefĂŒhrt. Im Bereich differentieller Netzwerkanalyse kombinierten wir Expressionswerte und topologische Netzwerkdaten, um Muster deregulierter Teilnetzwerke und mögliche SchlĂŒsselgene fĂŒr die Deregulation zu identifizieren. Signaturen deregulierter Teilnetzwerke können helfen die SensitivitĂ€t verschiedener Tumorarten gegenĂŒber Therapeutika vorherzusagen und damit zukĂŒnftig eine optimal angepasste Therapie zu ermöglichen. Außerdem können die identifizierten potentiellen SchlĂŒsselgene Oncogene, Tumorsuppressorgene, oder andere Gene darstellen, die zu wichtigen Änderungen von regulatorischen Prozessen in Krebszellen beitragen, und damit auch als potentielle Ziele fĂŒr eine individuelle Tumortherapie in Frage kommen. Mit diesen Anwendungen untermauern wir den Nutzen von GeneTrail und hoffen, dass unsere Arbeit in Zukunft zu einem besseren VerstĂ€ndnis von Krebs beitrĂ€gt

    EMT Network-based Lung Cancer Prognosis Prediction

    Get PDF
    Network-based feature selection methods on omics data have been developed in recent years. Their performance gain, however, is shown to be affected by the datasets, networks, and evaluation metrics. The reproducibility and robustness of biomarkers await to be improved. In this endeavor, one of the major challenges is the curse of dimensionality. To mitigate this issue, we proposed the Phenotype Relevant Network-based Feature Selection (PRNFS) framework. By employing a much smaller but phenotype relevant network, we could avoid irrelevant information and select robust molecular signatures. The advantages of PRNFS were demonstrated with the application of lung cancer prognosis prediction. Specifically, we constructed epithelial mesenchymal transition (EMT) networks and employed them for feature selection. We mapped multiple types of omics data on it alternatively to select single-omics signatures and further integrated them into multi-omics signatures. Then we introduced a multiplex network-based feature selection method to directly select multi-omics signatures. Both single-omics and multi-omics EMT signatures were evaluated on TCGA data as well as an independent multi-omics dataset. The results showed that EMT signatures achieved significant performance gain, although EMT networks covered less than 2.5% of the original data dimensions. Frequently selected EMT features achieved average AUC values of 0.83 on TCGA data. Employing EMT signatures on the independent dataset stratified the patients into significantly different prognostic groups. Multi-omics features showed superior performance over single-omics features on both TCGA data and the independent data. Additionally, we tested the performance of a few relational and non-relational databases for storing and retrieving omics data. Since biological data have large volume, high velocity, and wide varieties, it is necessary to have database systems that meet the need of integrative omics data analysis. Based on the results, we provided a few advices on building scalable omics data infrastructures

    Topological Data Analysis of High-dimensional Correlation Structures with Applications in Epigenetics

    Get PDF
    This thesis comprises a comprehensive study of the correlation of highdimensional datasets from a topological perspective. Derived from a lack of efficient algorithms of big data analysis and motivated by the importance of finding a structure of correlations in genomics, we have developed two analytical tools inspired by the topological data analysis approach that describe and predict the behavior of the correlated design. Those models allowed us to study epigenetic interactions from a local and global perspective, taking into account the different levels of complexity. We applied graph-theoretic and algebraic topology principles to quantify structural patterns on local correlation networks and, based on them, we proposed a network model that was able to predict the locally high correlations of DNA methylation data. This model provided with an efficient tool to measure the evolution of the correlation with the aging process. Furthermore, we developed a powerful computational algorithm to analyze the correlation structure globally that was able to detect differentiated methylation patterns over sample groups. This methodology aimed to serve as a diagnostic tool, as it provides with selected epigenetic biomarkers associated with a specific phenotype of interest. Overall, this work establishes a novel perspective of analysis and modulation of hidden correlation structures, specifically those of great dimension and complexity, contributing to the understanding of the epigenetic processes, and that is designed to be useful for non-biological fields too

    FAIR and bias-free network modules for mechanism-based disease redefinitions

    Get PDF
    Even though chronic diseases are the cause of 60% of all deaths around the world, the underlying causes for most of them are not fully understood. Hence, diseases are defined based on organs and symptoms, and therapies largely focus on mitigating symptoms rather than cure. This is also reflected in the most commonly used disease classifications. The complex nature of diseases, however, can be better defined in terms of networks of molecular interactions. This research applies the approaches of network medicine – a field that uses network science for identifying and treating diseases – to multiple diseases with highly unmet medical need such as stroke and hypertension. The results show the success of this approach to analyse complex disease networks and predict drug targets for different conditions, which are validated through preclinical experiments and are currently in human clinical trials

    Machine learning and large scale cancer omic data: decoding the biological mechanisms underpinning cancer

    Get PDF
    Many of the mechanisms underpinning cancer risk and tumorigenesis are still not fully understood. However, the next-generation sequencing revolution and the rapid advances in big data analytics allow us to study cells and complex phenotypes at unprecedented depth and breadth. While experimental and clinical data are still fundamental to validate findings and confirm hypotheses, computational biology is key for the analysis of system- and population-level data for detection of hidden patterns and the generation of testable hypotheses. In this work, I tackle two main questions regarding cancer risk and tumorigenesis that require novel computational methods for the analysis of system-level omic data. First, I focused on how frequent, low-penetrance inherited variants modulate cancer risk in the broader population. Genome-Wide Association Studies (GWAS) have shown that Single Nucleotide Polymorphisms (SNP) contribute to cancer risk with multiple subtle effects, but they are still failing to give further insight into their synergistic effects. I developed a novel hierarchical Bayesian regression model, BAGHERA, to estimate heritability at the gene-level from GWAS summary statistics. I then used BAGHERA to analyse data from 38 malignancies in the UK Biobank. I showed that genes with high heritable risk are involved in key processes associated with cancer and are often localised in genes that are somatically mutated drivers. Heritability, like many other omics analysis methods, study the effects of DNA variants on single genes in isolation. However, we know that most biological processes require the interplay of multiple genes and we often lack a broad perspective on them. For the second part of this thesis, I then worked on the integration of Protein-Protein Interaction (PPI) graphs and omics data, which bridges this gap and recapitulates these interactions at a system level. First, I developed a modular and scalable Python package, PyGNA, that enables robust statistical testing of genesets' topological properties. PyGNA complements the literature with a tool that can be routinely introduced in bioinformatics automated pipelines. With PyGNA I processed multiple genesets obtained from genomics and transcriptomics data. However, topological properties alone have proven to be insufficient to fully characterise complex phenotypes. Therefore, I focused on a model that allows to combine topological and functional data to detect multiple communities associated with a phenotype. Detecting cancer-specific submodules is still an open problem, but it has the potential to elucidate mechanisms detectable only by integrating multi-omics data. Building on the recent advances in Graph Neural Networks (GNN), I present a supervised geometric deep learning model that combines GNNs and Stochastic Block Models (SBM). The model is able to learn multiple graph-aware representations, as multiple joint SBMs, of the attributed network, accounting for nodes participating in multiple processes. The simultaneous estimation of structure and function provides an interpretable picture of how genes interact in specific conditions and it allows to detect novel putative pathways associated with cancer
    corecore