348 research outputs found

    Network Approaches to the Study of Genomic Variation in Cancer

    Get PDF
    Advances in genomic sequencing technologies opened the door for a wider study of cancer etiology. By analyzing datasets with thousands of exomes (or genomes), researchers gained a better understanding of the genomic alterations that confer a selective advantage towards cancerous growth. A predominant narrative in the field has been based on a dichotomy of alterations that confer a strong selective advantage, called cancer drivers, and the bulk of other alterations assumed to have a neutral effect, called passengers. Yet, a series of studies questioned this narrative and assigned potential roles to passengers, be it in terms of facilitating tumorigenesis or countering the effect of drivers. Consequently, the passenger mutational landscape received a higher level of attention in attempt to prioritize the possible effects of its alterations and to identify new therapeutic targets. In this dissertation, we introduce interpretable network approaches to the study of genomic variation in cancer. We rely on two types of networks, namely functional biological networks and artificial neural nets. In the first chapter, we describe a propagation method that prioritizes 230 infrequently mutated genes with respect to their potential contribution to cancer development. In the second chapter, we further transcend the driver-passenger dichotomy and demonstrate a gradient of cancer relevance across human genes. In the last two chapters, we present methods that simplify neural network models to render them more interpretable with a focus on functional genomic applications in cancer and beyond

    Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    Get PDF
    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer

    Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations

    Get PDF
    Cancer arises from the accumulation of somatic mutations and genetic alterations in cell division checkpoints and apoptosis, this often leads to abnormal tumor proliferation. Proper classification of cancer-linked driver mutations will considerably help our understanding of the molecular dynamics of cancer. In this study, we compared several cancer-specific predictive models for prediction of driver mutations in cancer-linked genes that were validated on canonical data sets of functionally validated mutations and applied to a raw cancer genomics data. By analyzing pathogenicity prediction and conservation scores, we have shown that evolutionary conservation scores play a pivotal role in the classification of cancer drivers and were the most informative features in the driver mutation classification. Through extensive comparative analysis with structure-functional experiments and multicenter mutational calling data from PanCancer Atlas studies, we have demonstrated the robustness of our models and addressed the validity of computational predictions. We evaluated the performance of our models using the standard diagnostic metrics such as sensitivity, specificity, area under the curve and F-measure. To address the interpretability of cancer-specific classification models and obtain novel insights about molecular signatures of driver mutations, we have complemented machine learning predictions with structure-functional analysis of cancer driver mutations in several key tumor suppressor genes and oncogenes. Through the experiments carried out in this study, we found that evolutionary-based features have the strongest signal in the machine learning classification VII of driver mutations and provide orthogonal information to the ensembled-based scores that are prominent in the ranking of feature importance

    Discerning Drivers of Cancer: Computational Approaches to Somatic Exome Sequencing Data

    Get PDF
    Paired tumor-normal sequencing of thousands of patient’s exomes has revealed millions of somatic mutations, but functional characterization and clinical decision making are stymied because biologically neutral ‘passenger’ mutations greatly outnumber pathogenic ‘driver’ mutations. Since most mutations will return negative results if tested, conventional resource-intensive experiments are reserved for mutations which are observed in multiple patients or rarer mutations found in well-established cancer genes. Most mutations are therefore never tested, diminishing the potential to discover new mechanisms of cancer development and treatment opportunities. Computational methods that reliably prioritize mutations for testing would greatly increase the translation of sequencing results to clinical care. The goal of this thesis is to develop new approaches that use datasets of protein-coding somatic mutations to identify putative cancer-causing genes and mutations, and to validate these predictions in silico and experimentally. This effort will be split among several inter-related efforts, which taken together will help experimental biologists and clinicians focus on hypotheses that can yield novel insights into cancer biology, development, and treatment

    The search for cis-regulatory driver mutations in cancer genomes

    Get PDF
    With the advent of high-throughput and relatively inexpensive whole-genome sequencing technology, the focus of cancer research has begun to shift toward analyses of somatic mutations in non-coding cis-regulatory elements of the cancer genome. Cis-regulatory elements play an important role in gene regulation, with mutations in these elements potentially resulting in changes to the expression of linked genes. The recent discoveries of recurrent TERT promoter mutations in melanoma, and recurrent mutations that create a super-enhancer regulating TAL1 expression in T-cell acute lymphoblastic leukaemia (T-ALL), have sparked significant interest in the search for other somatic cis-regulatory mutations driving cancer development. In this review, we look more closely at the TERT promoter and TAL1 enhancer alterations and use these examples to ask whether other cis-regulatory mutations may play a role in cancer susceptibility. In doing so, we make observations from the data emerging from recent research in this field, and describe the experimental and analytical approaches which could be adopted in the hope of better uncovering the true functional significance of somatic cis-regulatory mutations in cancer.Link_to_subscribed_fulltex

    Network-based identification of driver pathways in clonal systems

    Get PDF
    Highly ethanol-tolerant bacteria for the production of biofuels, bacterial pathogenes which are resistant to antibiotics and cancer cells are examples of phenotypes that are of importance to society and are currently being studied. In order to better understand these phenotypes and their underlying genotype-phenotype relationships it is now commonplace to investigate DNA and expression profiles using next generation sequencing (NGS) and microarray techniques. These techniques generate large amounts of omics data which result in lists of genes that have mutations or expression profiles which potentially contribute to the phenotype. These lists often include a multitude of genes and are troublesome to verify manually as performing literature studies and wet-lab experiments for a large number of genes is very time and resources consuming. Therefore, (computational) methods are required which can narrow these gene lists down by removing generally abundant false positives from these lists and can ideally provide additional information on the relationships between the selected genes. Other high-throughput techniques such as yeast two-hybrid (Y2H), ChIP-Seq and Chip-Chip but also a myriad of small-scale experiments and predictive computational methods have generated a treasure of interactomics data over the last decade, most of which is now publicly available. By combining this data into a biological interaction network, which contains all molecular pathways that an organisms can utilize and thus is the equivalent of the blueprint of an organisms, it is possible to integrate the omics data obtained from experiments with these biological interaction networks. Biological interaction networks are key to the computational methods presented in this thesis as they enables methods to account for important relations between genes (and gene products). Doing so it is possible to not only identify interesting genes but also to uncover molecular processes important to the phenotype. As the best way to analyze omics data from an interesting phenotype varies widely based on the experimental setup and the available data, multiple methods were developed and applied in the context of this thesis: In a first approach, an existing method (PheNetic) was applied to a consortium of three bacterial species that together are able to efficiently degrade a herbicide but none of the species are able to efficiently degrade the herbicide on their own. For each of the species expression data (RNA-seq) was generated for the consortium and the species in isolation. PheNetic identified molecular pathways which were differentially expressed and likely contribute to a cross-feeding mechanism between the species in the consortium. Having obtained proof-of-concept, PheNetic was adapted to cope with experimental evolution datasets in which, in addition to expression data, genomics data was also available. Two publicly available datasets were analyzed: Amikacin resistance in E. coli and coexisting ecotypes in E.coli. The results allowed to elicit well-known and newly found molecular pathways involved in these phenotypes. Experimental evolution sometimes generates datasets consisting of mutator phenotypes which have high mutation rates. These datasets are hard to analyze due to the large amount of noise (most mutations have no effect on the phenotype). To this end IAMBEE was developed. IAMBEE is able to analyze genomic datasets from evolution experiments even if they contain mutator phenotypes. IAMBEE was tested using an E. coli evolution experiment in which cells were exposed to increasing concentrations of ethanol. The results were validated in the wet-lab. In addition to methods for analysis of causal mutations and mechanisms in bacteria, a method for the identification of causal molecular pathways in cancer was developed. As bacteria and cancerous cells are both clonal, they can be treated similar in this context. The big differences are the amount of data available (many more samples are available in cancer) and the fact that cancer is a complex and heterogenic phenotype. Therefore we developed SSA-ME, which makes use of the concept that a causal molecular pathway has at most one mutation in a cancerous cell (mutual exclusivity). However, enforcing this criterion is computationally hard. SSA-ME is designed to cope with this problem and search for mutual exclusive patterns in relatively large datasets. SSA-ME was tested on cancer data from the TCGA PAN-cancer dataset. From the results we could, in addition to already known molecular pathways and mutated genes, predict the involvement of few rarely mutated genes.nrpages: 246status: publishe

    Computational approaches for identifying somatic intergenic mutations of relevance in cancer

    Get PDF
    Cancer is a complex genomic disease characterized by accumulation of somatic mutations over the lifetime of a patient. Identification of somatic driver mutations that contribute to tumorigenesis is a major goal of cancer genomics. With the recent advances in the sequencing technologies it became possible to study somatic mutations on the whole-genome scale in multiple cancers. While most of the cancer genomics studies were previously focused on identification of driver mutations affecting exons, several examples of driver events within the non-protein-coding regions of the genome were identified, including the recurrent TERT promoter mutations. Such findings have spurred searches for similar examples of recurrent non-coding mutations using computational cancer genomics. In my PhD thesis, I present several computational approaches aimedto identify somatic driver mutations with a specific focus on intergenic regions of the genome. The first part of this thesis focuses on the somatic mutational patterns along the cancer genome and addresses a fundamental problem of computational identification of recurrently mutated regions – regional mutational heterogeneity. Here I studied the correlation of specific genomic features with background somatic mutation rates and devised a background model that accounts for regional mutational heterogeneity. The second part of this thesis describes three different computational approaches designed to identify somatic driver events of functional relevance in cancer. The first approach integrates somatic mutation calls with gene expression data to identify variants associated with altered mRNA levels. The second approach is designed to predict changes in transcription factor binding sites in presence of recurrent somatic mutations. The third approach uses cross-validation scheme to enable parameter tuning in screens for recurrently somatically mutated regions in cancer genomes in an unbiased genome-wide manner. Using this approach, we identify several known cancer-relevant targets, both exonic (e.g., the TP53, MYC, and SMARCA4 genes) as well as non-coding regulatory regions (e.g., the TERT promoter) and uncover novel candidate regulatory driver regions. Among those, a cluster of recurrent intergenic mutations, occurring in an enhancer element near the FADS2 gene, which encodes a critical enzyme in the biosynthesis of long chain polyunsaturated fatty acids and has been previously implicated in cancer. Collectively, the computational approaches presented here helped in uncovering novel somatic candidate events of relevance in cancer and can be further used for various applications in cancer genomics

    Integration of Random Forest Classifiers and Deep Convolutional Neural Networks for Classification and Biomolecular Modeling of Cancer Driver Mutations

    Get PDF
    Development of machine learning solutions for prediction of functional and clinical significance of cancer driver genes and mutations are paramount in modern biomedical research and have gained a significant momentum in a recent decade. In this work, we integrate different machine learning approaches, including tree based methods, random forest and gradient boosted tree (GBT) classifiers along with deep convolutional neural networks (CNN) for prediction of cancer driver mutations in the genomic datasets. The feasibility of CNN in using raw nucleotide sequences for classification of cancer driver mutations was initially explored by employing label encoding, one hot encoding, and embedding to preprocess the DNA information. These classifiers were benchmarked against their tree-based alternatives in order to evaluate the performance on a relative scale. We then integrated DNA-based scores generated by CNN with various categories of conservational, evolutionary and functional features into a generalized random forest classifier. The results of this study have demonstrated that CNN can learn high level features from genomic information that are complementary to the ensemble-based predictors often employed for classification of cancer mutations. By combining deep learning-generated score with only two main ensemble-based functional features, we can achieve a superior performance of various machine learning classifiers. Our findings have also suggested that synergy of nucleotide-based deep learning scores and integrated metrics derived from protein sequence conservation scores can allow for robust classification of cancer driver mutations with a limited number of highly informative features. Machine learning predictions are leveraged in molecular simulations, protein stability, and network-based analysis of cancer mutations in the protein kinase genes to obtain insights about molecular signatures of driver mutations and enhance the interpretability of cancer-specific classification models

    Sequence analysis methods for the design of cancer vaccines that target tumor-specific mutant antigens (neoantigens)

    Get PDF
    The human adaptive immune system is programmed to distinguish between self and non-self proteins and if trained to recognize markers unique to a cancer, it may be possible to stimulate the selective destruction of cancer cells. Therapeutic cancer vaccines aim to boost the immune system by selectively increasing the population of T cells specifically targeted to the tumor-unique antigens, thereby initiating cancer cell death.. In the past, this approach has primarily focused on targeted selection of ‘shared’ tumor antigens, found across many patients. The advent of massively parallel sequencing and specialized analytical approaches has enabled more efficient characterization of tumor-specific mutant antigens, or neoantigens. Specifically, methods to predict which tumor-specific mutant peptides (neoantigens) can elicit anti-tumor T cell recognition improve predictions of immune checkpoint therapy response and identify one or more neoantigens as targets for personalized vaccines. Selecting the best/most immunogenic neoantigens from a large number of mutations is an important challenge, in particular in cancers with a high mutational load, such as melanomas and smoker-associated lung cancers. To address such a challenging task, Chapter 1 of this thesis describes a genome-guided in silico approach to identifying tumor neoantigens that integrates tumor mutation and expression data (DNA- and RNA-Seq). The cancer vaccine design process, from read alignment to variant calling and neoantigen prediction, typically assumes that the genotype of the Human Reference Genome sequence surrounding each somatic variant is representative of the patient’s genome sequence, and does not account for the effect of nearby variants (somatic or germline) in the neoantigenic peptide sequence. Because the accuracy of neoantigen identification has important implications for many clinical trials and studies of basic cancer immunology, Chapter 2 describes and supports the need for patient-specific inclusion of proximal variants to address this previously oversimplified assumption in the identification of neoantigens. The method of neoantigen identification described in Chapter 1 was subsequently extended (Chapter 3) and improved by the addition of a modular workflow that aids in each component of the neoantigen prediction process from neoantigen identification, prioritization, data visualization, and DNA vaccine design. These chapters describe massively parallel sequence analysis methods that will help in the identification and subsequent refinement of patient-specific antigens for use in personalized immunotherapy
    • …
    corecore