309 research outputs found

    Identifier les variations conduisant au cancer dans le génome non codant et du transcriptome

    Get PDF
    Functional annotation of somatic mutations have been a consistent hotspot of cancer genomics studies. In the past, researchers preferentially focused on mutations in the coding fraction of the genome, for which ample bioinformatics tools were developed to distinguish cancer-driver mutations from neutral ones. In recent years, as an increasing number of variants were being identified as disease-associated in the non-coding genome, interpreting non-coding cancer mutations has become an urgent task. The completion of large scale projects such as ENCODE, has made functional interpretation of cancer variants achievable, and several programs were produced based on this functional information. However, there still exists some limitations as to these prediction tools, such as low prediction accuracy, lack of cancer mutation information and significant ascertainment bias. In chapter 2 of this thesis, in order to functionally interpret non-coding mutations in cancer, we developed two independent random forest models, referred to as SNP and SOM. Given a combination of features at a given genome positions, the SNP model predicts the expected fraction of rare SNPs (a measure of negative selection), and the SOM model predicts the expected mutation density at this position. We applied our two models to score these non-coding disease-associated clinvariant and HGMD variants and a set of random control SNPs. Results showed that disease-associated variants were scored higher than control SNPs with the SNP model and lower than control SNPs with the SOM model, supporting our hypothesis that purifying selection as measured by fraction of rare SNPs and mutation density is informative for the evaluation of the functional impact of cancer mutations in the non-coding genome. In the past, researchers have preferentially considered protein-coding genes as critical to the initiation and progression of cancers. However, recent evidences have shown that ncRNAs, in particular lncRNAs, are actively implicated in various cancer processes. A chapter of this thesis is devoted to this class of non-coding transcripts. Similar to protein coding genes, there might be a large number of lncRNAs with cancer-driving functions. The development of bioinformatics tools to prioritize them has become a new focus of research for computational oncologists.The last part of this thesis is devoted to the implementation of methods for discovering potential cancer-driving non-coding elements in lncRNA and protein-coding genes. We applied three scoring tools, CADD, funSeq2, GWAVA, together with our SNP and SOM scoring systems to prioritize cancer-associated elements using a permutation-based algorithm. For each locus, we compute the average score of all observed variants using one of the models, and we randomly take the same number of variants and compute their average score 1 million times to form a null distribution and obtain a P value for this locus. To validate our hypothesis and permutation model, we tested this system on 61 cancer-related lncRNA and 452 cancer genes using somatic mutation data from liver cancer, lung cancer, CLL and melanoma. We observed that both cancer lncRNAs and protein-coding genes had significantly lower average P values than total lncRNAs and protein-coding genes in all cases. Applying the permutation test to lncRNAs with five different scoring systems enabled us to prioritize hundreds to thousands of cancer-related lncRNA candidates. These candidates can be used for future experimental validation.L'annotation fonctionnelle de mutations somatiques est un point focal des études de génomique du cancer. Jusque récemment, la recherche s'est concentré sur des mutations dans la fraction codante du génome, pour lesquelles de puissants outils bioinformatiques ont été développés afin de distinguer des mutations délétères des mutations neutres. On identifie un nombre croissant de variants associés à des maladies dans le génome non-codant. L'interprétation des mutations non-codantes dans le cancer est donc devenue une tâche urgente. Des projets de grande envergure tels que ENCODE ont rendu possible l'interprétation fonctionnelle de variants dans les cancers. Plusieurs programmes ont été produits sur la base de ces informations fonctionnelles. Ces outilssont encore limités, notamment, une bas précision de la prédiction, le manque d'information de la mutation de cancer et biais de constatation importante. Dans le chapitre 2 de cette thèse, pour interpréter fonctionnellement les mutations non-codantes dans les cancers, nous avons développé deux modèles de forêts aléatoires indépendants, appelées SNP et SOM. Compte tenu de la combinaison de caractéristiques fonctionnelles à une position donnée du génome, le modèle SNP prédit la fraction de SNP rares (une mesure de la sélection négative), et le modèle SOM prédit la densité de mutations somatiques attendue à cette position. Nous avons appliqué nos deux modèles pour évaluer des clinvariant and HGMD variants asociés à des maladies, et un ensemble de SNP-contrôle aléatoires. Les résultats ont montré que les variants associés à des maladies ont des scores plus élevés que les SNP-contrôle avec le modèle SNP et inférieures avec le modèle SOM, confortant notre hypothèse selon laquelle la sélection négative, telle que mesurée par fraction de SNP rares et de densité de mutation somatiques, nous informe sur l'impact fonctionnel des mutations tumorales dans le génome non-codant. Jusqu'à présent, les chercheurs ont surtout considéré les gènes protéiques comme critiques dans l'initiation et la progression des cancers. Toutefois, des preuves récentes ont montré que les ARN non-codants, en particulier les lncRNAs, sont activement impliqués dans divers processus de cancer. Un chapitre de cette thèse est consacré à cette classe de transcripts non codants. Comme pour les gènes codants, il pourrait exister un grand nombre de lncRNAs driver de cancer. Le développement d'outils bioinformatiques pour identifier et hiérarchiser les lncRNA et autres ARN non-codants est devenu un important objet de recherche en oncologie.La dernière partie de cette thèse est consacrée à la mise en œuvre de méthodes pour découvrir des éléments non-codants potentiellement driver de cancer. Nous avons d'abord appliqué trois outils tierces, CADD, funSeq2, GWAVA, ainsi que nos modèles SNP et SOM, pour évaluer l'impact des mutations non-codantes dans tout le génome. Pour chaque locus, nous calculons la moyenne des scores de tous les variants observés à l'aide de l'un des modèles, et nous prenons au hasard le même nombre de variants et calculons leur score moyen 1 million de fois pour former une distribution nulle et obtenir une P-valeur pour ce locus. Pour valider notre hypothèse et notre modèle de permutation, nous avons testé ce système sur 452 gènes codants et 61 lncRNA liés au cancer, en utilisant des données de mutation somatique de cancer du foie, cancer du poumon, CLL et mélanome. Nous avons constaté que les lncRNAs et gènes codants associés au cancer avaient des valeurs-P significativement plus faibles que l'ensemble de lncRNAs et gènes codant. Appliquer ce test de permutation à des lncRNAs avec cinq systèmes de notation différents nous a permis de prioriser les centaines de candidats potentiellement liés au cancer.Ces candidats peuvent maintenant être soumis à validation expérimentale

    Multi-omics Portraits of Cancer

    Get PDF
    Precision oncology demands accurate portrayal of a disease at all molecular levels. However, current large-scale studies of omics are often isolated by data types. I have been developing computational tools to conduct integrative analyses of omics data, identifying unique molecular etiology in each tumor. Particularly, this dissertation presents the following contributions to the computational omics of cancer: (1) uncovering the predisposition landscape in 33 cancers and how germline genome collaborates with somatic alterations in oncogenesis; (2) pioneering methods to combine genomic and proteomic data to identify treatment opportunities; and (3) revealing selective phosphorylation of kinase-substrate pairs. These findings advance our understanding of tumor biology on a systematic scale and inform clinical practice of cancer diagnosis and treatment design

    Insights into Neuroblastoma Initiation and Disease Progression Through integrative Genomics and Epigenomics

    Get PDF
    In this dissertation, we use integrative genomics to shed new insights into the molecular lesions and mechanisms that drive neuroblastoma. In Part 1, we use imputation and epigenetic profiling in order to identify the causal germline SNP that drives differential susceptibility to neuroblastoma at the LMO1 oncogene locus. In Part 2, we use whole genome sequencing and Bayesian statistical modeling to understand the clonal evolution that occurs between diagnosis and relapse. Part 1: Neuroblastoma is a pediatric malignancy that typically arises in early childhood, and is derived from the developing sympathetic nervous system. A previous genome-wide association study identified common polymorphisms at the LMO1 gene locus that are highly associated with neuroblastoma susceptibility and oncogenic addiction to LMO1 in the tumor cells. Here we investigate the causal DNA variant at this locus. We show that SNP rs2168101 G\u3eT is the most highly associated variant and resides in a super-enhancer defined by extensive acetylation of histone H3 lysine 27 within the first intron of LMO1. The ancestral G allele that is associated with tumor formation resides in a conserved GATA transcription factor binding motif. We show that the newly evolved protective TATA allele ablates GATA3 binding and enhancer activity, and is associated with decreased total and allele-specific LMO1 expression in neuroblastoma primary tumors. These findings indicate that a recently evolved polymorphism within a super-enhancer element in the first intron of LMO1 influences neuroblastoma susceptibility through differential GATA transcription factor binding and direct modulation of LMO1 expression in cis. Part 2: The majority of high-risk neuroblastomas initially respond to chemotherapy, but over half of patients will experience therapy-resistant relapses which are nearly always fatal. The molecular defects driving relapse and drug resistance are unknown. We performed whole genome sequencing of 23 paired diagnostic and relapsed neuroblastomas, and corresponding normal lymphocyte DNAs, to define genetic alterations associated with relapse. Unbiased pathway analysis of the somatic mutations detected in the relapse tissues identified a strong enrichment in genes associated with RAS-MAPK signaling (18 of 23 patients). These RAS-MAPK mutations were clonally enriched at relapse and exist within clonal or major subclonal tumor populations. Similar MAPK pathway mutations were detected in 11 of 18 human neuroblastoma-derived cell lines, and these lesions are predicted to be sensitive to small molecule inhibition of MEK in vitro and in vivo. In this study of 23 neuroblastoma cases, MAPK pathway mutations were highly enriched in the relapsed genomes, providing a potential biomarker for new therapeutic approaches to chemotherapy refractory disease. Collectively, these studies provide important insights into the genetic and epigenetic factors driving neuroblastoma, and suggest new opportunities for pathway-targeted therapies

    Integrative Transcriptomic Analysis of Long Intergenic Non-Coding RNAs in Cancer.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Whole-genome sequencing of chronic lymphocytic leukemia identifies subgroups with distinct biological and clinical features.

    Get PDF
    The value of genome-wide over targeted driver analyses for predicting clinical outcomes of cancer patients is debated. Here, we report the whole-genome sequencing of 485 chronic lymphocytic leukemia patients enrolled in clinical trials as part of the United Kingdom's 100,000 Genomes Project. We identify an extended catalog of recurrent coding and noncoding genetic mutations that represents a source for future studies and provide the most complete high-resolution map of structural variants, copy number changes and global genome features including telomere length, mutational signatures and genomic complexity. We demonstrate the relationship of these features with clinical outcome and show that integration of 186 distinct recurrent genomic alterations defines five genomic subgroups that associate with response to therapy, refining conventional outcome prediction. While requiring independent validation, our findings highlight the potential of whole-genome sequencing to inform future risk stratification in chronic lymphocytic leukemia

    Extensive disruption of protein interactions by genetic variants across the allele frequency spectrum in human populations

    Get PDF
    Each human genome carries tens of thousands of coding variants. The extent to which this variation is functional and the mechanisms by which they exert their influence remains largely unexplored. To address this gap, we leverage the ExAC database of 60,706 human exomes to investigate experimentally the impact of 2009 missense single nucleotide variants (SNVs) across 2185 protein-protein interactions, generating interaction profiles for 4797 SNV-interaction pairs, of which 421 SNVs segregate at > 1% allele frequency in human populations. We find that interaction-disruptive SNVs are prevalent at both rare and common allele frequencies. Furthermore, these results suggest that 10.5% of missense variants carried per individual are disruptive, a higher proportion than previously reported; this indicates that each individual’s genetic makeup may be significantly more complex than expected. Finally, we demonstrate that candidate disease-associated mutations can be identified through shared interaction perturbations between variants of interest and known disease mutations

    Identification of Novel Causative Genes for Colorectal Adenomatous Polyposis

    Get PDF
    In up to 50% of families with clinically verified adenomatous polyposis no germline mutations in the established genes APC and MUTYH can be identified during routine diagnostics although the presence of high numbers of colorectal adenomas strongly argues for an underlying genetic cause, either as a monogenic or genetically complex trait. Therefore, the aim of this study was (i) to identify cryptic germline mutations in the APC gene which were not detected by routine diagnostics; (ii) to identify novel causative genes of adenomatous polyposis by a genome-wide SNP-array based CNV analysis, and (iii) to further evaluate the pathogenic relevance of the candidate genes by additional experiments including screening for germline point mutations in the patient cohort. Firstly, a functional study at the mRNA (transcript) level was carried out to look for deep intronic APC mutations. We identified aberrant transcript patterns in 8 (6%) of 125 unrelated patients. Five of them carried a founder germline mutation in intron 4 and three patients showed germline point mutations in intron 10, which lead to the inclusion of a pseudoexon 4 and a pseudoexon 10 on transcript level. The pseudoexons are predicted to result in frameshift mutations and premature stop codons. Therefore, a few deep intronic mutations contribute substantially to the APC mutation spectrum and cDNA analysis and/or target sequencing of intronic regions should be considered as an additional mutation discovery approach in polyposis patients. To uncover novel causative genes in patients with unexplained adenomatous polyposis, a genome-wide analysis of germline copy number variants (CNV) using high-resolution SNP arrays was performed in 221 unrelated, well characterized APC and MUTYH mutation negative German patients. Putative CNVs were filtered according to stringent criteria, compared with those of 531 population-based German controls, and validated by qPCR. 125 unique rare germline CNVs in 93 (42%) of 221 patients were identified. These CNVs involved 68 deleted and 168 duplicated genes. The vast majority of patients harbor one CNV only. To further evaluate the pathogenic relevance of the candidate genes, additional filtering and prioritization steps on gene level including expression analysis in cDNA from human colon tissue, network analysis, enrichment analyses of genes and pathways, and data mining were performed. Ninety-eight candidate genes remained, 32 of which showed molecular and cellular functions related to tumorigenesis. To further explore the clinical relevance of the candidate genes in the absence of recurrent alterations and lack of segregation information, a germline point mutation analysis was performed in a validation cohort using a targeted next generation sequencing (NGS) approach. Fifteen rare heterozygous truncating point mutations in 11 genes were identified in 15 patients. In these 11 genes, we found additional 27 rare missense mutations which were predicted to be deleterious. CNTN6 and FOCAD showed different truncating mutations in more than one patient whereas KIF26B has the highest frequency of potential deleterious mutations overall. By integrating all results and recent studies of early-onset colorectal and breast cancer, CNTN6, EPHB4, KIF26B, MCM3AP, FOCAD, and HSPH1were selectedas the most convincing predisposing genes for colorectal adenomatous polyposis. In addition, in the canonical Wnt pathway oncogene CTNNB1 (ß-catenin), two potential gain-of-function mutations were found. This thesis identified a group of rarely affected genes which are likely to predispose to colorectal adenoma formation and confirmed previously published candidates for tumor predisposition as etiologically relevant. Our analysis demonstrated that the underlying genetic factors of unexplained colorectal polyposis are likely to be very heterogeneous, which makes clinical validation challenging. To further characterize the functional relevance of the selected genes, international collaborations with large patient cohorts and functional studies are needed

    Modifiers of CAG repeat instability: insights from mammalian models

    Get PDF
    At thirteen different genomic locations, the expansion of a CAG/CTG repeat causes a neurodegenerative or neuromuscular disease, the most common being Huntington’s disease and myotonic dystrophy type 1. These disorders are characterized by germline and somatic instability of the causative CAG/CTG repeat mutations. Repeat lengthening, or expansion, in the germline leads to an earlier age of onset or more severe symptoms in the next generation. In somatic cells, repeat expansion is thought to precipitate the rate of disease. The mechanisms underlying repeat instability are not well understood. Here we review the mammalian model systems that have been used to study CAG/CTG repeat instability, and the modifiers identified in these systems. Mouse models have demonstrated prominent roles for proteins in the mismatch repair pathway as critical drivers of CAG/CTG instability, which is also suggested by recent genome-wide association studies in humans. We draw attention to a network of connections between modifiers identified across several systems that might indicate pathway crosstalk in the context of repeat instability, and which could provide hypotheses for further validation or discovery. Overall, the data indicate that repeat dynamics might be modulated by altering the levels of DNA metabolic proteins, their regulation, their interaction with chromatin, or by direct perturbation of the repeat tract. Applying novel methodologies and technologies to this exciting area of research will be needed to gain deeper mechanistic insight that can be harnessed for therapies aimed at preventing repeat expansion or promoting repeat contraction

    Small genomic insertions form enhancers that misregulate oncogenes

    Get PDF
    The non-coding regions of tumour cell genomes harbour a considerable fraction of total DNA sequence variation, but the functional contribution of these variants to tumorigenesis is ill-defined. Among these non-coding variants, somatic insertions are among the least well characterized due to challenges with interpreting short-read DNA sequences. Here, using a combination of Chip-seq to enrich enhancer DNA and a computational approach with multiple DNA alignment procedures, we identify enhancer-associated small insertion variants. Among the 102 tumour cell genomes we analyse, small insertions are frequently observed in enhancer DNA sequences near known oncogenes. Further study of one insertion, somatically acquired in primary leukaemia tumour genomes, reveals that it nucleates formation of an active enhancer that drives expression of the LMO2 oncogene. The approach described here to identify enhancer-associated small insertion variants provides a foundation for further study of these abnormalities across human cancers
    corecore