2,899 research outputs found

    Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    Get PDF
    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer

    Syöpägenetiikan tutkimus uusien sekvensointimenetelmien aikakaudella

    Get PDF
    The research in cancer genetics aims to detect genetic causes for the excessive growth of cells, which may subsequently form a tumor and further develop into cancer. The Human Genome Project succeeded in mapping the majority of the human DNA sequence, which enabled modern sequencing technologies to emerge, namely next-generation sequencing (NGS). The new era of disease genetics research shifted DNA analyses from laboratory to computer screens. Since then, the massive growth of sequencing data has been facilitating the detection of novel disease-causing mutations and thus improving the screening and medical treatments of cancer. However, the exponential growth of sequencing data brought new challenges for computing. The sheer size of the data is not only expensive to store and maintain, but also highly demanding to process and analyze. Moreover, not only has the amount of sequencing data increased, but new kinds of functional genomics data, which are instrumental in figuring out the consequences of detected mutations, have also emerged. To this end, continuous software development has become essential to enable the utilization of all produced research data, new and old. This thesis describes a software for the analysis and visualization of NGS data (publication I) that allows the integration of genomic data from various sources. The software, BasePlayer, was designed for the need of efficient and user-friendly methods that could be used to analyze and visualize massive variant, and various other types of genomic data. To this end, we developed a multi-purpose tool for the analysis of genomic data, such as DNA, RNA, ChIP-seq, and DNase. The capabilities of BasePlayer in the detection of putatively causative variants and data visualization have already been used in over twenty scientific publications. The applicability of the software is demonstrated in this thesis with two distinct analysis cases - publications II and III. The second study considered somatic mutations in colorectal cancer (CRC) genomes. We were able to identify distinct mutation patterns at the CTCF/Cohesin binding sites (CBSs) by analyzing whole-genome sequencing (WGS) data with BasePlayer. The sites were observed to be frequently mutated in CRC, especially in samples with a specific mutational signature. However, the source for the mutation accumulation remained unclear. On the contrary, a subset of samples with an ultra-mutator phenotype, caused by defective polymerase epsilon (POLE) gene, exhibited an inverse pattern at CBSs. We detected the same signal in other, predominantly gastrointestinal, cancers as well. However, we were not able to measure changes in gene expressions at mutated sites, so the role of the CBS mutations in tumorigenesis remained and still remains to be elucidated. The third study considered esophageal squamous cell carcinoma (ESCC), and the objective was to detect predisposing mutations using the Finnish Cancer Registry (FCR) data. We performed clustering analysis for the FCR data, with additional information obtained from the Population Information System of Finland. We detected an enrichment of ESCC in the Karelia region and were able to collect and sequence 30 formalin-fixed paraffin-embedded (FFPE) samples from the region. We reported several candidate genes, out of which EP300 and DNAH9 were considered the most interesting. The study not only reported putative genes predisposing to ESCC but also worked as a proof of concept for the feasibility of conducting genetic research utilizing both clustering of the FCR data and FFPE exome sequencing in such studies.Syöpägenetiikan tutkimuksen tavoitteena on löytää perimmäisiä syitä solujen liikakasvulle, joka voi johtaa kasvaimen muodostumiseen ja kehittyä edelleen syöväksi. Laajamittainen Human Genome Project, jonka tavoitteena oli selvittää ihmisen koko DNA sekvenssi (genomi) saatiin suurelta osin päätökseen vuosituhannen alussa. Kokonaisten genomien määrittäminen mahdollisti toisen sukupolven sekvensointimenetelmien (next-generation sequencing, NGS) kehityksen ja käyttöönoton. Tämä aloitti uuden aikakauden erityisesti tautigenetiikassa ja siirsi analyysit laboratorioista tietokoneiden ruuduille. NGS menetelmien tuottamat valtavat datamäärät vauhdittivat uusien geneettisten löydösten tekemistä, mutta toivat myös uusia haasteita erityisesti biologiseen tietojenkäsittelyyn - bioinformatiikkaan. Datamäärien lisäksi myös erilaisten datatyyppien määrä kasvoi ja kasvaa edelleen; kaiken tuotetun datan prosessointi analysoitavaan muotoon vaatii erittäin tehokkaita tietokoneita ja algoritmeja. Lisäksi monen eri näytteen ja datatyypin yhdistäminen (integrointi) järkeväksi kokonaisuudeksi vaatii analyysiohjelmistoilta joustavuutta ja tehokkuutta erityisesti säätelyalueisiin liittyvissä tutkimuksissa. Bioinformaattisten ohjelmistojen jatkuva kehitys on täten ensiarvoisen tärkeää, jotta kaikki tuotettu data saadaan mahdollisimman hyvin tutkijoille hyödynnettäväksi. Tässä väitöskirjassa esitellään ohjelmisto, BasePlayer, joka on kehitetty laajoihin sekvenssidata-analyyseihin ja visualisointiin (julkaisu I). BasePlayer yhdistää graafisessa käyttöliittymässä geneettiseen analyysiin tarvittavat ominaisuudet, dataintegraation sekä visualisaation. Ohjelmisto mahdollistaa esimerkiksi satojen kasvainnäytteiden samanaikaisen tarkastelun, jonka avulla voi tunnistaa altistavia tai syöpää ajavia mutaatioita geenien säätelyalueilla. BasePlayeria on käytetty jo yli kahdessakymmenessä tieteellisessä julkaisussa, joista kaksi on tämän väitöskirjan osatöinä (julkaisut II ja III). Toisessa julkaisussa etsittiin BasePlayeria hyödyntäen syöpää ajavia mutaatioita geenien säätelyalueilta käyttäen yli kahtasataa kolorektaalisyöpänäytettä. Koko-genomin kattavalla sekvensointiaineistolla havaitsimme, että osassa näytteitä mutaatioita on kertynyt runsaasti erityisesti kohesiinin sitoutumiskohtiin. Kohesiini on mukana useissa tärkeissä tehtävissä mm. DNA:n rakenteeseen ja geenien säätelyyn liittyen. Havaitsimme myös mutaatioiden vähenemän samoilla alueilla näytteissä, jotka olivat ultra-mutatoituneita (satakertainen mutaatiomäärä keskimääräisiin kolorektaalikasvaimiin verrattuna). Mutaatioiden kertymä havaittiin myös muissa, erityisesti ruoansulatuskanavan syövissä. Havaitun ilmiön rooli kasvainten kehittymisessä jäi tosin vielä selvittämättä. Kolmannessa työssä etsittiin ruokatorven syöpään altistavia geenimutaatioita. Haimme Suomen syöpärekisteriä ja väestötietojärjestelmää apuna käyttäen alueita, joissa ruokatorven syöpää esiintyi sukunimeen perustuen merkittävästi keskimääräistä enemmän. Merkittävästi rikastunut alue löytyi luovutetun Karjalan alueelta, josta saimme kerättyä ja sekvensoitua 30 arkistoitua kudosnäytettä. Hyödynsimme BasePlayerin näytevertailu-ominaisuuksia, joiden avulla havaitsimme potilaissa rikastuneet variantit normaaliväestöön verrattuna. Kiinnostavimmat tulokset liittyivät harvinaisiin variantteihin EP300 ja DNAH9 geeneissä. Mahdollisten uusien alttiusgeenien raportoinnin lisäksi tämä työ osoitti, että syöpärekisteriä hyödyntämällä voidaan löytää kuvatun kaltaisia tauti-tihentymiä ja myös sen, että arkistoitu kudosmateriaali on käyttökelpoista tämänkaltaisissa sekvensointiin pohjautuvissa tutkimuksissa

    Genomic basis for RNA alterations in cancer

    Get PDF
    Transcript alterations often result from somatic changes in cancer genomes. Various forms of RNA alterations have been described in cancer, including overexpression, altered splicing and gene fusions; however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom samples have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis, of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed 'bridged' fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer

    Mapping the Landscape of Mutation Rate Heterogeneity in the Human Genome: Approaches and Applications

    Full text link
    All heritable genetic variation is ultimately the result of mutations that have occurred in the past. Understanding the processes which determine the rate and spectra of new mutations is therefore fundamentally important in efforts to characterize the genetic basis of heritable disease, infer the timing and extent of past demographic events (e.g., population expansion, migration), or identify signals of natural selection. This dissertation aims to describe patterns of mutation rate heterogeneity in detail, identify factors contributing to this heterogeneity, and develop methods and tools to harness such knowledge for more effective and efficient analysis of whole-genome sequencing data. In Chapters 2 and 3, we catalog granular patterns of germline mutation rate heterogeneity throughout the human genome by analyzing extremely rare variants ascertained from large-scale whole-genome sequencing datasets. In Chapter 2, we describe how mutation rates are influenced by local sequence context and various features of the genomic landscape (e.g., histone marks, recombination rate, replication timing), providing detailed insight into the determinants of single-nucleotide mutation rate variation. We show that these estimates reflect genuine patterns of variation among de novo mutations, with broad potential for improving our understanding of the biology of underlying mutation processes and the consequences for human health and evolution. These estimated rates are publicly available at http://mutation.sph.umich.edu/. In Chapter 3, we introduce a novel statistical model to elucidate the variation in rate and spectra of multinucleotide mutations throughout the genome. We catalog two major classes of multinucleotide mutations: those resulting from error-prone translesion synthesis, and those resulting from repair of double-strand breaks. In addition, we identify specific hotspots for these unique mutation classes and describe the genomic features associated with their spatial variation. We show how these multinucleotide mutation processes, along with sample demography and mutation rate heterogeneity, contribute to the overall patterns of clustered variation throughout the genome, promoting a more holistic approach to interpreting the source of these patterns. In chapter 4, we develop Helmsman, a computationally efficient software tool to infer mutational signatures in large samples of cancer genomes. By incorporating parallelization routines and efficient programming techniques, Helmsman performs this task up to 300 times faster and with a memory footprint 100 times smaller than existing mutation signature analysis software. Moreover, Helmsman is the only such program capable of directly analyzing arbitrarily large datasets. The Helmsman software can be accessed at https://github.com/carjed/helmsman. Finally, in Chapter 5, we present a new method for quality control in large-scale whole-genome sequencing datasets, using a combination of dimensionality reduction algorithms and unsupervised anomaly detection techniques. Just as the mutation spectrum can be used to infer the presence of underlying mechanisms, we show that the spectrum of rare variation is a powerful and informative indicator of sample sequencing quality. Analyzing three large-scale datasets, we demonstrate that our method is capable of identifying samples affected by a variety of technical artifacts that would otherwise go undetected by standard ad hoc filtering criteria. We have implemented this method in a software package, Doomsayer, available at https://github.com/carjed/doomsayer.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147537/1/jedidiah_1.pd

    Network Approaches to the Study of Genomic Variation in Cancer

    Get PDF
    Advances in genomic sequencing technologies opened the door for a wider study of cancer etiology. By analyzing datasets with thousands of exomes (or genomes), researchers gained a better understanding of the genomic alterations that confer a selective advantage towards cancerous growth. A predominant narrative in the field has been based on a dichotomy of alterations that confer a strong selective advantage, called cancer drivers, and the bulk of other alterations assumed to have a neutral effect, called passengers. Yet, a series of studies questioned this narrative and assigned potential roles to passengers, be it in terms of facilitating tumorigenesis or countering the effect of drivers. Consequently, the passenger mutational landscape received a higher level of attention in attempt to prioritize the possible effects of its alterations and to identify new therapeutic targets. In this dissertation, we introduce interpretable network approaches to the study of genomic variation in cancer. We rely on two types of networks, namely functional biological networks and artificial neural nets. In the first chapter, we describe a propagation method that prioritizes 230 infrequently mutated genes with respect to their potential contribution to cancer development. In the second chapter, we further transcend the driver-passenger dichotomy and demonstrate a gradient of cancer relevance across human genes. In the last two chapters, we present methods that simplify neural network models to render them more interpretable with a focus on functional genomic applications in cancer and beyond

    Signatures of TOP1 transcription-associated mutagenesis in cancer and germline

    Get PDF
    The mutational landscape is shaped by many processes. Genic regions are vulnerable to mutation but are preferentially protected by transcription-coupled repair1. In microorganisms, transcription has been demonstrated to be mutagenic2,3; however, the impact of transcription-associated mutagenesis remains to be established in higher eukaryotes4. Here we show that ID4—a cancer insertion–deletion (indel) mutation signature of unknown aetiology5 characterized by short (2 to 5 base pair) deletions —is due to a transcription-associated mutagenesis process. We demonstrate that defective ribonucleotide excision repair in mammals is associated with the ID4 signature, with mutations occurring at a TNT sequence motif, implicating topoisomerase 1 (TOP1) activity at sites of genome-embedded ribonucleotides as a mechanistic basis. Such TOP1-mediated deletions occur somatically in cancer, and the ID-TOP1 signature is also found in physiological settings, contributing to genic de novo indel mutations in the germline. Thus, although topoisomerases protect against genome instability by relieving topological stress6, their activity may also be an important source of mutations in the human genome

    Computational investigation of cancer genomes

    Get PDF
    Cancer is a leading cause of death worldwide, and its incidence is increasing due to modern lifestyle that prolonged human life. All cancers originate from a single cell that had acquired genetic aberrations enabling uncontrolled proliferation. Each cancer is unique in its aberrant genetic makeup, which defines, to large extent, its biology, aggressiveness, and vulnerabilities to different treatments. Furthermore, the genetic makeup of each cancer is heterogeneous among its constituent cancer cells, and dynamic with the ability to evolve in order to preserve the survival of cancer cells. Sequencing technologies are currently producing massive amounts of data that, with the help of specialized computational methods, can revolutionize our knowledge on cancer. A key question in cancer research is how to personalize the treatment of cancer patients, so that each cancer is treated according to its molecular characteristics. The first study in this thesis takes a step in that direction through a proposed novel molecular classification system of diffuse large B-cell lymphoma (DLBCL), which is the most common hematological malignancy in adults. The suggested classification, derived from the integrative analysis of gene expression and DNA mutations, stratifies DLBCL into four groups with distinct biology, genetic landscapes, and clinical outcome. These subtypes could help identify patients at high risk who may benefit from an altered treatment plan. Understanding the genomic evolution of cancer that transforms a typically curable primary tumor into an incurable drug-resistant metastasis is another aspect of cancer research under intensive investigation. The second study in this thesis investigates the spreading patterns of metastasis in breast cancer, which is the most common cancer in women. Using phylogenetic analysis of somatic mutations from longitudinal breast cancer samples, the metastasis routes were uncovered. The study revealed that breast cancer spreads either in parallel from primary tumor to multiple distant sites, or linearly from primary tumor to a distant site, and then from that to another. However, in all cases, axillary lymph nodes did not mediate the spreading to distant sites. This provided a genetic-based evidence on the redundancy of lymph node dissection in breast cancer management. Towards a genetic-based diagnostics in cancer, the computational methods used to detect genetic aberrations need to be evaluated for their accuracy. The third study in this thesis performs a comparison of methods for detecting somatic copy number alterations from cancer samples. The study evaluated several commonly used methods for two different sequencing platforms using simulated and real cancer data. The results provided an overview of the weaknesses of the different methods that could be methodologically improved. Altogether, this thesis gives an overview on the field of computational cancer genomics and presents three studies that exemplify the clinical relevance of computational research.Not availabl
    corecore