15 research outputs found

    Feature-based classifiers for somatic mutation detection in tumourā€“normal paired sequencing data

    Get PDF
    Motivation: The study of cancer genomes now routinely involves using next-generation sequencing technology (NGS) to profile tumours for single nucleotide variant (SNV) somatic mutations. However, surprisingly few published bioinformatics methods exist for the specific purpose of identifying somatic mutations from NGS data and existing tools are often inaccurate, yielding intolerably high false prediction rates. As such, the computational problem of accurately inferring somatic mutations from paired tumour/normal NGS data remains an unsolved challenge

    A community effort to create standards for evaluating tumor subclonal reconstruction

    Get PDF
    Methods for reconstructing tumor evolution are benchmarked in the DREAM Somatic Mutation Calling Tumour Heterogeneity Challenge. Tumor DNA sequencing data can be interpreted by computational methods that analyze genomic heterogeneity to infer evolutionary dynamics. A growing number of studies have used these approaches to link cancer evolution with clinical progression and response to therapy. Although the inference of tumor phylogenies is rapidly becoming standard practice in cancer genome analyses, standards for evaluating them are lacking. To address this need, we systematically assess methods for reconstructing tumor subclonality. First, we elucidate the main algorithmic problems in subclonal reconstruction and develop quantitative metrics for evaluating them. Then we simulate realistic tumor genomes that harbor all known clonal and subclonal mutation types and processes. Finally, we benchmark 580 tumor reconstructions, varying tumor read depth, tumor type and somatic variant detection. Our analysis provides a baseline for the establishment of gold-standard methods to analyze tumor heterogeneity.Peer reviewe

    Identifying driver mutations in sequenced cancer genomes: computational approaches to enable precision medicine

    Get PDF
    High-throughput DNA sequencing is revolutionizing the study of cancer and enabling the measurement of the somatic mutations that drive cancer development. However, the resulting sequencing datasets are large and complex, obscuring the clinically important mutations in a background of errors, noise, and random mutations. Here, we review computational approaches to identify somatic mutations in cancer genome sequences and to distinguish the driver mutations that are responsible for cancer from random, passenger mutations. First, we describe approaches to detect somatic mutations from high-throughput DNA sequencing data, particularly for tumor samples that comprise heterogeneous populations of cells. Next, we review computational approaches that aim to predict driver mutations according to their frequency of occurrence in a cohort of samples, or according to their predicted functional impact on protein sequence or structure. Finally, we review techniques to identify recurrent combinations of somatic mutations, including approaches that examine mutations in known pathways or protein-interaction networks, as well as de novo approaches that identify combinations of mutations according to statistical patterns of mutual exclusivity. These techniques, coupled with advances in high-throughput DNA sequencing, are enabling precision medicine approaches to the diagnosis and treatment of cancer

    Detection and benchmarking of somatic mutations in cancer genomes using RNA-seq data

    Get PDF
    To detect functional somatic mutations in tumor samples, whole-exome sequencing (WES) is often used for its reliability and relative low cost. RNA-seq, while generally used to measure gene expression, can potentially also be used for identification of somatic mutations. However there has been little systematic evaluation of the utility of RNA-seq for identifying somatic mutations. Here, we develop and evaluate a pipeline for processing RNA-seq data from glioblastoma multiforme (GBM) tumors in order to identify somatic mutations. The pipeline entails the use of the STAR aligner 2-pass procedure jointly with MuTect2 from genome analysis toolkit (GATK) to detect somatic variants. Variants identified from RNA-seq data were evaluated by comparison against the COSMIC and dbSNP databases, and also compared to somatic variants identified by exome sequencing. We also estimated the putative functional impact of coding variants in the most frequently mutated genes in GBM. Interestingly, variants identified by RNA-seq alone showed better representation of GBM-related mutations cataloged by COSMIC. RNA-seq-only data substantially outperformed the ability of WES to reveal potentially new somatic mutations in known GBM-related pathways, and allowed us to build a high-quality set of somatic mutations common to exome and RNA-seq calls. Using RNA-seq data in parallel with WES data to detect somatic mutations in cancer genomes can thus broaden the scope of discoveries and lend additional support to somatic variants identified by exome sequencing alone

    Single-cell genomic variation induced by mutational processes in cancer

    Full text link
    How cell-to-cell copy number alterations that underpin genomic instability1 in human cancers drive genomic and phenotypic variation, and consequently the evolution of cancer2, remains understudied. Here, by applying scaled single-cell whole-genome sequencing3 to wild-type, TP53-deficient and TP53-deficient;BRCA1-deficient or TP53-deficient;BRCA2-deficient mammary epithelial cells (13,818 genomes), and to primary triple-negative breast cancer (TNBC) and high-grade serous ovarian cancer (HGSC) cells (22,057 genomes), we identify three distinct 'foreground' mutational patterns that are defined by cell-to-cell structural variation. Cell- and clone-specific high-level amplifications, parallel haplotype-specific copy number alterations and copy number segment length variation (serrate structural variations) had measurable phenotypic and evolutionary consequences. In TNBC and HGSC, clone-specific high-level amplifications in known oncogenes were highly prevalent in tumours bearing fold-back inversions, relative to tumours with homologous recombination deficiency, and were associated with increased clone-to-clone phenotypic variation. Parallel haplotype-specific alterations were also commonly observed, leading to phylogenetic evolutionary diversity and clone-specific mono-allelic expression. Serrate variants were increased in tumours with fold-back inversions and were highly correlated with increased genomic diversity of cellular populations. Together, our findings show that cell-to-cell structural variation contributes to the origins of phenotypic and evolutionary diversity in TNBC and HGSC, and provide insight into the genomic and mutational states of individual cancer cells

    Knowledge Driven Approaches and Machine Learning Improve the Identification of Clinically Relevant Somatic Mutations in Cancer Genomics

    Get PDF
    For cancer genomics to fully expand its utility from research discovery to clinical adoption, somatic variant detection pipelines must be optimized and standardized to ensure identification of clinically relevant mutations and to reduce laborious and error-prone post-processing steps. To address the need for improved catalogues of clinically and biologically important somatic mutations, we developed DoCM, a Database of Curated Mutations in Cancer (http://docm.info), as described in Chapter 2. DoCM is an open source, openly licensed resource to enable the cancer research community to aggregate, store and track biologically and clinically important cancer variants. DoCM is currently comprised of 1,364 variants in 132 genes across 122 cancer subtypes, based on the curation of 876 publications. To demonstrate the utility of this resource, the mutations in DoCM were used to identify variants of established significance in cancer that were missed by standard variant discovery pipelines (Chapter 3). Sequencing data from 1,833 cases across four TCGA projects were reanalyzed and 1,228 putative variants that were missed in the original TCGA reports were identified. Validation sequencing data were produced from 93 of these cases to confirm the putative variant we detected with DoCM. Here, we demonstrated that at least one functionally important variant in DoCM was recovered in 41% of cases studied. A major bottleneck in the DoCM analysis in Chapter 3 was the filtering and manual review of somatic variants. Several steps in this post-processing phase of somatic variant calling have already been automated. However, false positive filtering and manual review of variant candidates remains as a major challenge, especially in high-throughput discovery projects or in clinical cancer diagnostics. In Chapter 4, an approach that systematized and standardized the post-processing of somatic variant calls using machine learning algorithms, trained on 41,000 manually reviewed variants from 20 cancer genome projects, is outlined. The approach accurately reproduced the manual review process on hold out test samples, and accurately predicted which variants would be confirmed by orthogonal validation sequencing data. When compared to traditional manual review, this approach increased identification of clinically actionable variants by 6.2%. These chapters outline studies that result in substantial improvements in the identification and interpretation of somatic variants, the use of which can standardize and streamline cancer genomics, enabling its use at high throughput as well as clinically

    Tumor subclone structure reconstruction with genomic variation data

    Get PDF
    Thesis advisor: Gabor MarthUnlike normal tissue cells, which contain identical copies of the same genome, tumors are composed of genetically divergent cell subpopulations, or subclones. The abilities to identify the number of subclones, their frequencies within the tumor mass, and the evolutionary relationships among them are crucial in understanding the basis of tumorigenesis, drug response, relapse, and metastasis. It is also essential information for informed, personalized therapeutic decisions. Studies have attempted to reconstruct subclone structure by identifying distinct allele frequency distribution modes at a handful of somatic single nucleotide variant loci, but this question was not adequately addressed with computational means at the start of this dissertation work, and recent efforts either enforce certain assumptions or resort to statistical procedure which cannot guarantee the complete landscape of solution space. This dissertation present a computational framework that examines somatic variation events, such as copy number changes, loss of heterozygosity, or point mutations, in order to identify the underlying subclone structure. Chapter 2 discuss the presence of intra-tumoral heterogeneity, and for historical interest, a method to reconstruct the parsimonious solution based on simplifying assumptions in tumor micro-evolution process. Analysis results on clinical datasets concerning Ovarian Serious Carcinoma and Intracranal Germ Cell Tumor based on this method, which confirmed the genomic complexity, are also presented. Due to the reason that the linkage information i.e. whether two mutations are co-localizing in the same cancer cell is lost during tissue homogenization and DNA fragmentation, common sample preparation steps used in whole genome profiling techniques, often there are more than one subclone model capable of explaining the observation. Chapter 3 describes an extended method that is able to search for all models consistent with the observation. Consequently, the solution to a specific input dataset is then a set of possible subclone structures. The method then trim this solution space in the case that more than one sample from the same patient are available, such as the primary and relapse tumor pairs. Furthermore, a statistical framework is developed that, when further trimming is not possible, predicts whether two mutations are co-localizing in the same subclone. The formal definition on the problem of subclone structure reconstruction, as well as techniques to pre-process various types of genomic variation data are given given here as well. Results on the analysis of published and novel datasets, ranging from cancer types including Acute Myeloid Leukemia, Sinonasal Undifferenciated Carcinoma and Ovarian Serious Carcinoma, and data types including whole genome sequencing, copy number array, single nucleotide polymorphism array and single nucleotide variant calls with deep sequencing are also included. They show that the method is applicable to these wide range of cancer and data types, able to independently replicate the published conclusion based on manual reasoning, and gain novel insights into the pattern of tumor recurrence and chemoresistance. It also shows that the method can be valuable in prioritizing variants for function study.Thesis (PhD) ā€” Boston College, 2014.Submitted to: Boston College. Graduate School of Arts and Sciences.Discipline: Biology
    corecore