43 research outputs found
Assessing the gene regulatory landscape in 1,188 human tumors
Cancer is characterised by somatic genetic variation, but the effect of the majority of non-coding somatic variants and the interface with the germline genome are still unknown. We analysed the whole genome and RNA-seq data from 1,188 human cancer patients as provided by the Pan-cancer Analysis of Whole Genomes (PCAWG) project to map cis expression quantitative trait loci of somatic and germline variation and to uncover the causes of allele-specific expression patterns in human cancers. The availability of the first large-scale dataset with both whole genome and gene expression data enabled us to uncover the effects of the non-coding variation on cancer. In addition to confirming known regulatory effects, we identified novel associations between somatic variation and expression dysregulation, in particular in distal regulatory elements. Finally, we uncovered links between somatic mutational signatures and gene expression changes, including TERT and LMO2, and we explained the inherited risk factors in APOBEC-related mutational processes. This work represents the first large-scale assessment of the effects of both germline and somatic genetic variation on gene expression in cancer and creates a valuable resource cataloguing these effects
Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis.
Long non-coding RNAs (lncRNAs) are a growing focus of cancer genomics studies, creating the need for a resource of lncRNAs with validated cancer roles. Furthermore, it remains debated whether mutated lncRNAs can drive tumorigenesis, and whether such functions could be conserved during evolution. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we introduce the Cancer LncRNA Census (CLC), a compilation of 122 GENCODE lncRNAs with causal roles in cancer phenotypes. In contrast to existing databases, CLC requires strong functional or genetic evidence. CLC genes are enriched amongst driver genes predicted from somatic mutations, and display characteristic genomic features. Strikingly, CLC genes are enriched for driver mutations from unbiased, genome-wide transposon-mutagenesis screens in mice. We identified 10 tumour-causing mutations in orthologues of 8 lncRNAs, including LINC-PINT and NEAT1, but not MALAT1. Thus CLC represents a dataset of high-confidence cancer lncRNAs. Mutagenesis maps are a novel means for identifying deeply-conserved roles of lncRNAs in tumorigenesis
High-coverage whole-genome analysis of 1220 cancers reveals hundreds of genes deregulated by rearrangement-mediated cis-regulatory alterations.
The impact of somatic structural variants (SVs) on gene expression in cancer is largely unknown. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole-genome sequencing data and RNA sequencing from a common set of 1220 cancer cases, we report hundreds of genes for which the presence within 100 kb of an SV breakpoint associates with altered expression. For the majority of these genes, expression increases rather than decreases with corresponding breakpoint events. Up-regulated cancer-associated genes impacted by this phenomenon include TERT, MDM2, CDK4, ERBB2, CD274, PDCD1LG2, and IGF2. TERT-associated breakpoints involve ~3% of cases, most frequently in liver biliary, melanoma, sarcoma, stomach, and kidney cancers. SVs associated with up-regulation of PD1 and PDL1 genes involve ~1% of non-amplified cases. For many genes, SVs are significantly associated with increased numbers or greater proximity of enhancer regulatory elements near the gene. DNA methylation near the promoter is often increased with nearby SV breakpoint, which may involve inactivation of repressor elements
Recommended from our members
Analyses of non-coding somatic drivers in 2,658 cancer whole genomes.
The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available
Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin
Recent genomic analyses of pathologically-defined tumor types identify “within-a-tissue” disease subtypes. However, the extent to which genomic signatures are shared across tissues is still unclear. We performed an integrative analysis using five genome-wide platforms and one proteomic platform on 3,527 specimens from 12 cancer types, revealing a unified classification into 11 major subtypes. Five subtypes were nearly identical to their tissue-of-origin counterparts, but several distinct cancer types were found to converge into common subtypes. Lung squamous, head & neck, and a subset of bladder cancers coalesced into one subtype typified by TP53 alterations, TP63 amplifications, and high expression of immune and proliferation pathway genes. Of note, bladder cancers split into three pan-cancer subtypes. The multi-platform classification, while correlated with tissue-of-origin, provides independent information for predicting clinical outcomes. All datasets are available for data-mining from a unified resource to support further biological discoveries and insights into novel therapeutic strategies
Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples
Funder: NCI U24CA211006Abstract: The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA samples, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF < 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological divergences between two reproducible somatic variant detection efforts
Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs
Much of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases. We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data. When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes. While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes. It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices. This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche
RNA Instant Quality Check: Alignment-Free RNA-Degradation Detection
With the constant increase of large-scale genomic data projects, automated and high-throughput quality assessment becomes a crucial component of any analysis. Whereas small projects often have a more homogeneous design and a manageable structure allowing for a manual per-sample analysis of quality, large-scale studies tend to be much more heterogeneous and complex. Many quality metrics have been developed to assess the quality of an individual sample on the raw read level. Degradation effects are typically assessed based on the RNA integrity (RIN) score, or on postalignment data. In this study, we show that single commonly used quality criteria such as the RIN score alone are not sufficient to ensure RNA sample quality. We developed a new approach and provide an efficient tool that estimates RNA sample degradation by computing the 5 '/3 ' bias based on all genes in an alignment-free manner. That enables degradation assessment right after data generation and not during the analysis procedure allowing for early intervention in the sample handling process. Our analysis shows that this strategy is fast, robust to annotation and differences in library size, and provides complementary quality information to RIN scores enabling the accurate identification of degraded samples