79 research outputs found
PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions
As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species _Drosophila_ genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE
A gene expression atlas of embryonic neurogenesis in Drosophila reveals complex spatiotemporal regulation of lncRNAs
Cell type specification during early nervous system development in Drosophila melanogaster requires precise regulation of gene expression in time and space. Resolving the programs driving neurogenesis has been a major challenge owing to the complexity and rapidity with which distinct cell populations arise. To resolve the cell type-specific gene expression dynamics in early nervous system development, we have sequenced the transcriptomes of purified neurogenic cell types across consecutive time points covering crucial events in neurogenesis. The resulting gene expression atlas comprises a detailed resource of global transcriptome dynamics that permits systematic analysis of how cells in the nervous system acquire distinct fates. We resolve known gene expression dynamics and uncover novel expression signatures for hundreds of genes among diverse neurogenic cell types, most of which remain unstudied. We also identified a set of conserved long noncoding RNAs (lncRNAs) that are regulated in a tissue-specific manner and exhibit spatiotemporal expression during neurogenesis with exquisite specificity. lncRNA expression is highly dynamic and demarcates specific subpopulations within neurogenic cell types. Our spatiotemporal transcriptome atlas provides a comprehensive resource for investigating the function of coding genes and noncoding RNAs during crucial stages of early neurogenesis
Heterologous Stop Codon Readthrough of Metazoan Readthrough Candidates in Yeast
Recent analysis of genomic signatures in mammals, flies, and worms indicates that functional translational stop codon readthrough is considerably more abundant in metazoa than previously recognized, but this analysis provides only limited clues about the function or mechanism of readthrough. If an mRNA known to be read through in one species is also read through in another, perhaps these questions can be studied in a simpler setting. With this end in mind, we have investigated whether some of the readthrough genes in human, fly, and worm also exhibit readthrough when expressed in S. cerevisiae. We found that readthrough was highest in a gene with a post-stop hexamer known to trigger readthrough, while other metazoan readthrough genes exhibit borderline readthrough in S. cerevisiae.National Institutes of Health (U.S.) (5U54HG004555-03
Stop codon readthrough generates a C-terminally extended variant of the human vitamin D receptor with reduced calcitriol response
Although stop codon readthrough is used extensively by viruses to expand their gene expression, verified instances of mammalian readthrough have only recently been uncovered by systems biology and comparative genomics approaches. Previously our analysis of conserved protein coding signatures that extend beyond annotated stop codons predicted stop codon readthrough of several mammalian genes, all of which have been validated experimentally. Four mRNAs display highly efficient stop codon readthrough, and these mRNAs have a UGA stop codon immediately followed by CUAG (UGA_CUAG) that is conserved throughout vertebrates. Extending on the identification of this readthrough motif, we here investigated stop codon readthrough, using tissue culture reporter assays, for all previously untested human genes containing UGA_CUAG. The readthrough efficiency of the annotated stop codon for the sequence encoding vitamin D receptor (VDR) was 6.7%. It was the highest of those tested but all showed notable levels of readthrough. The VDR is a member of the nuclear receptor superfamily of ligand-inducible transcription factors and binds its major ligand, calcitriol, via its C-terminal ligand-binding domain. Readthrough of the annotated VDR mRNA results in a 67 amino-acid-long C-terminal extension that generates a VDR proteoform named VDRx. VDRx may form homodimers and heterodimers with VDR but, compared to VDR, VDRx displayed a reduced transcriptional response to calcitriol even in the presence of its partner retinoid X receptor
Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci.
The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization
Genomic RNA Elements Drive Phase Separation of the SARS-CoV-2 Nucleocapsid
We report that the SARS-CoV-2 nucleocapsid protein (N-protein) undergoes liquid-liquid phase separation (LLPS) with viral RNA. N-protein condenses with specific RNA genomic elements under physiological buffer conditions and condensation is enhanced at human body temperatures (33°C and 37°C) and reduced at room temperature (22°C). RNA sequence and structure in specific genomic regions regulate N-protein condensation while other genomic regions promote condensate dissolution, potentially preventing aggregation of the large genome. At low concentrations, N-protein preferentially crosslinks to specific regions characterized by single-stranded RNA flanked by structured elements and these features specify the location, number, and strength of N-protein binding sites (valency). Liquid-like N-protein condensates form in mammalian cells in a concentration-dependent manner and can be altered by small molecules. Condensation of N-protein is RNA sequence and structure specific, sensitive to human body temperature, and manipulatable with small molecules, and therefore presents a screenable process for identifying antiviral compounds effective against SARS-CoV-2
Evolution of enhanced innate immune evasion by SARS-CoV-2
Emergence of SARS-CoV-2 variants of concern (VOCs) suggests viral adaptation to enhance human-to-human transmission1,2. Although much effort has focused on characterisation of spike changes in VOCs, mutations outside spike likely contribute to adaptation. Here we used unbiased abundance proteomics, phosphoproteomics, RNAseq and viral replication assays to show that isolates of the Alpha (B.1.1.7) variant3 more effectively suppress innate immune responses in airway epithelial cells, compared to first wave isolates. We found that Alpha has dramatically increased subgenomic RNA and protein levels of N, Orf9b and Orf6, all known innate immune antagonists. Expression of Orf9b alone suppressed the innate immune response through interaction with TOM70, a mitochondrial protein required for RNA sensing adaptor MAVS activation. Moreover, the activity of Orf9b and its association with TOM70 was regulated by phosphorylation. We propose that more effective innate immune suppression, through enhanced expression of specific viral antagonist proteins, increases the likelihood of successful Alpha transmission, and may increase in vivo replication and duration of infection4. The importance of mutations outside Spike in adaptation of SARS-CoV-2 to humans is underscored by the observation that similar mutations exist in the Delta and Omicron N/Orf9b regulatory regions
GENCODE 2021
© The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.National Human Genome Research Institute of the National Institutes of Health [U41HG007234]; the content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health; Wellcome Trust [WT108749/Z/15/Z, WT200990/Z/16/Z]; European Molecular Biology Laboratory; Swiss National Science Foundation through the National Center of Competence in Research ‘RNA & Disease’ (to R.J.); Medical Faculty of the University of Bern (to R.J). Funding for open access charge: National Institutes of Health
Recommended from our members
GENCODE: reference annotation for the human and mouse genomes in 2023
Data availability: No new data were generated or analysed in support of this research.Copyright © The Author(s) 2022. GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.National Human Genome Research Institute of the National Institutes of Health [U41HG007234, R01HG004037]; Wellcome Trust [WT222155/Z/20/Z]; European Molecular Biology Laboratory. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Funding for open access charge: National Institutes of Health
- …