63 research outputs found
A Third Approach to Gene Prediction Suggests Thousands of Additional Human Transcribed Regions
The identification and characterization of the complete ensemble of genes is a main goal of deciphering the digital information stored in the human genome. Many algorithms for computational gene prediction have been described, ultimately derived from two basic concepts: (1) modeling gene structure and (2) recognizing sequence similarity. Successful hybrid methods combining these two concepts have also been developed. We present a third orthogonal approach to gene prediction, based on detecting the genomic signatures of transcription, accumulated over evolutionary time. We discuss four algorithms based on this third concept: Greens and CHOWDER, which quantify mutational strand biases caused by transcription-coupled DNA repair, and ROAST and PASTA, which are based on strand-specific selection against polyadenylation signals. We combined these algorithms into an integrated method called FEAST, which we used to predict the location and orientation of thousands of putative transcription units not overlapping known genes. Many of the newly predicted transcriptional units do not appear to code for proteins. The new algorithms are particularly apt at detecting genes with long introns and lacking sequence conservation. They therefore complement existing gene prediction methods and will help identify functional transcripts within many apparent “genomic deserts.
A comparative genomics multitool for scientific discovery and conservation
A whole-genome alignment of 240 phylogenetically diverse species of eutherian mammal-including 131 previously uncharacterized species-from the Zoonomia Project provides data that support biological discovery, medical research and conservation. The Zoonomia Project is investigating the genomics of shared and specialized traits in eutherian mammals. Here we provide genome assemblies for 131 species, of which all but 9 are previously uncharacterized, and describe a whole-genome alignment of 240 species of considerable phylogenetic diversity, comprising representatives from more than 80% of mammalian families. We find that regions of reduced genetic diversity are more abundant in species at a high risk of extinction, discern signals of evolutionary selection at high resolution and provide insights from individual reference genomes. By prioritizing phylogenetic diversity and making data available quickly and without restriction, the Zoonomia Project aims to support biological discovery, medical research and the conservation of biodiversity.Peer reviewe
Insights into mammalian TE diversity through the curation of 248 genome assemblies
[INTRODUCTION] An estimated 160 million years have passed since the first placental mammals evolved. These eutherians are categorized into 19 orders consisting of nearly 4000 extant species, with ~70% being bats or rodents. Broad, in-depth, and comparative genomic studies across Eutheria have previously been unachievable because of the lack of genomic resources. The collaboration of the Zoonomia Consortium made available hundreds of high-quality genome assemblies for comparative analysis. Our focus within the consortium was to investigate the evolution of transposable elements (TEs) among placental mammals. Using these data, we identified previously known TEs, described previously unknown TEs, and analyzed the TE distribution among multiple taxonomic levels.[RATIONALE] The emergence of accurate and affordable sequencing technology has propelled efforts to sequence increasingly more nonmodel mammalian genomes in the past decade. Most of these efforts have traditionally focused on genic regions searching for patterns of selection or variation in gene regulation. The common trend of ignoring or trivializing TE annotation with newly published genomes has resulted in severe lag of TE analyses, leading to extensive undiscovered TE variation. This oversight has neglected an important source of evolution because the accumulation of TEs is attributable to drastic alterations in genome architecture, including insertions, deletions, duplications, translocations, and inversions. Our approach to the Zoonomia dataset was to provide future inquirers accurate and meticulous TE curations and to describe taxonomic variation among eutherians.[RESULTS] We annotated the TE content of 248 mammalian genome assemblies, which yielded a library of 25,676 consensus TE sequences, 8263 of which were previously unidentified TE sequences (available at https://dfam.org). We affirmed that the largest component of a typical mammalian genome is comprised of TEs (average 45.6%). Of the 248 assemblies, the lowest genomic percentage of TEs was found in the star-nosed mole (27.6%), and the largest percentage was seen in the aardvark (74.5%), whose increase in TE accumulation drove a corresponding increase in genome size—a correlation we observed across Eutheria. The overall genomic proportions of recently accumulated TEs were roughly similar across most mammals in the dataset, with a few notable exceptions (see the figure). Diversity of recently accumulated TEs is highest among multiple families of bats, mostly driven by substantial DNA transposon activity. Our data also exhibit an increase of recently accumulated DNA transposons among carnivore lineages over their herbivorous counterparts, which suggests that diet may play a role in determining the genomic content of TEs.[CONCLUSION] The copious TE data provided in this work emanated from the largest comprehensive TE curation effort to date. Considering the wide-ranging effects that TEs impose on genomic architecture, these data are an important resource for future inquiries into mammalian genomics and evolution and suggest avenues for continued study of these important yet understudied genomic denizens.This project was partially supported by NSF grant DEB 1838283 (D.D.M.-S. and D.A.R.), NSF grant IOS 2032006 (D.D.M.-S. and D.A.R.), National Institutes of Health (NIH) grant R01HG002939 (J.M.S., R.H., A.F.A.S., and J.Ros.), NIH grant U24HG010136 (J.M.S., R.H., A.F.A.S., and J.Ros.), NSF grant DEB 1838273 (L.M.D.), NSF grant DGE 1633299 (L.M.D.), NIH grant NHGRI R01HG008742 (Zoonomia Consortium), and a Swedish Research Council Distinguished Professor Award (Zoonomia Consortium).Peer reviewe
Dissecting the Shared Genetic Architecture of Suicide Attempt, Psychiatric Disorders, and Known Risk Factors
Background Suicide is a leading cause of death worldwide, and nonfatal suicide attempts, which occur far more frequently, are a major source of disability and social and economic burden. Both have substantial genetic etiology, which is partially shared and partially distinct from that of related psychiatric disorders. Methods We conducted a genome-wide association study (GWAS) of 29,782 suicide attempt (SA) cases and 519,961 controls in the International Suicide Genetics Consortium (ISGC). The GWAS of SA was conditioned on psychiatric disorders using GWAS summary statistics via multitrait-based conditional and joint analysis, to remove genetic effects on SA mediated by psychiatric disorders. We investigated the shared and divergent genetic architectures of SA, psychiatric disorders, and other known risk factors. Results Two loci reached genome-wide significance for SA: the major histocompatibility complex and an intergenic locus on chromosome 7, the latter of which remained associated with SA after conditioning on psychiatric disorders and replicated in an independent cohort from the Million Veteran Program. This locus has been implicated in risk-taking behavior, smoking, and insomnia. SA showed strong genetic correlation with psychiatric disorders, particularly major depression, and also with smoking, pain, risk-taking behavior, sleep disturbances, lower educational attainment, reproductive traits, lower socioeconomic status, and poorer general health. After conditioning on psychiatric disorders, the genetic correlations between SA and psychiatric disorders decreased, whereas those with nonpsychiatric traits remained largely unchanged. Conclusions Our results identify a risk locus that contributes more strongly to SA than other phenotypes and suggest a shared underlying biology between SA and known risk factors that is not mediated by psychiatric disorders.Peer reviewe
Erratum: Corrigendum: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
International Chicken Genome Sequencing Consortium.
The Original Article was published on 09 December 2004.
Nature432, 695–716 (2004).
In Table 5 of this Article, the last four values listed in the ‘Copy number’ column were incorrect. These should be: LTR elements, 30,000; DNA transposons, 20,000; simple repeats, 140,000; and satellites, 4,000. These errors do not affect any of the conclusions in our paper.
Additional information.
The online version of the original article can be found at 10.1038/nature0315
Cohort Profile: Burden of Obstructive Lung Disease (BOLD) study
The Burden of Obstructive Lung Disease (BOLD) study was established to assess the prevalence of chronic airflow obstruction, a key characteristic of chronic obstructive pulmonary disease, and its risk factors in adults (≥40 years) from general populations across the world.
The baseline study was conducted between 2003 and 2016, in 41 sites across Africa, Asia, Europe, North America, the Caribbean and Oceania, and collected high-quality pre- and post-bronchodilator spirometry from 28 828 participants.
The follow-up study was conducted between 2019 and 2021, in 18 sites across Africa, Asia, Europe and the Caribbean. At baseline, there were in these sites 12 502 participants with high-quality spirometry. A total of 6452 were followed up, with 5936 completing the study core questionnaire. Of these, 4044 also provided high-quality pre- and post-bronchodilator spirometry.
On both occasions, the core questionnaire covered information on respiratory symptoms, doctor diagnoses, health care use, medication use and ealth status, as well as potential risk factors. Information on occupation, environmental exposures and diet was also collected
Accuracy of multiple sequence alignment methods in the reconstruction of transposable element families.
The construction of a high-quality multiple sequence alignment (MSA) from copies of a transposable element (TE) is a critical step in the characterization of a new TE family. Most studies of MSA accuracy have been conducted on protein or RNA sequence families, where structural features and strong signals of selection may assist with alignment. Less attention has been given to the quality of sequence alignments involving neutrally evolving DNA sequences such as those resulting from TE replication. Transposable element sequences are challenging to align due to their wide divergence ranges, fragmentation, and predominantly-neutral mutation patterns. To gain insight into the effects of these properties on MSA accuracy, we developed a simulator of TE sequence evolution, and used it to generate a benchmark with which we evaluated the MSA predictions produced by several popular aligners, along with Refiner, a method we developed in the context of our RepeatModeler software. We find that MAFFT and Refiner generally outperform other aligners for low to medium divergence simulated sequences, while Refiner is uniquely effective when tasked with aligning high-divergent and fragmented instances of a family
Curation Guidelines for de novo Generated Transposable Element Families.
Transposable elements (TEs) have the ability to alter individual genomic landscapes and shape the course of evolution for species in which they reside. Such profound changes can be understood by studying the biology of the organism and the interplay of the TEs it hosts. Characterizing and curating TEs across a wide range of species is a fundamental first step in this endeavor. This protocol employs techniques honed while developing TE libraries for a wide range of organisms and specifically addresses: (1) the extension of truncated de novo results into full-length TE families; (2) the iterative refinement of TE multiple sequence alignments; and (3) the use of alignment visualization to assess model completeness and subfamily structure. © 2021 Wiley Periodicals LLC. Basic Protocol: Extension and edge polishing of consensi and seed alignments derived from de novo repeat finders Support Protocol: Generating seed alignments using a library of consensi and a genome assembly
- …