180 research outputs found

    Quantifying the mechanisms of domain gain in animal proteins.

    Get PDF
    RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.BACKGROUND: Protein domains are protein regions that are shared among different proteins and are frequently functionally and structurally independent from the rest of the protein. Novel domain combinations have a major role in evolutionary innovation. However, the relative contributions of the different molecular mechanisms that underlie domain gains in animals are still unknown. By using animal gene phylogenies we were able to identify a set of high confidence domain gain events and by looking at their coding DNA investigate the causative mechanisms. RESULTS: Here we show that the major mechanism for gains of new domains in metazoan proteins is likely to be gene fusion through joining of exons from adjacent genes, possibly mediated by non-allelic homologous recombination. Retroposition and insertion of exons into ancestral introns through intronic recombination are, in contrast to previous expectations, only minor contributors to domain gains and have accounted for less than 1% and 10% of high confidence domain gain events, respectively. Additionally, exonization of previously non-coding regions appears to be an important mechanism for addition of disordered segments to proteins. We observe that gene duplication has preceded domain gain in at least 80% of the gain events. CONCLUSIONS: The interplay of gene duplication and domain gain demonstrates an important mechanism for fast neofunctionalization of genes.Published versio

    The importance of identifying alternative splicing in vertebrate genome annotation

    Get PDF
    While alternative splicing (AS) can potentially expand the functional repertoire of vertebrate genomes, relatively few AS transcripts have been experimentally characterized. We describe our detailed manual annotation of vertebrate genomes, which is generating a publicly available geneset rich in AS. In order to achieve this we have adopted a highly sensitive approach to annotating gene models supported by correctly mapped, canonically spliced transcriptional evidence combined with a highly cautious approach to adding unsupported extensions to models and making decisions on their functional potential. We use information about the predicted functional potential and structural properties of every AS transcript annotated at a protein-coding or non-coding locus to place them into one of eleven subclasses. We describe the incorporation of new sequencing and proteomics technologies into our annotation pipelines, which are used to identify and validate AS. Combining all data sources has led to the production of a rich geneset containing an average of 6.3 AS transcripts for every human multi-exon protein-coding gene. The datasets produced have proved very useful in providing context to studies investigating the functional potential of genes and the effect of variation may have on gene structure and function

    Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates

    Get PDF
    Novel human pseudogenes are identified that had previous functionality and their age is estimated. The rate of loss-of-function occurred uniformly

    SCN1A: bioinformatically-informed revised boundaries for promoter and enhancer regions

    Get PDF
    Pathogenic variations in the sodium voltage-gated channel alpha subunit 1 (SCN1A) gene are responsible for multiple epilepsy phenotypes, including Dravet syndrome (DS), febrile seizures (FS), and genetic epilepsy with febrile seizures plus (GEFS+). Phenotypic heterogeneity is a hallmark of SCN1A-related epilepsies, the causes of which are yet to be clarified. Genetic variation in the non-coding regulatory regions of SCN1A could be one potential causal factor. However, a comprehensive understanding of the SCN1A regulatory landscape is currently lacking. Here, we summarised the current state of knowledge of SCN1A regulation, providing details of its promoter and enhancer regions. We then integrated currently available data on SCN1A promoters by extracting information related to the SCN1A locus from genome-wide repositories, and clearly defined the promoter and enhancer regions of SCN1A. Further, we explored the cellular specificity of differential SCN1A promoter usage. We also reviewed and integrated the available human brain-derived enhancer databases and mouse-derived data to provide a comprehensive computationally-developed summary of SCN1A brain-active enhancers. By querying genome-wide data repositories, extracting SCN1A-specific data and integrating the different types of independent evidence, we created a comprehensive catalogue that better defines the regulatory landscape of SCN1A, which could be used to explore the role of SCN1A regulatory regions in disease

    Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes

    Get PDF
    An analysis of ribosomal protein pseudogenes in the four mammalian genomes reveals no correlation between number of pseudogenes and mRNA abundance

    Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment

    Get PDF
    BACKGROUND: This study analyzes the predictions of a number of promoter predictors on the ENCODE regions of the human genome as part of the ENCODE Genome Annotation Assessment Project (EGASP). The systems analyzed operate on various principles and we assessed the effectiveness of different conceptual strategies used to correlate produced promoter predictions with the manually annotated 5' gene ends. RESULTS: The predictions were assessed relative to the manual HAVANA annotation of the 5' gene ends. These 5' gene ends were used as the estimated reference transcription start sites. With the maximum allowed distance for predictions of 1,000 nucleotides from the reference transcription start sites, the sensitivity of predictors was in the range 32% to 56%, while the positive predictive value was in the range 79% to 93%. The average distance mismatch of predictions from the reference transcription start sites was in the range 259 to 305 nucleotides. At the same time, using transcription start site estimates from DBTSS and H-Invitational databases as promoter predictions, we obtained a sensitivity of 58%, a positive predictive value of 92%, and an average distance from the annotated transcription start sites of 117 nucleotides. In this experiment, the best performing promoter predictors were those that combined promoter prediction with gene prediction. The main reason for this is the reduced promoter search space that resulted in smaller numbers of false positive predictions. CONCLUSION: The main finding, now supported by comprehensive data, is that the accuracy of human promoter predictors for high-throughput annotation purposes can be significantly improved if promoter prediction is combined with gene prediction. Based on the lessons learned in this experiment, we propose a framework for the preparation of the next similar promoter prediction assessment

    RNAcentral : a hub of information for non-coding RNA sequences

    Get PDF
    RNAcentral is a comprehensive database of non-coding RNA (ncRNA) sequences, collating information on ncRNA sequences of all types from a broad range of organisms. We have recently added a new genome mapping pipeline that identifies genomic locations for ncRNA sequences in 296 species. We have also added several new types of functional annotations, such as tRNA secondary structures, Gene Ontology annotations, and miRNA-target interactions. A new quality control mechanism based on Rfam family assignments identifies potential contamination, incomplete sequences, and more. The RNAcentral database has become a vital component of many workflows in the RNA community, serving as both the primary source of sequence data for academic and commercial groups, as well as a source of stable accessions for the annotation of genomic and functional features. These examples are facilitated by an improved RNAcentral web interface, which features an updated genome browser, a new sequence feature viewer, and improved text search functionality. RNAcentral is freely available at https://rnacentral.org
    corecore