490 research outputs found

    PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments

    Get PDF
    PAL2NAL is a web server that constructs a multiple codon alignment from the corresponding aligned protein sequences. Such codon alignments can be used to evaluate the type and rate of nucleotide substitutions in coding DNA for a wide range of evolutionary analyses, such as the identification of levels of selective constraint acting on genes, or to perform DNA-based phylogenetic studies. The server takes a protein sequence alignment and the corresponding DNA sequences as input. In contrast to other existing applications, this server is able to construct codon alignments even if the input DNA sequence has mismatches with the input protein sequence, or contains untranslated regions and polyA tails. The server can also deal with frame shifts and inframe stop codons in the input models, and is thus suitable for the analysis of pseudogenes. Another distinct feature is that the user can specify a subregion of the input alignment in order to specifically analyze functional domains or exons of interest. The PAL2NAL server is available at

    Extraction of Transcript Diversity from Scientific Literature

    Get PDF
    Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/

    Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation.</p> <p>Results</p> <p>We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124).</p> <p>Conclusion</p> <p>Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.</p

    The PAM domain, a multi-protein complex-associated module with an all-alpha-helix fold

    Get PDF
    BACKGROUND: Multimeric protein complexes have a role in many cellular pathways and are highly interconnected with various other proteins. The characterization of their domain composition and organization provides useful information on the specific role of each region of their sequence. RESULTS: We identified a new module, the PAM domain (PCI/PINT associated module), present in single subunits of well characterized multiprotein complexes, like the regulatory lid of the 26S proteasome, the COP-9 signalosome and the Sac3-Thp1 complex. This module is an around 200 residue long domain with a predicted TPR-like all-alpha-helical fold. CONCLUSIONS: The occurrence of the PAM domain in specific subunits of multimeric protein complexes, together with the role of other all-alpha-helical folds in protein-protein interactions, suggest a function for this domain in mediating transient binding to diverse target proteins

    Annotation of the M. tuberculosis Hypothetical Orfeome: Adding Functional Information to More than Half of the Uncharacterized Proteins

    Get PDF
    The genome of Mycobacterium tuberculosis (H37Rv) contains 4,019 protein coding genes, of which more than thousand have been categorized as ‘hypothetical’ implying that for these not even weak functional associations could be identified so far. We here predict reliable functional indications for half of this large hypothetical orfeome: 497 genes can be annotated based on orthology, and another 125 can be linked to interacting proteins via integrated genomic context analysis and literature mining. The assignments include newly identified clusters of interacting proteins, hypothetical genes that are associated to well known pathways and putative disease-relevant targets. All together, we have raised the fraction of the proteome with at least some functional annotation to 88% which should considerably enhance the interpretation of large-scale experiments targeting this medically important organism

    Universally distributed single-copy genes indicate a constant rate of horizontal transfer

    Get PDF
    Single copy genes, universally distributed across the three domains of life and encoding mostly ancient parts of the translation machinery, are thought to be only rarely subjected to horizontal gene transfer (HGT). Indeed it has been proposed to have occurred in only a few genes and implies a rare, probably not advantageous event in which an ortholog displaces the original gene and has to function in a foreign context (orthologous gene displacement, OGD). Here, we have utilised an automatic method to identify HGT based on a conservative statistical approach capable of robustly assigning both donors and acceptors. Applied to 40 universally single copy genes we found that as many as 68 HGTs (implying OGDs) have occurred in these genes with a rate of 1.7 per family since the last universal common ancestor (LUCA). We examined a number of factors that have been claimed to be fundamental to HGT in general and tested their validity in the subset of universally distributed single copy genes. We found that differing functional constraints impact rates of OGD and the more evolutionarily distant the donor and acceptor, the less likely an OGD is to occur. Furthermore, species with larger genomes are more likely to be subjected to OGD. Most importantly, regardless of the trends above, the number of OGDs increases linearly with time, indicating a neutral, constant rate. This suggests that levels of HGT above this rate may be indicative of positively selected transfers that may allow niche adaptation or bestow other benefits to the recipient organism

    Protein coding potential of retroviruses and other transposable elements in vertebrate genomes

    Get PDF
    We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions
    corecore