14 research outputs found

    PROMPT: a protein mapping and comparison tool

    Get PDF
    BACKGROUND: Comparison of large protein datasets has become a standard task in bioinformatics. Typically researchers wish to know whether one group of proteins is significantly enriched in certain annotation attributes or sequence properties compared to another group, and whether this enrichment is statistically significant. In order to conduct such comparisons it is often required to integrate molecular sequence data and experimental information from disparate incompatible sources. While many specialized programs exist for comparisons of this kind in individual problem domains, such as expression data analysis, no generic software solution capable of addressing a wide spectrum of routine tasks in comparative proteomics is currently available. RESULTS: PROMPT is a comprehensive bioinformatics software environment which enables the user to compare arbitrary protein sequence sets, revealing statistically significant differences in their annotation features. It allows automatic retrieval and integration of data from a multitude of molecular biological databases as well as from a custom XML format. Similarity-based mapping of sequence IDs makes it possible to link experimental information obtained from different sources despite discrepancies in gene identifiers and minor sequence variation. PROMPT provides a full set of statistical procedures to address the following four use cases: i) comparison of the frequencies of categorical annotations between two sets, ii) enrichment of nominal features in one set with respect to another one, iii) comparison of numeric distributions, and iv) correlation of numeric variables. Analysis results can be visualized in the form of plots and spreadsheets and exported in various formats, including Microsoft Excel. CONCLUSION: PROMPT is a versatile, platform-independent, easily expandable, stand-alone application designed to be a practical workhorse in analysing and mining protein sequences and associated annotation. The availability of the Java Application Programming Interface and scripting capabilities on one hand, and the intuitive Graphical User Interface with context-sensitive help system on the other, make it equally accessible to professional bioinformaticians and biologically-oriented users. PROMPT is freely available for academic users from

    CodingQuarry: Highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts

    Get PDF
    Background: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study. Results: CodingQuarry is a highly accurate, self-training GHMM fungal gene predictor designed to work with assembled, aligned RNA-seq transcripts. RNA-seq data informs annotations both during gene-model training and in prediction. Our approach capitalises on the high quality of fungal transcript assemblies by incorporating predictions made directly from transcript sequences. Correct predictions are made despite transcript assembly problems, including those caused by overlap between the transcripts of adjacent gene loci. Stringent benchmarking against high-confidence annotation subsets showed CodingQuarry predicted 91.3% of Schizosaccharomyces pombe genes and 90.4% of Saccharomyces cerevisiae genes perfectly. These results are 4-5% better than those of AUGUSTUS, the next best performing RNA-seq driven gene predictor tested. Comparisons against whole genome Sc. pombe and S. cerevisiae annotations further substantiate a 4-5% improvement in the number of correctly predicted genes. Conclusions: We demonstrate the success of a novel method of incorporating RNA-seq data into GHMM fungal gene prediction. This shows that a high quality annotation can be achieved without relying on protein homology or a training set of genes. CodingQuarry is freely available (https://sourceforge.net/projects/codingquarry/), and suitable for incorporation into genome annotation pipelines

    Genomic survey of the non-cultivatable opportunistic human pathogen, Enterocytozoon bieneusi

    Get PDF
    © 2009 The Authors. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in PLoS Pathogens 5 (2009): e1000261, doi:10.1371/journal.ppat.1000261.Enterocytozoon bieneusi is the most common microsporidian associated with human disease, particularly in the immunocompromised population. In the setting of HIV infection, it is associated with diarrhea and wasting syndrome. Like all microsporidia, E. bieneusi is an obligate, intracellular parasite, but unlike others, it is in direct contact with the host cell cytoplasm. Studies of E. bieneusi have been greatly limited due to the absence of genomic data and lack of a robust cultivation system. Here, we present the first large-scale genomic dataset for E. bieneusi. Approximately 3.86 Mb of unique sequence was generated by paired end Sanger sequencing, representing about 64% of the estimated 6 Mb genome. A total of 3,804 genes were identified in E. bieneusi, of which 1,702 encode proteins with assigned functions. Of these, 653 are homologs of Encephalitozoon cuniculi proteins. Only one E. bieneusi protein with assigned function had no E. cuniculi homolog. The shared proteins were, in general, evenly distributed among the functional categories, with the exception of a dearth of genes encoding proteins associated with pathways for fatty acid and core carbon metabolism. Short intergenic regions, high gene density, and shortened protein-coding sequences were observed in the E. bieneusi genome, all traits consistent with genomic compaction. Our findings suggest that E. bieneusi is a likely model for extreme genome reduction and host dependence.This research was supported by National Institutes of Health (NIH) grants R21 AI064118 (DEA) and R21 AI52792 (ST). HGM was supported in part by NIH contracts HHSN266200400041C and HHSN2662004037C (Bioinformatics Resource Centers) and by the G. Unger Vetlesen Foundation
    corecore