4 research outputs found

    MIPS bacterial genomes functional annotation benchmark dataset.

    No full text
    Motivation: Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. Results: The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation

    High specificity automatic function assignment for enzyme sequences

    Get PDF
    The number of protein sequences being deposited in databases is currently growing rapidly as a result of large-scale high throughput genome sequencing efforts. A large proportion of these sequences have no experimentally determined structure. Also, relatively few have high quality, specific, experimentally determined functions. Due to the time, cost and technical complexity of experimental procedures for the determination of protein function this situation is unlikely to change in the near future. Therefore, one of the major challenges for bioinformatics is the ability to automatically assign highly accurate, high-specificity functional information to these unknown protein sequences. As yet this problem has not been successfully solved to a level both acceptable in terms of detailed accuracy and reliability for use as a basis for detailed biological analysis on a genome wide, automated, high-throughput scale. This research thesis aims to address this shortfall through the provision and benchmarking of methods that can be used towards improving the accuracy of high-specificity protein function prediction from enzyme sequences. The datasets used in these studies are multiple alignments of evolutionarily related protein sequences, identified through the use of BLAST sequence database searches. Firstly, a number of non-standard amino acid substitution matrices were used to re-score the benchmark multiple sequence alignments. A subset of these matrices were shown to improve the accuracy of specific function annotation, when compared to both the original BLAST sequence similarity ordering and a random sequence selection model. Following this, two established methods for the identification of functional specificity determining amino acid residues (fSDRs) were used to identify regions within the aligned sequences that are functionally and phylogenetically informative. These localised sequence regions were then used to re-score the aligned sequences and provide an assessment of their ability to improve the specific functional annotation of the benchmark sequence sets. Finally, a machine learning approach (support vector machines) was followed to evaluate the possibility of identifying fSDRs, which improve the annotation accuracy, directly from alignments of closely related protein sequences without prior knowledge of their specific functional sub-types. The performance of this SVM based method was then assessed by applying it to the automatic functional assignment of a number of well studied classes of enzymes
    corecore