6 research outputs found

    A Protein Classification Benchmark collection for machine learning

    Get PDF
    Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection () was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms

    Homology modelling and protein engineering strategy of subtilases, the family of subtilisin-like serine proteinases

    Get PDF
    Subtilases are members of the family of subtilisin-like serine proteases. Presently, >50 subtilases are known, >40 of which with their complete amino acid sequences. We have compared these sequences and the available three-dimensional structures (subtilisin BPN', subtilisin Carlsberg, thermitase and proteinase K). The mature enzymes contain up to 1775 residues, with N-terminal catalytic domains ranging from 268 to 511 residues, and signal and/or activation-peptides ranging from 27 to 280 residues. Several members contain C-terminal extensions, relative to the subtilisins, which display additional properties such as sequence repeats, processing sites and membrane anchor segments. Multiple sequence alignment of the N-terminal catalytic domains allows the definition of two main classes of subtilases. A structurally conserved framework of 191 core residues has been defined from a comparison of the four known three-dimensional structures. Eighteen of these core residues are highly conserved, nine of which are glycines. While the α-helix and β-sheet secondary structure elements show considerable sequence homology, this is less so for peptide loops that connect the core secondary structure elements. These loops can vary in length by >150 residues. While the core three-dimensional structure is conserved, insertions and deletions are preferentially confined to surface loops. From the known three-dimensional structures various predictions are made for the other subtilases concerning essential conserved residues, allowable amino acid substitutions, disulphide bonds, Ca2+-binding sites, substrate-binding site residues, ionic and aromatic interactions, proteolytically susceptible surface loops, etc. These predictions form a basis for protein engineering of members of the subtilase family, for which no three-dimensional structure is known.
    corecore