Skip to main content
Article thumbnail
Location of Repository

Prediction of Protein Function Using Statistically Significant Sub-Structure Discovery

By Craig Lucas


Proteins perform a vast number of functional roles. The number of protein structures available for analysis continues to grow and, with the development of methods to predict protein structure directly from genetic sequence without imaging technology, the number of structures with unknown function is likely to increase. Computational methods for predicting the function of protein structures are therefore desirable. There are several existing systems for attempting to assign function but their use is inadvisable without human intervention. Methods for searching proteins with shared function for a shared structural feature are often limited in ways that are counterproductive to a general discovery solution. Assigning accurate scores to significant sub-structures also remains an area of development. A method is presented that can find common sub-structures between multiple proteins, without the size or structural limitations of existing discovery methods. A novel measure of assigning statistical significance is also presented. These methods are tested on artificially generated and real protein data to demonstrate their ability to successfully discover statistically significant sub-structures. With a database of such sub-structures, it is then shown that prediction of function for a new protein is possible based on the presence of the discovered significant patterns

Publisher: School of Computing (Leeds)
Year: 2007
OAI identifier:

Suggested articles


  1. (1989). 3dsearch: A system for three-dimensional substructure searching.
  2. (2001). A computer system to perform structure comparisons using TOPS representations of protein structure.
  3. (1976). A fast backtracking algorithm to test directed graphs for isomorphism using distance matrices.
  4. (1998). A geometric algorithm to find small but highly similar 3D substructures in proteins.
  5. (1994). A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures.
  6. (2003). A model for statistical significance of local similarities in structure.
  7. A study of quality measures for protein threading models.
  8. (1996). A surface of minimum area metric for the structural comparison of proteins.
  9. A unified statistical framework for sequence comparison and structure comparison.
  10. AI-based algorithms for protein surface comparisons.
  11. (2002). An efficient algorithm for large-scale detection of protein families.
  12. (1999). An empirical study of domain knowledge and its benefits to substructure discovery.
  13. (2001). An improved algorithm for matching large graphs.
  14. (2003). Annotation in three dimensions. PINTS: patterns in non-homologous tertiary structures.
  15. (2006). Apples to apples: improving the performance of motif finders and their significance analysis in the twilight zone.
  16. (2000). Assessing annotation transfer for genomics: Quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores.
  17. (1998). Assessment of ab initio protein structure prediction.
  18. (2002). ASTRAL compendium enhancements.
  19. (1997). Automated discovery of active motifs in three dimensional molecules.
  20. (2001). Automated discovery of structurel signatures of protein fold and function.
  21. Automated Gene Ontology annotation for anonymous sequence data.
  22. (2003). Automatic prediction of protein function. Cellular Molecular Life Sciences,
  23. (1990). Basic local alignment search tool.
  24. (2000). Can sequence determine function?
  25. (2006). Chi square tutorial,
  26. (1998). Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins.
  27. (2003). Confusion over measures of evidence versus errors in classical statistical testing.
  28. (1998). Contemporary approaches to protein structure classification.
  29. (1999). Correlation of observed fold frequency with the occurrence of local structural motifs.
  30. (1995). Dali: A netword tool for protein structure comparison.
  31. (1979). Detection of three-dimensional patterns of atoms in chemical structures.
  32. (1998). Detectionofproteinthree-dimensionalside-chainpatterns: New examples of convergent evolution.
  33. (1993). Discovery of inexact concepts from structural data.
  34. (1992). Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques.
  35. (1992). Essential Genetics.
  36. (2001). Evolution of function in protein superfamilies, from a structural perspective.
  37. (2002). FAUST: an algorithm for extracting functionally relevant templates from protein structures.
  38. (1973). Finding all cliques of an undirected graph.
  39. (2002). Flexible protein alignment and hinge detection. Proteins: Structure, Function and Genetics,
  40. FlexProt: an algorithm for alignment of flexible protein structures.
  41. (2001). Fold recognition from sequence comparisons. Proteins: Structure, Function and Genetics,
  42. (2003). Functional sites in protein families uncovered via and objective and automated graph theoretic approach.
  43. (1992). Fuzzy substructure discovery.
  44. (2001). Gene: A vision for protein science using a petaflop supercomputer.
  45. (2002). Gene3D: structural assignmentforwholegenesandgenomesusingtheCATHdomainstructuredatabase.
  46. (2001). Generating protein threedimensional fold signatures using inductive logic programming.
  47. Hierarchical protein structure superposition using both secondary structure and atomic representations.
  48. (2003). How bayes tests of molecular phylogenies compare with frequentist approaches.
  49. (2005). Humam promoter genomic composition demonstrates non-random groupings that reflect general cellular function.
  50. (2003). Identification of protein biochemical functions by similarity search using the molecular surface database eF-site.
  51. (1993). Identification of tertiary structure resemblence in proteins using a maximal common subgraph isomorphism algorithm.
  52. (2003). Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation.
  53. (1995). Knowledge discovery from structural data.
  54. (2003). MASS: multiple structural alignment by secondary structures.
  55. (2001). Modelling protein side-chain conformations using constraint logic programming.
  56. (2002). Multiprot – a multiple protein structural alignment algorithm.
  57. (2001). New method for accurate prediction of solvent accessibility from protein sequence. Proteins: Structure, Function and Genetics,
  58. On the evolution of proteinfolds: Aresimilarmotifsindifferentproteinfoldstheresultofconvergence, insertion or relics of an ancient peptide world?
  59. (2002). One fold with many functions: The evolutionary relationships between tim barrel families based on their sequences, structures and functions.
  60. (2003). Ontoblast function: from sequence similarities directly to potential functional annotations by ontology terms.
  61. (2002). Plasticity of enzyme active sites.
  62. (1997). Predicting enzyme function from sequence: A systematic appraisal.
  63. (2001). Predicting function from structure: examples of the serine protease inhibitor canonical loop conformation found in extracellular proteins.
  64. Predicting gene function from patterns of annotation.
  65. (2002). Predicting Gene Ontology functions from ProDom and CDD protein domains.
  66. (2005). Predicting protein function from sequence and structural data.
  67. (2005). Predictionofhumanprotein function according to gene ontology categories.
  68. (2000). Predictions of protein segments with the same aminoacid sequence and different secondary structure: A benchmark for predictive methods. Proteins: Structure, Function and Genetics,
  69. (2003). Predictions without templates: New folds, secondary structure and contacts
  70. Preditionofenzymeclassificationfromproteinsequencewithout the use of sequence similarity.BIBLIOGRAPHY
  71. (2003). Protein disorder prediction: Implications for structural proteomics.
  72. (1999). Protein folds, functions and evolution.
  73. (2001). Protein Structure Alignment: A Comparison of Methods.
  74. (2003). PSI: indexing protein structures for fast similarity search. Bioinformatics,
  75. (2002). Quantifying the similarities within fold space.
  76. (2003). Recognising the fold of a protein structure.
  77. (2000). Representing and analysing molecular and cellular function in the computer.
  78. (2002). Retrospect and prospect of virtual screening in drug discovery. Current Topics in Medicinal Chemistry,
  79. (2001). Review: What can structural classifications reveal about protein evolution?
  80. (1996). Scalable discovery of informative structural concepts using domain knowledge.
  81. (1998). Scop: a structural classification of proteins database.
  82. (2003). Sensitive pattern discovery with ‘fuzzy’ alignments of distantly related proteins.
  83. Sequence and structural differences between enzyme and nonenzyme homologs.
  84. (2001). Shaobing Su, Ron Maglothin, and Istvan Jonyer. Structural mining of molecular biology data.
  85. (1998). Similarity search in 3D protein databases.
  86. (1996). Ssap: sequential structure alignment program for protein structure comparison. Methods Enzymol.,
  87. (1994). Substructure discovery using minimum description length and background knowledge.
  88. (1999). Successful recognition of protein folds using threading methods biased by sequence similarity and predicted secondary structure. Proteins: Structure, Function and Genetics,
  89. (1996). Surprising similarities in structure comparison.
  90. (2003). The CATH database: an extended protein family resource for structural and functional genomics.
  91. (2000). The enzyme database in 2000.
  92. (2000). The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology.
  93. (2003). The relationship between sequence and interaction divergence in proteins.
  94. (2004). The response of internal dynamics to hydrophobiccoremutationsinthesh3domainfromthefyntyrosinekinase. Protein Science,
  95. (2000). The swiss-prot protein sequence database and its supplement trembl in 2000.
  96. (2002). The third dimension for protein interactions and complexes.
  97. (1997). The vast protein structure comparison method.
  98. (2002). Theprotein databank.
  99. (1997). Three-dimensional structures and contexts associated with recurrent amino acid sequence patterns.
  100. (1997). Two new examples of protein structural similarities within the structure-function twilight zone.
  101. (2002). Use of structure comparison methods for the refinement of protein structure predictions: Identifying the structural family of a protein from low-resolution models. Proteins: Structure, Function and Genetics,
  102. (1989). Use of techniques derived from graph theory to compare secondary structure motifs in proteins.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.