Location of Repository

Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins.

By Norman E. Davey, Richard J. Edwards and Denis C. Shields


Background: large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. <br/><br/>Here, we develop more exact methods and explore the potential biases of computationally efficient approximations. <br/><br/>Results: a widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. <br/><br/>Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. <br/><br/>Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). <br/><br/>Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure. <br/><br/>Conclusions: a method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated

Year: 2010
OAI identifier: oai:eprints.soton.ac.uk:142439
Provided by: e-Prints Soton

Suggested articles



  1. (2008). Aloy P: Contextual specificity in peptide-mediated protein interactions. PLoS ONE doi
  2. (2004). Bairoch A: Recent improvements to the PROSITE database. doi
  3. (2007). Bateman A: Reuse of structural domain-domain interactions in protein networks. doi
  4. (2007). Characterization of protein hubs by inferring interacting motifs from protein interactions. PLoS Comput Biol doi
  5. (2006). DC: Absolute net charge and the biological activity of oligopeptides. doi
  6. (2007). DC: SLiMFinder: A Probabilistic Method for Identifying Over-Represented, Convergently Evolved, Short Linear Motifs in Proteins. PLoS ONE doi
  7. (1994). Detecting patterns in protein sequences. doi
  8. (1995). DG: Finding flexible patterns in unaligned protein sequences. Protein Sci doi
  9. (1999). Dyson HJ: Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. doi
  10. (2003). et al: ELM server: A new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res doi
  11. (2006). et al: Human protein reference database–2006 update. Nucleic Acids Res
  12. (2006). et al: Minimotif Miner: a tool for investigating protein function. Nat Methods doi
  13. (2005). et al: The Universal Protein Resource (UniProt). Nucleic Acids Res
  14. (2005). I: IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics doi
  15. (2007). I: Local structural disorder imparts plasticity on linear motifs. Bioinformatics doi
  16. (2009). Masking residues using contextspecific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics doi
  17. (2006). RB: DILIMOT: discovery of linear motifs in proteins. doi
  18. (2005). RB: Linear motifs: evolutionary interaction switches. doi
  19. (2005). RB: Systematic Discovery of New Recognition Peptides Mediating Protein Interaction Networks. doi
  20. (2008). Teasdale RD: LOCATE: a mammalian protein subcellular localization database. doi
  21. (2005). The EH1 motif in metazoan transcription factors.
  22. (2003). Tibshirani R: Statistical significance for genomewide studies. doi
  23. (2008). TJ: A careful disorderliness in the proteome: sites for interaction and targets for future therapies. FEBS Lett doi
  24. (2008). TJ: Discovery of candidate KEN-box motifs using cell cycle keyword enrichment combined with native disorder prediction and motif conservation. Bioinformatics doi
  25. (2008). TJ: Phospho.ELM: a database of phosphorylation sites–update 2008. Nucleic Acids Res doi
  26. (2008). TJ: Understanding eukaryotic linear motifs and their role in cell signaling and regulation. Front Biosci doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.