Testing statistical significance scores of sequence comparison methods with structure similarity

AA Schaffer; AD Kester; EV Kriventseva; G Salton; GA Price; HS Booth; J Park; Jack AM Leunissen; Jacob de Vlieg; JJ Codani; JP Comet; JT Reese; M Gribskov; O Bastien; P Agarwal; Peter MA Groenen; R Apweiler; RF Doolittle; S Henikoff; SE Brenner; SE Brenner; SF Altschul; T Hulsen; T Rognes; TF Smith; Tim Hulsen; WR Pearson; WR Pearson; WR Pearson; WR Pearson; Z Chen

Testing statistical significance scores of sequence comparison methods with structure similarity

Authors: AA Schaffer
AD Kester
EV Kriventseva
G Salton
GA Price
HS Booth
J Park
Jack AM Leunissen
Jacob de Vlieg
JJ Codani
JP Comet
JT Reese
M Gribskov
O Bastien
P Agarwal
Peter MA Groenen
R Apweiler
RF Doolittle
S Henikoff
SE Brenner
SE Brenner
SF Altschul
T Hulsen
T Rognes
TF Smith
Tim Hulsen
WR Pearson
WR Pearson
WR Pearson
WR Pearson
Z Chen
Publication date: 1 January 2006
Publisher: BioMed Central
Doi

Abstract

BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons