Search CORE

1,171 research outputs found

Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version

Author: Sahinalp S. Cenk
Salari Raheleh
Schönhuth Alexander
Publication venue
Publication date: 11/06/2010
Field of study

Although computationally aligning sequence is a crucial step in the vast majority of comparative genomics studies our understanding of alignment biases still needs to be improved. To infer true structural or homologous regions computational alignments need further evaluation. It has been shown that the accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re-evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with pair hidden Markov models and develop efficient algorithms by which to identify gaps which are significant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established structural alignments from SABmark and find that indel reliability substantially increases with their significance in particular in worst-case twilight zone alignments. This points out that our statistics can reliably complement other methods which mostly focus on the reliability of match positions.Comment: 17 pages, 7 figure

arXiv.org e-Print Archive

CWI's Institutional Repository

Scaling Laws and Similarity Detection in Sequence Alignment with Gaps

Author: Drasdo Dirk
Hwa Terence
Lassig Michael
Publication venue
Publication date: 01/01/1998
Field of study

We study the problem of similarity detection by sequence alignment with gaps, using a recently established theoretical framework based on the morphology of alignment paths. Alignments of sequences without mutual correlations are found to have scale-invariant statistics. This is the basis for a scaling theory of alignments of correlated sequences. Using a simple Markov model of evolution, we generate sequences with well-defined mutual correlations and quantify the fidelity of an alignment in an unambiguous way. The scaling theory predicts the dependence of the fidelity on the alignment parameters and on the statistical evolution parameters characterizing the sequence correlations. Specific criteria for the optimal choice of alignment parameters emerge from this theory. The results are verified by extensive numerical simulations.Comment: 25 pages, 11 figure

arXiv.org e-Print Archive

CiteSeerX

CERN Document Server

Testing statistical significance scores of sequence comparison methods with structure similarity

Author: AA Schaffer
AD Kester
EV Kriventseva
G Salton
GA Price
HS Booth
J Park
Jack AM Leunissen
Jacob de Vlieg
JJ Codani
JP Comet
JT Reese
M Gribskov
O Bastien
P Agarwal
Peter MA Groenen
R Apweiler
RF Doolittle
S Henikoff
SE Brenner
SE Brenner
SF Altschul
T Hulsen
T Rognes
TF Smith
Tim Hulsen
WR Pearson
WR Pearson
WR Pearson
WR Pearson
Z Chen
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Wageningen University & Research Publications

Radboud Repository

Island method for estimating the statistical significance of profile-profile alignment scores

Author: A Dembo
A Gambin
A Poleksic
A Poleksic
AG Murzin
Aleksandar Poleksic
D Fischer
D Przybylski
DA Debe
E Lindahl
EJ Gumbel
G Yona
H Pang
J Heringa
J Moult
J Söding
JF Collins
JF Lawless
K Ginalski
L Holm
L Rychlewski
L Rychlewski
M Frenkel-Morgenstern
MS Waterman
MS Waterman
O Bastien
O Bastien
R Mott
R Mott
R Olsen
RI Sadreyev
RI Sadreyev
S Karlin
S Karlin
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
T Hulsen
TF Smith
TF Smith
WR Pearson
WR Pearson
YK Yu
YK Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive. Results We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments. Conclusion The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Northern Iowa

A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Author: A Krogh
A Marchler-Bauer
A Milosavljević
A Pertsemlidis
AA Schäffer
AY Mitrophanov
BJ Webb
Burkhard Rost
C Barrett
C Webber
D Drasdo
D Metzler
D Siegmund
DJC MacKay
EJ Gumbel
EP Nawrocki
ET Jaynes
I Letunic
J Park
JD Storey
JF Lawless
JS Liu
K Karplus
K Karplus
K Sjölander
M Madera
MG Kann
MQ Zhang
MS Waterman
N Chia
P Bucher
R Bundschuh
R Durbin
R Mott
R Mott
R Mott
R Olsen
RC Edgar
RD Finn
S Johnson
S Karlin
S Karlin
S Miyazawa
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
SR Eddy
TF Smith
WR Pearson
Y-K Yu
Y-K Yu
Y-K Yu
Y-K Yu
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central