Search CORE

INRIA a CCSD electronic archive server

Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty

Author: A Agrawal
A Agrawal
AA Schäffer
AK Hartmann
Ankit Agrawal
AY Mitrophanov
CA Orengo
J Rocha
M Kschischo
M Pagni
ML Sierk
MS Waterman
P Bucher
PH Sellers
R Mott
R Mott
R Mott
R Olsen
RF Mott
S Grossmann
S Karlin
S Kotz
S Sheetlin
S Wolfsheimer
SE Brenner
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
TF Smith
WR Pearson
WR Pearson
WR Pearson
WR Pearson
WR Pearson
X Huang
X Huang
Xiaoqiu Huang
YK Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: Accurate estimation of statistical significance of a pairwise alignment is an important problem in sequence comparison. Recently, a comparative study of pairwise statistical significance with database statistical significance was conducted. In this paper, we extend the earlier work on pairwise statistical significance by incorporating with it the use of multiple parameter sets. Results: Results for a knowledge discovery application of homology detection reveal that using multiple parameter sets for pairwise statistical significance estimates gives better coverage than using a single parameter set, at least at some error levels. Further, the results of pairwise statistical significance using multiple parameter sets are shown to be significantly better than database statistical significance estimates reported by BLAST and PSI-BLAST, and comparable and at times significantly better than SSEARCH. Using non-zero parameter set change penalty values give better performance than zero penalty. Conclusion: The fact that the homology detection performance does not degrade when using multiple parameter sets is a strong evidence for the validity of the assumption that the alignment score distribution follows an extreme value distribution even when using multiple parameter sets. Parameter set change penalty is a useful parameter for alignment using multiple parameter sets. Pairwise statistical significance using multiple parameter sets can be effectively used to determine the relatedness of a (or a few) pair(s) of sequences without performing a time-consuming database search

Digital Repository @ Iowa State University (ISU)

Island method for estimating the statistical significance of profile-profile alignment scores

Author: A Dembo
A Gambin
A Poleksic
A Poleksic
AG Murzin
Aleksandar Poleksic
D Fischer
D Przybylski
DA Debe
E Lindahl
EJ Gumbel
G Yona
H Pang
J Heringa
J Moult
J Söding
JF Collins
JF Lawless
K Ginalski
L Holm
L Rychlewski
L Rychlewski
M Frenkel-Morgenstern
MS Waterman
MS Waterman
O Bastien
O Bastien
R Mott
R Mott
R Olsen
RI Sadreyev
RI Sadreyev
S Karlin
S Karlin
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
T Hulsen
TF Smith
TF Smith
WR Pearson
WR Pearson
YK Yu
YK Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive. Results We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments. Conclusion The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.</p

University of Northern Iowa

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment

Author: Altschul
Altschul
Altschul
Bundschuh
Collins
Gotoh
Henikoff
J. L. Spouge
Karlin
Mott
Mott
Mott
Needleman
Robinson
S. Sheetlin
Smith
Smith
Storey
Waterman
Y. Park
Yu
Publication venue: Oxford University Press
Publication date: 06/09/2005
Field of study

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor

Public Library of Science (PLOS)

A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Author: A Krogh
A Marchler-Bauer
A Milosavljević
A Pertsemlidis
AA Schäffer
AY Mitrophanov
BJ Webb
Burkhard Rost
C Barrett
C Webber
D Drasdo
D Metzler
D Siegmund
DJC MacKay
EJ Gumbel
EP Nawrocki
ET Jaynes
I Letunic
J Park
JD Storey
JF Lawless
JS Liu
K Karplus
K Karplus
K Sjölander
M Madera
MG Kann
MQ Zhang
MS Waterman
N Chia
P Bucher
R Bundschuh
R Durbin
R Mott
R Mott
R Mott
R Olsen
RC Edgar
RD Finn
S Johnson
S Karlin
S Karlin
S Miyazawa
Sean R. Eddy
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SF Altschul
SR Eddy
SR Eddy
TF Smith
WR Pearson
Y-K Yu
Y-K Yu
Y-K Yu
Y-K Yu
Publication venue: Public Library of Science
Publication date: 01/05/2008
Field of study

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Author: A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Agrawal
A Mitrophanov
A Poleksic
A Samuel
AA Schäffer
Alok Choudhary
Ankit Agrawal
C Camacho
D Honbo
DS Roos
L Ligowski
M Pagni
M Waterman
Md Mostofa Ali Patwary
ML Sierk
ML Sierk
NVIDIA
NVIDIA
P Aleksandar
R Mott
R O
S Altschul
S Karlin
S Manavski
S Ryoo
S Yooseph
S Zuyderduyn
Sanchit Misra
SF Altschul
SR Eddy
T Rognes
T Smith
W Liu
W Pearson
W Pearson
Wei-keng Liao
WR Pearson
Y Liu
Y Liu
Y Yu
Y Yu
Y Zhang
Y Zhang
Yuhong Zhang
Zhiguang Qin
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail

Author: A Dembo
A Dieker
A Garci'a
A Gelman
A Hartmann
A Raftery
A Robinson
Alexander K Hartmann
BEfron
Bernd Burghardt
C Fraser
C Geyer
C Geyer
D Earl
D Metzler
D Siegmund
E Gumbel
E Marinari
H Katzgraber
J Liu
J Liu
K Hukushima
M Cowles
M Dayhoff
M Kschischo
M Körner
O Gotoh
P Sellers
R Arratia
R Olsen
R Schwartz
R Zhou
R Zhou
R Zhou
R Zhou
S Altschul
S Altschul
S Altschul
S Altschul
S Auer
S Brooks
S Brown
S Heinkoff
S Karlin
S Karlin
S Rashidi
SB Needleman
Stefan Wolfsheimer
T Hwa
TF Smith
W Wilbur
WK Hastings
X Meng
Y Yu
Y Yu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. Results We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (<it>L </it>> 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the <it>k </it>best alignments. Conclusion Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of <it>k </it>best alignments is included.</p

RSEARCH: Finding homologs of single structured RNA sequences

Author: Eddy Sean R
Klein Robert J
Publication venue: BioMed Central
Publication date: 01/01/2003
Field of study

BACKGROUND: For many RNA molecules, secondary structure rather than primary sequence is the evolutionarily conserved feature. No programs have yet been published that allow searching a sequence database for homologs of a single RNA molecule on the basis of secondary structure. RESULTS: We have developed a program, RSEARCH, that takes a single RNA sequence with its secondary structure and utilizes a local alignment algorithm to search a database for homologous RNAs. For this purpose, we have developed a series of base pair and single nucleotide substitution matrices for RNA sequences called RIBOSUM matrices. RSEARCH reports the statistical confidence for each hit as well as the structural alignment of the hit. We show several examples in which RSEARCH outperforms the primary sequence search programs BLAST and SSEARCH. The primary drawback of the program is that it is slow. The C code for RSEARCH is freely available from our lab's website. CONCLUSION: RSEARCH outperforms primary sequence programs in finding homologs of structured RNA sequences

Digital Commons@Becker

Back-translation for discovering distant protein homologies in the presence of frameshift mutations

Author: Gîrdea Marta
Kucherov Gregory
Noé Laurent
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins ’ common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. \ud \ud Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.\ud \ud Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional\ud alignment methods, which is confirmed by biologically significant example

CiteSeerX

HAL - Lille 3

INRIA a CCSD electronic archive server