7 research outputs found

    The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment

    Get PDF
    The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor

    Where Does the Alignment Score Distribution Shape Come from?

    Get PDF
    Alignment algorithms are powerful tools for searching for homologous proteins in databases, providing a score for each sequence present in the database. It has been well known for 20 years that the shape of the score distribution looks like an extreme value distribution. The extremely large number of times biologists face this class of distributions raises the question of the evolutionary origin of this probability law

    A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

    Get PDF
    Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

    Sequence-specific sequence comparison using pairwise statistical significance

    Get PDF
    Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. The biologists routinely use such pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This research addresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online. With the all-pervasive application of sequence alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community

    Generalization of predicates with string arguments

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2002.Thesis (Master's) -- Bilkent University, 2002.Includes bibliographical references leaves 60-63.String/sequence generalization is used in many different areas such as machine learning, example-based machine translation and DNA sequence alignment. In this thesis, a method is proposed to find the generalizations of the predicates with string arguments from the given examples. Trying to learn from examples is a very hard problem in machine learning, since finding the global optimal point to stop generalization is a difficult and time consuming process. All the work done until now is about employing a heuristic to find the best solution. This work is one of them. In this study, some restrictions applied by the SLGG (Specific Least General Generalization) algorithm, which is developed to be used in an example-based machine translation system, are relaxed to find the all possible alignments of two strings. Moreover, a Euclidian distance like scoring mechanism is used to find the most specific generalizations. Some of the generated templates are eliminated by four different selection/filtering approaches to get a good solution set. Finally, the result set is presented as a decision list, which provides the handling of exceptional cases.Canıtezer, GökerM.S

    Rapid Significance Estimation in Local Sequence Alignment with Gaps

    No full text
    In order to assess the significance of sequence alignments it is crucial to know the distribution of alignment scores of pairs of random sequences. For gapped local alignment it is empirically known that the shape of this distribution is of the Gumbel form. However, the determination of the parameters of this distribution is a computationally very expensive task. We present a new algorithmic approach which allows to estimate the more important of the Gumbel parameters at least five times faster than the traditional methods. Actual runtimes of our algorithm between less than a second and a few minutes on a workstation bring significance estimation into the realm of interactive applications

    Text-basierte Ähnlichkeitssuche zur Treffer- und Leitstruktur-Identifizierung

    Get PDF
    This work investigated the applicability of global pairwise sequence alignment to the detection of functional analogues in virtual screening. This variant of sequence comparison was developed for the identification of homologue proteins based on amino acid or nucleotide sequences. Because of the significant differences between biopolymers and small molecules several aspects of this approach for sequence comparison had to be adapted. All proposed concepts were implemented as the ‘Pharmacophore Alignment Search Tool’ (PhAST) and evaluated in retrospective experiments on the COBRA dataset in version 6.1. The aim to identify functional analogues raised the necessity for identification and classification of functional properties in molecular structures. This was realized by fragment-based atom-typing, where one out of nine functional properties was assigned to each non-hydrogen atom in a structure. These properties were pre-assigned to atoms in the fragments. Whenever a fragment matched a substructure in a molecule, the assigned properties were transferred from fragment atoms to structure atoms. Each functional property was represented by exactly one symbol. Unlike amino acid or nucleotide sequences, small drug-like molecules contain branches and cycles. This was a major obstacle in the application of sequence alignment to virtual screening, since this technique can only be applied to linear sequences of symbols. The best linearization technique was shown to be Minimum Volume Embedding. To the best of knowledge, this work represents the first application of dimensionality reduction to graph linearization. Sequence alignment relies on a scoring system that rates symbol equivalences (matches) and differences (mismatches) based on functional properties that correspond to rated symbols. Existing scoring schemes are applicable only to amino acids and nucleotides. In this work, scoring schemes for functional properties in drug-like molecules were developed based on property frequencies and isofunctionality judged from chemical experience, pairwise sequence alignments, pairwise kernel-based assignments and stochastic optimization. The scoring system based on property frequencies and isofunctionality proved to be the most powerful (measured in enrichment capability). All developed scoring systems performed superior compared to simple scoring approaches that rate matches and mismatches uniformly. The frameworks proposed for score calculations can be used to guide modifications to the atom-typing in promising directions. The scoring system was further modified to allow for emphasis on particular symbols in a sequence. It was proven that the application of weights to symbols that correspond to key interaction points important to receptor-ligand-interaction significantly improves screening capabilities of PhAST. It was demonstrated that the systematic application of weights to all sequence positions in retrospective experiments can be used for pharmacophore elucidation. A scoring system based on structural instead of functional similarity was investigated and found to be suitable for similarity searches in shape-constrained datasets. Three methods for similarity assessment based on alignments were evaluated: Sequence identity, alignment score and significance. PhAST achieved significantly higher enrichment with alignment scores compared to sequence identity. p-values as significance estimates were calculated in a combination of Marcov Chain Monte Carlo Simulation and Importance Sampling. p-values were adapted to library size in a Bonferroni correction, yielding E-values. A significance threshold of an E-value of 1*10-5 was proposed for the application in prospective screenings. PhAST was compared to state-of-the-art methods for virtual screening. The unweighted version was shown to exhibit comparable enrichment capabilities. Compound rankings obtained with PhAST were proven to be complementary to those of other methods. The application to three-dimensional instead of two-dimensional molecular representations resulted in altered compound rankings without increased enrichment. PhAST was employed in two prospective applications. A screening for non-nucleoside analogue inhibitors of bacterial thymidin kinase yielded a hit with a distinct structural framework but only weak activity. The search for drugs not member of the NSAID (non-steroidal anti-inflammatory drug) class as modulators of gamma-secretase resulted in a potent modulator with clear structural distiction from the reference compound. The calculation of significance estimates, emphasizing on key interactions, the pharmacophore elucidation capabilities and the unique compound rannkings set PhAST apart from other screening techniques.In dieser Arbeit wurde die Anwendbarkeit von paarweisem globalen Sequenzalignment auf das Problem des MolekĂŒlsvergleichs im virtuellen Screening untersucht, einem Teilgebiet der computerbasierten Wirkstoffentwicklung. Sequenzalignment wurde zur Identifizierung homologer Proteine entwickelt. Bisher wurde es nur angewendet auf Sequenzen aus AminosĂ€uren oder Nukleotiden. Aufgrund der Unterschiede zwischen Biopolymeren und wirkstoffartigen MolekĂŒlen wurde dieser Ansatz zum Sequenzvergleich modifiziert und auf die konkrete Problemstellung angepasst. Alle vorgestellten und untersuchten Methoden wurden implementiert unter dem Namen ‚Pharmacophore Alignment Search Tool’ (PhAST). Zielsetzung bei der Entwicklung von PhAST war es, die funktionelle Ähnlichkeit zwischen MolekĂŒlen zu berechnen. DafĂŒr war es notwendig, einen Ansatz zu implementieren, der den Atomen eines MolekĂŒls funktionelle Eigenschaften zuweist. Dies wurde realisiert durch eine auf Fragmenten basierende Atomtypisierung. Den Atomen einer Sammlung vordefinierter Fragmente wurden nach bestem Wissen und Gewissen Eigenschaften zugewiesen. In jedem Fall, in dem eines der Fragmente als Substruktur eines MolekĂŒls auftrat, wurden die Atomtypisierungen von dem jeweiligen Fragment auf die Atome des MolekĂŒls ĂŒbertragen. Insgesamt unterscheidet PhAST neun funktionelle Eigenschaften und deren Kombination, wobei jedem Atomtyp genau ein Symbol zugeordnet ist. Im Gegensatz zu Sequenzen von AminosĂ€uren und Nukleotiden sind wirkstoffartige MolekĂŒle verzweigt, ungerichtet und enthalten RingeschlĂŒsse. Sequenzalignment ist aber ausschließlich auf lineare Sequenzen anwendbar. Folglich mussten MolekĂŒle mit ihren funktionellen Eigenschaften zunĂ€chst in einer linearisierten Form gespeichert werden. Es wurde gezeigt, dass Minimum Volume Embedding die performanteste der getesteten Linearisierungsmethoden war. Nach bestem Wissen und Gewissen wurden in dieser Arbeit zum ersten mal Methoden zur Dimensionsreduktion auf das Problem der kanonischen Indizierung von Graphen angewendet. Zur Berechnung von Sequenzalignments ist ein Bewertungssystem von Equivalenzen und Unterschieden von Symbolen notwendig. Die bestehenden Systeme sind nur anwendbar auf AminosĂ€uren und Nukleotide. Daher wurde ein Bewertungssystem fĂŒr Atomeigenschaften nach chemischer Intuition entwickelt, sowie drei automatisierte Methoden, solche Systeme zu berechnen. Das nach chemischer Intuition erstellte Bewertungsschema wurde als den anderen signifikant ĂŒberlegen identifiziert. Die FlexibilitĂ€t des Bewertungssystems in globalem Sequenzalignment machte es möglich, Symbole die berechneten Alignments stĂ€rker beeinflussen zu lassen, von denen bekannt war, dass sie fĂŒr essentielle Wechselwirkungen in der Rezeptor-Ligand-Interaktion stehen. Es wurde gezeigt, dass diese Gewichtung die Screening FĂ€higkeiten von PhAST signifikant steigerte. Weiterhin konnte gezeigt werden, dass PhAST mit der systematischen Anwendung von Gewichten auf alle Sequenzpositionen in der Lage war, essentielle Wechselwirkungen fĂŒr die Rezeptor-Ligand-Interaktion zu identifizieren. Bedingung hierfĂŒr war jedoch, dass ein geeigneter Datensatz mit aktiven und inaktiven Substanzen zur VerfĂŒgung stand. In dieser Arbeit wurden verschiedene Methoden evaluiert, mit denen aus Alignments Ähnlichkeiten berechnet werden können: SequenzidentitĂ€t, Alignment Score und p-Werte. Es wurde gezeigt, dass der Alignmentscore der SequenzidentitĂ€t fĂŒr die Verwendung in PhAST signifikant ĂŒberlegen ist. FĂŒr die Berechnung von p-Werten zur Bestimmung der Signfifikanz von Alignments musste zunĂ€chst die Verteilung von Alignment Scores fĂŒr zufĂ€llige Sequenzen bestimmter LĂ€ngen bestimmt werden. Dies geschah mit einer Kombination aus ‚Marcov Chain Monte Carlo Simulation’ und ‚Importance Sampling’. Die berechneten p-Werte wurden einer Bonferroni Korrektur unterzogen, und so unter BerĂŒcksichtigung der Gesamtzahl von im virtuellen Screening verglichenen MolekĂŒlen zu E-Werten. Als Ergebnis dieser Arbeit wird ein E-Wert von 1*10-5 als Grenzwert vorgeschlagen, wobei Alignments mit geringeren E-Werten als signifikant anzuerkennen sind. PhAST wurde in retrospektiven Screening mit anderen Methoden zum virtuellen Screening verglichen. Es konnte gezeigt werden, dass seine FĂ€higkeiten zur Identifizierung funktioneller Analoga mit denen der besten anderen Methoden vergleichbar oder ihnen sogar ĂŒberlegen ist. Es konnte gezeigt werden, dass nach von PhAST berechneten Ähnlichkeiten sortierte MolekĂŒlsammlungen von den Sortierungen anderer Methoden abweichen. Im Rahmen dieser Arbeit wurde PhAST erfolgreich in zwei prospektiven Anwendungen eingesetzt. So wurde ein schwacher Inhibitor der bakteriellen Thymidinkinase identifiziert, der kein Nukleosid Analogon ist. In einem Screening nach Modulatoren der Gamma-Sekretase wurde ein potentes MolekĂŒl identifiziert, das deutliche strukturelle Unterschiede zur verwendeten Referenz aufwies
    corecore