6 research outputs found

    Counting approximately-shortest paths in directed acyclic graphs

    Full text link
    Given a directed acyclic graph with positive edge-weights, two vertices s and t, and a threshold-weight L, we present a fully-polynomial time approximation-scheme for the problem of counting the s-t paths of length at most L. We extend the algorithm for the case of two (or more) instances of the same problem. That is, given two graphs that have the same vertices and edges and differ only in edge-weights, and given two threshold-weights L_1 and L_2, we show how to approximately count the s-t paths that have length at most L_1 in the first graph and length at most L_2 in the second graph. We believe that our algorithms should find application in counting approximate solutions of related optimization problems, where finding an (optimum) solution can be reduced to the computation of a shortest path in a purpose-built auxiliary graph

    HMM sampling and applications to gene finding and alternative splicing

    Get PDF
    The standard method of applying hidden Markov models to biological problems is to find a Viterbi (maximal weight) path through the HMM graph. The Viterbi algorithm reduces the problem of finding the most likely hidden state sequence that explains given observations, to a dynamic programming problem for corresponding directed acyclic graphs. For example, in the gene finding application, the HMM is used to find the most likely underlying gene structure given a DNA sequence. In this note we discuss the applications of sampling methods for HMMs. The standard sampling algorithm for HMMs is a variant of the common forward-backward and backtrack algorithms, and has already been applied in the context of Gibbs sampling methods. Nevetheless, the practice of sampling state paths from HMMs does not seem to have been widely adopted, and important applications have been overlooked. We show how sampling can be used for finding alternative splicings for genes, including alternative splicings that are conserved between genes from related organisms. We also show how sampling from the posterior distribution is a natural way to compute probabilities for predicted exons and gene structures being correct under the assumed model. Finally, we describe a new memory efficient sampling algorithm for certain classes of HMMs which provides a practical sampling alternative to the Hirschberg algorithm for optimal alignment. The ideas presented have applications not only to gene finding and HMMs but more generally to stochastic context free grammars and RNA structure prediction

    Efficient homology search for genomic sequence databases

    Get PDF
    Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research

    Text-basierte Ähnlichkeitssuche zur Treffer- und Leitstruktur-Identifizierung

    Get PDF
    This work investigated the applicability of global pairwise sequence alignment to the detection of functional analogues in virtual screening. This variant of sequence comparison was developed for the identification of homologue proteins based on amino acid or nucleotide sequences. Because of the significant differences between biopolymers and small molecules several aspects of this approach for sequence comparison had to be adapted. All proposed concepts were implemented as the ‘Pharmacophore Alignment Search Tool’ (PhAST) and evaluated in retrospective experiments on the COBRA dataset in version 6.1. The aim to identify functional analogues raised the necessity for identification and classification of functional properties in molecular structures. This was realized by fragment-based atom-typing, where one out of nine functional properties was assigned to each non-hydrogen atom in a structure. These properties were pre-assigned to atoms in the fragments. Whenever a fragment matched a substructure in a molecule, the assigned properties were transferred from fragment atoms to structure atoms. Each functional property was represented by exactly one symbol. Unlike amino acid or nucleotide sequences, small drug-like molecules contain branches and cycles. This was a major obstacle in the application of sequence alignment to virtual screening, since this technique can only be applied to linear sequences of symbols. The best linearization technique was shown to be Minimum Volume Embedding. To the best of knowledge, this work represents the first application of dimensionality reduction to graph linearization. Sequence alignment relies on a scoring system that rates symbol equivalences (matches) and differences (mismatches) based on functional properties that correspond to rated symbols. Existing scoring schemes are applicable only to amino acids and nucleotides. In this work, scoring schemes for functional properties in drug-like molecules were developed based on property frequencies and isofunctionality judged from chemical experience, pairwise sequence alignments, pairwise kernel-based assignments and stochastic optimization. The scoring system based on property frequencies and isofunctionality proved to be the most powerful (measured in enrichment capability). All developed scoring systems performed superior compared to simple scoring approaches that rate matches and mismatches uniformly. The frameworks proposed for score calculations can be used to guide modifications to the atom-typing in promising directions. The scoring system was further modified to allow for emphasis on particular symbols in a sequence. It was proven that the application of weights to symbols that correspond to key interaction points important to receptor-ligand-interaction significantly improves screening capabilities of PhAST. It was demonstrated that the systematic application of weights to all sequence positions in retrospective experiments can be used for pharmacophore elucidation. A scoring system based on structural instead of functional similarity was investigated and found to be suitable for similarity searches in shape-constrained datasets. Three methods for similarity assessment based on alignments were evaluated: Sequence identity, alignment score and significance. PhAST achieved significantly higher enrichment with alignment scores compared to sequence identity. p-values as significance estimates were calculated in a combination of Marcov Chain Monte Carlo Simulation and Importance Sampling. p-values were adapted to library size in a Bonferroni correction, yielding E-values. A significance threshold of an E-value of 1*10-5 was proposed for the application in prospective screenings. PhAST was compared to state-of-the-art methods for virtual screening. The unweighted version was shown to exhibit comparable enrichment capabilities. Compound rankings obtained with PhAST were proven to be complementary to those of other methods. The application to three-dimensional instead of two-dimensional molecular representations resulted in altered compound rankings without increased enrichment. PhAST was employed in two prospective applications. A screening for non-nucleoside analogue inhibitors of bacterial thymidin kinase yielded a hit with a distinct structural framework but only weak activity. The search for drugs not member of the NSAID (non-steroidal anti-inflammatory drug) class as modulators of gamma-secretase resulted in a potent modulator with clear structural distiction from the reference compound. The calculation of significance estimates, emphasizing on key interactions, the pharmacophore elucidation capabilities and the unique compound rannkings set PhAST apart from other screening techniques.In dieser Arbeit wurde die Anwendbarkeit von paarweisem globalen Sequenzalignment auf das Problem des Molekülsvergleichs im virtuellen Screening untersucht, einem Teilgebiet der computerbasierten Wirkstoffentwicklung. Sequenzalignment wurde zur Identifizierung homologer Proteine entwickelt. Bisher wurde es nur angewendet auf Sequenzen aus Aminosäuren oder Nukleotiden. Aufgrund der Unterschiede zwischen Biopolymeren und wirkstoffartigen Molekülen wurde dieser Ansatz zum Sequenzvergleich modifiziert und auf die konkrete Problemstellung angepasst. Alle vorgestellten und untersuchten Methoden wurden implementiert unter dem Namen ‚Pharmacophore Alignment Search Tool’ (PhAST). Zielsetzung bei der Entwicklung von PhAST war es, die funktionelle Ähnlichkeit zwischen Molekülen zu berechnen. Dafür war es notwendig, einen Ansatz zu implementieren, der den Atomen eines Moleküls funktionelle Eigenschaften zuweist. Dies wurde realisiert durch eine auf Fragmenten basierende Atomtypisierung. Den Atomen einer Sammlung vordefinierter Fragmente wurden nach bestem Wissen und Gewissen Eigenschaften zugewiesen. In jedem Fall, in dem eines der Fragmente als Substruktur eines Moleküls auftrat, wurden die Atomtypisierungen von dem jeweiligen Fragment auf die Atome des Moleküls übertragen. Insgesamt unterscheidet PhAST neun funktionelle Eigenschaften und deren Kombination, wobei jedem Atomtyp genau ein Symbol zugeordnet ist. Im Gegensatz zu Sequenzen von Aminosäuren und Nukleotiden sind wirkstoffartige Moleküle verzweigt, ungerichtet und enthalten Ringeschlüsse. Sequenzalignment ist aber ausschließlich auf lineare Sequenzen anwendbar. Folglich mussten Moleküle mit ihren funktionellen Eigenschaften zunächst in einer linearisierten Form gespeichert werden. Es wurde gezeigt, dass Minimum Volume Embedding die performanteste der getesteten Linearisierungsmethoden war. Nach bestem Wissen und Gewissen wurden in dieser Arbeit zum ersten mal Methoden zur Dimensionsreduktion auf das Problem der kanonischen Indizierung von Graphen angewendet. Zur Berechnung von Sequenzalignments ist ein Bewertungssystem von Equivalenzen und Unterschieden von Symbolen notwendig. Die bestehenden Systeme sind nur anwendbar auf Aminosäuren und Nukleotide. Daher wurde ein Bewertungssystem für Atomeigenschaften nach chemischer Intuition entwickelt, sowie drei automatisierte Methoden, solche Systeme zu berechnen. Das nach chemischer Intuition erstellte Bewertungsschema wurde als den anderen signifikant überlegen identifiziert. Die Flexibilität des Bewertungssystems in globalem Sequenzalignment machte es möglich, Symbole die berechneten Alignments stärker beeinflussen zu lassen, von denen bekannt war, dass sie für essentielle Wechselwirkungen in der Rezeptor-Ligand-Interaktion stehen. Es wurde gezeigt, dass diese Gewichtung die Screening Fähigkeiten von PhAST signifikant steigerte. Weiterhin konnte gezeigt werden, dass PhAST mit der systematischen Anwendung von Gewichten auf alle Sequenzpositionen in der Lage war, essentielle Wechselwirkungen für die Rezeptor-Ligand-Interaktion zu identifizieren. Bedingung hierfür war jedoch, dass ein geeigneter Datensatz mit aktiven und inaktiven Substanzen zur Verfügung stand. In dieser Arbeit wurden verschiedene Methoden evaluiert, mit denen aus Alignments Ähnlichkeiten berechnet werden können: Sequenzidentität, Alignment Score und p-Werte. Es wurde gezeigt, dass der Alignmentscore der Sequenzidentität für die Verwendung in PhAST signifikant überlegen ist. Für die Berechnung von p-Werten zur Bestimmung der Signfifikanz von Alignments musste zunächst die Verteilung von Alignment Scores für zufällige Sequenzen bestimmter Längen bestimmt werden. Dies geschah mit einer Kombination aus ‚Marcov Chain Monte Carlo Simulation’ und ‚Importance Sampling’. Die berechneten p-Werte wurden einer Bonferroni Korrektur unterzogen, und so unter Berücksichtigung der Gesamtzahl von im virtuellen Screening verglichenen Molekülen zu E-Werten. Als Ergebnis dieser Arbeit wird ein E-Wert von 1*10-5 als Grenzwert vorgeschlagen, wobei Alignments mit geringeren E-Werten als signifikant anzuerkennen sind. PhAST wurde in retrospektiven Screening mit anderen Methoden zum virtuellen Screening verglichen. Es konnte gezeigt werden, dass seine Fähigkeiten zur Identifizierung funktioneller Analoga mit denen der besten anderen Methoden vergleichbar oder ihnen sogar überlegen ist. Es konnte gezeigt werden, dass nach von PhAST berechneten Ähnlichkeiten sortierte Molekülsammlungen von den Sortierungen anderer Methoden abweichen. Im Rahmen dieser Arbeit wurde PhAST erfolgreich in zwei prospektiven Anwendungen eingesetzt. So wurde ein schwacher Inhibitor der bakteriellen Thymidinkinase identifiziert, der kein Nukleosid Analogon ist. In einem Screening nach Modulatoren der Gamma-Sekretase wurde ein potentes Molekül identifiziert, das deutliche strukturelle Unterschiede zur verwendeten Referenz aufwies
    corecore