2,960 research outputs found

    Linear-Time Algorithms for Computing Maximum-Density Sequence Segments with Bioinformatics Applications

    Get PDF
    We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a_i,w_i) for i = 1,..,n and w_i>0, a segment A(i,j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i,j) is w(i,j) = sum_{i <= k <= j} w_k, and the density is (sum_{i<= k <= j} a_k)/ w(i,j). The maximum-density segment problem takes A and two values L and U as input and asks for a segment of A with the largest possible density among those of width at least L and at most U. When U is unbounded, we provide a relatively simple, O(n)-time algorithm, improving upon the O(n \log L)-time algorithm by Lin, Jiang and Chao. When both L and U are specified, there are no previous nontrivial results. We solve the problem in O(n) time if w_i=1 for all i, and more generally in O(n+n\log(U-L+1)) time when w_i>=1 for all i.Comment: 23 pages, 13 figures. A significant portion of these results appeared under the title, "Fast Algorithms for Finding Maximum-Density Segments of a Sequence with Applications to Bioinformatics," in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), volume 2452 of Lecture Notes in Computer Science (Springer-Verlag, Berlin), R. Guigo and D. Gusfield editors, 2002, pp. 157--17

    Detection of recombination in DNA multiple alignments with hidden markov models

    Get PDF
    CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected

    Selective review of offline change point detection methods

    Full text link
    This article presents a selective survey of algorithms for the offline detection of multiple change points in multivariate time series. A general yet structuring methodological strategy is adopted to organize this vast body of work. More precisely, detection algorithms considered in this review are characterized by three elements: a cost function, a search method and a constraint on the number of changes. Each of those elements is described, reviewed and discussed separately. Implementations of the main algorithms described in this article are provided within a Python package called ruptures

    The Mathematics of Phylogenomics

    Get PDF
    The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called Phylogenomics. This discipline results from the combination of two major fields in the life sciences: Genomics, i.e., the study of the function and structure of genes and genomes; and Molecular Phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure

    Integrating protein structural information

    Get PDF
    Dissertação apresentada para obtenção de Grau de Doutor em Bioquímica,Bioquímica Estrutural, pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaThe central theme of this work is the application of constraint programming and other artificial intelligence techniques to protein structure problems, with the goal of better combining experimental data with structure prediction methods. Part one of the dissertation introduces the main subjects of protein structure and constraint programming, summarises the state of the art in the modelling of protein structures and complexes, sets the context for the techniques described later on, and outlines the main points of the thesis: the integration of experimental data in modelling. The first chapter, Protein Structure, introduces the reader to the basic notions of amino acid structure, protein chains, and protein folding and interaction. These are important concepts to understand the work described in parts two and three. Chapter two, Protein Modelling, gives a brief overview of experimental and theoretical techniques to model protein structures. The information in this chapter provides the context of the investigations described in parts two and three, but is not essential to understanding the methods developed. Chapter three, Constraint Programming, outlines the main concepts of this programming technique. Understanding variable modelling, the notions of consistency and propagation, and search methods should greatly help the reader interested in the details of the algorithms, as described in part two of this book. The fourth chapter, Integrating Structural Information, is a summary of the thesis proposed here. This chapter is an overview of the objectives of this work, and gives an idea of how the algorithms developed here could help in modelling protein structures. The main goal is to provide a flexible and continuously evolving framework for the integration of structural information from a diversity of experimental techniques and theoretical predictions. Part two describes the algorithms developed, which make up the main original contribution of this work. This part is aimed especially at developers interested in the details of the algorithms, in replicating the results, in improving the method or in integrating them in other applications. Biochemical aspects are dealt with briefly and as necessary, and the emphasis is on the algorithms and the code

    A FAST ALGORITHM FOR COMPUTING HIGHLY SENSITIVE MULTIPLE SPACED SEEDS

    Get PDF
    The main goal of homology search is to find similar segments, or local alignments, be­ tween two DNA or protein sequences. Since the dynamic programming algorithm of Smith- Waterman is too slow, heuristic methods have been designed to achieve both efficiency and accuracy. Seed-based methods were made well known by their use in BLAST, the most widely used software program in biological applications. The seed of BLAST trades sensitivity for speed and spaced seeds were introduced in PatternHunter to achieve both. Several seeds are better than one and near perfect sensitivity can be obtained while maintaining the speed. There­ fore, multiple spaced seeds quickly became the state-of-the-art in similarity search, being em­ ployed by many software programs. However, the quality of these seeds is crucial and comput­ ing optimal multiple spaced seeds is NP-hard. All but one of the existing heuristic algorithms for computing good seeds are exponential. Our work has two main goals. First we engineer the only existing polynomial-time heuristic algorithm to compute better seeds than any other program, while running orders of magnitude faster. Second, we estimate its performance by comparing its seeds with the optimal seeds in a few practical cases. In order to make the computation feasible, a very fast implementation of the sensitivity function is provided

    Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

    Get PDF
    Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.Table of Contents 1. GENERAL INTRODUCTION.........................................................................1 1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9 1.1.1. Suffix Tree..........................................................................................11 1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13 1.1.3. Suffix Array........................................................................................14 1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15 1.1.5. Enhanced Suffix Array.......................................................................17 1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21 1.1.7. Self-index...........................................................................................22 1.1.8. Burrows-Wheeler transform..............................................................23 1.1.9. The FM-Index and the backward search algorithm..........................25 1.1.10. The space and the time-complexity of the FM-index.........................29 2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31 2.1. Introduction................................................................................................31 2.2. Methods.....................................................................................................33 2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33 2.2.2. Problem statement.............................................................................35 2.2.3. Time complexity analysis of the previous approach (kr 1)................35 2.2.4. Time complexity analysis of the new approach (kr 2).......................37 2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38 2.2.6. The implementation of kr version 2...................................................44 2.3. Analysis of Kr on simulated data sets........................................................45 2.3.1. Auxiliary programs............................................................................45 2.3.2. Consistency of Kr...............................................................................46 i 2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48 2.3.4. The effect of genome duplication on the accuracy of Kr....................49 2.3.5. Run time comparison of kr 1 and kr 2...............................................50 2.4. Application of kr version 2........................................................................53 2.4.1. Auxililary software used for the analysis of real data sets................56 2.4.2. The analysis of 12 Drosophila genomes............................................57 2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58 2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61 2.5. Discussion..................................................................................................62 3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66 3.1. Introduction................................................................................................66 3.2. Methods.....................................................................................................69 3.2.1. Problem statement – determining subtype(s) of a query sequence....69 3.2.2. Construction of locally homologous segments..................................71 3.2.3. Time complexity of computing a list of intervals Ii............................72 3.2.4. Algorithm 2: Construction of an interval tree...................................73 3.2.5. Computing a list of segements Gi.......................................................80 3.3. Analysis of st on simulated data sets.........................................................82 3.3.1. Run-time and memory usage analysis of st........................................82 3.3.2. Consistency of st................................................................................85 3.3.3. Comparison to SCUEAL on simulated data sets...............................92 3.4. Application of st.........................................................................................97 3.4.1. The analysis of Neisseria meningitidis..............................................98 3.4.2. The analysis of a recombinant form of HIV-1...................................99 3.4.3. The analysis of circulating recombinant forms of HIV-1................103 3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104 3.5. Discussionii iii ABSTRACT............................................................................................................124 SAŽETAK..............................................................................................................125 CURRICULUM VITAE........................................................................................126 ŽIVOTOPIS...........................................................................................................12
    corecore