199 research outputs found

    Genome Rearrangement Problems

    Get PDF
    Various global rearrangements of permutations, such as reversals and transpositions, have recently become of interest because of their applications in computational molecular biology. A reversal is an operation that reverses the order of a substring of a permutation. A transposition is an operation that swaps two adjacent substrings of a permutation. The problem of determining the smallest number of reversals required to transform a given permutation into the identity permutation is called sorting by reversals. Similar problems can be defined for transpositions and other global rearrangements. Related to sorting by reversals is the problem of establishing the reversal diameter. The reversal diameter of Sn (the symmetric group on n elements) is the maximum number of reversals required to sort a permutation of length n. Of course, diameter problems can be posed for other global rearrangements. These various problems are of interest because the permutations can be used to represent sequences of genes in chromosomes, and the global rearrangements then represent evolutionary events. As a result, we call these problems genome rearrangement problems. Genome rearrangement problems seem to be unlike previously studied algorithmic problems on sequences, so new methods have had to be developed to deal with them. These methods predominantly employ graphs to model permutation structure. However, even using these methods, often a genome rearrangement problem has no obvious polynomial-time algorithm, and in some cases can be shown to be NP-hard. For example, the problem of sorting by reversals is NP-hard, whereas the computational complexity of sorting by transpositions is open. For problems like these, it is natural to seek polynomial-time approximation algorithms that achieve an approximation guarantee. In this thesis, we study several genome rearrangement problems as interesting and challenging algorithmic problems in their own right, including some problems for which the global rearrangement has no immediate biological equivalent. For example, we define a block-interchange to be a rearrangement that swaps any two substrings of the permutation. We examine, in particular, how the graph theoretic models relate to the genome rearrangement problems that we study. The major new results contained in this thesis are as follows: We present a 3/2-approximation algorithm for sorting by reversals. This is the best known approximation algorithm for the problem, and improves upon the 7/4 approximation bound of the previous best algorithm. We give a polynomial-time algorithm for a significant special case of sorting by reversals, thereby disproving a conjecture of Kececioglu and Sankoff, who had suggested that this special case was likely to be NP-hard. We analyse the structure of the so-called cpcle graph of a permutation in the context of sorting by transpositions, and thereby gain a deeper insight into this problem. Among the consequences are; a tighter lower bound for the problem, a simpler 3/2-aproximation algorithm than had previously been described, and algorithms that, in empirical tests, almost always find the exact transposition distance of random permutations. We introduce a natural generalisation of sorting by transpositions called sorting by block-interchanges, and present a polynomial-time algorithm for this problem. We initiate the study of analogous problems on strings over a fixed length alphabet. We establish upper and lower bounds and diameter results for the problems over a binary alphabet. We also prove that the problems analogous to sorting by reversals and sorting by block-interchanges are NP-hard. (Abstract shortened by ProQuest.)

    Transposition Rearrangement: Linear Algorithm for Length-Cost Model

    Get PDF
    The contemporary computational biology gives motivation to study dependencies between finite sequences. Primary structures of DNA or proteins are represented by such sequences (also called words or strings). In the paper a linear algorithm, computing the distance between two words, is presented. The model operates with transpositions of single letters. The cost of a single transposition is equal to the distance which transposed letter has to cover. Other papers concerning the model give, as the best known, algorithms of time complexity O(n log n). The complexity of our algorithm is O(nk), where k is the size of the alphabet, and O(n) when the size is fixed

    Sorting by Prefix Block-Interchanges

    Get PDF
    We initiate the study of sorting permutations using prefix block-interchanges, which exchange any prefix of a permutation with another non-intersecting interval. The goal is to transform a given permutation into the identity permutation using as few such operations as possible. We give a 2-approximation algorithm for this problem, show how to obtain improved lower and upper bounds on the corresponding distance, and determine the largest possible value for that distance

    A fast algorithm for the multiple genome rearrangement problem with weighted reversals and transpositions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Due to recent progress in genome sequencing, more and more data for phylogenetic reconstruction based on rearrangement distances between genomes become available. However, this phylogenetic reconstruction is a very challenging task. For the most simple distance measures (the breakpoint distance and the reversal distance), the problem is NP-hard even if one considers only three genomes.</p> <p>Results</p> <p>In this paper, we present a new heuristic algorithm that directly constructs a phylogenetic tree w.r.t. the weighted reversal and transposition distance. Experimental results on previously published datasets show that constructing phylogenetic trees in this way results in better trees than constructing the trees w.r.t. the reversal distance, and recalculating the weight of the trees with the weighted reversal and transposition distance. An implementation of the algorithm can be obtained from the authors.</p> <p>Conclusion</p> <p>The possibility of creating phylogenetic trees directly w.r.t. the weighted reversal and transposition distance results in biologically more realistic scenarios. Our algorithm can solve today's most challenging biological datasets in a reasonable amount of time.</p

    Sorting by Block Moves

    Get PDF
    The research in this thesis is focused on the problem of Block Sorting, which has applications in Computational Biology and in Optical Character Recognition (OCR). A block in a permutation is a maximal sequence of consecutive elements that are also consecutive in the identity permutation. BLOCK SORTING is the process of transforming an arbitrary permutation to the identity permutation through a sequence of block moves. Given an arbitrary permutation π and an integer m, the Block Sorting Problem, or the problem of deciding whether the transformation can be accomplished in at most m block moves has been shown to be NP-hard. After being known to be 3-approximable for over a decade, block sorting has been researched extensively and now there are several 2-approximation algorithms for its solution. This work introduces new structures on a permutation, which are called runs and ordered pairs, and are used to develop two new approximation algorithms. Both the new algorithms are 2-approximation algorithms, yielding the approximation ratio equal to the current best. This work also includes an analysis of both the new algorithms showing they are 2-approximation algorithms

    Models and Algorithms for Sorting Permutations with Tandem Duplication and Random Loss

    Get PDF
    A central topic of evolutionary biology is the inference of phylogeny, i. e., the evolutionary history of species. A powerful tool for the inference of such phylogenetic relationships is the arrangement of the genes in mitochondrial genomes. The rationale is that these gene arrangements are subject to different types of mutations in the course of evolution. Hence, a high similarity in the gene arrangement between two species indicates a close evolutionary relation. Metazoan mitochondrial gene arrangements are particularly well suited for such phylogenetic studies as they are available for a wide range of species, their gene content is almost invariant, and usually free of duplicates. With these properties gene arrangements of mitochondrial genomes are modeled by permutations in which each element represents a gene, i. e., a specific genetic sequence. The mutations that shape the gene arrangement of genomes are then represented by operations that rearrange elements in permutations, so-called genome rearrangements, and thereby bridge the gap between evolutionary biology and optimization. Many problems of phylogeny inference can be formulated as challenging combinatorial optimization problems which makes this research area especially interesting for computer scientists. The most prominent examples of such optimization problems are the sorting problem and the distance problem. While the sorting problem requires a minimum length sequence of rearrangements that transforms one given permutation into another given permutation, i. e., it aims for a hypothetical scenario of gene order evolution, the distance problem intends to determine only the length of such a sequence. This minimum length is called distance and used as a (dis)similarity measure quantifying the evolutionary relatedness. Most evolutionary changes occurring in gene arrangements of mitochondrial genomes can be explained by the tandem duplication random loss (TDRL) genome rearrangement model. A TDRL consists of a duplication of a consecutive set of genes in tandem followed by a random loss of one copy of each duplicated gene. In spite of the importance of the TDRL genome rearrangement in mitochondrial evolution, its combinatorial properties have rarely been studied. In addition, models of genome rearrangements which include all types of rearrangement that are relevant for mitochondrial genomes, i. e., inversions, transpositions, inverse transpositions, and TDRLs, while admitting computational tractability are rare. Nevertheless, especially for metazoan gene arrangements the TDRL rearrangement should be considered for the reconstruction of phylogeny. Realizing that a better understanding of the TDRL model is indispensable for the study of mitochondrial gene arrangements, the central theme of this thesis is to broaden the horizon of TDRL genome rearrangements with respect to mitochondrial genome evolution. For this purpose, this thesis provides combinatorial properties of the TDRL model and its variants as well as efficient methods for a plausible reconstruction of rearrangement scenarios between gene arrangements. The methods that are proposed consider all types of genome rearrangements that predominately occur during mitochondrial evolution. More precisely, the main points contained in this thesis are as follows: The distance problem and the sorting problem for the TDRL model are further examined in respect to circular permutations, a formal concept that reflects the circular structure of mitochondrial genomes. As a result, a closed formula for the distance is provided. Recently, evidence for a variant of the TDRL rearrangement model in which the duplicated set of genes is additionally inverted have been found. Initiating the algorithmic study of this new rearrangement model on a certain type of permutations, a closed formula solving the distance problem is proposed as well as a quasilinear time algorithm that solves the corresponding sorting problem. The assumption that only one type of genome rearrangement has occurred during the evolution of certain gene arrangements is most likely unrealistic, e. g., at least three types of rearrangements on top of the TDRL rearrangement have to be considered for the evolution metazoan mitochondrial genomes. Therefore, three different biologically motivated constraints are taken into account in this thesis in order to produce plausible evolutionary rearrangement scenarios. The first constraint is extending the considered set of genome rearrangements to the model that covers all four common types of mitochondrial genome rearrangements. For this 4-type model a sharp lower bound and several close additive upper bounds on the distance are developed. As a byproduct, a polynomial-time approximation algorithm for the corresponding sorting problem is provided that guarantees the computation of pairwise rearrangement scenarios that deviate from a minimum length scenario by at most two rearrangement operations. The second biologically motivated constraint is the relative frequency of the different types of rearrangements occurring during the evolution. The frequency is modeled by employing a weighting scheme on the 4-type model in which every rearrangement is weighted with respect to its type. The resulting NP-hard sorting problem is then solved by means of a polynomial size integer linear program. The third biologically motivated constraint that has been taken into account is that certain subsets of genes are often found in close proximity in the gene arrangements of many different species. This observation is reflected by demanding rearrangement scenarios to preserve certain groups of genes which are modeled by common intervals of permutations. In order to solve the sorting problem that considers all three types of biologically motivated constraints, the exact dynamic programming algorithm CREx2 is proposed. CREx2 has a linear runtime for a large class of problem instances. Otherwise, two versions of the CREx2 are provided: The first version provides exact solutions but has an exponential runtime in the worst case and the second version provides approximated solutions efficiently. CREx2 is evaluated by an empirical study for simulated artificial and real biological mitochondrial gene arrangements

    O problema da ordenação de permutações usando rearranjos de prefixos e sufixos

    Get PDF
    Orientador: Zanoni DiasTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O Problema das Panquecas tem como objetivo ordenar uma pilha de panquecas que possuem tamanhos distintos realizando o menor número possível de operações. A operação permitida é chamada reversão de prefixo e, quando aplicada, inverte o topo da pilha de panquecas. Tal problema é interessante do ponto de vista combinatório por si só, mas ele também possui algumas aplicações em biologia computacional. Dados dois genomas que compartilham o mesmo número de genes, e assumindo que cada gene aparece apenas uma vez por genoma, podemos representá-los como permutações (pilhas de panquecas também são representadas por permutações). Então, podemos comparar os genomas tentando descobrir como um foi transformado no outro por meio da aplicação de rearranjos de genoma, que são eventos de mutação de grande escala. Reversões e transposições são os tipos mais comumente estudados de rearranjo de genomas e uma reversão de prefixo (ou transposição de prefixo) é um tipo de reversão (ou transposição) que é restrita ao início da permutação. Quando o rearranjo é restrito ao final da permutação, dizemos que ele é um rearranjo de sufixo. Um problema de ordenação de permutações por rearranjos é, portanto, o problema de encontrar uma sequência de rearranjos de custo mínimo que ordene a permutação dada. A abordagem tradicional considera que todos os rearranjos têm o mesmo custo unitário, de forma que o objetivo é tentar encontrar o menor número de rearranjos necessários para ordenar a permutação. Vários esforços foram feitos nos últimos anos considerando essa abordagem. Por outro lado, um rearranjo muito longo (que na verdade é uma mutação) tem mais probabilidade de perturbar o organismo. Portanto, pesos baseados no comprimento do segmento envolvido podem ter um papel importante no processo evolutivo. Dizemos que essa abordagem é ponderada por comprimento e o objetivo nela é tentar encontrar uma sequência de rearranjos cujo custo total (que é a soma do custo de cada rearranjo, que por sua vez depende de seu comprimento) seja mínimo. Nessa tese nós apresentamos os primeiros resultados que envolvem problemas de ordenação de permutações por reversões e transposições de prefixo e sufixo considerando ambas abordagens tradicional e ponderada por comprimento. Na abordagem tradicional, consideramos um total de 10 problemas e desenvolvemos novos resultados para 6 deles. Na abordagem ponderada por comprimento, consideramos um total de 13 problemas e desenvolvemos novos resultados para todos elesAbstract: The goal of the Pancake Flipping problem is to sort a stack of pancakes that have different sizes by performing as few operations as possible. The operation allowed is called prefix reversal and, when applied, flips the top of the stack of pancakes. Such problem is an interesting combinatorial problem by itself, but it has some applications in computational biology. Given two genomes that share the same genes and assuming that each gene appears only once per genome, we can represent them as permutations (stacks of pancakes are also represented by permutations). Then, we can compare the genomes by figuring out how one was transformed into the other through the application of genome rearrangements, which are large scale mutations. Reversals and transpositions are the most commonly studied types of genome rearrangements and a prefix reversal (or prefix transposition) is a type of reversal (or transposition) which is restricted to the beginning of the permutation. When the rearrangement is restricted to the end of the permutation, we say it is a suffix rearrangement. A problem of sorting permutations by rearrangements is, therefore, the problem to find a sequence of rearrangements with minimum cost that sorts a given permutation. The traditional approach considers that all rearrangements have the same unitary cost, in which case the goal is trying to find the minimum number of rearrangements that are needed to sort the permutation. Numerous efforts have been made over the past years regarding this approach. On the other hand, a long rearrangement (which is in fact a mutation) is more likely to disturb the organism. Therefore, weights based on the length of the segment involved may have an important role in the evolutionary process. We say this is the length-weighted approach and the goal is trying to find a sequence of rearrangements whose total cost (the sum of the cost of each rearrangement, which depends on its length) is minimum. In this thesis we present the first results regarding problems of sorting permutations by prefix and suffix reversals and transpositions considering both the traditional and the length-weighted approach. For the traditional approach, we considered a total of 10 problems and developed new results for 6 of them. For the length-weighted approach, we considered a total of 13 problems and developed new results for all of themDoutoradoCiência da ComputaçãoDoutora em Ciência da Computação140017/2013-52013/01172-0FAPESPCNP

    Sorting by Strip Moves and Strip Swaps

    Get PDF
    Genome rearrangement problems in computational biology [19, 29, 27] and zoning algorithms in optical character recognition [14, 4] have been modeled as combinatorial optimization problems related to the familiar problem of sorting, namely transforming arbitrary permutations to the identity permutation. The term permutation is used for an arbitrary arrangement of the integers 1, 2,···, n, and the term identity permutation for the arrangement of 1, 2,···, n in increasing order. When a permutation is viewed as the string of integers from 1 through n, any substring in it that is also a substring in the identity permutation will be called a strip. The objective in the combinatorial optimization problems arising from the applications is to obtain the identity permutation from an arbitrary permutation in the least number of a particular chosen strip operation. Among the strip operations which have been investigated thus far in the literature are strip moves, transpositions, reversals, and block interchanges [16, 2, 25, 11, 34]. However, it is important to note that most of the existing research on sorting by strip operations has been focused on obtaining hardness results or designing approximation algorithms, with little work carried out thus far on the implementation of the proposed approximation algorithms. This research starts with implementing two existing algorithms [5, 34] and as the main contributions, provides two new algorithms for sorting by strip swaps: 1) A greedy algorithm in which each strip swap reduces the number of strips the most, and puts maximum strips in their correct positions; 2) Another algorithm that uses the strategy of bringing closest consecutive pairs together called the closest consecutive pair (CCP) algorithm. The approximation ratios for the implemented algorithms are also experimentally estimated
    corecore