    On the distribution of the number of cycles in the breakpoint graph of a random signed permutation

    International audienceWe use the finite Markov chain embedding technique to obtain the distribution of the number of cycles in the breakpoint graph of a random uniform signed permutation. This further gives a very good approximation of the distribution of the reversal distance between two random genomes

    Compound Poisson Approximation and Testing for Gene Clusters with Multigene Families

    International audienceWe present in this article a compound Poisson approximation for computing probabilities involved in significance tests for conserved genomic regions between different species. We consider the case when the conserved genomic regions are found by the reference region approach. An important aspect of our computations is the fact that we are taking into account the existence of multigene families. We obtain convergence results for the error of our approximation by using the Stein-Chen method for compound Poisson approximation

    MinMax-Profiles: A Unifying View of Common Intervals, Nested Common Intervals and Conserved Intervals of K Permutations

    Common intervals of K permutations over the same set of n elements were firstly investigated by T. Uno and M.Yagiura (Algorithmica, 26:290:309, 2000), who proposed an efficient algorithm to find common intervals when K=2. Several particular classes of intervals have been defined since then, e.g. conserved intervals and nested common intervals, with applications mainly in genome comparison. Each such class, including common intervals, led to the development of a specific algorithmic approach for K=2, and - except for nested common intervals - for its extension to an arbitrary K. In this paper, we propose a common and efficient algorithmic framework for finding different types of common intervals in a set P of K permutations, with arbitrary K. Our generic algorithm is based on a global representation of the information stored in P, called the MinMax-profile of P, and an efficient data structure, called an LR-stack, that we introduce here. We show that common intervals (and their subclasses of irreducible common intervals and same-sign common intervals), nested common intervals (and their subclass of maximal nested common intervals) as well as conserved intervals (and their subclass of irreducible conserved intervals) may be obtained by appropriately setting the parameters of our algorithm in each case. All the resulting algorithms run in O(Kn+N)-time and need O(n) additional space, where N is the number of solutions. The algorithms for nested common intervals and maximal nested common intervals are new for K>2, in the sense that no other algorithm has been given so far to solve the problem with the same complexity, or better. The other algorithms are as efficient as the best known algorithms.Comment: 25 pages, 2 figure

    Models and Algorithms for Sorting Permutations with Tandem Duplication and Random Loss

    A central topic of evolutionary biology is the inference of phylogeny, i. e., the evolutionary history of species. A powerful tool for the inference of such phylogenetic relationships is the arrangement of the genes in mitochondrial genomes. The rationale is that these gene arrangements are subject to different types of mutations in the course of evolution. Hence, a high similarity in the gene arrangement between two species indicates a close evolutionary relation. Metazoan mitochondrial gene arrangements are particularly well suited for such phylogenetic studies as they are available for a wide range of species, their gene content is almost invariant, and usually free of duplicates. With these properties gene arrangements of mitochondrial genomes are modeled by permutations in which each element represents a gene, i. e., a specific genetic sequence. The mutations that shape the gene arrangement of genomes are then represented by operations that rearrange elements in permutations, so-called genome rearrangements, and thereby bridge the gap between evolutionary biology and optimization. Many problems of phylogeny inference can be formulated as challenging combinatorial optimization problems which makes this research area especially interesting for computer scientists. The most prominent examples of such optimization problems are the sorting problem and the distance problem. While the sorting problem requires a minimum length sequence of rearrangements that transforms one given permutation into another given permutation, i. e., it aims for a hypothetical scenario of gene order evolution, the distance problem intends to determine only the length of such a sequence. This minimum length is called distance and used as a (dis)similarity measure quantifying the evolutionary relatedness. Most evolutionary changes occurring in gene arrangements of mitochondrial genomes can be explained by the tandem duplication random loss (TDRL) genome rearrangement model. A TDRL consists of a duplication of a consecutive set of genes in tandem followed by a random loss of one copy of each duplicated gene. In spite of the importance of the TDRL genome rearrangement in mitochondrial evolution, its combinatorial properties have rarely been studied. In addition, models of genome rearrangements which include all types of rearrangement that are relevant for mitochondrial genomes, i. e., inversions, transpositions, inverse transpositions, and TDRLs, while admitting computational tractability are rare. Nevertheless, especially for metazoan gene arrangements the TDRL rearrangement should be considered for the reconstruction of phylogeny. Realizing that a better understanding of the TDRL model is indispensable for the study of mitochondrial gene arrangements, the central theme of this thesis is to broaden the horizon of TDRL genome rearrangements with respect to mitochondrial genome evolution. For this purpose, this thesis provides combinatorial properties of the TDRL model and its variants as well as efficient methods for a plausible reconstruction of rearrangement scenarios between gene arrangements. The methods that are proposed consider all types of genome rearrangements that predominately occur during mitochondrial evolution. More precisely, the main points contained in this thesis are as follows: The distance problem and the sorting problem for the TDRL model are further examined in respect to circular permutations, a formal concept that reflects the circular structure of mitochondrial genomes. As a result, a closed formula for the distance is provided. Recently, evidence for a variant of the TDRL rearrangement model in which the duplicated set of genes is additionally inverted have been found. Initiating the algorithmic study of this new rearrangement model on a certain type of permutations, a closed formula solving the distance problem is proposed as well as a quasilinear time algorithm that solves the corresponding sorting problem. The assumption that only one type of genome rearrangement has occurred during the evolution of certain gene arrangements is most likely unrealistic, e. g., at least three types of rearrangements on top of the TDRL rearrangement have to be considered for the evolution metazoan mitochondrial genomes. Therefore, three different biologically motivated constraints are taken into account in this thesis in order to produce plausible evolutionary rearrangement scenarios. The first constraint is extending the considered set of genome rearrangements to the model that covers all four common types of mitochondrial genome rearrangements. For this 4-type model a sharp lower bound and several close additive upper bounds on the distance are developed. As a byproduct, a polynomial-time approximation algorithm for the corresponding sorting problem is provided that guarantees the computation of pairwise rearrangement scenarios that deviate from a minimum length scenario by at most two rearrangement operations. The second biologically motivated constraint is the relative frequency of the different types of rearrangements occurring during the evolution. The frequency is modeled by employing a weighting scheme on the 4-type model in which every rearrangement is weighted with respect to its type. The resulting NP-hard sorting problem is then solved by means of a polynomial size integer linear program. The third biologically motivated constraint that has been taken into account is that certain subsets of genes are often found in close proximity in the gene arrangements of many different species. This observation is reflected by demanding rearrangement scenarios to preserve certain groups of genes which are modeled by common intervals of permutations. In order to solve the sorting problem that considers all three types of biologically motivated constraints, the exact dynamic programming algorithm CREx2 is proposed. CREx2 has a linear runtime for a large class of problem instances. Otherwise, two versions of the CREx2 are provided: The first version provides exact solutions but has an exponential runtime in the worst case and the second version provides approximated solutions efficiently. CREx2 is evaluated by an empirical study for simulated artificial and real biological mitochondrial gene arrangements

    Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

    The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment

    The incompatible desiderata of gene cluster properties

    There is widespread interest in comparative genomics in determining if historically and/or functionally related genes are spatially clustered in the genome, and whether the same sets of genes reappear in clusters in two or more genomes. We formalize and analyze the desirable properties of gene clusters and cluster definitions. Through detailed analysis of two commonly applied types of cluster, r-windows and maxgap, we investigate the extent to which a single definition can embody all of these properties simultaneously. We show that many of the most important properties are difficult to satisfy within the same definition. We also examine whether one commonly assumed property, which we call nestedness, is satisfied by the structures present in real genomic data