11 research outputs found

    Evolution through segmental duplications and losses : A Super-Reconciliation approach

    Get PDF
    The classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is not appropriate for genes grouped into syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation problem which consists in inferring a history of segmental duplication and loss events (involving a set of neighboring genes) leading to a set of present-day syntenies from a single ancestral one. In other words, we extend the traditional Duplication-Loss reconciliation problem of a single gene tree, to a set of trees, accounting for segmental duplications and losses. Existency of a Super-Reconciliation depends on individual gene tree consistency. In addition, ignoring rearrangements implies that existency also depends on gene order consistency. We first show that the problem of reconstructing a most parsimonious Super-Reconciliation, if any, is NP-hard and give an exact exponential-time algorithm to solve it. Alternatively, we show that accounting for rearrangements in the evolutionary model, but still only minimizing segmental duplication and loss events, leads to an exact polynomial-time algorithm. We finally assess time efficiency of the former exponential time algorithm for the Duplication-Loss model on simulated datasets, and give a proof of concept on the opioid receptor genes

    New Algorithms andMethodology for Analysing Distances

    Get PDF
    Distances arise in a wide variety of di�erent contexts, one of which is partitional clustering, that is, the problem of �nding groups of similar objects within a set of objects.¿ese groups are seemingly very easy to �nd for humans, but very di�cult to �nd for machines as there are two major di�culties to be overcome: the �rst de�ning an objective criterion for the vague notion of “groups of similar objects”, and the second is the computational complexity of �nding such groups given a criterion. In the �rst part of this thesis, we focus on the �rst di�culty and show that even seemingly similar optimisation criteria used for partitional clustering can produce vastly di�erent results. In the process of showing this we develop a new metric for comparing clustering solutions called the assignment metric. We then prove some new NP-completeness results for problems using two related “sum-of-squares” clustering criteria. Closely related to partitional clustering is the problem of hierarchical clustering. We extend and formalise this problem to the problem of constructing rooted edge-weighted X-trees, that is trees with a leafset X. It is well known that an X-tree can be uniquely reconstructed from a distance on X if the distance is an ultrametric. But in practice the complete distance on X may not always be available. In the second part of this thesis we look at some of the circumstances under which a tree can be uniquely reconstructed from incomplete distance information. We use a concept called a lasso and give some theoretical properties of a special type of lasso. We then develop an algorithm which can construct a tree together with a lasso from partial distance information and show how this can be applied to various incomplete datasets

    The matroid structure of representative triple sets and triple-closure computation

    Get PDF
    The closure cl (R) of a consistent set R of triples (rooted binary trees on three leaves) provides essential information about tree-like relations that are shown by any supertree that displays all triples in . In this contribution, we are concerned with representative triple sets, that is, subsets R' of R with cl (R') = cl . In this case, R' still contains all information on the tree structure implied by R, although R' might be significantly smaller. We show that representative triple sets that are minimal w.r.t. inclusion form the basis of a matroid. This in turn implies that minimal representative triple sets also have minimum cardinality. In particular, the matroid structure can be used to show that minimum representative triple sets can be computed in polynomial time with a simple greedy approach. For a given triple set R that “identifies” a tree, we provide an exact value for the cardinality of its minimum representative triple sets. In addition, we utilize the latter results to provide a novel and efficient method to compute the closure cl (R) of a consistent triple set R that improves the time complexity (R Lr 4) of the currently fastest known method proposed by Bryant and Steel (1995). In particular, if a minimum representative triple set for R is given, it can be shown that the time complexity to compute cl (R) can be improved by a factor up to R Lr . As it turns out, collections of quartets (unrooted binary trees on four leaves) do not provide a matroid structure, in general

    Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell

    Get PDF
    The genome of an organism is its complete set of DNA nucleotides, spanning all of its genes and also of its non-coding regions. It contains most of the information necessary to build and maintain an organism. It is therefore no surprise that sequencing the genome provides an invaluable tool for the scientific study of an organism. Via the inference of an evolutionary (phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary history of a set of species. DNA sequences, or genotype data, has also proven useful for predicting an organisms’ phenotype (i. e. observed traits) from its genotype. This is the objective of association studies. While methods for finding the DNA sequence of an organism have existed for decades, the recent advent of Next Generation Sequencing (NGS) has meant that the availability of such data has increased to such an extent that the computational challenges that now form an integral part of biological studies can no longer be ignored. By focusing on phylogenetics and Genome-Wide Association Studies (GWAS), this thesis aims to help address some of these challenges. As a consequence this thesis is in two parts with the first one centring on phylogenetics and the second one on GWAS. In the first part, we present theoretical insights for reconstructing phylogenetic trees from incomplete distances. This problem is important in the context of NGS data as incomplete pairwise distances between organisms occur frequently with such input and ignoring taxa for which information is missing can introduce undesirable bias. In the second part we focus on the problem of inferring population stratification between individuals in a dataset due to reproductive isolation. While powerful methods for doing this have been proposed in the literature, they tend to struggle when faced with the sheer volume of data that comes with NGS. To help address this problem we introduce the novel PSIKO software and show that it scales very well when dealing with large NGS datasets

    A list of parameterized problems in bioinformatics

    Get PDF
    In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version

    The design and evaluation of a QuA implementation broker based on peer-to-peer technology

    Get PDF
    Abstract In the QuA component based middleware architecture, the implementation broker assists the service planner in service planning by performing resource discovery. Pluggable core services is a key feature in QuA, and the implementation broker role is one of those. However, at the start of this thesis, there was only one component available for this role; the Basic Implementation Broker. The Basic implementation broker is designed to perform resource discovery of local resources. A second implementation should not only be able to share offer space for resources among instances of QuA, for its ability to scale well, self organize and provide robustness to data loss from node failure would allow for a larger field of use for the component. Peer-to-peer technology has evolved greatly since the rise and fall of Napster, and the scalability, robustness and self-organization properties make peer-to-peer technology a good basis for an architectural model for distributed systems. This thesis aims to investigate the feasibility of using peer-to-peer technology in QuA resource discovery by designing and implementing an implementation broker component based on peer-to-peer technology. The component is also tested and evaluated in terms of scalability, robustness and ability to self organize a network of peer-to-peer broker components without any centralized control. The design of the component is only technology generation specific, but the implementation described uses the FreePastry implementation of the Pastry technology. The component is fully operational as an implementation broker component in QuA. The evaluation of the component show that the component is able to distribute responsibility for query resolution on resources as evenly as the underlying technology permits on participating nodes in a network of peer-to-peer broker components. Further, it is able to re-organize responsibility for resources among participating nodes both in the event of nodes joining and departing from the network. The replication scheme is also proven to be working, and through that robustness to data loss from node failure is also acheived

    Algorithms, haplotypes and phylogenetic networks

    Get PDF
    Preface. Before I started my PhD in computational biology in 2005, I had never even heard of this term. Now, almost four years later, I think I have some idea of what is meant by it. One of the goals of my PhD was to explore different topics within computational biology and to see where the biggest opportunities for discrete/combinatorial mathematicians could be found. Roughly speaking, the first two years of my PhD I focussed mainly on problems related to haplotyping and genome rearrangements and the last two years on phylogenetic networks. I must say I really enjoyed learning so much about both mathematics and biology. It was especially amazing to learn how exact, theoretical mathematics can be used to solve complex, practical problems from biology. The topics I studied clearly show how extremely useful mathematics can be for biology. But I also learned that there are many more interesting topics in computational biology than the ones that I could study so far. The number of opportunities for discrete mathematicians is absolutely immense. I did not include my studies on genome rearrangements in this thesis, because my most interesting results [Hur07a; Hur07b] are not directly related to biology. This work is nevertheless interesting to mathematicians and I recommend them to read it. I can certainly conclude that also in this field there is a vast number of opportunities for mathematicians and that the topic genome rearrangements provides numerous beautiful mathematical problems. I could never have written this thesis without a great amount of help from many different people. I want to thank my supervisors Leen Stougie and Judith Keijsper for guiding me, for helping me, for correcting my mistakes, for supplying ideas and for the enjoyable time I had while working with them. I also want to thank the Dutch BSIK/BRICKS project for funding my research and Gerhard Woeginger for giving me the opportunity to work in his group and being my second promotor. I want to thank Jens Stoye and Julia Zakotnik for the work we did together and for the great time I had in Bielefeld. I want to thank Ferry Hagen and Teun Boekhout for helping me to make my work relevant for "real" biology. I also want to thank John Tromp, Rudi Cilibrasi, Cor Hurkens and all others I worked with during my PhD. I want to thank Erik de Vink and Mike Steel for reading and commenting my thesis. I want to thank my colleagues from the Combinatorial Optimisation group at the Technische Universiteit Eindhoven for the pleasant working conditions and the fun things we did besides work. I especially want to thank Matthias Mnich, not only a great colleague but also a good friend, for all his ideas, his humour and our good and fruitful cooperation. I also want to thank Steven Kelk. I must say that I was very lucky to work with Steven during my PhD. He introduced me to problems, had an enormous amount of ideas, found the critical mistakes in my proofs and made my PhD a success both in terms of results and in terms of enjoying work. Finally, I want to thank Conno Hendriksen and Bas Heideveld for assisting me during my PhD defence and I want to thank them and all my other friends and family for helping me with everything in my life but research
    corecore