unknown

Variants of the Consecutive Ones Property: Algorithms, Computational Complexity and Applications to Genomics

Abstract

Genome mapping problems in bioinformatics can be modelled as problems of finding sequences of vertices in hypergraphs, subject to consecutivity constraints. These problems are related to the emph{consecutive ones property}, a well-studied structural property on binary matrices. Many variants of this property have been introduced to include subtleties in the model, such as upper bounds on the number of times a vertex may appear in a sequence, the distance of the input from having the property, and confidence values for the consecutivity constraints. Most problems involving these variants are intractable, and efficient solutions call for restrictions on the structure of the input, exponential time algorithms, or approximations. The following document discusses these problems, from both a theoretical perspective, and from the genomics point of view.We encounter two main classes of problems, divided into models which account for repeated elements in genomes, and those which do not. Orthogonally, we divide the problems into decision and optimization questions. For models with repeats, we discuss when the given input can be used to reconstruct the genome map of interest, and if we can discard a minimal set of encoded consecutivity information from the model to obtain an input which can be used to reconstruct this genome map. We also discuss the problem of ambiguity introduced by repeats, and introduce the concept of emph{repeat spanning intervals} in order to address them. We show that the problem of optimizing over the set of repeat spanning intervals is NP-hard in general, and give an algorithm when the intervals are small. In models without repeated elements, we discuss the problem of optimization byfinding a solution that minimizes the distortion in the consecutivity information, by generalizing the concepts of bandwidth and minimum linear arrangement to hypergraphs. We design approximation algorithms for two versions of the latter problem, with an approximation ratio of Oleft(sqrtlognloglognright)Oleft(sqrt{log n}loglog nright).Finally, we provide details of implementations of some of the methods developed for genome mapping and scaffolding on ancestral genomes. We include results on real data for the genome of the Black Death agent, and for ancestral textit{Anopheles} mosquitoes

    Similar works