36 research outputs found

    A Decomposition of the Pure Parsimony Problem

    Get PDF
    We partially order a collection of genotypes so that we can represent the problem of inferring the least number of haplotypes in terms of substructures we call g-lattices. This representation allows us to prove that if the genotypes partition into chains with certain structure, then the NP-Hard problem can be solved efficiently. Even without the specified structure, the decomposition shows how to separate the underlying integer programming model into smaller models

    Diversity Graphs

    Get PDF
    Bipartite graphs have long been used to study and model matching problems, and in this paper we introduce the bipartite graphs that explain a recent matching problem in computational biology. The problem is to match haplotypes to genotypes in a way that minimizes the number of haplotypes, a problem called the Pure Parsimony problem. The goal of this work is not to address the computational or biological issues but rather to explore the mathematical structure through a study of the underlying graph theory

    Parsimony-based genetic algorithm for haplotype resolution and block partitioning

    Get PDF
    This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster

    G-lattices for an Unrooted Perfect Phylogeny

    Get PDF
    We look at the Pure Parsimony problem and the Perfect Phylogeny Haplotyping problem. From the Pure Parsimony problem we consider structures of genotypes called g-lattices. These structures either provide solutions or give bounds to the pure parsimony problem. In particular, we investigate which of these structures supports an unrooted perfect phylogeny, a condition that adds biological interpretation. By understanding which g-lattices support an unrooted perfect phylogeny, we connect two of the standard biological inference rules used to recreate how genetic diversity propagates across generations

    A Column Generation Approach for Pure Parsimony Haplotyping

    Get PDF

    On the Complexity of the Single Individual SNP Haplotyping Problem

    Full text link
    We present several new results pertaining to haplotyping. These results concern the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype fragments. We consider the complexity of the problems Minimum Error Correction (MEC) and Longest Haplotype Reconstruction (LHR) for different restrictions on the input data. Specifically, we look at the gapless case, where every row of the input corresponds to a gapless haplotype-fragment, and the 1-gap case, where at most one gap per fragment is allowed. We prove that MEC is APX-hard in the 1-gap case and still NP-hard in the gapless case. In addition, we question earlier claims that MEC is NP-hard even when the input matrix is restricted to being completely binary. Concerning LHR, we show that this problem is NP-hard and APX-hard in the 1-gap case (and thus also in the general case), but is polynomial time solvable in the gapless case.Comment: 26 pages. Related to the WABI2005 paper, "On the Complexity of Several Haplotyping Problems", but with more/different results. This papers has just been submitted to the IEEE/ACM Transactions on Computational Biology and Bioinformatics and we are awaiting a decision on acceptance. It differs from the mid-August version of this paper because here we prove that 1-gap LHR is APX-hard. (In the earlier version of the paper we could prove only that it was NP-hard.

    Maximum parsimony xor haplotyping by sparse dictionary selection

    Get PDF
    Background: Xor-genotype is a cost-effective alternative to the genotype sequence of an individual. Recent methods developed for haplotype inference have aimed at finding the solution based on xor-genotype data. Given the xor-genotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual with the aid of a small number of regular genotypes. Results: We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xor-genotypes augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic data sets with different number of individuals and SNPs, and compare the performances with the state-of-the-art xor-haplotyping methods PPXH and XOR-HAPLOGEN. Conclusions: Experimental results show good inference qualities for the proposed method under all circumstances, especially on large data sets. Results on a real database, CFTR, also demonstrate significantly better performance. The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors

    10091 Abstracts Collection -- Data Structures

    Get PDF
    From February 28th to March 5th 2010, the Dagstuhl Seminar 10091 "Data Structures" was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. It brought together 45 international researchers to discuss recent developments concerning data structures in terms of research, but also in terms of new technologies that impact how data can be stored, updated, and retrieved. During the seminar a fair number of participants presented their current research and open problems where discussed. This document first briefly describes the seminar topics and then gives the abstracts of the presentations given during the seminar

    A branch-and-price approach for Pure Parsimony haplotyping

    Get PDF
    This thesis comes as the result of a detailed study of decomposition methods for large-scale problems and their application to a particular problem arising in computational biology. The improvements on computer capabilities and programming techniques in the last decades have widened the set of problems that can be easily solved as Mixed Integer Linear programs. However, several applications still require formulations that involve a non-tractable amount of data necessary to describe the geometry of the solution space. In these cases, decomposition methods are used to reduce the size of the problems to be addressed. In this thesis we propose the application of some of these methods, as Dantzig-Wolfe reformulation, column generation and Lagrangian relaxation, to a problem related to the study of the human genome. The human DNA is made of two double chains, each of which consists in a sequence of nucleotides. Among these, the ones related to the Single Nucleotide Polymorphisms (SNPs) are interesting as they describe the differences between individuals. We define a haplotype as a sequence of nucleotides that describes a portion of the SNPs found in a particular chromosome, and a genotype as the sequence that aggregates the information on SNPs coming from the double DNA chain of an individual. The problem we address falls into the class defining the Haplotyping Inference problem, that consists in recovering the structure of the haplotypes, given the information on the genotypes. In particular, we consider the parsimony criterion, which means that we want to find the minimum number of haplotypes able to explain all the genotypes. This problem is known to be APX-hard. There are several contributions in the literature that can be divided into two main different classes of mixed integer linear formulations. The first one presents a polynomial number of both variables and constraints, thus these formulations are solved using a branch-and-cut approach. The second class consists of formulations that present an exponential number of constraints and variables, solved with a branch-and-cut-and-price approach. The scope of this thesis is to investigate how a new formulation that involves an exponential number of variables and a polynomial number of constraints can be solved by a branch-and-price approach. Its aim is to provide a competitive algorithm with respect to other formulations from the literature, in particular those with a polynomial number of constraints and variables. We start by providing a review of the state of the art on the Haplotype Inference problem, with particular focus on the Mixed Integer Linear programming approaches for the Haplotype Inference by Pure Parsimony (HIPP) problem. We then consider a new mathematical programming formulation for HIPP that includes a set of quadratic constraints. By applying Dantzig-Wolfe reformulation, we obtained a new integer linear programming formulation, presenting an exponential number of variables and a polynomial number of constraints on the input data. This model is the basis for the development of a branch-and-price approach. Due to the large number of variables involved, a column-generation approach is needed to solve the linear relaxation at a generic node of the search tree. An initial feasible solution is easily found by means of heuristics and used as starting point to build the Restricted Master Problem (RMP). In order to find variables to be added to the RMP, we solve a dedicated subproblem, the pricing problem, that in our case presents a quadratic objective function. We propose different ways of solving the pricing problem. Among the exact methods, we consider the integer linear model obtained by linearizing the quadratic objective function and a Smart Enumeration approach, that partitions the set of feasible solutions and solves the pricing problem restricted to each subset, exploiting some extra available information to further reduce the size of the subproblems. As heuristic approaches, we at first note that the pricing problem is easily solved for particular haplotypes. Then, for investigating the remaining solutions we propose a local search-based heuristic and an Early-terminated Smart Enumeration, where we stop the Smart Enumeration approach as soon as we find a variable that can be added to the RMP. The oscillatory behaviour of the dual variables involved in the definition of the pricing problem is limited by introducing a stabilization technique adapted to our formulation. In particular, we extended the proof of convergence of this procedure, that consists in using dual values obtained as convex combinations between real dual variables and a chosen stability center, to the cases in which the stabilized dual variables are feasible for the dual problem. In order to solve the integer model, the solution of the linear relaxation is embedded in a branch-and-price approach. The branching rule we present is inspired to the well-known Ryan-Foster branching rule for set-partitioning problems. The correctness of our approach has been proved. Further observations on the similarity of the formulation's constraints to multiple set-covering ones suggest that we can relax a family of constraints to obtain a new formulation similar to a multiple set-covering. However, we note that the proposed branch-and-price algorithm applied to this formulation does not provide a feasible solution for HIPP, thus we need to integrate the proposed branching rule and recover a feasible optimal solution for HIPP. This branch-and-price approach has been implemented in C++, with the aid of SCIP libraries and Cplex solver. Results have been obtained from different classes of instances found in literature, coming from real biological data and generated using ad-hoc programs, as well as newly generated ones. The branch-and-price approach proposed for our formulation proves to be competitive with state-of-the-art polynomial-sized formulations. In fact, we can note how the linear relaxation of our formulation is tighter than other linear relaxations and provides an effective starting solution for the branch-and-price algorithm. Results show how our approach is efficient, in particular on the set of instances that contain a larger number of genotypes We proved therefore that a branch-and-price procedure provides a good solution approach for a formulation with exponential number of variables and polynomial number of constraints. Further work may include enhancements on the implementation details, such as exploring different ways of ordering the genotypes or combining heuristic and exact methods in the stabilized framework to solve the pricing problem. Moreover, it is possible to investigate the generalization of the proposed approach in order to solve set-partitioning problems
    corecore