68 research outputs found
A Preprocessing Procedure for Haplotype Inference by Pure Parsimony
Haplotype data is especially important in the study of complex diseases
since it contains more information than genotype data. However,
obtaining haplotype data is technically difficult and expensive. Computational
methods have proved to be an effective way of inferring haplotype
data from genotype data. One of these methods, the haplotype inference
by pure parsimony approach (HIPP), casts the problem as an optimization
problem and as such has been proved to be NP-hard. We have designed
and developed a new preprocessing procedure for this problem. Our proposed
algorithm works with groups of haplotypes rather than individual
haplotypes. It iterates searching and deleting haplotypes that are not
helpful in order to find the optimal solution. This preprocess can be coupled
with any of the current solvers for the HIPP that need to preprocess
the genotype data. In order to test it, we have used two state-of-the-art
solvers, RTIP and GAHAP, and simulated and real HapMap data. Due to
the computational time and memory reduction caused by our preprocess,
problem instances that were previously unaffordable can be now efficiently
solved
Efficient Haplotype Inference with Pseudo-Boolean Optimization
Abstract. Haplotype inference from genotype data is a key computational problem in bioinformatics, since retrieving directly haplotype information from DNA samples is not feasible using existing technology. One of the methods for solving this problem uses the pure parsimony criterion, an approach known as Haplotype Inference by Pure Parsimony (HIPP). Initial work in this area was based on a number of different Integer Linear Programming (ILP) models and branch and bound algorithms. Recent work has shown that the utilization of a Boolean Satisfiability (SAT) formulation and state of the art SAT solvers represents the most efficient approach for solving the HIPP problem. Motivated by the promising results obtained using SAT techniques, this paper investigates the utilization of modern Pseudo-Boolean Optimization (PBO) algorithms for solving the HIPP problem. The paper starts by applying PBO to existing ILP models. The results are promising, and motivate the development of a new PBO model (RPoly) for the HIPP problem, which has a compact representation and eliminates key symmetries. Experimental results indicate that RPoly outperforms the SAT-based approach on most problem instances, being, in general, significantly more efficient
Boosting Haplotype Inference with Local Search
Abstract. A very challenging problem in the genetics domain is to infer haplotypes from genotypes. This process is expected to identify genes affecting health, disease and response to drugs. One of the approaches to haplotype inference aims to minimise the number of different haplotypes used, and is known as haplotype inference by pure parsimony (HIPP). The HIPP problem is computationally difficult, being NP-hard. Recently, a SAT-based method (SHIPs) has been proposed to solve the HIPP problem. This method iteratively considers an increasing number of haplotypes, starting from an initial lower bound. Hence, one important aspect of SHIPs is the lower bounding procedure, which reduces the number of iterations of the basic algorithm, and also indirectly simplifies the resulting SAT model. This paper describes the use of local search to improve existing lower bounding procedures. The new lower bounding procedure is guaranteed to be as tight as the existing procedures. In practice the new procedure is in most cases considerably tighter, allowing significant improvement of performance on challenging problem instances.
Parsimony-based genetic algorithm for haplotype resolution and block partitioning
This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster
Algorithmic approaches for the single individual haplotyping problem
Since its introduction in 2001, the Single Individual Haplotyping problem has received an ever-increasing attention from the scientific community. In this paper we survey, in the form of an annotated bibliography, the developments in the study of the problem from its origin until our days
A branch-and-price approach for Pure Parsimony haplotyping
This thesis comes as the result of a detailed study of decomposition methods for large-scale problems and their application to a particular problem arising in computational biology.
The improvements on computer capabilities and programming techniques in the last decades have widened the set of problems that can be easily solved as Mixed Integer Linear programs. However, several applications still require formulations that involve a non-tractable amount of data necessary to describe the geometry of the solution space. In these cases, decomposition methods are used to reduce the size of the problems to be addressed.
In this thesis we propose the application of some of these methods, as Dantzig-Wolfe reformulation, column generation and Lagrangian relaxation, to a problem related to the study of the human genome. The human DNA is made of two double chains, each of which consists in a sequence of nucleotides. Among these, the ones related to the Single Nucleotide Polymorphisms (SNPs) are interesting as they describe the differences between individuals. We define a haplotype as a sequence of nucleotides that describes a portion of the SNPs found in a particular chromosome, and a genotype as the sequence that aggregates the information on SNPs coming from the double DNA chain of an individual. The problem we address falls into the class defining the Haplotyping Inference problem, that consists in recovering the structure of the haplotypes, given the information on the genotypes. In particular, we consider the parsimony criterion, which means that we want to find the minimum number of haplotypes able to explain all the genotypes.
This problem is known to be APX-hard.
There are several contributions in the literature that can be divided into two main different classes of mixed integer linear formulations. The first one presents a polynomial number of both variables and constraints, thus these formulations are solved using a branch-and-cut approach. The second class consists of formulations that present an exponential number of constraints and variables, solved with a branch-and-cut-and-price approach.
The scope of this thesis is to investigate how a new formulation that involves an exponential number of variables and a polynomial number of constraints can be solved by a branch-and-price approach. Its aim is to provide a competitive algorithm with respect to other formulations from the literature, in particular those with a polynomial number of constraints and variables.
We start by providing a review of the state of the art on the Haplotype Inference problem, with particular focus on the Mixed Integer Linear programming approaches for the Haplotype Inference by Pure Parsimony (HIPP) problem. We then consider a new mathematical programming formulation for HIPP that includes a set of quadratic constraints. By applying Dantzig-Wolfe reformulation, we obtained a new integer linear programming formulation, presenting an exponential number of variables and a polynomial number of constraints on the input data. This model is the basis for the development of a branch-and-price approach.
Due to the large number of variables involved, a column-generation approach is needed to solve the linear relaxation at a generic node of the search tree. An initial feasible solution is easily found by means of heuristics and used as starting point to build the Restricted Master Problem (RMP). In order to find variables to be added to the RMP, we solve a dedicated subproblem, the pricing problem, that in our case presents a quadratic objective function. We propose different ways of solving the pricing problem. Among the exact methods, we consider the integer linear model obtained by linearizing the quadratic objective function and a Smart Enumeration approach, that partitions the set of feasible solutions and solves the pricing problem restricted to each subset, exploiting some extra available information to further reduce the size of the subproblems. As heuristic approaches, we at first note that the pricing problem is easily solved for particular haplotypes. Then, for investigating the remaining solutions we propose a local search-based heuristic and an Early-terminated Smart Enumeration, where we stop the Smart Enumeration approach as soon as we find a variable that can be added to the RMP.
The oscillatory behaviour of the dual variables involved in the definition of the pricing problem is limited by introducing a stabilization technique adapted to our formulation. In particular, we extended the proof of convergence of this procedure, that consists in using dual values obtained as convex combinations between real dual variables and a chosen stability center, to the cases in which the stabilized dual variables are feasible for the dual problem.
In order to solve the integer model, the solution of the linear relaxation is embedded in a branch-and-price approach. The branching rule we present is inspired to the well-known Ryan-Foster branching rule for set-partitioning problems. The correctness of our approach has been proved.
Further observations on the similarity of the formulation's constraints to multiple set-covering ones suggest that we can relax a family of constraints to obtain a new formulation similar to a multiple set-covering. However, we note that the proposed branch-and-price algorithm applied to this formulation does not provide a feasible solution for HIPP, thus we need to integrate the proposed branching rule and recover a feasible optimal solution for HIPP.
This branch-and-price approach has been implemented in C++, with the aid of SCIP libraries and Cplex solver. Results have been obtained from different classes of instances found in literature, coming from real biological data and generated using ad-hoc programs, as well as newly generated ones. The branch-and-price approach proposed for our formulation proves to be competitive with state-of-the-art polynomial-sized formulations. In fact, we can note how the linear relaxation of our formulation is tighter than other linear relaxations and provides an effective starting solution for the branch-and-price algorithm. Results show how our approach is efficient, in particular on the set of instances that contain a larger number of genotypes
We proved therefore that a branch-and-price procedure provides a good solution approach for a formulation with exponential number of variables and polynomial number of constraints. Further work may include enhancements on the implementation details, such as exploring different ways of ordering the genotypes or combining heuristic and exact methods in the stabilized framework to solve the pricing problem. Moreover, it is possible to investigate the generalization of the proposed approach in order to solve set-partitioning problems
Recommended from our members
Diversity, Distribution and Evolution of the Planktonic Diatom Family Chaetocerotaceae
The number and abundance of diatom species in environmental samples are counted traditionally by means of light microscopy (LM). However, recognizing –let alone, counting- species is often challenging because of the existence of cryptic species and intraspecific phenotypic plasticity. Proper characterization requires isolation of cells, growing them into monoclonal cultures, and characterizing the cultures genetically and morphologically. However, not all species grow in culture, featureless ones are less likely to be isolated, and the procedure is laborious. High-throughput sequencing (HTS) metabarcoding bypasses morphology; DNA is collected from environmental samples, a particular marker sequenced, and the resulting sequences sorted into clusters or terminal clades assumed to represent species. Yet, reference barcodes of taxonomically validated species are needed to identify these clades. This exercise is the main aim of my thesis.
Since it is impossible to do this for all the diversity within a PhD thesis project, we selected Chaetocerotaceae, an abundant and diverse family of marine planktonic diatoms, containing two genera: Chaetoceros and Bacteriastrum. Its members uniquely share setae; thin siliceous tubes emerging from the valve corners, facilitating detection in samples. Strains were obtained from the Gulf of Naples (GoN), from Central Chile and Roscoff – at sites for which LTER time series data are available.
A total of 270 strains were obtained from these sites, and their 18S- and partial 28S rDNA sequences and morphological information gathered. The strains grouped into 60 genetically distinct species, thus providing a dataset of validated Chaetocerotacean 18S reference barcodes. Inferred molecular phylogenies showed monophyletic Chaetocerotaceae as well as monophyletic Bacteriastrum inside paraphyletic Chaetoceros, and the presence of cryptic diversity. To start with taxonomic updates, the species C. sporotruncatus and C. dichatoensis were described within the C. socialis species-complex based on spore morphology and sequence differences. Several rDNA sequences contained spliceosomal introns (ca. 100bp) and/or group-I introns (ca. 400bp). Phylogenies inferred from the introns did not corroborate rDNA phylogenies, suggesting horizontal gene transfer. Presence/absence of introns in conspecific strains sampled in different seasons suggests population differentiation between these seasons.
A HTS dataset consisting of V4-sequences (part of 18S) from 48 seawater samples taken over the seasons in the GoN revealed 76 terminal clades of which 46 grouped with a reference barcode. Some of these species occur year-round whereas most others are seasonal. Surprisingly, of the 30 clades belonging to unknown Chaetocerotacean species, two appear to be among the most abundant in the GoN
ppSAT: Towards Two-Party Private SAT Solving
We design and implement a privacy-preserving Boolean satisfiability (ppSAT) solver, which allows mutually distrustful parties to evaluate the conjunction of their input formulas while maintaining privacy. We first define a family of security guarantees reconcilable with the (known) exponential complexity of SAT solving, and then construct an oblivious variant of the classic DPLL algorithm which can be integrated with existing secure two-party computation (2PC) techniques. We further observe that most known SAT solving heuristics are unsuitable for 2PC, as they are highly data-dependent in order to minimize the number of exploration steps. Faced with how best to trade off between the number of steps and the cost of obliviously executing each one, we design three efficient oblivious heuristics, one deterministic and two randomized. As a result of this effort we are able to evaluate our ppSAT solver on small but practical instances arising from the haplotype inference problem in bioinformatics. We conclude by looking towards future directions for making ppSAT solving more practical, most especially the integration of conflict-driven clause learning (CDCL)
- …