102 research outputs found
Efficient Haplotype Inference with Pseudo-Boolean Optimization
Abstract. Haplotype inference from genotype data is a key computational problem in bioinformatics, since retrieving directly haplotype information from DNA samples is not feasible using existing technology. One of the methods for solving this problem uses the pure parsimony criterion, an approach known as Haplotype Inference by Pure Parsimony (HIPP). Initial work in this area was based on a number of different Integer Linear Programming (ILP) models and branch and bound algorithms. Recent work has shown that the utilization of a Boolean Satisfiability (SAT) formulation and state of the art SAT solvers represents the most efficient approach for solving the HIPP problem. Motivated by the promising results obtained using SAT techniques, this paper investigates the utilization of modern Pseudo-Boolean Optimization (PBO) algorithms for solving the HIPP problem. The paper starts by applying PBO to existing ILP models. The results are promising, and motivate the development of a new PBO model (RPoly) for the HIPP problem, which has a compact representation and eliminates key symmetries. Experimental results indicate that RPoly outperforms the SAT-based approach on most problem instances, being, in general, significantly more efficient
TRIO LOGIC REGRESSION - DETECTION OF SNP - SNP INTERACTIONS IN CASE-PARENT TRIOS
Statistical approaches to evaluate higher order SNP-SNP and SNP-environment interactions are critical in genetic association studies, as susceptibility to complex disease is likely to be related to the interaction of multiple SNPs and environmental factors. Logic regression (Kooperberg et al., 2001; Ruczinski et al., 2003) is one such approach, where interactions between SNPs and environmental variables are assessed in a regression framework, and interactions become part of the model search space. In this manuscript we extend the logic regression methodology, originally developed for cohort and case-control studies, for studies of trios with affected probands. Trio logic regression accounts for the linkage disequilibrium (LD) structure in the genotype data, and accommodates missing genotypes via haplotype-based imputation. We also derive an efficient algorithm to simulate case-parent trios where genetic risk is determined via epistatic interactions
Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT
<p>Abstract</p> <p>Background</p> <p>Haplotype inference based on unphased SNP markers is an important task in population genetics. Although there are different approaches to the inference of haplotypes in diploid species, the existing software is not suitable for inferring haplotypes from unphased SNP data in polyploid species, such as the cultivated potato (<it>Solanum tuberosum</it>). Potato species are tetraploid and highly heterozygous.</p> <p>Results</p> <p>Here we present the software SATlotyper which is able to handle polyploid and polyallelic data. SATlo-typer uses the Boolean satisfiability problem to formulate Haplotype Inference by Pure Parsimony. The software excludes existing haplotype inferences, thus allowing for calculation of alternative inferences. As it is not known which of the multiple haplotype inferences are best supported by the given unphased data set, we use a bootstrapping procedure that allows for scoring of alternative inferences. Finally, by means of the bootstrapping scores, it is possible to optimise the phased genotypes belonging to a given haplotype inference. The program is evaluated with simulated and experimental SNP data generated for heterozygous tetraploid populations of potato. We show that, instead of taking the first haplotype inference reported by the program, we can significantly improve the quality of the final result by applying additional methods that include scoring of the alternative haplotype inferences and genotype optimisation. For a sub-population of nineteen individuals, the predicted results computed by SATlotyper were directly compared with results obtained by experimental haplotype inference via sequencing of cloned amplicons. Prediction and experiment gave similar results regarding the inferred haplotypes and phased genotypes.</p> <p>Conclusion</p> <p>Our results suggest that Haplotype Inference by Pure Parsimony can be solved efficiently by the SAT approach, even for data sets of unphased SNP from heterozygous polyploids. SATlotyper is freeware and is distributed as a Java JAR file. The software can be downloaded from the webpage of the GABI Primary Database at <url>http://www.gabipd.org/projects/satlotyper/</url>. The application of SATlotyper will provide haplotype information, which can be used in haplotype association mapping studies of polyploid plants.</p
Incremental Cardinality Constraints for MaxSAT
Maximum Satisfiability (MaxSAT) is an optimization variant of the Boolean
Satisfiability (SAT) problem. In general, MaxSAT algorithms perform a
succession of SAT solver calls to reach an optimum solution making extensive
use of cardinality constraints. Many of these algorithms are non-incremental in
nature, i.e. at each iteration the formula is rebuilt and no knowledge is
reused from one iteration to another. In this paper, we exploit the knowledge
acquired across iterations using novel schemes to use cardinality constraints
in an incremental fashion. We integrate these schemes with several MaxSAT
algorithms. Our experimental results show a significant performance boost for
these algo- rithms as compared to their non-incremental counterparts. These
results suggest that incremental cardinality constraints could be beneficial
for other constraint solving domains.Comment: 18 pages, 4 figures, 1 table. Final version published in Principles
and Practice of Constraint Programming (CP) 201
Statistical physics methods in computational biology
The interest of statistical physics for combinatorial optimization is not new, it suffices to think of a famous tool as
simulated annealing. Recently, it has also resorted to statistical inference to address some "hard" optimization problems, developing a new class of message passing algorithms. Three applications to computational biology are presented in this thesis, namely:
1) Boolean networks, a model for gene regulatory networks;
2) haplotype inference, to study the genetic information present in a population;
3) clustering, a general machine learning tool
A Preprocessing Procedure for Haplotype Inference by Pure Parsimony
Haplotype data is especially important in the study of complex diseases
since it contains more information than genotype data. However,
obtaining haplotype data is technically difficult and expensive. Computational
methods have proved to be an effective way of inferring haplotype
data from genotype data. One of these methods, the haplotype inference
by pure parsimony approach (HIPP), casts the problem as an optimization
problem and as such has been proved to be NP-hard. We have designed
and developed a new preprocessing procedure for this problem. Our proposed
algorithm works with groups of haplotypes rather than individual
haplotypes. It iterates searching and deleting haplotypes that are not
helpful in order to find the optimal solution. This preprocess can be coupled
with any of the current solvers for the HIPP that need to preprocess
the genotype data. In order to test it, we have used two state-of-the-art
solvers, RTIP and GAHAP, and simulated and real HapMap data. Due to
the computational time and memory reduction caused by our preprocess,
problem instances that were previously unaffordable can be now efficiently
solved
Pseudo-Booleanilainen optimisaatio käyttäen implisiittisiä osumisjoukkoja
There are many computationally difficult problems where the task is to find a solution with the lowest cost possible that fulfills a given set of constraints. Such problems are often NP-hard and are encountered in a variety of real-world problem domains, including planning and scheduling. NP-hard problems are often solved using a declarative approach by encoding the problem into a declarative constraint language and solving the encoding using a generic algorithm for that language. In this thesis we focus on pseudo-Boolean optimization (PBO), a special class of integer programs (IP) that only contain variables that admit the values 0 and 1.
We propose a novel approach to PBO that is based on the implicit hitting set (IHS) paradigm, which uses two separate components. An IP solver is used to find an optimal solution under an incomplete set of constraints. A pseudo-Boolean satisfiability solver is used to either validate the feasibility of the solution or to extract more constraints to the integer program. The IHS-based PBO algorithm iteratively invokes the two algorithms until an optimal solution to a given PBO instance is found.
In this thesis we lay out the IHS-based PBO solving approach in detail. We implement the algorithm as the PBO-IHS solver by making use of recent advances in reasoning techniques for pseudo-Boolean constraints. Through extensive empirical evaluation we show that our PBO-IHS solver outperforms other available specialized PBO solvers and has complementary performance compared to classical integer programming techniques
- …