507 research outputs found
Haplotype inference in general pedigrees with two sites
<p>Abstract</p> <p>Background</p> <p>Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with two sites for all members.</p> <p>Results</p> <p>We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem and therefore can be solved by an <it>O</it>(2<it><sup>k</sup></it> · <it>n</it><sup>2</sup>) exact algorithm, where <it>n</it> is the number of members and <it>k</it> is the number of recombination events.</p> <p>Conclusions</p> <p>Our work can therefore be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.</p
An FPT haplotyping algorithm on pedigrees with a small number of sites
<p>Abstract</p> <p>Background</p> <p>Genetic disease studies investigate relationships between changes in chromosomes and genetic diseases. Single haplotypes provide useful information for these studies but extracting single haplotypes directly by biochemical methods is expensive. A computational method to infer haplotypes from genotype data is therefore important. We investigate the problem of computing the minimum number of recombination events for general pedigrees with a small number of sites for all members.</p> <p>Results</p> <p>We show that this NP-hard problem can be parametrically reduced to the Bipartization by Edge Removal problem with additional parity constraints. We solve this problem with an exact algorithm that runs in <inline-formula><graphic file="1748-7188-6-8-i1.gif"/></inline-formula> time, where <it>n </it>is the number of members, <it>m </it>is the number of sites, and <it>k </it>is the number of recombination events.</p> <p>Conclusions</p> <p>This algorithm infers haplotypes for a small number of sites, which can be useful for genetic disease studies to track down how changes in haplotypes such as recombinations relate to genetic disease.</p
Parsimony-based genetic algorithm for haplotype resolution and block partitioning
This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster
Recommended from our members
Haplotype Inference through Sequential Monte Carlo
Technological advances in the last decade have given rise to large Genome Wide Studies which have helped researchers get better insights in the genetic basis of many common diseases. As the number of samples and genome coverage has increased dramatically it is currently typical that individuals are genotyped using high throughput platforms to more than 500,000 Single Nucleotide Polymorphisms. At the same time theoretical and empirical arguments have been made for the use of haplotypes, i.e. combinations of alleles at multiple loci in individual chromosomes, as opposed to genotypes so the problem of haplotype inference is particularly relevant. Existing haplotyping methods include population based methods, methods for pooled DNA samples and methods for family and pedigree data. Furthermore, the vast amount of available data pose new challenges for haplotyping algorithms. Candidate methods should scale well to the size of the datasets as the number of loci and the number of individuals are well to the thousands. In addition, as genotyping can be performed routinely, researchers encounter a number of specific new scenarios, which can be seen as hybrid between the population and pedigree inference scenarios and require special care to incorporate the maximum amount of information. In this thesis we present a Sequential Monte Carlo framework (TDS) and tailor it to address instances of haplotype inference and frequency estimation problems. Specifically, we first adjust our framework to perform haplotype inference in trio families resulting in a methodology that demonstrates an excellent tradeoff between speed and accuracy. Consequently, we extend our method to handle general nuclear families and demonstrate the gain using our approach as opposed to alternative scenarios. We further address the problem of haplotype inference in pooling data in which we show that our method achieves improved performance over existing approaches in datasets with large number of markers. We finally present a framework to handle the haplotype inference problem in regions of CNV/SNP data. Using our approach we can phase datasets where the ploidy of an individual can vary along the region and each individual can have different breakpoints
Statistical physics methods in computational biology
The interest of statistical physics for combinatorial optimization is not new, it suffices to think of a famous tool as
simulated annealing. Recently, it has also resorted to statistical inference to address some "hard" optimization problems, developing a new class of message passing algorithms. Three applications to computational biology are presented in this thesis, namely:
1) Boolean networks, a model for gene regulatory networks;
2) haplotype inference, to study the genetic information present in a population;
3) clustering, a general machine learning tool
Estimating genealogies from linked marker data: a Bayesian approach
<p>Abstract</p> <p>Background</p> <p>Answers to several fundamental questions in statistical genetics would ideally require knowledge of the ancestral pedigree and of the gene flow therein. A few examples of such questions are haplotype estimation, relatedness and relationship estimation, gene mapping by combining pedigree and linkage disequilibrium information, and estimation of population structure.</p> <p>Results</p> <p>We present a probabilistic method for genealogy reconstruction. Starting with a group of genotyped individuals from some population isolate, we explore the state space of their possible ancestral histories under our Bayesian model by using Markov chain Monte Carlo (MCMC) sampling techniques. The main contribution of our work is the development of sampling algorithms in the resulting vast state space with highly dependent variables. The main drawback is the computational complexity that limits the time horizon within which explicit reconstructions can be carried out in practice.</p> <p>Conclusion</p> <p>The estimates for IBD (identity-by-descent) and haplotype distributions are tested in several settings using simulated data. The results appear to be promising for a further development of the method.</p
Modelling dependencies in genetic-marker data and its application to haplotype analysis
The objective of this thesis is to develop new methods to reconstruct haplotypes from phaseunknown
genotypes. The need for new methodologies is motivated by the increasing avail¬
ability of high-resolution marker data for many species. Such markers typically exhibit
correlations, a phenomenon known as Linkage Disequilibrium (LD). It is believed that re¬
constructed haplotypes for markers in high LD can be valuable for a variety of application
areas in population genetics, including reconstructing population history and identifying
genetic disease variantsTraditionally, haplotype reconstruction methods can be categorized according to whether
they operate on a single pedigree or a collection of unrelated individuals. The thesis begins
with a critical assessment of the limitations of existing methods, and then presents a uni¬
fied statistical framework that can accommodate pedigree data, unrelated individuals and
tightly linked markers. The framework makes use of graphical models, where inference
entails representing the relevant joint probability distribution as a graph and then using
associated algorithms to facilitate computation. The graphical model formalism provides
invaluable tools to facilitate model specification, visualization, and inference.Once the unified framework is developed, a broad range of simulation studies are conducted
using previously published haplotype data. Important contributions include demonstrating
the different ways in which the haplotype frequency distribution can impact the accuracy of
both the phase assignments and haplotype frequency estimates; evaluating the effectiveness
of using family data to improve accuracy for different frequency profiles; and, assessing the
dangers of treating related individuals as unrelated in an association study
- …