Search CORE

144 research outputs found

Parsimony-based genetic algorithm for haplotype resolution and block partitioning

Author: Sazonova Nadezhda A.
Publication venue: The Research Repository @ WVU
Publication date: 01/12/2007
Field of study

This dissertation proposes a new algorithm for performing simultaneous haplotype resolution and block partitioning. The algorithm is based on genetic algorithm approach and the parsimonious principle. The multiloculs LD measure (Normalized Entropy Difference) is used as a block identification criterion. The proposed algorithm incorporates missing data is a part of the model and allows blocks of arbitrary length. In addition, the algorithm provides scores for the block boundaries which represent measures of strength of the boundaries at specific positions. The performance of the proposed algorithm was validated by running it on several publicly available data sets including the HapMap data and comparing results to those of the existing state-of-the-art algorithms. The results show that the proposed genetic algorithm provides the accuracy of haplotype decomposition within the range of the same indicators shown by the other algorithms. The block structure output by our algorithm in general agrees with the block structure for the same data provided by the other algorithms. Thus, the proposed algorithm can be successfully used for block partitioning and haplotype phasing while providing some new valuable features like scores for block boundaries and fully incorporated treatment of missing data. In addition, the proposed algorithm for haplotyping and block partitioning is used in development of the new clustering algorithm for two-population mixed genotype samples. The proposed clustering algorithm extracts from the given genotype sample two clusters with substantially different block structures and finds haplotype resolution and block partitioning for each cluster

The Research Repository @ WVU (West Virginia University)

A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies

Author: Leray Philippe
Mourad Raphaël
Sinoquet Christine
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Discovering the genetic basis of common genetic diseases in the human genome represents a public health issue. However, the dimensionality of the genetic data (up to 1 million genetic markers) and its complexity make the statistical analysis a challenging task. Results We present an accurate modeling of dependences between genetic markers, based on a forest of hierarchical latent class models which is a particular class of probabilistic graphical models. This model offers an adapted framework to deal with the fuzzy nature of linkage disequilibrium blocks. In addition, the data dimensionality can be reduced through the latent variables of the model which synthesize the information borne by genetic markers. In order to tackle the learning of both forest structure and probability distributions, a generic algorithm has been proposed. A first implementation of our algorithm has been shown to be tractable on benchmarks describing 105 variables for 2000 individuals. Conclusions The forest of hierarchical latent class models offers several advantages for genome-wide association studies: accurate modeling of linkage disequilibrium, flexible data dimensionality reduction and biological meaning borne by latent variables.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recommended from our members

Haplotype Inference through Sequential Monte Carlo

Author: Iliadis Alexandros
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2013
Field of study

Technological advances in the last decade have given rise to large Genome Wide Studies which have helped researchers get better insights in the genetic basis of many common diseases. As the number of samples and genome coverage has increased dramatically it is currently typical that individuals are genotyped using high throughput platforms to more than 500,000 Single Nucleotide Polymorphisms. At the same time theoretical and empirical arguments have been made for the use of haplotypes, i.e. combinations of alleles at multiple loci in individual chromosomes, as opposed to genotypes so the problem of haplotype inference is particularly relevant. Existing haplotyping methods include population based methods, methods for pooled DNA samples and methods for family and pedigree data. Furthermore, the vast amount of available data pose new challenges for haplotyping algorithms. Candidate methods should scale well to the size of the datasets as the number of loci and the number of individuals are well to the thousands. In addition, as genotyping can be performed routinely, researchers encounter a number of specific new scenarios, which can be seen as hybrid between the population and pedigree inference scenarios and require special care to incorporate the maximum amount of information. In this thesis we present a Sequential Monte Carlo framework (TDS) and tailor it to address instances of haplotype inference and frequency estimation problems. Specifically, we first adjust our framework to perform haplotype inference in trio families resulting in a methodology that demonstrates an excellent tradeoff between speed and accuracy. Consequently, we extend our method to handle general nuclear families and demonstrate the gain using our approach as opposed to alternative scenarios. We further address the problem of haplotype inference in pooling data in which we show that our method achieves improved performance over existing approaches in datasets with large number of markers. We finally present a framework to handle the haplotype inference problem in regions of CNV/SNP data. Using our approach we can phase datasets where the ploidy of an individual can vary along the region and each individual can have different breakpoints

Columbia University Academic Commons

Quantification and Visualization of LD Patterns and Identification of Haplotype Blocks

Author: Dudoit Sandrine
Wang Yan
Publication venue: Collection of Biostatistics Research Archive
Publication date: 29/06/2004
Field of study

Classical measures of linkage disequilibrium (LD) between two loci, based only on the joint distribution of alleles at these loci, present noisy patterns. In this paper, we propose a new distance-based LD measure, R, which takes into account multilocus haplotypes around the two loci in order to exploit information from neighboring loci. The LD measure R yields a matrix of pairwise distances between markers, based on the correlation between the lengths of shared haplotypes among chromosomes around these markers. Data analysis demonstrates that visualization of LD patterns through the R matrix reveals more deterministic patterns, with much less noise, than using classical LD measures. Moreover, the patterns are highly compatible with recently suggested models of haplotype block structure. We propose to apply the new LD measure to define haplotype blocks through cluster analysis. Specifically, we present a distance-based clustering algorithm, DHPBlocker, which performs hierarchical partitioning of an ordered sequence of markers into disjoint and adjacent blocks with a hierarchical structure. The proposed method integrates information on the two main existing criteria in defining haplotype blocks, namely, LD and haplotype diversity, through the use of silhouette width and description length as cluster validity measures, respectively. The new LD measure and clustering procedure are applied to single nucleotide polymorphism (SNP) datasets from the human 5q31 region (Daly et al. 2001) and the class II region of the human major histocompatibility complex (Jeffreys et al. 2001). Our results are in good agreement with published results. In addition, analyses performed on different subsets of markers indicate that the method is robust with regards to the allele frequency and density of the genotyped markers. Unlike previously proposed methods, our new cluster-based method can uncover hierarchical relationships among blocks and can be applied to polymorphic DNA markers or amino acid sequence data

Collection Of Biostatistics Research Archive

A haplotype inference algorithm for trios based on deterministic sampling

Author: Anastassiou Dimitris
Iliadis Alexandros
Wang Xiaodong
Watkinson John
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Statistical power can be increased when haplotypes, rather than three-valued genotypes, are used in analysis, so the problem of haplotype phase inference (phasing) is particularly relevant. Several phasing algorithms have been developed for data from unrelated individuals, based on different models, some of which have been extended to father-mother-child "trio" data. We introduce a technique for phasing trio datasets using a tree-based deterministic sampling scheme. We have compared our method with publicly available algorithms PHASE v2.1, BEAGLE v3.0.2 and 2SNP v1.7 on datasets of varying number of markers and trios. We have found that the computational complexity of PHASE makes it prohibitive for routine use; on the other hand 2SNP, though the fastest method for small datasets, was significantly inaccurate. We have shown that our method outperforms BEAGLE in terms of speed and accuracy for small to intermediate dataset sizes in terms of number of trios for all marker sizes examined. Our method is implemented in the "Tree-Based Deterministic Sampling" (TDS) package, available for download at http://www.ee.columbia.edu/~anastas/tds Using a Tree-Based Deterministic sampling technique, we present an intuitive and conceptually simple phasing algorithm for trio data. The trade off between speed and accuracy achieved by our algorithm makes it a strong candidate for routine use on trio datasets

Crossref

Springer - Publisher Connector

Columbia University Academic Commons

PubMed Central

Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies

Author: Fallin Danièle M
Parmigiani Giovanni
Pattaro Cristian
Ruczinski Ingo
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Identification of disease-related genes in association studies is challenged by the large number of SNPs typed. To address the dilution of power caused by high dimensionality, and to generate results that are biologically interpretable, it is critical to take into consideration spatial correlation of SNPs along the genome. With the goal of identifying true genetic associations, partitioning the genome according to spatial correlation can be a powerful and meaningful way to address this dimensionality problem. Results We developed and validated an MCMC Algorithm To Identify blocks of Linkage DisEquilibrium (MATILDE) for clustering contiguous SNPs, and a statistical testing framework to detect association using partitions as units of analysis. We compared its ability to detect true SNP associations to that of the most commonly used algorithm for block partitioning, as implemented in the Haploview and HapBlock software. Simulations were based on artificially assigning phenotypes to individuals with SNPs corresponding to region 14q11 of the HapMap database. When block partitioning is performed using MATILDE, the ability to correctly identify a disease SNP is higher, especially for small effects, than it is with the alternatives considered. Advantages can be both in terms of true positive findings and limiting the number of false discoveries. Finer partitions provided by LD-based methods or by marker-by-marker analysis are efficient only for detecting big effects, or in presence of large sample sizes. The probabilistic approach we propose offers several additional advantages, including: a) adapting the estimation of blocks to the population, technology, and sample size of the study; b) probabilistic assessment of uncertainty about block boundaries and about whether any two SNPs are in the same block; c) user selection of the probability threshold for assigning SNPs to the same block. Conclusion We demonstrate that, in realistic scenarios, our adaptive, study-specific block partitioning approach is as or more efficient than currently available LD-based approaches in guiding the search for disease loci.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A model-based approach to selection of tag SNPs

Author: A Barron
A Thomas
AP Dempster
B Halldórsson
BV Halldórsson
CE Shannon
CS Carlson
CS Carlson
D Botstein
DC Crawford
DC Crawford
EC Anderson
Fengzhu Sun
G Schwarz
GA McVean
H Akaike
H Mannila
J Besag
JD Wall
JD Wall
JFC Kingman
JN Hirschhorn
K Zhang
K Zhang
K Zhang
L Breiman
L Excoffier
L Li
LE Baum
Lei M Li
LR Rabiner
M Koivisto
M Nothnagel
M Stephens
MJ Daly
N Li
N Patil
Pierre Nicolas
S Lin
SB Gabriel
SE Ptak
T Niu
TG Schulze
The International HapMap Consortium
TM Cover
W Zhai
X Ke
X Sun
Z Liu
Z Meng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

HAL Descartes

Hal-Diderot

Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests

Author: Dina Christian
Leray Philippe
Mourad Raphaël
Sinoquet Christine
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Linkage disequilibrium study represents a major issue in statistical genetics as it plays a fundamental role in gene mapping and helps us to learn more about human history. The linkage disequilibrium complex structure makes its exploratory data analysis essential yet challenging. Visualization methods, such as the triangular heat map implemented in Haploview, provide simple and useful tools to help understand complex genetic patterns, but remain insufficient to fully describe them. Probabilistic graphical models have been widely recognized as a powerful formalism allowing a concise and accurate modeling of dependences between variables. In this paper, we propose a method for short-range, long-range and chromosome-wide linkage disequilibrium visualization using forests of hierarchical latent class models. Thanks to its hierarchical nature, our method is shown to provide a compact view of both pairwise and multilocus linkage disequilibrium spatial structures for the geneticist. Besides, a multilocus linkage disequilibrium measure has been designed to evaluate linkage disequilibrium in hierarchy clusters. To learn the proposed model, a new scalable algorithm is presented. It constrains the dependence scope, relying on physical positions, and is able to deal with more than one hundred thousand single nucleotide polymorphisms. The proposed algorithm is fast and does not require phase genotypic data

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central