84 research outputs found
ISHAPE: new rapid and accurate software for haplotyping
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licens
A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data
The perennial problem of "how many clusters?" remains an issue of substantial
interest in data mining and machine learning communities, and becomes
particularly salient in large data sets such as populational genomic data where
the number of clusters needs to be relatively large and open-ended. This
problem gets further complicated in a co-clustering scenario in which one needs
to solve multiple clustering problems simultaneously because of the presence of
common centroids (e.g., ancestors) shared by clusters (e.g., possible descents
from a certain ancestor) from different multiple-cluster samples (e.g.,
different human subpopulations). In this paper we present a hierarchical
nonparametric Bayesian model to address this problem in the context of
multi-population haplotype inference. Uncovering the haplotypes of single
nucleotide polymorphisms is essential for many biological and medical
applications. While it is uncommon for the genotype data to be pooled from
multiple ethnically distinct populations, few existing programs have explicitly
leveraged the individual ethnic information for haplotype inference. In this
paper we present a new haplotype inference program, Haploi, which makes use of
such information and is readily applicable to genotype sequences with thousands
of SNPs from heterogeneous populations, with competent and sometimes superior
speed and accuracy comparing to the state-of-the-art programs. Underlying
Haploi is a new haplotype distribution model based on a nonparametric Bayesian
formalism known as the hierarchical Dirichlet process, which represents a
tractable surrogate to the coalescent process. The proposed model is
exchangeable, unbounded, and capable of coupling demographic information of
different populations.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS225 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques
Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics
Empirical vs Bayesian approach for estimating haplotypes from genotypes of unrelated individuals
BACKGROUND: The completion of the HapMap project has stimulated further development of haplotype-based methodologies for disease associations. A key aspect of such development is the statistical inference of individual diplotypes from unphased genotypes. Several methodologies for inferring haplotypes have been developed, but they have not been evaluated extensively to determine which method not only performs well, but also can be easily incorporated in downstream haplotype-based association analyses. In this paper, we attempt to do so. Our evaluation was carried out by comparing the two leading Bayesian methods, implemented in PHASE and HAPLOTYPER, and the two leading empirical methods, implemented in PL-EM and HPlus. We used these methods to analyze real data, namely the dense genotypes on X-chromosome of 30 European and 30 African trios provided by the International HapMap Project, and simulated genotype data. Our conclusions are based on these analyses. RESULTS: All programs performed very well on X-chromosome data, with an average similarity index of 0.99 and an average prediction rate of 0.99 for both European and African trios. On simulated data with approximation of coalescence, PHASE implementing the Bayesian method based on the coalescence approximation outperformed other programs on small sample sizes. When the sample size increased, other programs performed as well as PHASE. PL-EM and HPlus implementing empirical methods required much less running time than the programs implementing the Bayesian methods. They required only one hundredth or thousandth of the running time required by PHASE, particularly when analyzing large sample sizes and large umber of SNPs. CONCLUSION: For large sample sizes (hundreds or more), which most association studies require, the two empirical methods might be used since they infer the haplotypes as accurately as any Bayesian methods and can be incorporated easily into downstream haplotype-based analyses such as haplotype-association analyses
Recommended from our members
Haplotype Assembly and Small Variant Calling using Emerging Sequencing Technologies
Short read DNA sequencing technologies from Illumina have made sequencing a human genome significantly more affordable, greatly accelerating studies of biological function and the association of genetic variants to disease. These technologies are frequently used to detect small genetic variants such as single nucleotide variants (SNVs) using a reference genome. However, short read sequencing technologies have several limitations. First, the human genome is diploid and short reads contain limited information for assembling haplotypes, or the sequences of alleles on homologous chromosomes. Moreover, there is significant input DNA required, which poses challenges for analyzing single cells. Further, there is limited ability to detect genetic variants inside long duplicated sequences that occur in the genome. As a result, there has been widespread development of novel methods to overcome these deficiencies using short reads. These include clone based sequencing, linked read sequencing, and proximity ligation sequencing, as well as various single cell sequencing methods. There are also entirely new sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies that produce significantly longer reads. While these emerging methods and technologies demonstrate improvements compared to short reads, they also have properties and error modalities that pose unique computational challenges. Moreover, there is a shortage of bioinformatics methods for accurate small variant detection and haplotype assembly using these approaches compared to short reads. This dissertation aims to address this problem with the introduction of several new algorithms for highly accurate haplotype assembly and SNV calling. First, it introduces HapCUT2, an algorithm that can rapidly assemble haplotypes using a broad range of sequencing technologies. Second, it introduces an algorithm for variant calling and haplotyping using SISSOR, a recently introduced microfluidics based technology for sequencing single cells. Finally, it introduces Longshot, an algorithm for detecting and phasing SNVs using error-prone long read technologies. In each case, the algorithms are benchmarked using multiple real whole-genome sequencing datasets and are found to be highly accurate. The methods introduced in this dissertation contribute to the goal of sequencing diploid genomes accurately and completely for a broad range of scientific and clinical purposes
Shape-IT: new rapid and accurate algorithm for haplotype inference
<p>Abstract</p> <p>Background</p> <p>We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.</p> <p>Results</p> <p>Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.</p> <p>Conclusion</p> <p>Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.</p
Hum Hered
The inference of haplotype pairs directly from unphased genotype data is a key step in the analysis of genetic variation in relation to disease and pharmacogenetically relevant traits. Most popular methods such as Phase and PL do require either the coalescence assumption or the assumption of linkage between the single-nucleotide polymorphisms (SNPs). We have now developed novel approaches that are independent of these assumptions. First, we introduce a new optimization criterion in combination with a block-wise evolutionary Monte Carlo algorithm. Based on this criterion, the 'haplotype likelihood', we develop two kinds of estimators, the maximum haplotype-likelihood (MHL) estimator and its empirical Bayesian (EB) version. Using both real and simulated data sets, we demonstrate that our proposed estimators allow substantial improvements over both the expectation-maximization (EM) algorithm and Clark's procedure in terms of capacity/scalability and error rate. Thus, hundreds and more ambiguous loci and potentially very large sample sizes can be processed. Moreover, applying our proposed EB estimator can result in significant reductions of error rate in the case of unlinked or only weakly linked SNPs
- …