1,039 research outputs found
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Populations in statistical genetic modelling and inference
What is a population? This review considers how a population may be defined
in terms of understanding the structure of the underlying genetics of the
individuals involved. The main approach is to consider statistically
identifiable groups of randomly mating individuals, which is well defined in
theory for any type of (sexual) organism. We discuss generative models using
drift, admixture and spatial structure, and the ancestral recombination graph.
These are contrasted with statistical models for inference, principle component
analysis and other `non-parametric' methods. The relationships between these
approaches are explored with both simulated and real-data examples. The
state-of-the-art practical software tools are discussed and contrasted. We
conclude that populations are a useful theoretical construct that can be well
defined in theory and often approximately exist in practice
NGS Based Haplotype Assembly Using Matrix Completion
We apply matrix completion methods for haplotype assembly from NGS reads to
develop the new HapSVT, HapNuc, and HapOPT algorithms. This is performed by
applying a mathematical model to convert the reads to an incomplete matrix and
estimating unknown components. This process is followed by quantizing and
decoding the completed matrix in order to estimate haplotypes. These algorithms
are compared to the state-of-the-art algorithms using simulated data as well as
the real fosmid data. It is shown that the SNP missing rate and the haplotype
block length of the proposed HapOPT are better than those of HapCUT2 with
comparable accuracy in terms of reconstruction rate and switch error rate. A
program implementing the proposed algorithms in MATLAB is freely available at
https://github.com/smajidian/HapMC
Genome-Wide Fine-Mapping Of Diabetic Traits
Type 2 diabetes results from both genes and the environment. Mapping genetic loci in animal models can help identify genes that are involved in type 2 diabetes to better understand the disease. Heterogeneous stock (HS) rats are derived from eight inbred founder strains and maintained in a breeding strategy that minimizes inbreeding. HS rats have a highly recombinant genome, which allows for rapid fine-mapping of complex traits genome-wide. However, this results in a complicated set of relationships between animals that is non-existent in traditional genetic mapping methods. To fine-map traits involved in type 2 diabetes, multiple diabetic phenotypes were collected in 1,038 HS male rats and these animals were genotyped using the Affymetrix 10K SNP array. Following ancestral haplotype reconstruction, a mixed modeling approach was used to identify genetic loci involved in two phenotypes suggestive of diabetes: fasting glucose and glucose area under the curve after a glucose tolerance test. Sibship was used as a random effect in the model to account for the complex family relationships. A genome-wide significant marker interval was detected on chromosome 11 for fasting glucose with a 95% confidence interval of 5.75 Mb. Genome-wide significant marker intervals were also detected on chromosomes 1,3, 10, and 13 for glucose area under the curve, with the average 95% confidence interval for these loci being only 3.15 Mb. A multilocus modeling technique involving resample model averaging was applied to the fasting glucose phenotype. This technique determines how frequently each locus is detected when resampling a portion of the original data-set, thus reducing potential false positives. Multilocus modeling results for fasting glucose coincided with the significant marker interval demonstrated in the mixed modeling approach. Both approaches are effective at detecting significant marker intervals that are expected to be involved in the phenotype of interest with a greater resolution over traditional methods
A model-based approach to selection of tag SNPs
BACKGROUND: Single Nucleotide Polymorphisms (SNPs) are the most common type of polymorphisms found in the human genome. Effective genetic association studies require the identification of sets of tag SNPs that capture as much haplotype information as possible. Tag SNP selection is analogous to the problem of data compression in information theory. According to Shannon's framework, the optimal tag set maximizes the entropy of the tag SNPs subject to constraints on the number of SNPs. This approach requires an appropriate probabilistic model. Compared to simple measures of Linkage Disequilibrium (LD), a good model of haplotype sequences can more accurately account for LD structure. It also provides a machinery for the prediction of tagged SNPs and thereby to assess the performances of tag sets through their ability to predict larger SNP sets. RESULTS: Here, we compute the description code-lengths of SNP data for an array of models and we develop tag SNP selection methods based on these models and the strategy of entropy maximization. Using data sets from the HapMap and ENCODE projects, we show that the hidden Markov model introduced by Li and Stephens outperforms the other models in several aspects: description code-length of SNP data, information content of tag sets, and prediction of tagged SNPs. This is the first use of this model in the context of tag SNP selection. CONCLUSION: Our study provides strong evidence that the tag sets selected by our best method, based on Li and Stephens model, outperform those chosen by several existing methods. The results also suggest that information content evaluated with a good model is more sensitive for assessing the quality of a tagging set than the correct prediction rate of tagged SNPs. Besides, we show that haplotype phase uncertainty has an almost negligible impact on the ability of good tag sets to predict tagged SNPs. This justifies the selection of tag SNPs on the basis of haplotype informativeness, although genotyping studies do not directly assess haplotypes. A software that implements our approach is available
Shape-IT: new rapid and accurate algorithm for haplotype inference
<p>Abstract</p> <p>Background</p> <p>We have developed a new computational algorithm, Shape-IT, to infer haplotypes under the genetic model of coalescence with recombination developed by Stephens et al in Phase v2.1. It runs much faster than Phase v2.1 while exhibiting the same accuracy. The major algorithmic improvements rely on the use of binary trees to represent the sets of candidate haplotypes for each individual. These binary tree representations: (1) speed up the computations of posterior probabilities of the haplotypes by avoiding the redundant operations made in Phase v2.1, and (2) overcome the exponential aspect of the haplotypes inference problem by the smart exploration of the most plausible pathways (ie. haplotypes) in the binary trees.</p> <p>Results</p> <p>Our results show that Shape-IT is several orders of magnitude faster than Phase v2.1 while being as accurate. For instance, Shape-IT runs 50 times faster than Phase v2.1 to compute the haplotypes of 200 subjects on 6,000 segments of 50 SNPs extracted from a standard Illumina 300 K chip (13 days instead of 630 days). We also compared Shape-IT with other widely used software, Gerbil, PL-EM, Fastphase, 2SNP, and Ishape in various tests: Shape-IT and Phase v2.1 were the most accurate in all cases, followed by Ishape and Fastphase. As a matter of speed, Shape-IT was faster than Ishape and Fastphase for datasets smaller than 100 SNPs, but Fastphase became faster -but still less accurate- to infer haplotypes on larger SNP datasets.</p> <p>Conclusion</p> <p>Shape-IT deserves to be extensively used for regular haplotype inference but also in the context of the new high-throughput genotyping chips since it permits to fit the genetic model of Phase v2.1 on large datasets. This new algorithm based on tree representations could be used in other HMM-based haplotype inference software and may apply more largely to other fields using HMM.</p
A comparison of methods for haplotype inference
This study presents some of the available methods for haplotype reconstruction and evaluates the accuracy and efficiency of three different software programs that utilize these methods. The analysis is performed on the QTLMAS XII common dataset, which is publicly available. The program LinkPHASE 5+, rule-based software, considers pedigree information (deduction and linkage) only. HiddenPHASE is a likelihood-based software, which takes into account molecular information (linkage disequilibrium). The DualPHASE software combines both of the above mentioned methods. We will see how usage of different available sources of information as well as the shape of the data affects the haplotype inference
- …