730 research outputs found
Haplotype Estimation from Fuzzy Genotypes Using Penalized Likelihood
The Composite Link Model is a generalization of the generalized linear model in which expected values of observed counts are constructed as a sum of generalized linear components. When combined with penalized likelihood, it provides a powerful and elegant way to estimate haplotype probabilities from observed genotypes. Uncertain (“fuzzy”) genotypes, like those resulting from AFLP scores, can be handled by adding an extra layer to the model. We describe the model and the estimation algorithm. We apply it to a data set of accurate human single nucleotide polymorphism (SNP) and to a data set of fuzzy tomato AFLP scores
Learning the optimal scale for GWAS through hierarchical SNP aggregation
Motivation: Genome-Wide Association Studies (GWAS) seek to identify causal
genomic variants associated with rare human diseases. The classical statistical
approach for detecting these variants is based on univariate hypothesis
testing, with healthy individuals being tested against affected individuals at
each locus. Given that an individual's genotype is characterized by up to one
million SNPs, this approach lacks precision, since it may yield a large number
of false positives that can lead to erroneous conclusions about genetic
associations with the disease. One way to improve the detection of true genetic
associations is to reduce the number of hypotheses to be tested by grouping
SNPs. Results: We propose a dimension-reduction approach which can be applied
in the context of GWAS by making use of the haplotype structure of the human
genome. We compare our method with standard univariate and multivariate
approaches on both synthetic and real GWAS data, and we show that reducing the
dimension of the predictor matrix by aggregating SNPs gives a greater precision
in the detection of associations between the phenotype and genomic regions
Bayesian Statistical Methods for Genetic Association Studies with Case-Control and Cohort Design
Large-scale genetic association studies are carried out with the hope of discovering single
nucleotide polymorphisms involved in the etiology of complex diseases. We propose a
coalescent-based model for association mapping which potentially increases the power to
detect disease-susceptibility variants in genetic association studies with case-control and cohort
design. The approach uses Bayesian partition modelling to cluster haplotypes with
similar disease risks by exploiting evolutionary information. We focus on candidate gene
regions and we split the chromosomal region of interest into sub-regions or windows of high
linkage disequilibrium (LD) therein assuming a perfect phylogeny. The haplotype space is
then partitioned into disjoint clusters within which the phenotype-haplotype association is
assumed to be the same. The novelty of our approach consists in the fact that the distance
used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered
according to the time to their most recent common mutation. Our approach is fully
Bayesian and we develop Markov Chain Monte Carlo algorithms to sample efficiently over
the space of possible partitions. We have also developed a Bayesian survival regression model
for high-dimension and small sample size settings. We provide a Bayesian variable selection
procedure and shrinkage tool by imposing shrinkage priors on the regression coefficients. We
have developed a computationally efficient optimization algorithm to explore the posterior
surface and find the maximum a posteriori estimates of the regression coefficients. We compare
the performance of the proposed methods in simulation studies and using real datasets
to both single-marker analyses and recently proposed multi-marker methods and show that
our methods perform similarly in localizing the causal allele while yielding lower false positive
rates. Moreover, our methods offer computational advantages over other multi-marker
approaches
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Mapping Haplotype-haplotype Interactions with Adaptive LASSO
<p>Abstract</p> <p>Background</p> <p>The genetic etiology of complex diseases in human has been commonly viewed as a complex process involving both genetic and environmental factors functioning in a complicated manner. Quite often the interactions among genetic variants play major roles in determining the susceptibility of an individual to a particular disease. Statistical methods for modeling interactions underlying complex diseases between single genetic variants (e.g. single nucleotide polymorphisms or SNPs) have been extensively studied. Recently, haplotype-based analysis has gained its popularity among genetic association studies. When multiple sequence or haplotype interactions are involved in determining an individual's susceptibility to a disease, it presents daunting challenges in statistical modeling and testing of the interaction effects, largely due to the complicated higher order epistatic complexity.</p> <p>Results</p> <p>In this article, we propose a new strategy in modeling haplotype-haplotype interactions under the penalized logistic regression framework with adaptive <it>L</it><sub>1</sub>-penalty. We consider interactions of sequence variants between haplotype blocks. The adaptive <it>L</it><sub>1</sub>-penalty allows simultaneous effect estimation and variable selection in a single model. We propose a new parameter estimation method which estimates and selects parameters by the modified Gauss-Seidel method nested within the EM algorithm. Simulation studies show that it has low false positive rate and reasonable power in detecting haplotype interactions. The method is applied to test haplotype interactions involved in mother and offspring genome in a small for gestational age (SGA) neonates data set, and significant interactions between different genomes are detected.</p> <p>Conclusions</p> <p>As demonstrated by the simulation studies and real data analysis, the approach developed provides an efficient tool for the modeling and testing of haplotype interactions. The implementation of the method in R codes can be freely downloaded from <url>http://www.stt.msu.edu/~cui/software.html</url>.</p
- …