102 research outputs found
Fast empirical Bayesian LASSO for multiple quantitative trait locus mapping
<p>Abstract</p> <p>Background</p> <p>The Bayesian shrinkage technique has been applied to multiple quantitative trait loci (QTLs) mapping to estimate the genetic effects of QTLs on quantitative traits from a very large set of possible effects including the main and epistatic effects of QTLs. Although the recently developed empirical Bayes (EB) method significantly reduced computation comparing with the fully Bayesian approach, its speed and accuracy are limited by the fact that numerical optimization is required to estimate the variance components in the QTL model.</p> <p>Results</p> <p>We developed a fast empirical Bayesian LASSO (EBLASSO) method for multiple QTL mapping. The fact that the EBLASSO can estimate the variance components in a closed form along with other algorithmic techniques render the EBLASSO method more efficient and accurate. Comparing with the EB method, our simulation study demonstrated that the EBLASSO method could substantially improve the computational speed and detect more QTL effects without increasing the false positive rate. Particularly, the EBLASSO algorithm running on a personal computer could easily handle a linear QTL model with more than 100,000 variables in our simulation study. Real data analysis also demonstrated that the EBLASSO method detected more reasonable effects than the EB method. Comparing with the LASSO, our simulation showed that the current version of the EBLASSO implemented in Matlab had similar speed as the LASSO implemented in Fortran, and that the EBLASSO detected the same number of true effects as the LASSO but a much smaller number of false positive effects.</p> <p>Conclusions</p> <p>The EBLASSO method can handle a large number of effects possibly including both the main and epistatic QTL effects, environmental effects and the effects of gene-environment interactions. It will be a very useful tool for multiple QTL mapping.</p
Recommended from our members
Sparse Model Learning for Inferring Genotype and Phenotype Associations
Genotype and phenotype associations are of paramount importance in understanding the genetic basis of living organisms, improving traits of interests in animal and plant breeding, as well as gaining insights into complex biological systems and the etiology of human diseases. With the advancements in molecular biology such as microarrays, high throughput next generation sequencing, RNAseq, et al, the number of available genotype markers is far exceeding the number of available samples in association studies. The objective of this dissertation is to develop sparse models for such high dimensional data, develop accurate sparse variable selection and estimation algorithms for the models, and design statistical methods for robust hypothesis tests for the genotype and phenotype associations. We develop a novel empirical Bayesian least absolute shrinkage and selection operator (EBlasso) algorithm with Normal, Exponential and Gamma (NEG), and Normal, Exponential (NE) hierarchical prior distributions, and an empirical Bayesian elastic net (EBEN) algorithm with an innovative Normal and generalized Gamma (NG) hierarchical prior distribution, for both general linear and generalized logistic regression models. Both of the two empirical Bayes methods estimate variance components of the regression coefficients with closed-form solutions and perform automatic variable selection such that a variable with zero variance is excluded from the model. With the closed-form solutions for variance components in the model and without estimating the posterior modes for excluded variables, the two empirical Bayes methods infer sparse models efficiently. Having both covariance and posterior modes estimated, they also provide a statistical testing method that considers as much information as possible without increasing the degrees of freedom (DF). Extensive simulation studies are carried out to evaluate the performance of the proposed methods, and real datasets are analyzed for validation. Both simulation and real data analyses suggest that the two methods are fast and accurate genotype-phenotype association methods that can easily handle high dimensional data including possible main and interaction effects. Comparing the two methods, EBlasso typically selects one variable out of a group of highly correlated effects, and the EBEN algorithm encourages a grouping effect that selects a group of effects if they are correlated. Not only verificatory simulation and real dataset analyses are performed, we further demonstrate the advantage of the developed algorithms through two exploratory applications, namely the whole-genome QTL mapping for an elite rice hybrid and pathway-based genome wide association study (GWAS) for human Parkinson disease (PD). In the first application, we exploit whole-genome markers of an immortalized F2 population derived from an elite rice hybrid to perform QTL mapping for the rice-yield phenotype. Our QTL model includes additive and dominance main effects of 1,619 markers and all pair-wise interactions, with a total of more than 5 million possible effects. This study not only reveals the major role of epistasis influencing rice yield, but also provides a set of candidate genetic loci for further experimental investigations. In the second application, we employ the EBlasso logistic regression model for pathway-based GWAS to include all possible main effects and a large number of pair-wise interactions of single nucleotide polymorphisms (SNPs) in a pathway, with a total number of more than 32 million effects included in the model. With effects inferred by EBlasso, the statistical significance of a pathway is tested with the Wald statistics and reliable effects in a significant pathway are selected using the stability selection technique. Another important area of genotype and phenotype association is to infer the structure of gene regulatory networks (GRNs). We developed a GRN inference algorithm by exploring sparse model selection and estimation methods in structural equation models (SEMs). We extend a previously developed sparse-aware maximum likelihood (SML) algorithm to incorporate the adaptive elastic net penalty for the SEM likelihood function (SEM-EN) and infer the model using a parallelized block coordinate ascent algorithm. With the versatile penalty function and powerful parallel computation, the SEM-EN algorithm is able to infer a network with thousands of nodes. The performance of the developed algorithm are demonstrated through simulation studies, in which power of detection and false discovery rate both suggest that SEM-EN significantly improves GRN inference over the previously developed SEM-SML algorithm. When applied to infer the GRN of a real budding yeast dataset with more than 3,000 nodes, SEM-EN infers a sparse network corroborated by previous independent studies in terms of roles of hub nodes and functions of key clusters. Given the fundamental importance of genotype and phenotype associations in understanding the genetic basis of complex biological system, the EBlasso-NE, EBlasso-NEG, EBEN, as well as SEM-EN algorithms and software packages developed in this dissertation achieve the effectiveness, robustness and efficiency that are needed for successful application to biology. With the advancement of high-throughput molecular technologies in generating information at genetic, epigenetic, transcriptional and post-transcriptional levels, the methods developed in this dissertation can have broad applications to infer different types of genotype and phenotypes associations.</p
Whole-genome quantitative trait locus mapping reveals major role of epistasis on yield of rice
Although rice yield has been doubled in most parts of the world since 1960s, thanks to the advancements in breeding technologies, the biological mechanisms controlling yield are largely unknown. To understand the genetic basis of rice yield, a number of quantitative trait locus (QTL) mapping studies have been carried out, but whole-genome QTL mapping incorporating all interaction effects is still lacking. In this paper, we exploited whole-genome markers of an immortalized F2 population derived from an elite rice hybrid to perform QTL mapping for rice yield characterized by yield per plant and three yield component traits. Our QTL model includes additive and dominance main effects of 1,619 markers and all pair-wise interactions, with a total of more than 5 million possible effects. The QTL mapping identified 54, 5, 28 and 4 significant effects involving 103, 9, 52 and 7 QTLs for the four traits, namely the number of panicles per plant, the number of grains per panicle, grain weight, and yield per plant. Most identified QTLs are involved in digenic interactions. An extensive literature survey of experimentally characterized genes related to crop yield shows that 19 of 54 effects, 4 of 5 effects, 12 of 28 effects and 2 of 4 effects for the four traits, respectively, involve at least one QTL that locates within 2 cM distance to at least one yield-related gene. This study not only reveals the major role of epistasis influencing rice yield, but also provides a set of candidate genetic loci for further experimental investigation
Data from: Empirical Bayesian elastic net for multiple quantitative trait locus mapping
In multiple quantitative trait locus (QTL) mapping, a high-dimensional sparse regression model is usually employed to account for possible multiple linked QTLs. The QTL model may include closely linked and thus highly correlated genetic markers, especially when high-density marker maps are used in QTL mapping because of the advancement in sequencing technology. Although existing algorithms, such as Lasso, empirical Bayesian Lasso (EBlasso) and elastic net (EN) are available to infer such QTL models, more powerful methods are highly desirable to detect more QTLs in the presence of correlated QTLs. We developed a novel empirical Bayesian EN (EBEN) algorithm for multiple QTL mapping that inherits the efficiency of our previously developed EBlasso algorithm. Simulation results demonstrated that EBEN provided higher power of detection and almost the same false discovery rate compared with EN and EBlasso. Particularly, EBEN can identify correlated QTLs that the other two algorithms may fail to identify. When analyzing a real dataset, EBEN detected more effects than EN and EBlasso. EBEN provides a useful tool for inferring high-dimensional sparse model in multiple QTL mapping and other applications. An R software package ‘EBEN’ implementing the EBEN algorithm is available on the Comprehensive R Archive Network (CRAN)
Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping
Abstract Background Complex binary traits are influenced by many factors including the main effects of many quantitative trait loci (QTLs), the epistatic effects involving more than one QTLs, environmental effects and the effects of gene-environment interactions. Although a number of QTL mapping methods for binary traits have been developed, there still lacks an efficient and powerful method that can handle both main and epistatic effects of a relatively large number of possible QTLs. Results In this paper, we use a Bayesian logistic regression model as the QTL model for binary traits that includes both main and epistatic effects. Our logistic regression model employs hierarchical priors for regression coefficients similar to the ones used in the Bayesian LASSO linear model for multiple QTL mapping for continuous traits. We develop efficient empirical Bayesian algorithms to infer the logistic regression model. Our simulation study shows that our algorithms can easily handle a QTL model with a large number of main and epistatic effects on a personal computer, and outperform five other methods examined including the LASSO, HyperLasso, BhGLM, RVM and the single-QTL mapping method based on logistic regression in terms of power of detection and false positive rate. The utility of our algorithms is also demonstrated through analysis of a real data set. A software package implementing the empirical Bayesian algorithms in this paper is freely available upon request. Conclusions The EBLASSO logistic regression method can handle a large number of effects possibly including the main and epistatic QTL effects, environmental effects and the effects of gene-environment interactions. It will be a very useful tool for multiple QTLs mapping for complex binary traits
- …