4,569 research outputs found
Sparse Probit Linear Mixed Model
Linear Mixed Models (LMMs) are important tools in statistical genetics. When
used for feature selection, they allow to find a sparse set of genetic traits
that best predict a continuous phenotype of interest, while simultaneously
correcting for various confounding factors such as age, ethnicity and
population structure. Formulated as models for linear regression, LMMs have
been restricted to continuous phenotypes. We introduce the Sparse Probit Linear
Mixed Model (Probit-LMM), where we generalize the LMM modeling paradigm to
binary phenotypes. As a technical challenge, the model no longer possesses a
closed-form likelihood function. In this paper, we present a scalable
approximate inference algorithm that lets us fit the model to high-dimensional
data sets. We show on three real-world examples from different domains that in
the setup of binary labels, our algorithm leads to better prediction accuracies
and also selects features which show less correlation with the confounding
factors.Comment: Published version, 21 pages, 6 figure
Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian information criterion
The prevailing method of analyzing GWAS data is still to test each marker
individually, although from a statistical point of view it is quite obvious
that in case of complex traits such single marker tests are not ideal. Recently
several model selection approaches for GWAS have been suggested, most of them
based on LASSO-type procedures. Here we will discuss an alternative model
selection approach which is based on a modification of the Bayesian Information
Criterion (mBIC2) which was previously shown to have certain asymptotic
optimality properties in terms of minimizing the misclassification error.
Heuristic search strategies are introduced which attempt to find the model
which minimizes mBIC2, and which are efficient enough to allow the analysis of
GWAS data.
Our approach is implemented in a software package called MOSGWA. Its
performance in case control GWAS is compared with the two algorithms HLASSO and
GWASelect, as well as with single marker tests, where we performed a simulation
study based on real SNP data from the POPRES sample. Our results show that
MOSGWA performs slightly better than HLASSO, whereas according to our
simulations GWASelect does not control the type I error when used to
automatically determine the number of important SNPs. We also reanalyze the
GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC) and compare
the findings of the different procedures
Exact Dimensionality Selection for Bayesian PCA
We present a Bayesian model selection approach to estimate the intrinsic
dimensionality of a high-dimensional dataset. To this end, we introduce a novel
formulation of the probabilisitic principal component analysis model based on a
normal-gamma prior distribution. In this context, we exhibit a closed-form
expression of the marginal likelihood which allows to infer an optimal number
of components. We also propose a heuristic based on the expected shape of the
marginal likelihood curve in order to choose the hyperparameters. In
non-asymptotic frameworks, we show on simulated data that this exact
dimensionality selection approach is competitive with both Bayesian and
frequentist state-of-the-art methods
Biomarker Detection in Association Studies: Modeling SNPs Simultaneously via Logistic ANOVA
In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this pa- per, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection
Bayesian and frequentist analysis of an Austrian genome-wide association study of colorectal cancer and advanced adenomas
Most genome-wide association studies (GWAS) were analyzed using single marker tests in combination with stringent correction procedures for multiple testing. Thus, a substantial proportion of associated single nucleotide polymorphisms (SNPs) remained undetected and may account for missing heritability in complex traits. Model selection procedures present a powerful alternative to identify associated SNPs in high-dimensional settings. In this GWAS including 1060 colorectal cancer cases, 689 cases of advanced colorectal adenomas and 4367 controls we pursued a dual approach to investigate genome-wide associations with disease risk applying both, single marker analysis and model selection based on the modified Bayesian information criterion, mBIC2, implemented in the software package MOSGWA. For different case-control comparisons, we report models including between 1-14 candidate SNPs. A genome-wide significant association of rs17659990 (P=5.43x10(-9), DOCK3, chromosome 3p21.2) with colorectal cancer risk was observed. Furthermore, 56 SNPs known to influence susceptibility to colorectal cancer and advanced adenoma were tested in a hypothesis-driven approach and several of them were found to be relevant in our Austrian cohort. After correction for multiple testing (alpha=8.9x10(-4)), the most significant associations were observed for SNPs rs10505477 (P=6.08x10(-4)) and rs6983267 (P=7.35x10(-4)) of CASC8, rs3802842 (P=8.98x10(-5), COLCA1,2), and rs12953717 (P=4.64x10(-4), SMAD7). All previously unreported SNPs demand replication in additional samples. Reanalysis of existing GWAS datasets using model selection as tool to detect SNPs associated with a complex trait may present a promising resource to identify further genetic risk variants not only for colorectal cancer
- …