83,636 research outputs found
Variable selection for BART: An application to gene regulation
We consider the task of discovering gene regulatory networks, which are
defined as sets of genes and the corresponding transcription factors which
regulate their expression levels. This can be viewed as a variable selection
problem, potentially with high dimensionality. Variable selection is especially
challenging in high-dimensional settings, where it is difficult to detect
subtle individual effects and interactions between predictors. Bayesian
Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a
novel nonparametric alternative to parametric regression approaches, such as
the lasso or stepwise regression, especially when the number of relevant
predictors is sparse relative to the total number of available predictors and
the fundamental relationships are nonlinear. We develop a principled
permutation-based inferential approach for determining when the effect of a
selected predictor is likely to be real. Going further, we adapt the BART
procedure to incorporate informed prior information about variable importance.
We present simulations demonstrating that our method compares favorably to
existing parametric and nonparametric procedures in a variety of data settings.
To demonstrate the potential of our approach in a biological context, we apply
it to the task of inferring the gene regulatory network in yeast (Saccharomyces
cerevisiae). We find that our BART-based procedure is best able to recover the
subset of covariates with the largest signal compared to other variable
selection methods. The methods developed in this work are readily available in
the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Doubly Robust Inference when Combining Probability and Non-probability Samples with High-dimensional Data
Non-probability samples become increasingly popular in survey statistics but
may suffer from selection biases that limit the generalizability of results to
the target population. We consider integrating a non-probability sample with a
probability sample which provides high-dimensional representative covariate
information of the target population. We propose a two-step approach for
variable selection and finite population inference. In the first step, we use
penalized estimating equations with folded-concave penalties to select
important variables for the sampling score of selection into the
non-probability sample and the outcome model. We show that the penalized
estimating equation approach enjoys the selection consistency property for
general probability samples. The major technical hurdle is due to the possible
dependence of the sample under the finite population framework. To overcome
this challenge, we construct martingales which enable us to apply Bernstein
concentration inequality for martingales. In the second step, we focus on a
doubly robust estimator of the finite population mean and re-estimate the
nuisance model parameters by minimizing the asymptotic squared bias of the
doubly robust estimator. This estimating strategy mitigates the possible
first-step selection error and renders the doubly robust estimator root-n
consistent if either the sampling probability or the outcome model is correctly
specified
Bayesian shrinkage in mixture-of-experts models: identifying robust determinants of class membership
A method for implicit variable selection in mixture-of-experts frameworks is proposed.
We introduce a prior structure where information is taken from a set of independent
covariates. Robust class membership predictors are identified using a normal gamma
prior. The resulting model setup is used in a finite mixture of Bernoulli distributions
to find homogenous clusters of women in Mozambique based on their information
sources on HIV. Fully Bayesian inference is carried out via the implementation of a
Gibbs sampler
Bayesian Inference under Cluster Sampling with Probability Proportional to Size
Cluster sampling is common in survey practice, and the corresponding
inference has been predominantly design-based. We develop a Bayesian framework
for cluster sampling and account for the design effect in the outcome modeling.
We consider a two-stage cluster sampling design where the clusters are first
selected with probability proportional to cluster size, and then units are
randomly sampled inside selected clusters. Challenges arise when the sizes of
nonsampled cluster are unknown. We propose nonparametric and parametric
Bayesian approaches for predicting the unknown cluster sizes, with this
inference performed simultaneously with the model for survey outcome.
Simulation studies show that the integrated Bayesian approach outperforms
classical methods with efficiency gains. We use Stan for computing and apply
the proposal to the Fragile Families and Child Wellbeing study as an
illustration of complex survey inference in health surveys
- …