3,091 research outputs found
A Permutation Approach for Selecting the Penalty Parameter in Penalized Model Selection
We describe a simple, efficient, permutation based procedure for selecting
the penalty parameter in the LASSO. The procedure, which is intended for
applications where variable selection is the primary focus, can be applied in a
variety of structural settings, including generalized linear models. We briefly
discuss connections between permutation selection and existing theory for the
LASSO. In addition, we present a simulation study and an analysis of three real
data sets in which permutation selection is compared with cross-validation
(CV), the Bayesian information criterion (BIC), and a selection method based on
recently developed testing procedures for the LASSO
A Penalty Approach to Differential Item Functioning in Rasch Models
A new diagnostic tool for the identification of differential item functioning (DIF) is proposed. Classical approaches to DIF allow to consider only few subpopulations like ethnic groups when investigating if the solution of items depends on the membership to a subpopulation. We propose an explicit model for differential item functioning that includes a set of variables, containing metric as well as categorical components, as potential candidates for inducing DIF. The ability to include a set of covariates entails that the model contains a large number of parameters. Regularized estimators, in particular penalized maximum likelihood estimators, are used
to solve the estimation problem and to identify the items that induce DIF. It is shown that the method is able to detect items with DIF. Simulations and two applications demonstrate the applicability of the method
Doubly Robust Inference when Combining Probability and Non-probability Samples with High-dimensional Data
Non-probability samples become increasingly popular in survey statistics but
may suffer from selection biases that limit the generalizability of results to
the target population. We consider integrating a non-probability sample with a
probability sample which provides high-dimensional representative covariate
information of the target population. We propose a two-step approach for
variable selection and finite population inference. In the first step, we use
penalized estimating equations with folded-concave penalties to select
important variables for the sampling score of selection into the
non-probability sample and the outcome model. We show that the penalized
estimating equation approach enjoys the selection consistency property for
general probability samples. The major technical hurdle is due to the possible
dependence of the sample under the finite population framework. To overcome
this challenge, we construct martingales which enable us to apply Bernstein
concentration inequality for martingales. In the second step, we focus on a
doubly robust estimator of the finite population mean and re-estimate the
nuisance model parameters by minimizing the asymptotic squared bias of the
doubly robust estimator. This estimating strategy mitigates the possible
first-step selection error and renders the doubly robust estimator root-n
consistent if either the sampling probability or the outcome model is correctly
specified
Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation
Using a collection of simulated an real benchmarks, we compare Bayesian and
frequentist regularization approaches under a low informative constraint when
the number of variables is almost equal to the number of observations on
simulated and real datasets. This comparison includes new global noninformative
approaches for Bayesian variable selection built on Zellner's g-priors that are
similar to Liang et al. (2008). The interest of those calibration-free
proposals is discussed. The numerical experiments we present highlight the
appeal of Bayesian regularization methods, when compared with non-Bayesian
alternatives. They dominate frequentist methods in the sense that they provide
smaller prediction errors while selecting the most relevant variables in a
parsimonious way
Sparse regulatory networks
In many organisms the expression levels of each gene are controlled by the
activation levels of known "Transcription Factors" (TF). A problem of
considerable interest is that of estimating the "Transcription Regulation
Networks" (TRN) relating the TFs and genes. While the expression levels of
genes can be observed, the activation levels of the corresponding TFs are
usually unknown, greatly increasing the difficulty of the problem. Based on
previous experimental work, it is often the case that partial information about
the TRN is available. For example, certain TFs may be known to regulate a given
gene or in other cases a connection may be predicted with a certain
probability. In general, the biology of the problem indicates there will be
very few connections between TFs and genes. Several methods have been proposed
for estimating TRNs. However, they all suffer from problems such as unrealistic
assumptions about prior knowledge of the network structure or computational
limitations. We propose a new approach that can directly utilize prior
information about the network structure in conjunction with observed gene
expression data to estimate the TRN. Our approach uses penalties on the
network to ensure a sparse structure. This has the advantage of being
computationally efficient as well as making many fewer assumptions about the
network structure. We use our methodology to construct the TRN for E. coli and
show that the estimate is biologically sensible and compares favorably with
previous estimates.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS350 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …