9,112 research outputs found
An Empirical Comparison of Multiple Imputation Methods for Categorical Data
Multiple imputation is a common approach for dealing with missing values in
statistical databases. The imputer fills in missing values with draws from
predictive models estimated from the observed data, resulting in multiple,
completed versions of the database. Researchers have developed a variety of
default routines to implement multiple imputation; however, there has been
limited research comparing the performance of these methods, particularly for
categorical data. We use simulation studies to compare repeated sampling
properties of three default multiple imputation methods for categorical data,
including chained equations using generalized linear models, chained equations
using classification and regression trees, and a fully Bayesian joint
distribution based on Dirichlet Process mixture models. We base the simulations
on categorical data from the American Community Survey. In the circumstances of
this study, the results suggest that default chained equations approaches based
on generalized linear models are dominated by the default regression tree and
Bayesian mixture model approaches. They also suggest competing advantages for
the regression tree and Bayesian mixture model approaches, making both
reasonable default engines for multiple imputation of categorical data. A
supplementary material for this article is available online
High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression
Motivation: The high dimensionality of genomic data calls for the development
of specific classification methodologies, especially to prevent over-optimistic
predictions. This challenge can be tackled by compression and variable
selection, which combined constitute a powerful framework for classification,
as well as data visualization and interpretation. However, current proposed
combinations lead to instable and non convergent methods due to inappropriate
computational frameworks. We hereby propose a stable and convergent approach
for classification in high dimensional based on sparse Partial Least Squares
(sparse PLS). Results: We start by proposing a new solution for the sparse PLS
problem that is based on proximal operators for the case of univariate
responses. Then we develop an adaptive version of the sparse PLS for
classification, which combines iterative optimization of logistic regression
and sparse PLS to ensure convergence and stability. Our results are confirmed
on synthetic and experimental data. In particular we show how crucial
convergence and stability can be when cross-validation is involved for
calibration purposes. Using gene expression data we explore the prediction of
breast cancer relapse. We also propose a multicategorial version of our method
on the prediction of cell-types based on single-cell expression data.
Availability: Our approach is implemented in the plsgenomics R-package.Comment: 9 pages, 3 figures, 4 tables + Supplementary Materials 8 pages, 3
figures, 10 table
The chopthin algorithm for resampling
Resampling is a standard step in particle filters and more generally
sequential Monte Carlo methods. We present an algorithm, called chopthin, for
resampling weighted particles. In contrast to standard resampling methods the
algorithm does not produce a set of equally weighted particles; instead it
merely enforces an upper bound on the ratio between the weights. Simulation
studies show that the chopthin algorithm consistently outperforms standard
resampling methods. The algorithms chops up particles with large weight and
thins out particles with low weight, hence its name. It implicitly guarantees a
lower bound on the effective sample size. The algorithm can be implemented
efficiently, making it practically useful. We show that the expected
computational effort is linear in the number of particles. Implementations for
C++, R (on CRAN), Python and Matlab are available.Comment: 14 pages, 4 figure
Variable Selection in General Multinomial Logit Models
The use of the multinomial logit model is typically restricted to applications with few predictors, because in
high-dimensional settings maximum likelihood estimates tend to deteriorate. In this paper we are proposing a sparsity-inducing penalty that accounts for the special structure of multinomial models. In contrast to existing methods, it penalizes the parameters that are linked to one variable
in a grouped way and thus yields variable selection instead of parameter selection. We develop a proximal gradient method that is able to efficiently compute stable estimates.
In addition, the penalization is extended to the important case of predictors that vary across response categories. We apply our estimator to the modeling of party choice of voters in Germany including voter-specific variables like age and gender but also party-specific features like stance on nuclear energy and immigration
Binary and Ordinal Random Effects Models Including Variable Selection
A likelihood-based boosting approach for fitting binary and ordinal mixed models is presented. In contrast to common procedures it can be used in high-dimensional settings where a large number of potentially influential explanatory variables is available. Constructed as a componentwise boosting method it is able to perform variable selection with the complexity of the resulting estimator being determined by information criteria. The method is investigated in simulation studies both for cumulative and sequential models and is illustrated by using real data sets
A method of moments estimator for semiparametric index models
We propose an easy to use derivative based two-step estimation procedure for semi-parametric index models. In the first step various functionals involving the derivatives of the unknown function are estimated using nonparametric kernel estimators. The functionals used provide moment conditions for the parameters of interest, which are used in the second step within a method-of-moments framework to estimate the parameters of interest. The estimator is shown to be root N consistent and asymptotically normal. We extend the procedure to multiple equation models. Our identification conditions and estimation framework provide natural tests for the number of indices in the model. In addition we discuss tests of separability, additivity, and linearity of the influence of the indices.Semiparametric estimation, multiple index models, average derivative functionals, generalized methods of moments estimator, rank testing
Hazard function models to estimate mortality rates affecting fish populations with application to the sea mullet (Mugil cephalus) fishery on the Queensland coast (Australia)
Fisheries management agencies around the world collect age data for the
purpose of assessing the status of natural resources in their jurisdiction.
Estimates of mortality rates represent a key information to assess the
sustainability of fish stocks exploitation. Contrary to medical research or
manufacturing where survival analysis is routinely applied to estimate failure
rates, survival analysis has seldom been applied in fisheries stock assessment
despite similar purposes between these fields of applied statistics. In this
paper, we developed hazard functions to model the dynamic of an exploited fish
population. These functions were used to estimate all parameters necessary for
stock assessment (including natural and fishing mortality rates as well as gear
selectivity) by maximum likelihood using age data from a sample of catch. This
novel application of survival analysis to fisheries stock assessment was tested
by Monte Carlo simulations to assert that it provided un-biased estimations of
relevant quantities. The method was applied to data from the Queensland
(Australia) sea mullet (Mugil cephalus) commercial fishery collected between
2007 and 2014. It provided, for the first time, an estimate of natural
mortality affecting this stock: 0.22 0.08 year
- …