4,863 research outputs found
An Empirical Comparison of Multiple Imputation Methods for Categorical Data
Multiple imputation is a common approach for dealing with missing values in
statistical databases. The imputer fills in missing values with draws from
predictive models estimated from the observed data, resulting in multiple,
completed versions of the database. Researchers have developed a variety of
default routines to implement multiple imputation; however, there has been
limited research comparing the performance of these methods, particularly for
categorical data. We use simulation studies to compare repeated sampling
properties of three default multiple imputation methods for categorical data,
including chained equations using generalized linear models, chained equations
using classification and regression trees, and a fully Bayesian joint
distribution based on Dirichlet Process mixture models. We base the simulations
on categorical data from the American Community Survey. In the circumstances of
this study, the results suggest that default chained equations approaches based
on generalized linear models are dominated by the default regression tree and
Bayesian mixture model approaches. They also suggest competing advantages for
the regression tree and Bayesian mixture model approaches, making both
reasonable default engines for multiple imputation of categorical data. A
supplementary material for this article is available online
Robust bootstrap procedures for the chain-ladder method
Insurers are faced with the challenge of estimating the future reserves
needed to handle historic and outstanding claims that are not fully settled. A
well-known and widely used technique is the chain-ladder method, which is a
deterministic algorithm. To include a stochastic component one may apply
generalized linear models to the run-off triangles based on past claims data.
Analytical expressions for the standard deviation of the resulting reserve
estimates are typically difficult to derive. A popular alternative approach to
obtain inference is to use the bootstrap technique. However, the standard
procedures are very sensitive to the possible presence of outliers. These
atypical observations, deviating from the pattern of the majority of the data,
may both inflate or deflate traditional reserve estimates and corresponding
inference such as their standard errors. Even when paired with a robust
chain-ladder method, classical bootstrap inference may break down. Therefore,
we discuss and implement several robust bootstrap procedures in the claims
reserving framework and we investigate and compare their performance on both
simulated and real data. We also illustrate their use for obtaining the
distribution of one year risk measures
Q-learning: flexible learning about useful utilities
Dynamic treatment regimes are fast becoming an important part of medicine, with the corresponding change in emphasis from treatment of the disease to treatment of the individual patient. Because of the limited number of trials to evaluate personally tailored treatment sequences, inferring optimal treatment regimes from observational data has increased importance. Q-learning is a popular method for estimating the optimal treatment regime, originally in randomized trials but more recently also in observational data. Previous applications of Q-learning have largely been restricted to continuous utility end-points with linear relationships. This paper is the first attempt at both extending the framework to discrete utilities and implementing the modelling of covariates from linear to more flexible modelling using the generalized additive model (GAM) framework. Simulated data results show that the GAM adapted Q-learning typically outperforms Q-learning with linear models and other frequently-used methods based on propensity scores in terms of coverage and bias/MSE. This represents a promising step toward a more fully general Q-learning approach to estimating optimal dynamic treatment regimes
Zero-inflated truncated generalized Pareto distribution for the analysis of radio audience data
Extreme value data with a high clump-at-zero occur in many domains. Moreover,
it might happen that the observed data are either truncated below a given
threshold and/or might not be reliable enough below that threshold because of
the recording devices. These situations occur, in particular, with radio
audience data measured using personal meters that record environmental noise
every minute, that is then matched to one of the several radio programs. There
are therefore genuine zeros for respondents not listening to the radio, but
also zeros corresponding to real listeners for whom the match between the
recorded noise and the radio program could not be achieved. Since radio
audiences are important for radio broadcasters in order, for example, to
determine advertisement price policies, possibly according to the type of
audience at different time points, it is essential to be able to explain not
only the probability of listening to a radio but also the average time spent
listening to the radio by means of the characteristics of the listeners. In
this paper we propose a generalized linear model for zero-inflated truncated
Pareto distribution (ZITPo) that we use to fit audience radio data. Because it
is based on the generalized Pareto distribution, the ZITPo model has nice
properties such as model invariance to the choice of the threshold and from
which a natural residual measure can be derived to assess the model fit to the
data. From a general formulation of the most popular models for zero-inflated
data, we derive our model by considering successively the truncated case, the
generalized Pareto distribution and then the inclusion of covariates to explain
the nonzero proportion of listeners and their average listening time. By means
of simulations, we study the performance of the maximum likelihood estimator
(and derived inference) and use the model to fully analyze the audience data of
a radio station in a certain area of Switzerland.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS358 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Generalized Estimating Equations Approach to Model Heterogeneity and Time Dependence in Capture-Recapture Studies
Individual heterogeneity in capture probabilities and time dependence are fundamentally important for estimating the closed animal population parameters in capture-recapture studies. A generalized estimating equations (GEE) approach accounts for linear correlation among capture-recapture occasions, and individual heterogeneity in capture probabilities in a closed population capture-recapture individual heterogeneity and time variation model. The estimated capture probabilities are used to estimate animal population parameters. Two real data sets are used for illustrative purposes. A simulation study is carried out to assess the performance of the GEE estimator. A Quasi-Likelihood Information Criterion (QIC) is applied for the selection of the best fitting model. This approach performs well when the estimated population parameters depend on the individual heterogeneity and the nature of linear correlation among capture-recapture occasions
- …