24,788 research outputs found
Nonparametric inference in hidden Markov models using P-splines
Hidden Markov models (HMMs) are flexible time series models in which the
distributions of the observations depend on unobserved serially correlated
states. The state-dependent distributions in HMMs are usually taken from some
class of parametrically specified distributions. The choice of this class can
be difficult, and an unfortunate choice can have serious consequences for
example on state estimates, on forecasts and generally on the resulting model
complexity and interpretation, in particular with respect to the number of
states. We develop a novel approach for estimating the state-dependent
distributions of an HMM in a nonparametric way, which is based on the idea of
representing the corresponding densities as linear combinations of a large
number of standardized B-spline basis functions, imposing a penalty term on
non-smoothness in order to maintain a good balance between goodness-of-fit and
smoothness. We illustrate the nonparametric modeling approach in a real data
application concerned with vertical speeds of a diving beaked whale,
demonstrating that compared to parametric counterparts it can lead to models
that are more parsimonious in terms of the number of states yet fit the data
equally well
Scalable visualisation methods for modern Generalized Additive Models
In the last two decades the growth of computational resources has made it
possible to handle Generalized Additive Models (GAMs) that formerly were too
costly for serious applications. However, the growth in model complexity has
not been matched by improved visualisations for model development and results
presentation. Motivated by an industrial application in electricity load
forecasting, we identify the areas where the lack of modern visualisation tools
for GAMs is particularly severe, and we address the shortcomings of existing
methods by proposing a set of visual tools that a) are fast enough for
interactive use, b) exploit the additive structure of GAMs, c) scale to large
data sets and d) can be used in conjunction with a wide range of response
distributions. All the new visual methods proposed in this work are implemented
by the mgcViz R package, which can be found on the Comprehensive R Archive
Network
Likelihood Inference for Models with Unobservables: Another View
There have been controversies among statisticians on (i) what to model and
(ii) how to make inferences from models with unobservables. One such
controversy concerns the difference between estimation methods for the marginal
means not necessarily having a probabilistic basis and statistical models
having unobservables with a probabilistic basis. Another concerns
likelihood-based inference for statistical models with unobservables. This
needs an extended-likelihood framework, and we show how one such extension,
hierarchical likelihood, allows this to be done. Modeling of unobservables
leads to rich classes of new probabilistic models from which likelihood-type
inferences can be made naturally with hierarchical likelihood.Comment: This paper discussed in: [arXiv:1010.0804], [arXiv:1010.0807],
[arXiv:1010.0810]. Rejoinder at [arXiv:1010.0814]. Published in at
http://dx.doi.org/10.1214/09-STS277 the Statistical Science
(http://www.imstat.org/sts/) by the Institute of Mathematical Statistics
(http://www.imstat.org
Using simulation studies to evaluate statistical methods
Simulation studies are computer experiments that involve creating data by
pseudorandom sampling. The key strength of simulation studies is the ability to
understand the behaviour of statistical methods because some 'truth' (usually
some parameter/s of interest) is known from the process of generating the data.
This allows us to consider properties of methods, such as bias. While widely
used, simulation studies are often poorly designed, analysed and reported. This
tutorial outlines the rationale for using simulation studies and offers
guidance for design, execution, analysis, reporting and presentation. In
particular, this tutorial provides: a structured approach for planning and
reporting simulation studies, which involves defining aims, data-generating
mechanisms, estimands, methods and performance measures ('ADEMP'); coherent
terminology for simulation studies; guidance on coding simulation studies; a
critical discussion of key performance measures and their estimation; guidance
on structuring tabular and graphical presentation of results; and new graphical
presentations. With a view to describing recent practice, we review 100
articles taken from Volume 34 of Statistics in Medicine that included at least
one simulation study and identify areas for improvement.Comment: 31 pages, 9 figures (2 in appendix), 8 tables (1 in appendix
A review of R-packages for random-intercept probit regression in small clusters
Generalized Linear Mixed Models (GLMMs) are widely used to model clustered categorical outcomes. To tackle the intractable integration over the random effects distributions, several approximation approaches have been developed for likelihood-based inference. As these seldom yield satisfactory results when analyzing binary outcomes from small clusters, estimation within the Structural Equation Modeling (SEM) framework is proposed as an alternative. We compare the performance of R-packages for random-intercept probit regression relying on: the Laplace approximation, adaptive Gaussian quadrature (AGQ), Penalized Quasi-Likelihood (PQL), an MCMC-implementation, and integrated nested Laplace approximation within the GLMM-framework, and a robust diagonally weighted least squares estimation within the SEM-framework. In terms of bias for the fixed and random effect estimators, SEM usually performs best for cluster size two, while AGQ prevails in terms of precision (mainly because of SEM's robust standard errors). As the cluster size increases, however, AGQ becomes the best choice for both bias and precision
An Empirical Comparison of Multiple Imputation Methods for Categorical Data
Multiple imputation is a common approach for dealing with missing values in
statistical databases. The imputer fills in missing values with draws from
predictive models estimated from the observed data, resulting in multiple,
completed versions of the database. Researchers have developed a variety of
default routines to implement multiple imputation; however, there has been
limited research comparing the performance of these methods, particularly for
categorical data. We use simulation studies to compare repeated sampling
properties of three default multiple imputation methods for categorical data,
including chained equations using generalized linear models, chained equations
using classification and regression trees, and a fully Bayesian joint
distribution based on Dirichlet Process mixture models. We base the simulations
on categorical data from the American Community Survey. In the circumstances of
this study, the results suggest that default chained equations approaches based
on generalized linear models are dominated by the default regression tree and
Bayesian mixture model approaches. They also suggest competing advantages for
the regression tree and Bayesian mixture model approaches, making both
reasonable default engines for multiple imputation of categorical data. A
supplementary material for this article is available online
Advocating better habitat use and selection models in bird ecology
Studies on habitat use and habitat selection represent a basic aspect of bird ecology, due to its importance in natural history, distribution, response to environmental changes, management and conservation. Basically, a statistical model that identifies environmental variables linked to a species presence is searched for. In this sense, there is a wide array of analytical methods that identify important explanatory variables within a model, with higher explanatory and predictive power than classical regression approaches. However, some of these powerful models are not widespread in ornithological studies, partly because of their complex theory, and in some cases, difficulties on their implementation and interpretation. Here, I describe generalized linear models and other five statistical models for the analysis of bird habitat use and selection outperforming classical approaches: generalized additive models, mixed effects models, occupancy models, binomial N-mixture models and decision trees (classification and regression trees, bagging, random forests and boosting). Each of these models has its benefits and drawbacks, but major advantages include dealing with non-normal distributions (presence-absence and abundance data typically found in habitat use and selection studies), heterogeneous variances, non-linear and complex relationships among variables, lack of statistical independence and imperfect detection. To aid ornithologists in making use of the methods described, a readable description of each method is provided, as well as a flowchart along with some recommendations to help them decide the most appropriate analysis. The use of these models in ornithological studies is encouraged, given their huge potential as statistical tools in bird ecology.Fil: Palacio, Facundo Xavier. Consejo Nacional de Investigaciones CientÃficas y Técnicas; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Naturales y Museo. División ZoologÃa de Vertebrados. Sección OrnitologÃa; Argentin
- …