31 research outputs found
Cross-validatory Model Comparison and Divergent Regions Detection using iIS and iWAIC for Disease Mapping
The well-documented problems associated with mapping raw rates of disease have resulted in an increased use of Bayesian hierarchical models to produce maps of "smoothed'' estimates of disease rates. Two statistical problems arise in using Bayesian hierarchical models for disease mapping. The first problem is in comparing goodness of fit of various models, which can be used to test different hypotheses. The second problem is in identifying outliers/divergent regions with unusually high or low residual risk of disease, or those whose disease rates are not well fitted. The results of outlier detection may generate further hypotheses as to what additional covariates might be necessary for explaining the disease. Leave-one-out cross-validatory (LOOCV) model assessment has been used for these two problems. However, actual LOOCV is time-consuming. This thesis introduces two methods, namely iIS and iWAIC, for approximating LOOCV, using only Markov chain samples simulated from a posterior distribution based on a full data set. In iIS and iWAIC, we first integrate the latent variables without reference to holdout observation, then apply IS and WAIC approximations to the integrated predictive density and evaluation function. We apply iIS and iWAIC to two real data sets. Our empirical results show that iIS and iWAIC can provide significantly better estimation of LOOCV model assessment than existing methods including DIC, Importance Sampling, WAIC, posterior checking and Ghosting methods
Approximating Cross-validatory Predictive P-values with Integrated IS for Disease Mapping Models
An important statistical task in disease mapping problems is to identify out-
lier/divergent regions with unusually high or low residual risk of disease.
Leave-one-out cross-validatory (LOOCV) model assessment is a gold standard for
computing predictive p-value that can flag such outliers. However, actual LOOCV
is time-consuming because one needs to re-simulate a Markov chain for each
posterior distribution in which an observation is held out as a test case. This
paper introduces a new method, called iIS, for approximating LOOCV with only
Markov chain samples simulated from a posterior based on a full data set. iIS
is based on importance sampling (IS). iIS integrates the p-value and the
likelihood of the test observation with respect to the distribution of the
latent variable without reference to the actual observation. The predictive
p-values computed with iIS can be proved to be equivalent to the LOOCV
predictive p-values, following the general theory for IS. We com- pare iIS and
other three existing methods in the literature with a lip cancer dataset
collected in Scotland. Our empirical results show that iIS provides predictive
p-values that are al- most identical to the actual LOOCV predictive p-values
and outperforms the existing three methods, including the recently proposed
ghosting method by Marshall and Spiegelhalter (2007).Comment: 21 page
Bayesian comparison of latent variable models: Conditional vs marginal likelihoods
Typical Bayesian methods for models with latent variables (or random effects)
involve directly sampling the latent variables along with the model parameters.
In high-level software code for model definitions (using, e.g., BUGS, JAGS,
Stan), the likelihood is therefore specified as conditional on the latent
variables. This can lead researchers to perform model comparisons via
conditional likelihoods, where the latent variables are considered model
parameters. In other settings, however, typical model comparisons involve
marginal likelihoods where the latent variables are integrated out. This
distinction is often overlooked despite the fact that it can have a large
impact on the comparisons of interest. In this paper, we clarify and illustrate
these issues, focusing on the comparison of conditional and marginal Deviance
Information Criteria (DICs) and Watanabe-Akaike Information Criteria (WAICs) in
psychometric modeling. The conditional/marginal distinction corresponds to
whether the model should be predictive for the clusters that are in the data or
for new clusters (where "clusters" typically correspond to higher-level units
like people or schools). Correspondingly, we show that marginal WAIC
corresponds to leave-one-cluster out (LOcO) cross-validation, whereas
conditional WAIC corresponds to leave-one-unit out (LOuO). These results lead
to recommendations on the general application of the criteria to models with
latent variables.Comment: Manuscript in press at Psychometrika; 31 pages, 8 figure
Recommended from our members
Scoring Model Predictions using Cross-Validation
We formalize a framework for quantitatively assessing agreement between two datasets that are assumed to come from two distinct data generating mechanisms. We propose a methodology for prediction scoring which provides a measure of the distance between two unobserved data generating mechanisms (DGMs), along the dimension of a particular model. The cross-validated scores can be used to evaluate preregistered hypotheses and to perform model validation in the face of complex statistical models. Using human behavior data from the Next Generation Social Science (NGS2) program, we demonstrate that prediction scores can be used as model assessment tools and that they can reveal insights based on data collected from different populations and across different settings. Our proposed cross-validated prediction scores are capable of quantifying true differences between data generating mechanisms, allow for the validation and assessment of complex models, and serve as valuable tools for reproducible research
Improving the identification of antigenic sites in the H1N1 Influenza virus through accounting for the experimental structure in a sparse hierarchical Bayesian model
Understanding how genetic changes allow emerging virus strains to escape the protection afforded by vaccination is vital for the maintenance of effective vaccines. We use structural and phylogenetic differences between pairs of virus strains to identify important antigenic sites on the surface of the influenza A(H1N1) virus through the prediction of haemagglutination inhibition (HI) titre: pairwise measures of the antigenic similarity of virus strains. We propose a sparse hierarchical Bayesian model that can deal with the pairwise structure and inherent experimental variability in the H1N1 data through the introduction of latent variables. The latent variables represent the underlying HI titre measurement of any given pair of virus strains and help to account for the fact that, for any HI titre measurement between the same pair of virus strains, the difference in the viral sequence remains the same. Through accurately representing the structure of the H1N1 data, the model can select virus sites which are antigenic, while its latent structure achieves the computational efficiency that is required to deal with large virus sequence data, as typically available for the influenza virus. In addition to the latent variable model, we also propose a new method, the block‐integrated widely applicable information criterion biWAIC, for selecting between competing models. We show how this enables us to select the random effects effectively when used with the model proposed and we apply both methods to an A(H1N1) data set
Spatio temporal modeling of species distribution
The aim of this thesis is study spatial distribution of different groups from different perspectives and to analyse the different approaches to this
problem.
We move away from the classical approach, commonly used by ecologists, to more complex solutions, already applied in several disciplines.
We are focused in applying advanced modelling techniques in order to understand species distribution and species behaviour and the
relationships between them and environmental factors and have used first the most common models applied in ecology to move then to more
advanced and complex perspectives.
From a general perspective and comparing the different models applied during the process, from MaxEnt to spatio-temporal models with
INLA, we can affirm that the models that we have developed show better results that the already built. Also, it is difficult to compare between
the different approaches, but the Bayesian approach shows more flexibility and also the inclusion of spatial field or the latent spatio-temporal
process allows to include residuals as a proxy for unmeasured variables.
Compared with additive models with thin plate splines, probably considered one of the greatest methods to analyse species distribution models
working with presence-absence data, comparable to MaxEnt, CART and MARS, our results show a better fit and more flexibility in the design.
As a natural process we have realised that the Bayesian approach could be a better solution or at least a different approach for consideration.
The main advantage of the Bayesian model formulation is the computational ease in model fit and prediction compared to classical
geostatistical methods. To do so, instead of MCMC we have used the novel integrated nested Laplace approximation approach through the
Stochastic Partial Differential Equation (SPDE) approach. The SPDE approach can be easily implemented providing results in reasonable
computing time (comparing with MCMC). We showed how SPDE is a useful tool in the analysis of species distribution. This modelling could
be expanded to the spatio-temporal domain by incorporating an extra term for the temporal effect, using parametric or semiparametric
constructions to reflect linear, nonlinear, autoregressive or more complex behaviours.
We can conclude that spatial and spatio-temporal Bayesian models are a really interesting approach for the understanding of environmental
dynamics, not only because of the possibility to develop and solve more complex problems but also for the easy understanding of the
implementation processes.The aim of this thesis is study spatial distribution of different groups from different perspectives and to analyse the different approaches to this
problem.
We move away from the classical approach, commonly used by ecologists, to more complex solutions, already applied in several disciplines.
We are focused in applying advanced modelling techniques in order to understand species distribution and species behaviour and the
relationships between them and environmental factors and have used first the most common models applied in ecology to move then to more
advanced and complex perspectives.
From a general perspective and comparing the different models applied during the process, from MaxEnt to spatio-temporal models with
INLA, we can affirm that the models that we have developed show better results that the already built. Also, it is difficult to compare between
the different approaches, but the Bayesian approach shows more flexibility and also the inclusion of spatial field or the latent spatio-temporal
process allows to include residuals as a proxy for unmeasured variables.
Compared with additive models with thin plate splines, probably considered one of the greatest methods to analyse species distribution models
working with presence-absence data, comparable to MaxEnt, CART and MARS, our results show a better fit and more flexibility in the design.
As a natural process we have realised that the Bayesian approach could be a better solution or at least a different approach for consideration.
The main advantage of the Bayesian model formulation is the computational ease in model fit and prediction compared to classical
geostatistical methods. To do so, instead of MCMC we have used the novel integrated nested Laplace approximation approach through the
Stochastic Partial Differential Equation (SPDE) approach. The SPDE approach can be easily implemented providing results in reasonable
computing time (comparing with MCMC). We showed how SPDE is a useful tool in the analysis of species distribution. This modelling could
be expanded to the spatio-temporal domain by incorporating an extra term for the temporal effect, using parametric or semiparametric
constructions to reflect linear, nonlinear, autoregressive or more complex behaviours.
We can conclude that spatial and spatio-temporal Bayesian models are a really interesting approach for the understanding of environmental
dynamics, not only because of the possibility to develop and solve more complex problems but also for the easy understanding of the
implementation processes