26 research outputs found
Bayesian Modeling of Presence-only Data
This thesis develops models and methods for statistical analysis of presence-only
data. Besides constructing new models, the emphasis is on the theoretical characteristics
of new models and on Bayesian prediction. Monte Carlo Markov chains
algorithms are developed for the new presence-only data models in order to be able
to simulate the posterior distribution of the unknowns and the predictive distribution
of variable of interest. The new methods are applied to simulated data. One
application in ecologic science have been a driving force behind the work
Bayesian logistic regression for presence-only data
Presence-only data are referred to situations in which a censoring mechanism acts on a binary response which can be partially observed only with respect to one outcome, usually denoting the \textit{presence} of an attribute of interest. A typical example is the recording of species presence in ecological surveys. In this work a Bayesian approach to the analysis of presence-only data based on a two levels scheme is presented. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to censoring and the other one due to sampling. In the paper, through the use of a stratified sampling design with non-overlapping strata, a new formulation of the logistic model for presence-only data is proposed. In particular, the logistic regression with linear predictor is considered. Estimation is carried out with a new Markov Chain Monte Carlo algorithm with data augmentation, which does not require the a priori knowledge of the population prevalence. The performance of the new algorithm is validated by means of extensive simulation experiments using three scenarios and comparison with optimal benchmarks. An application to data existing in literature is reported in order to discuss the model behaviour in real world situations together with the results of an original study on termites occurrences data
Bayesian Modeling and MCMC Computation in Linear Logistic Regression for Presence-only Data
Presence-only data are referred to situations in which, given a censoring
mechanism, a binary response can be observed only with respect to on outcome,
usually called \textit{presence}. In this work we present a Bayesian approach
to the problem of presence-only data based on a two levels scheme. A
probability law and a case-control design are combined to handle the double
source of uncertainty: one due to the censoring and one due to the sampling. We
propose a new formalization for the logistic model with presence-only data that
allows further insight into inferential issues related to the model. We
concentrate on the case of the linear logistic regression and, in order to make
inference on the parameters of interest, we present a Markov Chain Monte Carlo
algorithm with data augmentation that does not require the a priori knowledge
of the population prevalence. A simulation study concerning 24,000 simulated
datasets related to different scenarios is presented comparing our proposal to
optimal benchmarks.Comment: Affiliations: Fabio Divino - Division of Physics, Computer Science
and Mathematics, University of Molise Giovanna jona Lasinio and Natalia
Golini - Department of Statistical Sciences, University of Rome "La Sapienza"
Antti Penttinen - Department of Mathematics and Statistics, University of
Jyv\"{a}skyl\"{a} CONTACT: [email protected],
[email protected]
Functional zoning of biodiversity profiles
Spatial mapping of biodiversity is crucial to investigate spatial variations
in natural communities. Several indices have been proposed in the literature to
represent biodiversity as a single statistic. However, these indices only
provide information on individual dimensions of biodiversity, thus failing to
grasp its complexity comprehensively. Consequently, relying solely on these
single indices can lead to misleading conclusions about the actual state of
biodiversity. In this work, we focus on biodiversity profiles, which provide a
more flexible framework to express biodiversity through non-negative and convex
curves, which can be analyzed by means of functional data analysis. By treating
the whole curves as single entities, we propose to achieve a functional zoning
of the region of interest by means of a penalized model-based clustering
procedure. This provides a spatial clustering of the biodiversity profiles,
which is useful for policy-makers both for conserving and managing natural
resources and revealing patterns of interest. Our approach is discussed through
the analysis of Harvard Forest Data, which provides information on the spatial
distribution of woody stems within a plot of the Harvard Forest
Agrimonia: a dataset on livestock, meteorology and air quality in the Lombardy region, Italy
The air in the Lombardy region, Italy, is one of the most polluted in Europe because of limited air circulation and high emission levels. There is a large scientific consensus that the agricultural sector has a significant impact on air quality. To support studies quantifying the role of the agricultural and livestock sectors on the Lombardy air quality, this paper presents a harmonised dataset containing daily values of air quality, weather, emissions, livestock, and land and soil use in the years 2016–2021, for the Lombardy region. The daily scale is obtained by averaging hourly data and interpolating other variables. In fact, the pollutant data come from the European Environmental Agency and the Lombardy Regional Environment Protection Agency, weather and emissions data from the European Copernicus programme, livestock data from the Italian zootechnical registry, and land and soil use data from the CORINE Land Cover project. The resulting dataset is designed to be used as is by those using air quality data for research
Spatiotemporal modelling of PM concentrations in Lombardy (Italy) -- A comparative study
This study presents a comparative analysis of three predictive models with an
increasing degree of flexibility: hidden dynamic geostatistical models (HDGM),
generalised additive mixed models (GAMM), and the random forest spatiotemporal
kriging models (RFSTK). These models are evaluated for their effectiveness in
predicting PM concentrations in Lombardy (North Italy) from 2016 to
2020. Despite differing methodologies, all models demonstrate proficient
capture of spatiotemporal patterns within air pollution data with similar
out-of-sample performance. Furthermore, the study delves into station-specific
analyses, revealing variable model performance contingent on localised
conditions. Model interpretation, facilitated by parametric coefficient
analysis and partial dependence plots, unveils consistent associations between
predictor variables and PM concentrations. Despite nuanced variations
in modelling spatiotemporal correlations, all models effectively accounted for
the underlying dependence. In summary, this study underscores the efficacy of
conventional techniques in modelling correlated spatiotemporal data,
concurrently highlighting the complementary potential of Machine Learning and
classical statistical approaches