36 research outputs found
VAT tax gap prediction: a 2-steps Gradient Boosting approach
Tax evasion is the illegal evasion of taxes by individuals, corporations, and
trusts. The revenue loss from tax avoidance can undermine the effectiveness and
equity of the government policies. A standard measure of tax evasion is the tax
gap, that can be estimated as the difference between the total amounts of tax
theoretically collectable and the total amounts of tax actually collected in a
given period. This paper presents an original contribution to bottom-up
approach, based on results from fiscal audits, through the use of Machine
Learning. The major disadvantage of bottom-up approaches is represented by
selection bias when audited taxpayers are not randomly selected, as in the case
of audits performed by the Italian Revenue Agency. Our proposal, based on a
2-steps Gradient Boosting model, produces a robust tax gap estimate and, embeds
a solution to correct for the selection bias which do not require any
assumptions on the underlying data distribution. The 2-steps Gradient Boosting
approach is used to estimate the Italian Value-added tax (VAT) gap on
individual firms on the basis of fiscal and administrative data income tax
returns gathered from Tax Administration Data Base, for the fiscal year 2011.
The proposed method significantly boost the performance in predicting with
respect to the classical parametric approaches.Comment: 27 pages, 4 figures, 8 tables Presented at NTTS 2019 conference Under
review at another peer-reviewed journa
A Bayesian Spatio-Temporal Extension to Poisson Auto-Regression: Modeling the Disease Infection Rate of COVID-19 in England
The COVID-19 pandemic provided many modeling challenges to investigate the
evolution of an epidemic process over areal units. A suitable encompassing
model must describe the spatio-temporal variations of the disease infection
rate of multiple areal processes while adjusting for local and global inputs.
We develop an extension to Poisson Auto-Regression that incorporates
spatio-temporal dependence to characterize the local dynamics while borrowing
information among adjacent areas. The specification includes up to two sets of
space-time random effects to capture the spatio-temporal dependence and a
linear predictor depending on an arbitrary set of covariates. The proposed
model, adopted in a fully Bayesian framework and implemented through a novel
sparse-matrix representation in Stan, provides a framework for evaluating local
policy changes over the whole spatial and temporal domain of the study. It has
been validated through a substantial simulation study and applied to the weekly
COVID-19 cases observed in the English local authority districts between May
2020 and March 2021. The model detects substantial spatial and temporal
heterogeneity and allows a full evaluation of the impact of two alternative
sets of covariates: the level of local restrictions in place and the value of
the Google Mobility Indices. The paper also formalizes various novel
model-based investigation methods for assessing additional aspects of disease
epidemiology.Comment: 24 pages + supplementary, 8 figures, 12 table
A non-parametric Hawkes process model of primary and secondary accidents on a UK smart motorway
A self-exciting spatio-temporal point process is fitted to incident data from the UK National Traffic Information Service to model the rates of primary and secondary ac- cidents on the M25 motorway in a 12-month period during 2017-18. This process uses a background component to represent primary accidents, and a self-exciting component to represent secondary accidents. The background consists of periodic daily and weekly components, a spatial component and a long-term trend. The self-exciting components are decaying, unidirectional functions of space and time. These components are de- termined via kernel smoothing and likelihood estimation. Temporally, the background is stable across seasons with a daily double peak structure reflecting commuting patterns. Spatially, there are two peaks in intensity, one of which becomes more pronounced dur- ing the study period. Self-excitation accounts for 6-7% of the data with associated time and length scales around 100 minutes and 1 kilometre respectively. In-sample and out- of-sample validation are performed to assess the model fit. When we restrict the data to incidents that resulted in large speed drops on the network, the results remain coherent
Bayesian hierarchical modeling and analysis for physical activity trajectories using actigraph data
Rapid developments in streaming data technologies are continuing to generate
increased interest in monitoring human activity. Wearable devices, such as
wrist-worn sensors that monitor gross motor activity (actigraphy), have become
prevalent. An actigraph unit continually records the activity level of an
individual, producing a very large amount of data at a high-resolution that can
be immediately downloaded and analyzed. While this kind of \textit{big data}
includes both spatial and temporal information, the variation in such data
seems to be more appropriately modeled by considering stochastic evolution
through time while accounting for spatial information separately. We propose a
comprehensive Bayesian hierarchical modeling and inferential framework for
actigraphy data reckoning with the massive sizes of such databases while
attempting to offer full inference. Building upon recent developments in this
field, we construct Nearest Neighbour Gaussian Processes (NNGPs) for actigraphy
data to compute at large temporal scales. More specifically, we construct a
temporal NNGP and we focus on the optimized implementation of the collapsed
algorithm in this specific context. This approach permits improved model
scaling while also offering full inference. We test and validate our methods on
simulated data and subsequently apply and verify their predictive ability on an
original dataset concerning a health study conducted by the Fielding School of
Public Health of the University of California, Los Angeles
Finite mixtures in capture-recapture surveys for modelling residency patterns in marine wildlife populations
In this work, the goal is to estimate the abundance of an animal population
using data coming from capture-recapture surveys. We leverage the prior
knowledge about the population's structure to specify a parsimonious finite
mixture model tailored to its behavioral pattern. Inference is carried out
under the Bayesian framework, where we discuss suitable priors' specification
that could alleviate label-switching and non-identifiability issues affecting
finite mixtures. We conduct simulation experiments to show the competitive
advantage of our proposal over less specific alternatives. Finally, the
proposed model is used to estimate the common bottlenose dolphins' population
size at the Tiber River estuary (Mediterranean Sea), using data collected via
photo-identification from 2018 to 2020. Results provide novel insights on the
population's size and structure, and shed light on some of the ecological
processes governing the population dynamics
Nowcasting COVID-19 incidence indicators during the Italian first outbreak
A novel parametric regression model is proposed to fit incidence data typically collected during epidemics. The proposal is motivated by real-time monitoring and short-term forecasting of the main epidemiological indicators within the first outbreak of COVID-19 in Italy. Accurate short-term predictions, including the potential effect of exogenous or external variables are provided. This ensures to accurately predict important characteristics of the epidemic (e.g., peak time and height), allowing for a better allocation of health resources over time. Parameter estimation is carried out in a maximum likelihood framework. All computational details required to reproduce the approach and replicate the results are provided.publishedVersio
Covidâ19 in Italy: Modelling, communications, and collaborations
When Covid-19 arrived in Italy in early 2020, a group of statisticians came together to provide tools to make sense of the unfolding epidemic and to counter misleading media narratives. Here, members of StatGroup-19 reflect on their work to dat
Features of Primary Chronic Headache in Children and Adolescents and Validity of Ichd 3 Criteria
Introduction: Chronic headaches are not a rare condition in children and adolescents with negative effects on their quality of life. Our aims were to investigate the clinical features of chronic headache and usefulness of the International Classification of Headache Disorders 3rd edition (ICHD 3) criteria for the diagnosis in a cohort of pediatric patients.Methods: We retrospectively reviewed the charts of patients attending the Headache Center of Bambino GesĂč Children and Insubria University Hospital during the 2010â2016 time interval. Statistical analysis was conducted to study possible correlations between: (a) chronic primary headache (CPH) and demographic data (age and sex), (b) CPH and headache qualitative features, (c) CPH and risk of medication overuse headache (MOH), and (d) CPH and response to prophylactic therapies. Moreover, we compared the diagnosis obtained by ICHD 3 vs. ICHD 2 criteriaResults: We included 377 patients with CPH (66.4% females, 33.6% males, under 18 years of age). CPH was less frequent under 6 years of age (0.8%; p < 0.05) and there was no correlation between age/sex and different CPH types. The risk to develop MOH was higher after 15 years of age (p < 0.05). When we compared the diagnosis obtained by ICHD 2 and ICHD 3 criteria we found a significant difference for the undefined diagnosis (2.6% vs. 7.9%; p < 0.05), while the diagnosis of probable chronic migraine was only possible by using the ICHD2 criteria (11.9% of patients; p < 0.05). The main criterion which was not satisfied for a definitive diagnosis was the duration of the attacks less than 2 h (70% of patients younger than 6 years; p < 0.005). Amitriptyline and topiramate were the most effective drugs (p < 0.05), although no significant difference was found between them (p > 0.05).Conclusion: The ICHD 3 criteria show limitations when applied to children under 6 years of age. The risk of developing MOH increases with age. Although our âreal wordâ study shows that amitriptyline and topiramate are the most effective drugs regardless of the CPH type, the lack of placebo-controlled data and the limited follow-up results did not allow us to conclude about the drug efficacy
Innovative approaches in spatio-temporal modeling: handling data collected by new technologies
This thesis illustrates and puts in context two of the main research projects I worked on during my Ph.D. program, in collaboration with several national and international co-authors from "La Sapienza" and other prestigious universities. Both research lines concern spatial and spatio-temporal analysis of geo-referenced datasets, which is of broad and current interest in the statistical research literature and applications. My focus on such an area of statistics was not meditated before the start of the program. However, while pursuing my original research interests in the broader domain of Bayesian statistics, I realized there was an ever-increasing demand for viable and efficient statistical methods to analyze spatial and spatio-temporal data. That is a consequence of the extraordinary technological development that interested data collection systems during the last few decades. The innovative, cutting-edge technologies conceive new devices that can record and store data and information about the most diverse phenomena, possibly at a fine spatial scale and with high temporal resolution. Such capabilities were just a dream up to 20 or 30 years ago. Spatial statistics methods are rapidly evolving to face this surge of novel data structure in various application fields: geology, meteorology, ecology, epidemiology, economics, politics, and more.
The first chapter of this thesis introduces the general idea behind spatial statistics, that is the branch of statistics devoted to analyzing and modeling temporal and spatial structure in time and/or geo-referenced datasets. A brief historical introduction of its developments is provided, starting from the first (sometimes unwitting) applications of its logic to practical and theoretical problems at the end of the XIX century. Many methods and techniques in this domain evolved independently, driven by the specific needs of the application fields in which they were developed. The historical excursus
leads to a coarse (but reasonable) distinction in three main areas: continuous spatial variations, discrete spatial variations, spatial point patterns. These areas present further facets within themselves, making spatial statistics an incredibly diverse and rich topic. A really comprehensive review would require an entire book to be written and maybe a lifetime to be thoroughly studied. Therefore, in the following Chapters, the discussion is focused on specific areas and techniques used in the studies. Only those tools that proved valuable for the analysis performed in Alaimo Di Loro et al.(2021) and Kalair et al. (2020) are extensively treated.
The second chapter focuses on analyzing continuous spatial variation, which is the modeling of outcomes varying continuously over some space. First, the most relevant properties for continuous spatial processes are introduced; second, some of the most common methodologies for performing spatial interpolation of the mean trend and stochastic modeling of the residuals are listed and sketched. In
particular, the chapter digresses on Spline Regression as a valid technique to catch the first-order structure in spatial data. Soon after, the Geo-Statistical methods and the Bayesian Hierarchical framework are claimed as invaluable tools to attain the simultaneous estimation of the first and second-order structure of a process. Extension to spatio-temporal contexts is not as trivial as it may seem but must be approached with due care. An extensive discussion about the possible pitfalls and viable solutions is included in the same chapter. Finally, the problems arising in the analysis of Big spatial data are highlighted in the last section, where The Nearest Neighbor Gaussian Process (NNGP, Datta et al. (2016a,b)) model is introduced as a highly scalable framework for providing full inference on massive spatial and spatio-temporal datasets.
The third chapter includes an extended version of the paper Alaimo Di Loro et al. (2021), currently under-review and published as a pre-print. It describes how the aforementioned technological development has strongly affected human tracking and monitoring capabilities, generating substantial interest in monitoring human activity. New non-intrusive wearable devices, such as wrist-worn sensors that monitor gross motor activity (miniature accelerometers), can continuously record individual activity levels, producing massive amounts of high-resolution measurements. Analyzing such
data needs to account for spatial and temporal information on trajectories or paths traversed by subjects wearing such devices. Inferential objectives include estimating a subjectâs physical activity levels along a given trajectory, identifying trajectories that are more likely to produce higher levels of physical activity for a given subject, and predicting expected levels of physical activity in any proposed new trajectory for a given set of health attributes. We argue that the underlying process is more appropriately modeled as a stochastic evolution through time while accounting for spatial information separately. Building upon recent developments in this field, we construct temporal processes using directed acyclic graphs (DAG) on the line of the NNGP, include spatial dependence through penalized spline regression, and develop optimized implementations of the collapsed Markov chain Monte Carlo (MCMC) algorithm. The resulting Bayesian hierarchical modeling framework for the analysis of
spatial-temporal actigraphy data proves able to deliver fully model-based inference on trajectories while accounting for subject-level health attributes and spatial-temporal dependencies. We undertake a comprehensive analysis of an original dataset from the Physical Activity through Sustainable Transport Approaches in Los Angeles (PASTA-LA) study to formally ascertain spatial zones and trajectories exhibiting significantly higher physical activity levels. Suggestions for further extensions and improvements on the currently adopted methodology are discussed in the last section
of the chapter.
Chapter four undergoes a paradigm shift and introduces the basic theory and tools of spatial point patterns analysis. Some common probabilistic models for point processes are briefly discussed, with some of their properties and limitations highlighted. The rest of the chapter is instead entirely focused on the Hawkes process and its spatio-temporal extension. It is a particular kind of self-exciting
point process that presents a strong inter-dependence structure. While conceived in Hawkes (1971a), its use in the statistical application has been for a long time limited to the analysis of earthquakes dynamic. The recent escalation of data at the high temporal resolution, sometimes accompanied by spatial information, has favored its use in modeling events dynamics in diverse fields: finance, society, biology, etc. In particular, its defining properties are presented and state-of-the-art estimation methods of the spatio-temporal version are introduced.
In the fifth chapter, the semi-parametric Hawkes process with a periodic background originally introduced in Zhuang and Mateu (2019) is outlined. While very recent, it has already revealed itself very useful to model phenomena that are likely to present a cyclic pattern. It assumes that primary events occur as an effect of the background intensity, while secondary events are associated with the self-excitation effect. There are sound motivations that justify its utilization in the context of road accident dynamics, e.g.: excitation may occur when a driver, reacting to the disruption of one accident, triggers a subsequent accident upstream of the first one. The proposed framework is tested on two original applications on two original sets of data: the first one, somewhat preliminary, involves the modeling and analysis of road accidents that occurred on the urban road network of Rome, in Italy; the second is instead a conclusive analysis recently published in (Kalair et al., 2020), conducted on a collection of road accidents occurred on the M25 London Orbital, in the United Kingdom. Adaptations of the original methodology to the road accident setting were deemed necessary in both cases to consider specific features of car accidents and the geometry of the underlying space. The final results permit a fruitful interpretation of the temporal and spatial background that detects the typical commuting
behavior in the Roman and Londoners communities. The self-excitation component appears to have slightly different intensities in the two contexts, suggesting excitation mechanisms that vary between urban networks and motorways.
Finally, the sixth chapter summarizes all the main passages in the thesis, highlighting the previous chaptersâ original contributions. It also tries to summarize a take-home message about statistical modelingâs fundamental importance as a scientific tool to formulate and verify hypotheses that must not be discouraged by new challenges and technological advancements