112 research outputs found
Differentially Private Synthetic Heavy-tailed Data
The U.S. Census Longitudinal Business Database (LBD) product contains
employment and payroll information of all U.S. establishments and firms dating
back to 1976 and is an invaluable resource for economic research. However, the
sensitive information in LBD requires confidentiality measures that the U.S.
Census in part addressed by releasing a synthetic version (SynLBD) of the data
to protect firms' privacy while ensuring its usability for research activities,
but without provable privacy guarantees. In this paper, we propose using the
framework of differential privacy (DP) that offers strong provable privacy
protection against arbitrary adversaries to generate synthetic heavy-tailed
data with a formal privacy guarantee while preserving high levels of utility.
We propose using the K-Norm Gradient Mechanism (KNG) with quantile regression
for DP synthetic data generation. The proposed methodology offers the
flexibility of the well-known exponential mechanism while adding less noise. We
propose implementing KNG in a stepwise and sandwich order, such that new
quantile estimation relies on previously sampled quantiles, to more efficiently
use the privacy-loss budget. Generating synthetic heavy-tailed data with a
formal privacy guarantee while preserving high levels of utility is a
challenging problem for data curators and researchers. However, we show that
the proposed methods can achieve better data utility relative to the original
KNG at the same privacy-loss budget through a simulation study and an
application to the Synthetic Longitudinal Business Database
Small Area Estimation of Inequality Measures using Mixtures of Betas
Economic inequalities referring to specific regions are crucial in deepening
spatial heterogeneity. Income surveys are generally planned to produce reliable
estimates at countries or macroregion levels, thus we implement a small area
model for a set of inequality measures (Gini, Relative Theil and Atkinson
indexes) to obtain microregion estimates. Considering that inequality
estimators are unit-interval defined with skewed and heavy-tailed
distributions, we propose a Bayesian hierarchical model at area level involving
a Beta mixture. An application on EU-SILC data is carried out and a
design-based simulation is performed. Our model outperforms in terms of bias,
coverage and error the standard Beta regression model. Moreover, we extend the
analysis of inequality estimators by deriving their approximate variance
functions.Comment: 28 pages, 7 figures, 2 tables, 2 pages of supplementary materia
New methodological contributions in time series clustering
Programa Oficial de Doutoramento en EstatÃstica e Investigación Operativa. 555V01[Abstract]
This thesis presents new procedures to address the analysis cluster of time
series. First of all a two-stage procedure based on comparing frequencies and
magnitudes of the absolute maxima of the spectral densities is proposed. Assuming
that the clustering purpose is to group series according to the underlying
dependence structures, a detailed study of the behavior in clustering of a dissimilarity
based on comparing estimated quantile autocovariance functions (QAF)
is also carried out. A prediction-based resampling algorithm proposed by Dudoit
and Fridlyand is adjusted to select the optimal number of clusters. The
asymptotic behavior of the sample quantile autocovariances is studied and an
algorithm to determine optimal combinations of lags and pairs of quantile levels
to perform clustering is introduced. The proposed metric is used to perform
hard and soft partitioning-based clustering. First, a broad simulation study
examines the behavior of the proposed metric in crisp clustering using hierarchkal
and PAM procedure. Then, a novel fuzzy C-mcdoids algorithm based on
the QAF-dissimilarity is proposed. Three different robust versions of this fuzzy
algorithm are also presented to deal with data containing outlier time series.
Finally, other ways of soft clustering analysis are explored, namely probabilistic
0-clustering and clustering based on mixture models.[Resumo]
Esta tese presenta novos procedementos para abordar a análise cluster de
series temporais. En primeiro lugar proponse un procedemento en dúas etapas
baseádo na comparación de frecuencias e magnitudes dos máximos absolutos das
densidades espectrais. Supoñendo que o propósito é agrupar series dacordo coas
estruturas de dependencia subxaccntes, tamén se leva a cabo un estudo detallado
do comportamento en clustering dunha disimilaridade basea.da na comparación
das funcións estimadas das autocovarianzas cuantil (QAF). Un algoritmo de remostraxe
baseado na predición proposto por Dudoit e Fridlyand adáptase para
selecionar o número óptimo de clusters. Tamén se estuda o comportamento
asintótico das autocovarianzas cuantÃs e se introduce un algoritmo para determinar
as combinacións óptimas de lags e pares de niveles de cuantÃs para levar
a cabo a clasificación. A métrica proposta utilÃzase para realizar análise cluster
baseado en particións "hard" e "soft". En primeiro lugar, un amplo estudo de
simulación examina o comportamento da métrica proposta en clústering "hard"
utilizando os procedementos xerárquico e PAM. A continuación, proponse un
novo algoritmo "fuzzy" C-medoides baseado na disimilaridade QAF. Tamén se
presentan tres versións robustas deste algoritmo "fuzzy" para tratar con datos
que conteñan valores atÃpicos. Finalmente, explóranse outras vÃas de análise
cluster "soft", concretamente, D-clustering probabilÃstico e clustering baseado
en modelos mixtos.[Resumen]
Esta tesis presenta nuevos procedimientos para abordar el análisis cluster de
series temporales. En primer lugar se propone un procedimiento en dos etapas
basado en la comparación de frecuencias y magnitudes de los máximos absolutos
de las densidades espectrales. Suponiendo que el propósito es agrupar series
de acuerdo con las estructuras de dependencia subyacentes, también se lleva. a
cabo un estudio detallado del comportamiento en clustering de una disimilaridad
basada en la comparación de las funciones estimadas de las autoco,'afiancias
cuantil (QAF). Un algoritmo de remuestreo basado en predicción propuesto por
Dudoit y Fridlyand se adapta para seleccionar el número óptimo de clusters.
También se estudia el comportamiento asintótico de las autocovariancias cuantites
y se introduce un algoritmo para determinar las combinaciones óptimas de
lags y pares de niveles de cuantiles para llevar a cabo la clasificación. La. métrica
propuesta se utiliza para realizar análisis cluster basado en particiones "hard"
y ''soft". En primer lugar, un amplio elltudio de simulación examina el comportamiento
de la métrica propuesta en clúster "hard" utilizando los procedimientos
jerárquico y PAM. A continuación, se propone un nuevo algoritmo "fuzzy" Cmedoides
basado en la disimilaridad QAF. También se presentan tres versiones
robustas de este algoritmo "fuzzy" para tratar con datos que contengan atÃpicos.
Finalmente, se exploran otras vÃas de análisis clus ter "soft", concretamente,
D-clustering probabilÃstico y clustering basado en modelos mixtos
The GARCH-EVT-Copula model and simulation in scenario-based asset allocation
Financial market integration, in particular, portfolio allocations from advanced economies to South African markets, continues to strengthen volatility linkages and quicken volatility transmissions between participating markets. Largely as a result, South African portfolios are net recipients of returns and volatility shocks emanating from major world markets. In light of these, and other, sources of risk, this dissertation proposes a methodology to improve risk management systems in funds by building a contemporary asset allocation framework that offers practitioners an opportunity to explicitly model combinations of hypothesised global risks and the effects on their investments. The framework models portfolio return variables and their key risk driver variables separately and then joins them to model their combined dependence structure. The separate modelling of univariate and multivariate (MV) components admits the benefit of capturing the data generating processes with improved accuracy. Univariate variables were modelled using ARMA-GARCH-family structures paired with a variety of skewed and leptokurtic conditional distributions. Model residuals were fit using the Peaks-over-Threshold method from Extreme Value Theory for the tails and a non-parametric, kernel density for the interior, forming a completed semi-parametric distribution (SPD) for each variable. Asset and risk factor returns were then combined and their dependence structure jointly modelled with a MV Student t copula. Finally, the SPD margins and Student t copula were used to construct a MV meta t distribution. Monte Carlo simulations were generated from the fitted MV meta t distribution on which an out-of-sample test was conducted. The 2014-to-2015 horizon served to proxy as an out-of-sample, forward-looking scenario for a set of key risk factors against which a hypothetical, diversified portfolio was optimised. Traditional mean-variance and contemporary mean-CVaR optimisation techniques were used and their results compared. As an addendum, performance over the in-sample 2008 financial crisis was reported. The final Objective (7) addressed management and conservation strategies for the NMBM. The NMBM wetland database that was produced during this research is currently being used by the Municipality and will be added to the latest National Wetland Map. From the database, and tools developed in this research, approximately 90 wetlands have been identified as being highly vulnerable due to anthropogenic and environmental factors (Chapter 6) and should be earmarked as key conservation priority areas. Based on field experience and data collected, this study has also made conservation and rehabilitation recommendations for eight locations. Recommendations are also provided for six more wetland systems (or regions) that should be prioritised for further research, as these systems lack fundamental information on where the threat of anthropogenic activities affecting them is greatest. This study has made a significant contribution to understanding the underlying geomorphological processes in depressions, seeps and wetland flats. The desktop mapping component of this study illustrated the dominance of wetlands in the wetter parts of the Municipality. Perched wetland systems were identified in the field, on shallow bedrock, calcrete or clay. The prevalence of these perches in depressions, seeps and wetland flats also highlighted the importance of rainfall in driving wetland formation, by allowing water to pool on these perches, in the NMBM. These perches are likely to be a key factor in the high number of small, ephemeral wetlands that were observed in the study area, compared to other semi-arid regions. Therefore, this research highlights the value of multi-faceted and multi-scalar wetland research and how similar approaches should be used in future research methods has been highlighted. The approach used, along with the tools/methods developed in this study have facilitated the establishment of priority areas for conservation and management within the NMBM. Furthermore, the research approach has revealed emergent wetland properties that are only apparent when looking at different spatial scales. This research has highlighted the complex biological and geomorphological interactions between wetlands that operate over various spatial and temporal scales. As such, wetland management should occur across a wetland complex, rather than individual sites, to account for these multi-scalar influences
Econometrics of Machine Learning Methods in Economic Forecasting
This paper surveys the recent advances in machine learning method for
economic forecasting. The survey covers the following topics: nowcasting,
textual data, panel and tensor data, high-dimensional Granger causality tests,
time series cross-validation, classification with economic losses
Modelling South Africa's market risk using the APARCH model and heavy-tailed distributions.
Master of Science in Statistics. University of KwaZulu-Natal, Durban 2016.Estimating Value-at-risk (VaR) of stock returns, especially from emerging economies has recently attracted attention of both academics and risk managers. This is mainly because stock returns are relatively more volatile than its historical trend. VaR and other risk management tools, such as expected shortfall (conditional VaR) are highly dependent on an appropriate set of underlying distributional assumptions being made. Thus, identifying a distribution that best captures all aspects of financial returns is of great interest to both academics and risk managers. As a result, this study compares the relative performance of the GARCH-type model combined with heavy-tailed distribution, namely Skew Student t distribution, Pearson Type IV distribution (PIVD), Generalized Pareto distribution (GPD), Generalized Extreme Value distribution (GEVD), and stable distribution in estimating Value-at-Risk of South African all share index (ALSI) returns. Model adequacy is checked through the backtesting procedure. The Kupiec likelihood ratio test is used for backtesting. The proposed models are able to capture volatility clustering (conditional heteroskedasticity), and the asymmetric effect (leverage effect) and heavy-tailedness in the returns. The advantage of the proposed models lies in their ability to capture volatility clustering and the leverage effect on the returns, though the GARCH framework and at the same time model their heavy tailed behaviour through the heavy-tailed distribution. The main findings indicate that APARCH model combined with this heavy-tailed distribution performed well in modelling South African market’s risk at both the long and short position. It was also found that when compared in terms of their predictive ability, APARCH model combined with the PIVD, and APARCH model combined with GPD model gives a better VaR estimation for the short position while APARCH model combined with stable distribution give the better VaR estimation for long position. Thus, APARCH model combined with heavy-tailed distribution model provides a good alternative for modelling stock returns. The outcomes of this research are expected to be of salient value to financial analysts, portfolio managers, risk managers and financial market researchers, therefore giving a better understanding of the South African market
Variational Elliptical Processes
We present elliptical processes, a family of non-parametric probabilistic
models that subsume Gaussian processes and Student's t processes. This
generalization includes a range of new heavy-tailed behaviors while retaining
computational tractability. Elliptical processes are based on a representation
of elliptical distributions as a continuous mixture of Gaussian distributions.
We parameterize this mixture distribution as a spline normalizing flow, which
we train using variational inference. The proposed form of the variational
posterior enables a sparse variational elliptical process applicable to
large-scale problems. We highlight advantages compared to Gaussian processes
through regression and classification experiments. Elliptical processes can
supersede Gaussian processes in several settings, including cases where the
likelihood is non-Gaussian or when accurate tail modeling is essential.Comment: 14 pages, 15 figures, appendix 9 page
- …