112 research outputs found

    Differentially Private Synthetic Heavy-tailed Data

    Full text link
    The U.S. Census Longitudinal Business Database (LBD) product contains employment and payroll information of all U.S. establishments and firms dating back to 1976 and is an invaluable resource for economic research. However, the sensitive information in LBD requires confidentiality measures that the U.S. Census in part addressed by releasing a synthetic version (SynLBD) of the data to protect firms' privacy while ensuring its usability for research activities, but without provable privacy guarantees. In this paper, we propose using the framework of differential privacy (DP) that offers strong provable privacy protection against arbitrary adversaries to generate synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility. We propose using the K-Norm Gradient Mechanism (KNG) with quantile regression for DP synthetic data generation. The proposed methodology offers the flexibility of the well-known exponential mechanism while adding less noise. We propose implementing KNG in a stepwise and sandwich order, such that new quantile estimation relies on previously sampled quantiles, to more efficiently use the privacy-loss budget. Generating synthetic heavy-tailed data with a formal privacy guarantee while preserving high levels of utility is a challenging problem for data curators and researchers. However, we show that the proposed methods can achieve better data utility relative to the original KNG at the same privacy-loss budget through a simulation study and an application to the Synthetic Longitudinal Business Database

    Small Area Estimation of Inequality Measures using Mixtures of Betas

    Full text link
    Economic inequalities referring to specific regions are crucial in deepening spatial heterogeneity. Income surveys are generally planned to produce reliable estimates at countries or macroregion levels, thus we implement a small area model for a set of inequality measures (Gini, Relative Theil and Atkinson indexes) to obtain microregion estimates. Considering that inequality estimators are unit-interval defined with skewed and heavy-tailed distributions, we propose a Bayesian hierarchical model at area level involving a Beta mixture. An application on EU-SILC data is carried out and a design-based simulation is performed. Our model outperforms in terms of bias, coverage and error the standard Beta regression model. Moreover, we extend the analysis of inequality estimators by deriving their approximate variance functions.Comment: 28 pages, 7 figures, 2 tables, 2 pages of supplementary materia

    New methodological contributions in time series clustering

    Get PDF
    Programa Oficial de Doutoramento en Estatística e Investigación Operativa. 555V01[Abstract] This thesis presents new procedures to address the analysis cluster of time series. First of all a two-stage procedure based on comparing frequencies and magnitudes of the absolute maxima of the spectral densities is proposed. Assuming that the clustering purpose is to group series according to the underlying dependence structures, a detailed study of the behavior in clustering of a dissimilarity based on comparing estimated quantile autocovariance functions (QAF) is also carried out. A prediction-based resampling algorithm proposed by Dudoit and Fridlyand is adjusted to select the optimal number of clusters. The asymptotic behavior of the sample quantile autocovariances is studied and an algorithm to determine optimal combinations of lags and pairs of quantile levels to perform clustering is introduced. The proposed metric is used to perform hard and soft partitioning-based clustering. First, a broad simulation study examines the behavior of the proposed metric in crisp clustering using hierarchkal and PAM procedure. Then, a novel fuzzy C-mcdoids algorithm based on the QAF-dissimilarity is proposed. Three different robust versions of this fuzzy algorithm are also presented to deal with data containing outlier time series. Finally, other ways of soft clustering analysis are explored, namely probabilistic 0-clustering and clustering based on mixture models.[Resumo] Esta tese presenta novos procedementos para abordar a análise cluster de series temporais. En primeiro lugar proponse un procedemento en dúas etapas baseádo na comparación de frecuencias e magnitudes dos máximos absolutos das densidades espectrais. Supoñendo que o propósito é agrupar series dacordo coas estruturas de dependencia subxaccntes, tamén se leva a cabo un estudo detallado do comportamento en clustering dunha disimilaridade basea.da na comparación das funcións estimadas das autocovarianzas cuantil (QAF). Un algoritmo de remostraxe baseado na predición proposto por Dudoit e Fridlyand adáptase para selecionar o número óptimo de clusters. Tamén se estuda o comportamento asintótico das autocovarianzas cuantís e se introduce un algoritmo para determinar as combinacións óptimas de lags e pares de niveles de cuantís para levar a cabo a clasificación. A métrica proposta utilízase para realizar análise cluster baseado en particións "hard" e "soft". En primeiro lugar, un amplo estudo de simulación examina o comportamento da métrica proposta en clústering "hard" utilizando os procedementos xerárquico e PAM. A continuación, proponse un novo algoritmo "fuzzy" C-medoides baseado na disimilaridade QAF. Tamén se presentan tres versións robustas deste algoritmo "fuzzy" para tratar con datos que conteñan valores atípicos. Finalmente, explóranse outras vías de análise cluster "soft", concretamente, D-clustering probabilístico e clustering baseado en modelos mixtos.[Resumen] Esta tesis presenta nuevos procedimientos para abordar el análisis cluster de series temporales. En primer lugar se propone un procedimiento en dos etapas basado en la comparación de frecuencias y magnitudes de los máximos absolutos de las densidades espectrales. Suponiendo que el propósito es agrupar series de acuerdo con las estructuras de dependencia subyacentes, también se lleva. a cabo un estudio detallado del comportamiento en clustering de una disimilaridad basada en la comparación de las funciones estimadas de las autoco,'afiancias cuantil (QAF). Un algoritmo de remuestreo basado en predicción propuesto por Dudoit y Fridlyand se adapta para seleccionar el número óptimo de clusters. También se estudia el comportamiento asintótico de las autocovariancias cuantites y se introduce un algoritmo para determinar las combinaciones óptimas de lags y pares de niveles de cuantiles para llevar a cabo la clasificación. La. métrica propuesta se utiliza para realizar análisis cluster basado en particiones "hard" y ''soft". En primer lugar, un amplio elltudio de simulación examina el comportamiento de la métrica propuesta en clúster "hard" utilizando los procedimientos jerárquico y PAM. A continuación, se propone un nuevo algoritmo "fuzzy" Cmedoides basado en la disimilaridad QAF. También se presentan tres versiones robustas de este algoritmo "fuzzy" para tratar con datos que contengan atípicos. Finalmente, se exploran otras vías de análisis clus ter "soft", concretamente, D-clustering probabilístico y clustering basado en modelos mixtos

    The GARCH-EVT-Copula model and simulation in scenario-based asset allocation

    Get PDF
    Financial market integration, in particular, portfolio allocations from advanced economies to South African markets, continues to strengthen volatility linkages and quicken volatility transmissions between participating markets. Largely as a result, South African portfolios are net recipients of returns and volatility shocks emanating from major world markets. In light of these, and other, sources of risk, this dissertation proposes a methodology to improve risk management systems in funds by building a contemporary asset allocation framework that offers practitioners an opportunity to explicitly model combinations of hypothesised global risks and the effects on their investments. The framework models portfolio return variables and their key risk driver variables separately and then joins them to model their combined dependence structure. The separate modelling of univariate and multivariate (MV) components admits the benefit of capturing the data generating processes with improved accuracy. Univariate variables were modelled using ARMA-GARCH-family structures paired with a variety of skewed and leptokurtic conditional distributions. Model residuals were fit using the Peaks-over-Threshold method from Extreme Value Theory for the tails and a non-parametric, kernel density for the interior, forming a completed semi-parametric distribution (SPD) for each variable. Asset and risk factor returns were then combined and their dependence structure jointly modelled with a MV Student t copula. Finally, the SPD margins and Student t copula were used to construct a MV meta t distribution. Monte Carlo simulations were generated from the fitted MV meta t distribution on which an out-of-sample test was conducted. The 2014-to-2015 horizon served to proxy as an out-of-sample, forward-looking scenario for a set of key risk factors against which a hypothetical, diversified portfolio was optimised. Traditional mean-variance and contemporary mean-CVaR optimisation techniques were used and their results compared. As an addendum, performance over the in-sample 2008 financial crisis was reported. The final Objective (7) addressed management and conservation strategies for the NMBM. The NMBM wetland database that was produced during this research is currently being used by the Municipality and will be added to the latest National Wetland Map. From the database, and tools developed in this research, approximately 90 wetlands have been identified as being highly vulnerable due to anthropogenic and environmental factors (Chapter 6) and should be earmarked as key conservation priority areas. Based on field experience and data collected, this study has also made conservation and rehabilitation recommendations for eight locations. Recommendations are also provided for six more wetland systems (or regions) that should be prioritised for further research, as these systems lack fundamental information on where the threat of anthropogenic activities affecting them is greatest. This study has made a significant contribution to understanding the underlying geomorphological processes in depressions, seeps and wetland flats. The desktop mapping component of this study illustrated the dominance of wetlands in the wetter parts of the Municipality. Perched wetland systems were identified in the field, on shallow bedrock, calcrete or clay. The prevalence of these perches in depressions, seeps and wetland flats also highlighted the importance of rainfall in driving wetland formation, by allowing water to pool on these perches, in the NMBM. These perches are likely to be a key factor in the high number of small, ephemeral wetlands that were observed in the study area, compared to other semi-arid regions. Therefore, this research highlights the value of multi-faceted and multi-scalar wetland research and how similar approaches should be used in future research methods has been highlighted. The approach used, along with the tools/methods developed in this study have facilitated the establishment of priority areas for conservation and management within the NMBM. Furthermore, the research approach has revealed emergent wetland properties that are only apparent when looking at different spatial scales. This research has highlighted the complex biological and geomorphological interactions between wetlands that operate over various spatial and temporal scales. As such, wetland management should occur across a wetland complex, rather than individual sites, to account for these multi-scalar influences

    Econometrics of Machine Learning Methods in Economic Forecasting

    Full text link
    This paper surveys the recent advances in machine learning method for economic forecasting. The survey covers the following topics: nowcasting, textual data, panel and tensor data, high-dimensional Granger causality tests, time series cross-validation, classification with economic losses

    Modelling South Africa's market risk using the APARCH model and heavy-tailed distributions.

    Get PDF
    Master of Science in Statistics. University of KwaZulu-Natal, Durban 2016.Estimating Value-at-risk (VaR) of stock returns, especially from emerging economies has recently attracted attention of both academics and risk managers. This is mainly because stock returns are relatively more volatile than its historical trend. VaR and other risk management tools, such as expected shortfall (conditional VaR) are highly dependent on an appropriate set of underlying distributional assumptions being made. Thus, identifying a distribution that best captures all aspects of financial returns is of great interest to both academics and risk managers. As a result, this study compares the relative performance of the GARCH-type model combined with heavy-tailed distribution, namely Skew Student t distribution, Pearson Type IV distribution (PIVD), Generalized Pareto distribution (GPD), Generalized Extreme Value distribution (GEVD), and stable distribution in estimating Value-at-Risk of South African all share index (ALSI) returns. Model adequacy is checked through the backtesting procedure. The Kupiec likelihood ratio test is used for backtesting. The proposed models are able to capture volatility clustering (conditional heteroskedasticity), and the asymmetric effect (leverage effect) and heavy-tailedness in the returns. The advantage of the proposed models lies in their ability to capture volatility clustering and the leverage effect on the returns, though the GARCH framework and at the same time model their heavy tailed behaviour through the heavy-tailed distribution. The main findings indicate that APARCH model combined with this heavy-tailed distribution performed well in modelling South African market’s risk at both the long and short position. It was also found that when compared in terms of their predictive ability, APARCH model combined with the PIVD, and APARCH model combined with GPD model gives a better VaR estimation for the short position while APARCH model combined with stable distribution give the better VaR estimation for long position. Thus, APARCH model combined with heavy-tailed distribution model provides a good alternative for modelling stock returns. The outcomes of this research are expected to be of salient value to financial analysts, portfolio managers, risk managers and financial market researchers, therefore giving a better understanding of the South African market

    Variational Elliptical Processes

    Full text link
    We present elliptical processes, a family of non-parametric probabilistic models that subsume Gaussian processes and Student's t processes. This generalization includes a range of new heavy-tailed behaviors while retaining computational tractability. Elliptical processes are based on a representation of elliptical distributions as a continuous mixture of Gaussian distributions. We parameterize this mixture distribution as a spline normalizing flow, which we train using variational inference. The proposed form of the variational posterior enables a sparse variational elliptical process applicable to large-scale problems. We highlight advantages compared to Gaussian processes through regression and classification experiments. Elliptical processes can supersede Gaussian processes in several settings, including cases where the likelihood is non-Gaussian or when accurate tail modeling is essential.Comment: 14 pages, 15 figures, appendix 9 page
    • …
    corecore