13 research outputs found

    An Object-Oriented Framework for Statistical Simulation: The R Package simFrame

    Get PDF
    Simulation studies are widely used by statisticians to gain insight into the quality of developed methods. Usually some guidelines regarding, e.g., simulation designs, contamination, missing data models or evaluation criteria are necessary in order to draw meaningful conclusions. The R package simFrame is an object-oriented framework for statistical simulation, which allows researchers to make use of a wide range of simulation designs with a minimal effort of programming. Its object-oriented implementation provides clear interfaces for extensions by the user. Since statistical simulation is an embarrassingly parallel process, the framework supports parallel computing to increase computational performance. Furthermore, an appropriate plot method is selected automatically depending on the structure of the simulation results. In this paper, the implementation of simFrame is discussed in great detail and the functionality of the framework is demonstrated in examples for different simulation designs.

    Extreme incomes and the estimation of poverty and inequality indicators from EU-SILC

    Get PDF
    Micro-data estimates of welfare indices are known to be sensitive to observations from the tails of the income distribution. It is therefore customary to make adjustments to extreme data before estimating inequality and poverty statistics. This paper systematically evaluates the impact of such adjustments on indicators estimated from the EU-SILC (Community Statistics on Income and Living conditions) which is expected to become the reference source for comparative statistics on income distribution and social exclusion in the EU. Emphasis is put on the robustness of cross-country comparisons to alternative adjustments. Results from a sensitivity analysis considering both simple, classical adjustments and a more sophisticated approach based on modelling parametrically the tails of the income distribution are reported. Reassuringly, ordinal comparisons of countries are found to be robust to variants of data adjustment procedures. However, data adjustments are far from innocuous. Cardinal comparisons of countries reveal sensitive to the treatment of extreme incomes, even for seemingly small adjustments.social indicators ; poverty and inequality ; extreme incomes ; parametric tail ; EU-SILC

    Strategies for Multiply Imputed Survey Data and Modeling in the Context of Small Area Estimation

    Get PDF
    To target resources and policies where they are most needed, it is essential that policy-makers are provided with reliable socio-demographic indicators on sub-groups. These sub-groups can be defined by regional divisions or by demographic characteristics and are referred to as areas or domains. Information on these domains is usually obtained through surveys, often planned at a higher level, such as the national level. As sample sizes at disaggregated levels may become small or unavailable, estimates based on survey data alone may no longer be considered reliable or may not be available. Increasing the sample size is time consuming and costly. Small area estimation (SAE) methods aim to solve this problem and achieve higher precision. SAE methods enrich information from survey data with data from additional sources and "borrow" strength from other domains . This is done by modeling and linking the survey data with administrative or register data and by using area-specific structures. Auxiliary data are traditionally population data available at the micro or aggregate level that can be used to estimate unit-level models or area-level models. Due to strict privacy regulations, it is often difficult to obtain these data at the micro level. Therefore, models based on aggregated auxiliary information, such as the Fay-Herriot model and its extensions, are of great interest for obtaining SAE estimators. Despite the problem of small sample sizes at the disaggregated level, surveys often suffer from high non-response. One possible solution to item non-response is multiple imputation (MI), which replaces missing values with multiple plausible values. The missing values and their replacement introduce additional uncertainty into the estimate. Part I focuses on the Fay-Herriot model, where the resulting estimator is a combination of a design-unbiased estimator based only on the survey data (hereafter called the direct estimator) and a synthetic regression component. Solutions are presented to account for the uncertainty introduced by missing values in the SAE estimator using Rubin's rules. Since financial assets and wealth are sensitive topics, surveys on this type of data suffer particularly from item non-response. Chapter 1 focuses on estimating private wealth at the regionally disaggregated level in Germany. Data from the 2010 Household Finance and Consumption Survey (HFCS) are used for this application. In addition to the non-response problem, income and wealth data are often right-skewed, requiring a transformation to fully satisfy the normality assumptions of the model. Therefore, Chapter 1 presents a modified Fay-Herriot approach that incorporates the uncertainty of missing values into the log-transformed direct estimator of a mean. Chapter 2 complements Chapter 1 by presenting a framework that extends the general class of transformed Fay-Herriot models to account for the additional uncertainty due to MI by including it in the direct component and simultaneously in the regression component of the Fay-Herriot estimator. In addition, the uncertainty due to missing values is also included in the mean squared error estimator, which serves as the uncertainty measure. The estimation of a mean, the use of the log transformation for skewed data, and the arcsine transformation for proportions as target indicators are considered. The proposed framework is evaluated for the three cases in a model-based simulation study. To illustrate the methodology, 2017 data from the HFCS for European Union countries are used to estimate the average value of bonds at the national level. The approaches presented in Chapters 1 and 2 contribute to the literature by providing solutions for estimating SAE models in the presence of multiply imputed survey data. In particular, Chapter 2 presents a general approach that can be extended to other indicators. To obtain the best possible SAE estimator in terms of accuracy and precision, it is important to find the optimal model for the relationship between the target variable and the auxiliary data. The notion of "optimal" can be multifaceted. One way to look at optimality is to find the best transformation of the target variable to fully satisfy model assumptions or to account for nonlinearity. Another perspective is to identify the most important covariates and their relationship to each other and to the target variable. Part II of this dissertation therefore brings together research on optimal transformations and model selection in the context of SAE. Chapter 3 considers both problems simultaneously for linear mixed models (LMM) and proposes a model selection approach for LMM with data-driven transformations. In particular, the conditional Akaike information criterion is adapted by introducing the Jacobian into the criterion to allow comparison of models at different scales. The methodology is evaluated in a simulation experiment comparing different transformations with different underlying true models. Since SAE models are LMMs, this methodology is applied to the unit-level small-area method, the empirical best predictor (EBP), in an application with Mexican survey and census data (ENIGH - National Survey of Household Income and Expenditure) and shows improvements in efficiency when the optimal (linear mixed) model and the transformation parameters are found simultaneously. Chapter 3 bridges the gap between model selection and optimal transformations to satisfy normality assumptions in unit-level SAE models in particular and LMMs in general. Chapter 4 explores the problem of model selection from a different perspective and for area-level data. To model interactions between auxiliary variables and nonlinear relationships between them and the dependent variable, machine learning methods can be a versatile tool. For unit-level SAE models, mixed-effects random forests (MERFs) provide a flexible solution to account for interactions and nonlinear relationships, ensure robustness to outliers, and perform implicit model selection. In Chapter 4, the idea of MERFs is transferred to area-level models and the linear regression synthetic part of the Fay-Herriot model is replaced by a random forest to benefit from the above properties and to provide an alternative modeling approach. Chapter 4 therefore contributes to the literature by proposing a first way to combine area-level SAE models with random forests for mean estimation to allow for interactions, nonlinear relationships, and implicit variable selection. Another advantage of random forest is its non-extrapolation property, i.e. the range of predictions is limited by the lowest and highest observed values. This could help to avoid transformations at the area-level when estimating indicators defined in a fixed range. The standard Fay-Herriot model was originally developed to estimate a mean, and transformations are required when the indicator of interest is, for example, a share or a Gini coefficient. This usually requires the development of appropriate back-transformations and MSE estimators. 5 presents a Fay-Herriot model for estimating logit-transformed Gini coefficients with a bias-corrected back-transformation and a bootstrap MSE estimator. A model-based simulation is performed to show the validity of the methodology, and regionally disaggregated data from Germany are used to illustrate the proposed approach. 5 contributes to the existing literature by providing, from a frequentist perspective, an alternative to the Bayesian area-level model for estimating Gini coefficients using a logit transformation

    The Use of Data-driven Transformations and Their Applicability in Small Area Estimation

    Get PDF
    One of the goals of data analysts is to establish relationships between variables using regression models. Standard statistical techniques for linear and linear mixed regression models are commonly associated with interpretation, estimation, and inference. These techniques rely on basic assumptions underlying the working model, listed below: - Normality: Transforming data to create symmetry in order to correctly use interpretation and inferential techniques - Homoscedasticity: Creating equality of spread as a means to gain efficiency in estimation processes and to properly use inference processes - Linearity: Linearizing relationships in an effort to avoid misleading conclusions for estimation and inference techniques. Different options are available to the data analyst when the model assumptions are not met in practice. Researchers could formulate the regression model under alternative and more flexible parametric assumptions. They could also use a regression model that minimizes the use of parametric assumptions or under robust estimation. Another option would be to parsimoniously redesign the model by finding an appropriate transformation such that the model assumptions hold. A standard practice in applied work is to transform the target variable by computing its logarithm. However, this type of transformation does not adjust to the underlying data. Therefore, some research effort has been shifted towards alternative data-driven transformations, such as the Box-Cox, which includes a transformation parameter that adjusts to the data. The literature of transformations in theoretical statistics and practical case studies in different research fields is rich and most relevant results were published during the early 1980s. More sophisticated and complex techniques and tools are available nowadays to the applied statistician as alternatives to using transformations. However, simplification is still a gold nugget in statistical practice, which is often the case when applying suitable transformations within the working model. In general, researchers have been using data transformations as a go-to tool to assist scientific work under the classical and linear mixed regression models instead of developing new theories, applying complex methods or extending software functions. However, transformations are often automatically and routinely applied without considering different aspects on their utility. In Part 1 of this work, some modeling guidelines for practitioners in transformations are each presented. An extensive guideline and an overview of different transformations and estimation methods of transformation parameters in the context of linear and linear mixed regression models are presented in Chapter 1. Furthermore, in order to provide an extensive collection of transformations usable in linear regression models and a wide range of estimation methods for the transformation parameter, the package trafo is presented in Chapter 2. This package complements and enlarges the methods that exist in R so far, and offers a simple, user-friendly framework for selecting a suitable transformation depending on the research purpose. In the literature, little attention has been paid to the study of techniques of the linear mixed regression model when working with transformations. This becomes a challenge for users of small area estimation (SAE) methods, since most commonly used SAE methods are based on the linear mixed regression model which often relies on Gaussian assumptions. In particular, the empirical best predictor is widely used in practice to produce reliable estimates of general indicators for areas with small sample sizes. The issue of data transformations is addressed in the current SAE literature in a fairly ad-hoc manner. Contrary to standard practice in applied work, recent empirical work indicates that using transformations in SAE is not as simple as transforming the target variable by computing its logarithm. In Part 2 of the present work, transformations in the context of SAE are applied and further developed. Chapter 3 proposes a protocol for the production of small area official statistics that is based on three stages, namely (i) Specification, (ii) Analysis/Adaptation and (iii) Evaluation. In this chapter, the use of some adaptations of the working model by using transformations is showed as a part of the (ii) stage. In Chapter 4 we extended the use of data-driven transformations under linear mixed model-based SAE methods; In particular, the estimation method of the transformation parameter under maximum likelihood theory. First, we analyze how the performance of SAE methods are affected by departures from normality and how such transformations can assist with improving the validity of the model assumptions and the precision of small area prediction. In particular, attention has been paid to the estimation of poverty and inequality indicators, due to its important socio-economical relevance and political impact. Second, we adapt the mean squared error estimator to account for the additional uncertainty due to the estimation of transformation parameters. Finally, as in Chapter 3, the methods are illustrated by using real survey and census data from Mexico. In order to improve some features of existing software packages suitable for the estimation of indicators for small areas, the package emdi is developed in Chapter 5. This package offers a methodological and computational framework for the estimation of regionally disaggregated indicators using SAE methods as well as providing tools for assessing, processing, and presenting the results. Finally, in Part 3, a discussion of the applicability of transformations is made in the context of generalized linear models (GLMs). In Chapter 6, a comparison is made in terms of precision measurements between using count data transformations within the classical regression model and applying GLMs, in particular for the Poisson case. Therefore, some methodological differences are presented and a simulation study is carried out. The learning from this analysis focuses on the relevance of knowing the research purpose and the data scenario in order to choose which methodology should be preferable for any given situation

    A Framework for the Estimation of Disaggregated Statistical Indicators Using Tree-Based Machine Learning Methods

    Get PDF
    The thesis combines four papers that introduce a coherent framework based on MERFs for the estimation of spatially disaggregated economic and inequality indicators and associated uncertainties. Chapter 1 focusses on flexible domain prediction using MERFs. We discuss characteristics of semi-parametric point and uncertainty estimates for domain-specific means. Extensive model- and design-based simulations highlight advantages of MERFs in comparison to 'traditional' LMM-based SAE methods. Chapter 2 introduces the use of MERFs under limited covariate information. The access to population-level micro-data for auxiliary information imposes barriers for researchers and practitioners. We introduce an approach that adaptively incorporates aggregated auxiliary information using calibration-weights in the absence of unit-level auxiliary data. We apply the proposed method to German survey data and use aggregated covariate census information from the same year to estimate the average opportunity cost of care work for 96 planning regions in Germany. In Chapter 3, we discuss the estimation of non-linear poverty and inequality indicators. Our proposed method allows to estimate domain-specific cumulative distribution functions from which desired (non-linear) poverty estimators can be obtained. We evaluate proposed point and uncertainty estimators in a design-based simulation and focus on a case study uncovering spatial patterns of poverty for the Mexican state of Veracruz. Additionally, Chapter 3 informs a methodological discussion on differences and advantages between the use of predictive algorithms and (linear) statistical models in the context of SAE. The final Chapter 4 complements the previous research by implementing discussed methods for point and uncertainty estimates in the open-source R package SAEforest. The package facilitates the use of discussed methods and accessibly adds MERFs to the existing toolbox for SAE and official statistics. Overall, this work aims to synergize aspects from two statistical spheres (e.g. 'traditional' parametric models and nonparametric predictive algorithms) by critically discussing and adapting tree-based methods for applications in SAE. In this perspective, the thesis contributes to the existing literature along three dimensions: 1) The methodological development of alternative semi-parametric methods for the estimation of non-linear domain-specific indicators and means under unit-level and aggregated auxiliary covariates. 2) The proposition of a general framework that enables further discussions between 'traditional' and algorithmic approaches for SAE as well as an extensive comparison between LMM-based methods and MERFs in applications and several model and design-based simulations. 3) The provision of an open-source software package to facilitate the usability of methods and thus making MERFs and general SAE methodology accessible for tailored research applications of statistical, institutional and political practitioners

    Inequalities' Impacts: State of the Art Review

    Get PDF
    By way of introduction This report provides the fi rm foundation for anchoring the research that will be performed by the GINI project. It subsequently considers the fi elds covered by each of the main work packages: ● inequalities of income, wealth and education, ● social impacts, ● political and cultural impacts, and ● policy effects on and of inequality. Though extensive this review does not pretend to be exhaustive. The review may be “light” in some respects and can be expanded when the analysis evolves. In each of the four fi elds a signifi cant number of discussion papers will be produced, in total well over 100. These will add to the state of the art while also covering new round and generating results that will be incorporated in the Analysis Reports to be prepared for the work packages. In that sense, the current review provides the starting point. At the same time, the existing body of knowledge is broader or deeper depending on the particular fi eld and its tradition of research. The very motivation of GINI’s focused study of the impacts of inequalities is that a systematic study is lacking and relatively little is known about those impacts. This also holds for the complex collection of, the effects that inequality can have on policy making and the contributions that policies can make to mitigating inequalities but also to enhancing them. By contrast, analyses of inequality itself are many, not least because there is a wide array of inequalities; inequalities have become more easily studied comparatively and much of that analysis has a signifi cant descriptive fl avour that includes an extensive discussion of measurement issues. @GINI hopes to go beyond that and cover the impacts of inequalities at the same time

    Inequalities' impacts

    Get PDF
    corecore