Search CORE

16 research outputs found

The Use of Data-driven Transformations and Their Applicability in Small Area Estimation

Author: Rojas-Perilla Natalia
Publication venue
Publication date: 01/01/2018
Field of study

One of the goals of data analysts is to establish relationships between variables using regression models. Standard statistical techniques for linear and linear mixed regression models are commonly associated with interpretation, estimation, and inference. These techniques rely on basic assumptions underlying the working model, listed below: - Normality: Transforming data to create symmetry in order to correctly use interpretation and inferential techniques - Homoscedasticity: Creating equality of spread as a means to gain efficiency in estimation processes and to properly use inference processes - Linearity: Linearizing relationships in an effort to avoid misleading conclusions for estimation and inference techniques. Different options are available to the data analyst when the model assumptions are not met in practice. Researchers could formulate the regression model under alternative and more flexible parametric assumptions. They could also use a regression model that minimizes the use of parametric assumptions or under robust estimation. Another option would be to parsimoniously redesign the model by finding an appropriate transformation such that the model assumptions hold. A standard practice in applied work is to transform the target variable by computing its logarithm. However, this type of transformation does not adjust to the underlying data. Therefore, some research effort has been shifted towards alternative data-driven transformations, such as the Box-Cox, which includes a transformation parameter that adjusts to the data. The literature of transformations in theoretical statistics and practical case studies in different research fields is rich and most relevant results were published during the early 1980s. More sophisticated and complex techniques and tools are available nowadays to the applied statistician as alternatives to using transformations. However, simplification is still a gold nugget in statistical practice, which is often the case when applying suitable transformations within the working model. In general, researchers have been using data transformations as a go-to tool to assist scientific work under the classical and linear mixed regression models instead of developing new theories, applying complex methods or extending software functions. However, transformations are often automatically and routinely applied without considering different aspects on their utility. In Part 1 of this work, some modeling guidelines for practitioners in transformations are each presented. An extensive guideline and an overview of different transformations and estimation methods of transformation parameters in the context of linear and linear mixed regression models are presented in Chapter 1. Furthermore, in order to provide an extensive collection of transformations usable in linear regression models and a wide range of estimation methods for the transformation parameter, the package trafo is presented in Chapter 2. This package complements and enlarges the methods that exist in R so far, and offers a simple, user-friendly framework for selecting a suitable transformation depending on the research purpose. In the literature, little attention has been paid to the study of techniques of the linear mixed regression model when working with transformations. This becomes a challenge for users of small area estimation (SAE) methods, since most commonly used SAE methods are based on the linear mixed regression model which often relies on Gaussian assumptions. In particular, the empirical best predictor is widely used in practice to produce reliable estimates of general indicators for areas with small sample sizes. The issue of data transformations is addressed in the current SAE literature in a fairly ad-hoc manner. Contrary to standard practice in applied work, recent empirical work indicates that using transformations in SAE is not as simple as transforming the target variable by computing its logarithm. In Part 2 of the present work, transformations in the context of SAE are applied and further developed. Chapter 3 proposes a protocol for the production of small area official statistics that is based on three stages, namely (i) Specification, (ii) Analysis/Adaptation and (iii) Evaluation. In this chapter, the use of some adaptations of the working model by using transformations is showed as a part of the (ii) stage. In Chapter 4 we extended the use of data-driven transformations under linear mixed model-based SAE methods; In particular, the estimation method of the transformation parameter under maximum likelihood theory. First, we analyze how the performance of SAE methods are affected by departures from normality and how such transformations can assist with improving the validity of the model assumptions and the precision of small area prediction. In particular, attention has been paid to the estimation of poverty and inequality indicators, due to its important socio-economical relevance and political impact. Second, we adapt the mean squared error estimator to account for the additional uncertainty due to the estimation of transformation parameters. Finally, as in Chapter 3, the methods are illustrated by using real survey and census data from Mexico. In order to improve some features of existing software packages suitable for the estimation of indicators for small areas, the package emdi is developed in Chapter 5. This package offers a methodological and computational framework for the estimation of regionally disaggregated indicators using SAE methods as well as providing tools for assessing, processing, and presenting the results. Finally, in Part 3, a discussion of the applicability of transformations is made in the context of generalized linear models (GLMs). In Chapter 6, a comparison is made in terms of precision measurements between using count data transformations within the classical regression model and applying GLMs, in particular for the Poisson case. Therefore, some methodological differences are presented and a simulation study is carried out. The learning from this analysis focuses on the relevance of knowing the research purpose and the data scenario in order to choose which methodology should be preferable for any given situation

Institutional Repository of the Freie Universität Berlin

Data-Driven Transformations In Small Area Estimation

Author: Pannier Sören
Rojas-Perilla Natalia
Schmid Timo
Tzavidis Nikos
Publication venue
Publication date: 01/01/2017
Field of study

Small area models typically depend on the validity of model as- sumptions. For example, a commonly used version of the Empirical Best Predictor relies on the Gaussian assumptions of the error terms of the linear mixed model, a feature rarely observed in applications with real data. The present paper proposes to tackle the potential lack of validity of the model assumptions by using data- driven scaled transformations as opposed to ad-hoc chosen transformations. Dif- ferent types of transformations are explored, the estimation of the transformation parameters is studied in detail under a linear mixed model and transformations are used in small area prediction of lin- ear and non-linear parameters. The use of scaled transformations is crucial as it allows for fitting the linear mixed model with standard software and hence it simplifies the work of the data analyst. Mean squared error estimation that accounts for the uncertainty due to the estimation of the transformation parameters is explored using para- metric and semi-parametric (wild) bootstrap. The proposed methods are illustrated using real survey and census data for estimating in- come deprivation parameters for municipalities in the Mexican state of Guerrero. Extensive simulation studies and the results from the application show that using carefully selected, data driven transfor- mations can improve small area estimation

Institutional Repository of the Freie Universität Berlin

Variable selection using conditional AIC for linear mixed models with data-driven transformations

Author: Lee Yeonjoo
Rojas-Perilla Natalia
Runge Marina
Schmid Timo
Publication venue
Publication date: 01/01/2023
Field of study

When data analysts use linear mixed models, they usually encounter two practical problems: (a) the true model is unknown and (b) the Gaussian assumptions of the errors do not hold. While these problems commonly appear together, researchers tend to treat them individually by (a) finding an optimal model based on the conditional Akaike information criterion (cAIC) and (b) applying transformations on the dependent variable. However, the optimal model depends on the transformation and vice versa. In this paper, we aim to solve both problems simultaneously. In particular, we propose an adjusted cAIC by using the Jacobian of the particular transformation such that various model candidates with differently transformed data can be compared. From a computational perspective, we propose a step-wise selection approach based on the introduced adjusted cAIC. Model-based simulations are used to compare the proposed selection approach to alternative approaches. Finally, the introduced approach is applied to Mexican data to estimate poverty and inequality indicators for 81 municipalities

Institutional Repository of the Freie Universität Berlin

Intercensal updating using structure-preserving methods and satellite imagery

Author: Arias-Salazar Alejandra
Koebe Till
Rojas-Perilla Natalia
Schmid Timo
Publication venue
Publication date: 01/03/2021
Field of study

Censuses are fundamental building blocks of most modern-day societies, yet collected every ten years at best. We propose an extension of the widely popular census updating technique Structure Preserving Estimation by incorporating auxiliary information in order to take ongoing subnational population shifts into account. We apply our method by incorporating satellite imagery as additional source to derive annual small-area updates of multidimensional poverty indicators from 2013 to 2020 for a population at risk: female-headed households in Senegal. We evaluate the performance of our proposal using data from two different census periods

arXiv.org e-Print Archive

Institutional Repository of the Freie Universität Berlin

Cost-effectiveness of Artificial Intelligence as a Decision-Support System Applied to the Detection and Grading of Melanoma, Dental Caries, and Diabetic Retinopathy

Author: Gomez Rossi Jesus
Krois Joachim
Rojas-Perilla Natalia
Schwendicke Falk
Publication venue
Publication date: 01/01/2022
Field of study

Objective: To assess the cost-effectiveness of artificial intelligence (AI) for supporting clinicians in detecting and grading diseases in dermatology, dentistry, and ophthalmology. Importance: AI has been referred to as a facilitator for more precise, personalized, and safer health care, and AI algorithms have been reported to have diagnostic accuracies at or above the average physician in dermatology, dentistry, and ophthalmology. Design, setting, and participants: This economic evaluation analyzed data from 3 Markov models used in previous cost-effectiveness studies that were adapted to compare AI vs standard of care to detect melanoma on skin photographs, dental caries on radiographs, and diabetic retinopathy on retina fundus imaging. The general US and German population aged 50 and 12 years, respectively, as well as individuals with diabetes in Brazil aged 40 years were modeled over their lifetime. Monte Carlo microsimulations and sensitivity analyses were used to capture lifetime efficacy and costs. An annual cycle length was chosen. Data were analyzed between February 2021 and August 2021. Exposure: AI vs standard of care. Main outcomes and measures: Association of AI with tooth retention-years for dentistry and quality-adjusted life-years (QALYs) for individuals in dermatology and ophthalmology; diagnostic costs. Results: In 1000 microsimulations with 1000 random samples, AI as a diagnostic-support system showed limited cost-savings and gains in tooth retention-years and QALYs. In dermatology, AI showed mean costs of

750 (95% CI,

608-

970) and was associated with 86.5 QALYs (95% CI, 84.9-87.9 QALYs), while the control showed higher costs

759 (95% CI,

618-

970) with similar QALY outcome. In dentistry, AI accumulated costs of €320 (95% CI, €299-€341) (purchasing power parity [PPP] conversion,

429 [95% CI,

400-

458]) with 62.4 years per tooth retention (95% CI, 60.7-65.1 years). The control was associated with higher cost, €342 (95% CI, €318-€368) (PPP,

458; 95% CI,

426-

493) and fewer tooth retention-years (60.9 years; 95% CI, 60.5-63.1 years). In ophthalmology, AI accrued costs of R

1321 (95% CI, R

1283-R

1364) (PPP,

559; 95% CI,

543-

577) at 8.4 QALYs (95% CI, 8.0-8.7 QALYs), while the control was less expensive (R

1260; 95% CI, R

1222-R

1303) (PPP,

533; 95% CI,

517-

551) and associated with similar QALYs. Dominance in favor of AI was dependent on small differences in the fee paid for the service and the treatment assumed after diagnosis. The fee paid for AI was a factor in patient preferences in cost-effectiveness between strategies. Conclusions and relevance: The findings of this study suggest that marginal improvements in diagnostic accuracy when using AI may translate into a marginal improvement in outcomes. The current evidence supporting AI as decision support from a cost-effectiveness perspective is limited; AI should be evaluated on a case-specific basis to capture not only differences in costs and payment mechanisms but also treatment after diagnosis

Institutional Repository of the Freie Universität Berlin

PubMed Central

A framework for the production of small area official statistics

Author: Luna Hernandez Angela
Rojas-Perilla Natalia
Schmid Timo
Tzavidis Nikos
Zhang Li-Chun
Publication venue
Publication date: 01/01/2016
Field of study

Small area estimation is a research area in official and survey statistics of great practical relevance for National Statistical Institutes and related organisations. Despite rapid developments in methodology and software, researchers and users would benefit from having practical guidelines that assist the process of small area estimation. In this paper we propose a general framework for the production of small area statistics that is based on three broadly defined stages namely, Specification, Analysis/Adaptation and Evaluation. The corner stone of the proposed framework is the principle of parsimony. Emphasis is given on the interaction between a user and a methodologist for specifying the target geography and parameters in light of the available data. Model-free and model-dependent methods are described with focus on model selection and testing, model diagnostics and adaptations e.g. use of data transformations. The use of uncertainty measures and model and design-based simulations for method evaluation are also at the centre of the paper. We illustrate each stage of the process both theoretically and by using real data for estimating a simple and complex (non-linear) indicators

Institutional Repository of the Freie Universität Berlin

National Centre for Research Methods: NCRM EPrints Repository

The role of personality traits and social support in relations of health-related behaviours and depressive symptoms

Author: Baumeister Harald
Cohrdes Caroline
Edler Johanna‑Sophie
Manz Kristin
Rojas‑Perilla Natalia
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 22/01/2022
Field of study

Background: Previous evidence has suggested that physically inactive individuals and extensive media users are at high risk for experiencing depressive symptoms. We examined personality traits and perceived social support as potential moderators of this association. Personality and perceived social support were included as two of the most frequently considered variables when determining predispositioning factors for media use phenomena also discussed in relation to physical activity. Methods: We analysed cross-sectional data from 1402 adults (18–31 years old) who participated in a national health survey in Germany (KiGGS, Study on the health of children and adolescents in Germany, wave 2). The data included one-week accelerometer assessments as objective indicators of physical activity, self-reported media use, depressive symptoms, perceived social support and Big 5 personality traits. An elastic net regression model was fit with depressive symptoms as outcome. Ten-fold cross-validation was implemented. Results: Amongst the main effects, we found that high media use was positively correlated with depressive symptoms, whereas physical activity was not correlated. Looking at support and individual differences as moderators, revealed that PC use was more strongly correlated with depressive symptoms in cases of low levels of perceived social support. Positive associations of social media use with depressive symptoms were more pronounced, whereas negative associations of moderate to vigorous physical activity with depressive symptoms were less pronounced in extraverts than they were in introverts. Conclusions: Results highlight the importance of considering individual factors for deriving more valid recommendations on protective health behaviours.Peer Reviewe

PubMed Central

Publikationsserver des Robert Koch-Instituts

Small area estimation in R with application to Mexican income data

Author: Kreutzmann Ann-Kristin
Pannier Sören
Rojas-Perilla Natalia
Schmid Timo
Templ Matthias
Tzavidis Nikos
Publication venue: ZHAW Zürcher Hochschule für Angewandte Wissenschaften
Publication date: 01/01/2017
Field of study

In the last decades policy decisions are often based on statistical measures. The more detailed this information is, the better is the basis for targeting policies and evaluating policy programs. For instance, the United Nations suggest more disaggregation of statistical indicators for monitoring their Sustainable Development Goals and also the number of National Statistical Institutes (NSIs) that notice the need of more disaggregated statistics is increasing. Dimensions for disaggregation can be characteristics of the individuals or households like sex, age or ethnicity, economic activity or spatial dimensions like metropolitan areas or districts. Primary data sources for variables that are used to estimate statistical indicators are national household surveys. However, sample sizes are usually small or even zero at disaggregated levels. Therefore, direct estimators based only on survey data can be unreliable or not available for small domains. While the option of more specific surveys is costly, model-based methodologies for dealing with small sample sizes can help to obtain reliable estimates for small domains. The so-called Small Area Estimation (SAE) methods [1,2] link survey data that is only available for a proportion of households with administrative or census data available for all households in the area of interest. Even though a wide range of SAE methods is proposed by academic researchers, these are, so far, applied only by a small number of NSIs or other practitioners like the World Bank. This gap between theoretical possibilities and practical application can have several reasons. One reason can be the lack of suitable statistical software. The free software environment R helps to counteract this issue since researchers can make their codes available to the public via packages. Thus, new methods can reach the practitioner faster than with non-free software. The next two sections summarize which packages are already available and what could be improved in the future

ZHAW digitalcollection

The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators

Author: Kreutzmann Ann-Kristin
Pannier Sören
Rojas-Perilla Natalia
Schmid Timo
Templ Matthias
Tzavidis Nikos
Publication venue
Publication date: 01/01/2017
Field of study

The R package emdi offers a methodological and computational framework for the estimation of regionally disaggregated indicators using small area estimation methods and provides tools for assessing, processing and presenting the results. A range of indicators that includes the mean of the target variable, the quantiles of its distribution and complex, non-linear indicators or customized indicators can be estimated simultaneously using direct estimation and the empirical best predictor (EBP) approach (Molina and Rao 2010). In the application presented in this paper package emdi is used for estimating inequality indicators and the median of the income distributions for small areas in Austria. Because the EBP approach relies on the normality of the mixed model error terms, the user is further assisted by an automatic selection of data-driven transformation parameters. Estimating the uncertainty of small area estimates (using a mean squared error - MSE measure) is achieved by using both parametric bootstrap and semi-parametric wild bootstrap. The additional uncertainty due to the estimation of the transformation parameter is also captured in MSE estimation. The semi-parametric wild bootstrap further protects the user against departures from the assumptions of the mixed model in particular, those of the unit-level error term. The bootstrap schemes are facilitated by computationally effcient code that uses parallel computing. The package supports the users beyond the production of small area estimates. Firstly, tools are provided for exploring the structure of the data and for diagnostic analysis of the model assumptions. Secondly, tools that allow the spatial mapping of the estimates enable the user to create high quality visualizations. Thirdly, results and model summaries can be exported to Excel™ spreadsheets for further reporting purposes

Institutional Repository of the Freie Universität Berlin

The R Package emdi for Estimating and Mapping Regionally Disaggregated Indicators

Author: Kreutzmann Ann-Kristin
Pannier Sören
Rojas-Perilla Natalia
Schmid Timo
Templ Matthias
Tzavidis Nikos
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/01/2019
Field of study

The R package emdi enables the estimation of regionally disaggregated indicators using small area estimation methods and includes tools for processing, assessing, and presenting the results. The mean of the target variable, the quantiles of its distribution, the headcount ratio, the poverty gap, the Gini coefficient, the quintile share ratio, and customized indicators are estimated using direct and model-based estimation with the empirical best predictor (Molina and Rao 2010). The user is assisted by automatic estimation of datadriven transformation parameters. Parametric and semi-parametric, wild bootstrap for mean squared error estimation are implemented with the latter offering protection against possible misspecification of the error distribution. Tools for (a) customized parallel computing, (b) model diagnostic analyses, (c) creating high quality maps and (d) exporting the results to Excel and OpenDocument Spreadsheets are included. The functionality of the package is illustrated with example data sets for estimating the Gini coefficient and median income for districts in Austria

Institutional Repository of the Freie Universität Berlin

SSOAR - Social Science Open Access Repository

Journal of Statistical Software

ZHAW digitalcollection