83 research outputs found

    A Review of Hot Deck Imputation for Survey Non-response

    Full text link
    L'imputation hot deck est une méthode de gestion des données manquantes dans laquelle chaque valeur manquante est remplacée par une réponse observée à partir d'une unité“similaire.” Bien qu'elle soit largement utilisée en pratique, sa théorie n'est pas aussi développée que celle des autres méthodes d'imputation. Nous avons constaté qu'il n'existe aucun consensus quant à la meilleure faon d'appliquer les hot deck et obtenir des inférences à partir de la série de données complète. Ici, nous passons en revue les différentes formes de hot deck et les recherches existantes sur ses propriétés statistiques. Nous décrivons les applications du hot deck actuellement utilisées, y compris le hot deck du Bureau US du recensement pour la Current Population Survey (CPS). Nous proposons aussi des exemples nombreux de variations du hot deck à la troisième National Health and Nutrition Examination Survey (NHANES III). Certains domaines possibles de recherches futures sont mises en évidence.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/78729/1/j.1751-5823.2010.00103.x.pd

    Relleno de datos de velocidades de viento mediante la aplicación de método de Hot Deck para la estimación de producción de energía eléctrica en base al recurso eólico

    Get PDF
    Durante las campañas de medición de viento en las estaciones meteorológicas, pueden sucederse condiciones atípicas que producen la pérdida de datos ya sea por falla del equipo, por falla en el suministro eléctrico de respaldo, por saturación de espacio de almacenamiento, entre otras. Por tanto, es necesario que las series de datos sean completadas, tratando de reducir la incertidumbre en el proceso. En el presente trabajo se trabaja con datos de velocidades de viento proporcionadas por la estación meteorológica instalada en la Universidad Politécnica Salesiana. Sede Quito – Campus Sur. Los datos registrados de forma horaria se encuentran completos y validados por el método de las Rachas. En base a la serie completa se obtienen 3 series de datos adicionales quitando de manera aleatoria el 10, 40 y 70% de datos. Aplicando el método de Hot-Deck se completan las series construidas y se realizan comparaciones con la serie de datos original completa. Para la estimación de producción de energía eléctrica se utiliza la Distribución de Weibull. Finalmente, se muestran los resultados en los que se analizan la efectividad del llenado de datos conforme a los escenarios propuestos. Para el desarrollo del trabajo se ha empleado las ayudas computacionales RStudio y Matlab.During the wind measurement campaigns in the weather stations, atypical conditions can occur that produce the loss of data either by equipment failure, by backup power supply failure, by storage space saturation, among others. Therefore, it is necessary that the data series be complete, try to reduce the uncertainty in the process. In this work we work with wind speed data provided by the weather station installed at the Salesian Polytechnic University. Quito Headquarters - South Campus. The data recorded on an hourly basis is complete and validated by the Rachas method. Based on the complete series, 3 additional data series will be needed, randomly citing 10, 40 and 70% of data. Applying the Hot-Deck method, the constructed series are completed and comparisons are made with the complete original data series. Weibull Distribution is used for the modification of electric energy production. Finally, there are the results in which the effectiveness of data submission is analyzed according to the proposed scenarios. RStudio and Matlab computational aids have been used for the development of the work

    Multiple Imputation via Local Regression (Miles)

    Get PDF
    Methods for statistical analyses generally rely upon complete rectangular data sets. When the data are incomplete due to, e.g. nonresponse in surveys, the researcher must choose between three alternatives: 1. The analysis rests on the complete cases only: This is almost always the worst option. In, e.g. market research, missing values occur more often among younger respondents. Because relevant behavior such as media consumption or past purchases often correlates with age, a complete case analysis provides the researcher with misleading answers. 2. The missing data are imputed (i.e., filled in) by the application of an ad-hoc method: Ad-hoc methods range from filling in mean values to applying nearest neighbor techniques. Whereas filling in mean values performs poorly, nearest neighbor approaches bear the advantage of imputing plausible values and work well in some applications. Yet, ad-hoc approaches generally suffer from two limitations: they do not apply to complex missing data patterns, and they distort statistical inference, such as t-tests, on the completed data sets. 3. The missing data are imputed by the application of a method that is based on an explicit model: Such model-based methods can cope with the broadest range of missing data problems. However, they depend on a considerable set of assumptions and are susceptible to their violations. This dissertation proposes the two new methods <midastouch> and <Miles> that build on ideas by Cleveland & Devlin (1988) and Siddique & Belin (2008). Both these methods combine model-based imputation with nearest neighbor techniques. Compared to default model-based imputation, these methods are as broadly applicable but require fewer assumptions and thus hopefully appeal to practitioners. In this text, the proposed methods' theoretical derivations in the multiple imputation framework (Rubin, 1987) precede their performance assessments using both artificial data and a natural TV consumption data set from the GfK SE company. In highly nonlinear data, we observe <Miles> outperform alternative methods and thus recommend its use in applications

    A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

    Get PDF
    Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging

    An algorithm for augmenting cancer registry data for epidemiological research applied to oesophageal cancers

    Get PDF
    Oesophageal cancer is an important cancer with short survival, but the relationship between pre-diagnosis health behaviour and post-diagnosis survival remains poorly understood. Cancer registries can provide a high quality census of cancer cases but do not record pre-diagnosis exposures. The aim of this thesis is to document relationships between pre-diagnosis health behaviours on post-diagnosis survival times in oesophageal cancer, developing new methods as required. A systematic review and meta-analysis conducted in 2014, and updated in 2021, to investigate the association between pre-diagnosis health behaviours and oesophageal cancer. Visualising health behaviour variables as part of the cancer registry data set, with 100% missing data, led to the development of new approaches for augmenting US oesophageal cancer registry data with health behaviour data from a US national health survey Firstly, the health survey data were used to create logistic regression models of the probability of each behaviour relative to demographic characteristics and then these models were applied to cancer cases to estimate their probability of each behaviour. Secondly, cold-deck imputation such that two randomly selected but demographically similar health survey respondents both donated their health behaviour to the matching cancer case. The agreement between these two imputed values was used as an estimate of the misclassification and corrected for during the analyses. The logistic regression imputation-based analyses returned accurate point estimates, with wide confidence intervals, if the behaviour occurred in more than approximately 5% of cases. Our reviews and analyses confirmed that pre-diagnosis smoking decreased survival in oesophageal cancer (hazard ratio (HR) 1.08, 95% confidence interval (CI) 1.00-1.17) particularly squamous cell carcinoma when comparing highest to lowest lifetime exposure ( and HR 1.55, 95%CI 1.25-1.94); with similar associations for alcohol consumption. Pre-diagnosis leisure time physical activity was found to be associated with reduced hazard (HR 0.25, 95%CI 0.03,0.81) overall. Findings from these analyses can assist in modelling the impact of current changes in community health behaviour, as well as informing prognosis and treatment decisions at the individual level. This novel method of augmenting cancer registry data with pre-diagnosis variables appears to be effective and will benefit from further validation. This thesis has significantly progressed both issues and identified future opportunities for research and development

    Non-response error in surveys

    Get PDF
    Non-response is an error common to most surveys. In this dissertation, the error of non-response is described in terms of its sources and its contribution to the Mean Square Error of survey estimates. Various response and completion rates are defined. Techniques are examined that can be used to identify the extent of nonresponse bias in surveys. Methods to identify auxiliary variables for use in nonresponse adjustment procedures are described. Strategies for dealing with nonresponse are classified into two types, namely preventive strategies and post hoc adjustments of data. Preventive strategies discussed include the use of call-backs and follow-ups and the selection of a probability sub-sample of non-respondents for intensive follow-ups. Post hoc adjustments discussed include population and sample weighting adjustments and raking ratio estimation to compensate for unit non-response as well as various imputation methods to compensate for item non-response.Mathematical SciencesM. Com. (Statistics

    Composed Index for the Evaluation of Energy Security in Power Systems within the Frame of Energy Transitions—The Case of Latin America and the Caribbean

    Get PDF
    Energy transitions are transforming energy systems around the globe. Such a shift has caused the power system to become a critical piece of infrastructure for the economic development of every nation on the planet. Therefore, guaranteeing its security is crucial, not only for energy purposes but also as a part of a national security strategy. This paper presents a multidimensional index developed to assess energy security of electrical systems in the long term. This tool, named the Power System Security Index (PSIx), which has been previously used for the evaluation of a country in two different time frames, is applied to evaluate the member countries of the Latin American Energy Organization, located within the Latin America and the Caribbean region, to measure its performance on energy security. Mixed results were obtained from the analysis, with clear top performers in the region such as Argentina, while there are others with broad areas of opportunity, as is the case of Hait

    The application of nonparametric data augmentation and imputation using classification and regression trees within a large-scale panel study

    Get PDF
    Generally, multiple imputation is the recommended method for handling item nonresponse in surveys. Usually it is applied as chained equations approach based on parametric models. As Burgette & Reiter (2010) have shown classification and regression trees (CART) are a good alternative replacing the parametric models as conditional models especially when complex models occur, interactions and nonlinear models have to be handled and the amount of variables is very large. In large-scale panel studies many types of data sets with special data situations have to be handled. Based on the study of Burgette & Reiter (2010), this thesis intends to further assess the suitability of CART in combination with multiple imputation and data augmentation on some of these special situations. Unit nonresponse, panel attrition in particular, is a problem with high impact on survey quality in social sciences. The first application aims at imputing missing values by CART to generate a proper data base for the decision whether weighting has to be considered. This decision was based on auxiliary information about respondents and nonrespondents. Both, auxiliary information and the participation status as response indicator, contained missing values that had to be imputed. The described situation originated in a school survey. The schools were asked to transmit auxiliary information about their students without knowing if they participated in the survey or not. In the end both information, auxiliary information and the participation status, should have been combined by their identification number by the survey research institute. Some data were collected and transmitted correctly, some were not. Due to those errors four data situations were distinguished and handled in different ways. 1) Complete cases, that is no missing values neither for the participation status, nor the auxiliary information. That means that the information whether the student participated were available and the auxiliary information were completely observed and correctly merged. 2) The participation status was missing, but the auxiliary information were complete. That happened when the school transmitted the auxiliary data of a student completely, but the combination with the survey participation information failed. 3) The participation status was available, but there were missings in the auxiliary information and 4) there were missings in participation status as well as in the auxiliary information. The procedure to handle the complete data situation 1) was a standard probit analysis. A Probit Forecast Draw was applied in situations 2) and 4) which was based on a Metropolis-Hasting algorithm that used the available information of the maximum number of participants conditional on an auxiliary variable. In practice, the amount of male and female students that participated in the survey was known. This number was used as a maximum when the auxiliary information were combined with a probable participation status. All missings in auxiliary information, that was situations 3) and 4), were augmented by CART. That means that the imputation values were drawn via Bayesian Bootstrap from final nodes of the classification and regression trees. Both, the imputation and the probit model with the response indicator as the dependent variable resulted in a data augmentation approach. All steps were chained to use as much information as possible for the analysis. The application shows that CART can flexibly be combined with data augmentation resulting in a Markov chain Monte Carlo method or more precisely a Gibbs sampler. The results of the analysis of the (meta-)data showed a selectivity due to nonparticipation which could be explained by the variable sex. Female students tended to participate more likely than male students. The results based on the usage of CART differed clearly from those of the complete cases analysis ignoring the second level random effect as well as from those outcomes of the complete cases analysis including the second level random effect. Surveys based on flexible filtering offer the opportunity to adjust the questionnaire to the respondents' situation. Hence, data quality can be increased and response burden can be decreased. Therefore, filters are often implemented in large-scale surveys resulting in a complex data structure, that has to be considered when imputing. The second study of this thesis shows how a data set containing many filters and a high filter-depth that limits the admissible range of values for multiple imputation can be handled by using CART. To get more into detail, a very large and complex data set contained variables that were used for the analysis of household net income. The variables were distributed over modules. Modules are blocks of questions referring to certain topics which are partially steered by filters. Additionally, within those modules the survey was steered by filter questions. As a consequence the number of respondents on each variable differed. It can be assumed that due to the structure of the survey missing values were mainly produced by filters or caused by the respondent intentionally and only a minor part were missing e.g. by interviewers overseeing them. The second application shows that the described procedure is able to consider the complex data structure as the draws from CART are flexibly limited due to the changing filter structure which is generated by imputed filter steering values as well. Regarding the amount of 213 chosen variables for the household net income imputation, CART in contrast to other approaches obviously leads to time savings as no model specification is needed for each variable that has to be imputed. Still, there is a need to get some feedback concerning the suitability of CART-based imputation. Therefore, as third application of this thesis, a simulation study was conducted to show the performance of CART in a combination with multiple imputation by chained equations (MICE) on cross-sectional data. Additionally, it was checked whether a change of settings improves the performance for the given data. There were three different data generating functions of Y . The first was a typical linear model with a normally distributed error term. The second included a chi-squared error term. The third included a non-linear (logarithmic) term. The rate of missing values was set to 60% steered by a missing at random mechanism. Regression parameters, mean, quantiles and correlations were calculated and combined. The quality of the estimation for before deletion, complete cases and the imputed data was measured by coverage, i.e. the proportion of 95%-confidence intervals for the estimated parameters that contain the true value. Additionally, bias and mean squared error were calculated. Then, the settings were changed for the first type of data set, that was the ordinary linear model. First, the initialization was changed to a tree-based initialization instead of draws from the unconditional empirical distribution. Second, the iterations of the tree-based MI approach were increased from 20 to 50. Third, the number of imputed data sets that were combined for the confidence intervals was doubled from 15 to 30. CART-based MICE showed a good performance (88.8% to 91.8%) for all three data sets. Additionally, it was not worthwhile changing the settings of CART for the partitioning of the simulated data. Moreover, the third application shows some insights about the performance and the settings of CART-based MICE. There were many default settings and peculiarities that had to be considered when using CART-based MICE. The results suggest that the default settings and the performance of CART in general lead to sufficient results when conducted on cross-sectional data. Respective the settings, changing the initialization from tree-based draws to draws from the unconditional empirical distribution is recommendable for typical survey data, that is data with missing values in large parts of the data. The fourth application gives some insights into the performance of CART-based MICE on panel data. Therefore, the first simulated data set was extended to panel data containing information from two waves. Four data situations were distinguished, that was three random effects models with different combinations of time-variant and time-invariant variables and a fixed effects model. The last was defined by an intercept that is correlated to a regressor, the missingness steering variable X1. CART-based MICE showed a good performance (89.0% to 91.4%) for all four data sets. CART chose the variables from the correct wave for each of the four data situations and waves. That means that only first wave information was used for the imputation of the first wave variable Yt=1, respectively only second wave information was used for the second wave variable Yt=2. This is crucial as the data generation for each of both waves was conducted as either independent of the other wave or the variables were time-variant for all four data situations. This thesis demonstrates that CART can be used as a highly flexible imputation component which can be recommended with constraints for large-scale panel studies. Missing values in cross-sectional data as well as panel data can both be handled with CART-based MICE. Of course, the accuracy depends on the availability of explanatory power and correlations for both, cross-sectional and panel data. The combination of CART with data augmentation and the extension concerning the filtering of the data are both feasible and promising. In addition, further research about the performance of CART is highly recommended, for example by extending the current simulation study by changes of the variables over time based on past values of the same variable, more waves or different data generation processes

    Pure imputation for statistical surveys

    Get PDF
    corecore