12 research outputs found
Some methods for handling missing data in surveys
Missing data, or incomplete data, inevitably occurs in many surveys. It is mainly due to nonresponse such that sample units do not fully or partly respond for the survey items. It can be also arisen from sample selection. For example, two-phase sampling can be viewed as a missing data problem in the sense that the study variable is not observed in the first-phase. In truncated data that are intentionally selected by researcher, it will be also missing data problem if we are interested in estimation of non-truncated data properties. Many statistical methods for handling missing data can be categorized into two types based on statistical treatment: one is weighting method and the other is imputation method. The weighting method such as propensity score adjustment that uses response probability as compensation for nonresponse is popular for reducing nonresponse bias. Also, the imputation approach is also prevailed to create complete data for statistical estimation or inference of those imputed data. In this thesis we investigate new statistical methods in both of weighting and imputation methods corresponding to three different missing data situations: (i) propensity score adjustment for nonignorable nonresponse data with several follow-ups, (ii) correlation estimation of singly tuncated bivariate samples and (iii) fractional hot deck imputation for multivariate missing data
Propensity score adjustment with several follow-ups
Propensity score weighting adjustment is commonly used to handle unit nonresponse. When the response mechanism is nonignorable in the sense that the response probability depends directly on the study variable, a follow-up sample is commonly used to obtain an unbiased estimator using the framework of two-phase sampling, where the follow-up sample is assumed to respond completely. In practice, the follow-up sample is also subject to missingness. We consider propensity score weighting adjustment for nonignorable nonresponse when there are several follow-ups and the final follow-up sample is also subject to missingness. We propose a method-of-moments estimator for estimating parameters in the response probability. The proposed method can be implemented using the generalized method of moments and a consistent variance estimate can be obtained relatively easily. A limited simulation study shows the robustness of the proposed method. The proposed methods are applied to a Korean household survey of employment
Identification enhanced generalised linear model estimation with nonignorable missing outcomes
Missing data often result in undesirable bias and loss of efficiency. These
become substantial problems when the response mechanism is nonignorable, such
that the response model depends on unobserved variables. It is necessary to
estimate the joint distribution of unobserved variables and response indicators
to manage nonignorable nonresponse. However, model misspecification and
identification issues prevent robust estimates despite careful estimation of
the target joint distribution. In this study, we modelled the distribution of
the observed parts and derived sufficient conditions for model identifiability,
assuming a logistic regression model as the response mechanism and generalised
linear models as the main outcome model of interest. More importantly, the
derived sufficient conditions are testable with the observed data and do not
require any instrumental variables, which are often assumed to guarantee model
identifiability but cannot be practically determined beforehand. To analyse
missing data, we propose a new imputation method which incorporates verifiable
identifiability using only observed data. Furthermore, we present the
performance of the proposed estimators in numerical studies and apply the
proposed method to two sets of real data: exit polls for the 19th South Korean
election data and public data collected from the Korean Survey of Household
Finances and Living Conditions
Some methods for handling missing data in surveys
Missing data, or incomplete data, inevitably occurs in many surveys. It is mainly due to nonresponse such that sample units do not fully or partly respond for the survey items. It can be also arisen from sample selection. For example, two-phase sampling can be viewed as a missing data problem in the sense that the study variable is not observed in the first-phase. In truncated data that are intentionally selected by researcher, it will be also missing data problem if we are interested in estimation of non-truncated data properties. Many statistical methods for handling missing data can be categorized into two types based on statistical treatment: one is weighting method and the other is imputation method. The weighting method such as propensity score adjustment that uses response probability as compensation for nonresponse is popular for reducing nonresponse bias. Also, the imputation approach is also prevailed to create complete data for statistical estimation or inference of those imputed data. In this thesis we investigate new statistical methods in both of weighting and imputation methods corresponding to three different missing data situations: (i) propensity score adjustment for nonignorable nonresponse data with several follow-ups, (ii) correlation estimation of singly tuncated bivariate samples and (iii) fractional hot deck imputation for multivariate missing data.</p
Cost-Effective Extreme Case-Control Design Using a Resampling Method
Nested case-control sampling design is a popular method in a cohort study whose events are often rare. The controls are randomly selected with or without the matching variable fully observed across all cohort samples to control confounding factors. In this article, we propose a new nested case-control sampling design incorporating both extreme case-control design and a resampling technique. This new algorithm has two main advantages with respect to the conventional nested case-control design. First, it inherits the strength of extreme case-control design such that it does not require the risk sets in each event time to be specified. Second, the target number of controls can only be determined by the budget and time constraints and the resampling method allows an under sampling design, which means that the total number of sampled controls can be smaller than the number of cases. A simulation study demonstrated that the proposed algorithm performs well even when we have a smaller number of controls compared with the number of cases. The proposed sampling algorithm is applied to a public data collected for “Thorotrast Study.
Does the Written Word Matter? The Role of Uncovering and Utilizing Information from Written Comments in Housing Ads
The hedonic price model is a popular method to estimate the implicit prices of observed attributes of a property. However, the inputs to the model are only numerically quantified information. This study quantifies the unstructured qualitative statements contained in the written descriptions from the Multiple Listing Service (MLS) data. These statements contain unstructured text describing the features and setting of the house, providing important but typically unused qualitative information. Our approach is unique in that we use the qualitative information to classify these words into eight groups that reflect previously unmeasured housing quality. The purpose of the study is to test whether these previously unmeasured attributes of the property have an impact on the selling price of the property and its time on the market. The dataset consists of 5,160 home sales in Ames, Iowa between the second quarter of 2003 and the second quarter of 2015. Our findings show that the role of unstructured qualitative text varies; some are redundant to the quantitative information already in the models and have no effect, while others, particularly those reflecting the quality of the structure, represent unique information and are important predictors in determining housing prices and the time on market
FHDI: An R Package for Fractional Hot Deck Imputation
Fractional hot deck imputation (FHDI), proposed by Kalton and Kish (1984) and investigated by Kim and Fuller (2004), is a tool for handling item nonresponse in survey sampling. In FHDI, each missing item is filled with multiple observed values yielding a single completed data set for subsequent analyses. An R package FHDI is developed to perform FHDI and also the fully efficient fractional imputation (FEFI) method of (Fuller and Kim, 2005) to impute multivariate missing data with arbitrary missing patterns. FHDI substitutes missing items with a few observed values jointly obtained from a set of donors whereas the FEFI uses all the possible donors. This paper introduces FHDI as a tool for implementing the multivariate version of fractional hot deck imputation discussed in Im et al. (2015) as well as FEFI. For variance estimation of FHDI and FEFI, the Jackknife method is implemented, and replicated weights are provided as a part of the output.This article is published as Im, J., Cho, I. H., & Kim, J. K. (2018). FHDI: An R package for fractional hot deck imputation. R Journal, 10(1), 140-154.
DOI: 10.32614/RJ-2018-020.
Copyright 2018 The R Foundation.
Attribution 4.0 International (CC BY 4.0).
Posted with permission
Propensity score adjustment with several follow-ups
Propensity score weighting adjustment is commonly used to handle unit nonresponse. When the response mechanism is nonignorable in the sense that the response probability depends directly on the study variable, a follow-up sample is commonly used to obtain an unbiased estimator using the framework of two-phase sampling, where the follow-up sample is assumed to respond completely. In practice, the follow-up sample is also subject to missingness. We consider propensity score weighting adjustment for nonignorable nonresponse when there are several follow-ups and the final follow-up sample is also subject to missingness. We propose a method-of-moments estimator for estimating parameters in the response probability. The proposed method can be implemented using the generalized method of moments and a consistent variance estimate can be obtained relatively easily. A limited simulation study shows the robustness of the proposed method. The proposed methods are applied to a Korean household survey of employment.This is a pre-copyedited, author-produced PDF of an article accepted for publication in Biometrika following peer review. The version of record (Jae Kwang Kim, Jongho Im; Propensity score adjustment with several follow-ups. Biometrika 2014; 101 (2): 439-448) is available online at doi:10.1093/biomet/asu003. Posted with permission.</p
Biodistribution and respiratory toxicity of chloromethylisothiazolinone/methylisothiazolinone following intranasal and intratracheal administration
A variety of isothiazolinone-containing small molecules have been registered and used as chemical additives in many household products. However, their biodistribution and potential harmful effects on human health, especially respiratory effects, were not yet identified in sufficient detail. The purpose of this study was to investigate whether a biocide comprising a mixture of chloromethylisothiazolinone (CMIT) and methylisothiazolinone (MIT) could reach the lungs and induce lung injury when exposure occurs by two administration routes involving the respiratory tract: intratracheal and intranasal instillation. To investigate the biodistribution of CMIT/MIT, we quantified the uptake of 14C-labeled CMIT/MIT in experimental animals for up to seven days after intratracheal and intranasal instillation. In the toxicity study, lung injury was assessed in mice using total inflammatory cell count in bronchoalveolar lavage fluid (BALF) and lung histopathology. The results of the biodistribution study indicated that CMIT/MIT were rapidly distributed throughout the respiratory tract. Using quantitative whole-body autoradiogram analysis, we confirmed that following intranasal exposure, CMIT/MIT reached the lungs via the respiratory tract (nose–trachea–lung). After 5 min post intratracheal and intranasal instillation, the amount of radiotracer ([14C]CMIT/MIT) in the lungs was 2720 ng g−1 and 752 ng g−1 tissue, respectively, and lung damage was observed. A higher amount of the radiotracer resulted in higher toxicity. Both intratracheal and intranasal instillation of CMIT/MIT increased inflammatory cell counts in the BALF and induced injuries in the alveoli. The frequency and the severity scores of injuries caused by intratracheal instillation were approximately-four to five times higher than those induced by intranasal instillation. Therefore, we concluded that CMIT/MIT could reach the lungs following nasal and intratracheal exposure and cause lung injuries, and the extent of injury was dependent on the exposure dose