149 research outputs found

    Assessing the disclosure protection provided by misclassification for survey microdata

    No full text
    Government statistical agencies often apply statistical disclosure limitation techniques to survey microdata to protect confidentiality. There is a need for ways to assess the protection provided. This paper develops some simple methods for disclosure limitation techniques which perturb the values of categorical identifying variables. The methods are applied in numerical experiments based upon census data from the United Kingdom which are subject to two perturbation techniques: data swapping and the post randomisation method. Some simplifying approximations to the measure of risk are found to work well in capturing the impacts of these techniques. These approximations provide simple extensions of existing risk assessment methods based upon Poisson log-linear models. A numerical experiment is also undertaken to assess the impact of multivariate misclassification with an increasing number of identifying variables. The methods developed in this paper may also be used to obtain more realistic assessments of risk which take account of the kinds of measurement and other non-sampling errors commonly arising in surveys

    Calibrated imputation of numerical data under linear edit restrictions

    No full text
    A common problem faced by statistical offices is that data may be missing from collected data sets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data often have to satisfy certain edit rules and that values of variables sometimes have to sum up to known totals. Standard imputation methods for numerical data as described in the literature generally do not take such edit rules and totals into account. In the paper we describe algorithms for imputation of missing numerical data that do take edit restrictions into account and that ensure that sums are calibrated to known totals. The methods sequentially impute the missing data, i.e. the variables with missing values are imputed one by one. To assess the performance of the imputation methods a simulation study is carried out as well as an evaluation study based on a real dataset

    Measuring risk of re-identification in microdata: state-of-the art and new directions

    Get PDF
    We review the influential research carried out by Chris Skinner in the area of statistical disclosure control, and in particular quantifying the risk of re-identification in sample microdata from a random survey drawn from a finite population. We use the sample microdata to infer population parameters when the population is unknown, and estimate the risk of re-identification based on the notion of population uniqueness using probabilistic modelling. We also introduce a new approach to measure the risk of re-identification for a subpopulation in a register that is not representative of the general population, for example a register of cancer patients. In addition, we can use the additional information from the register to measure the risk of re-identification for the sample microdata. This new approach was developed by the two authors and is published here for the first time. We demonstrate this approach in an application study based on UK census data where we can compare the estimated risk measures to the known truth

    Participant recruitment in sensitive surveys: a comparative trial of ‘opt in’ versus ‘opt out’ approaches

    Get PDF
    BACKGROUND: Although in health services survey research we strive for a high response rate, this must be balanced against the need to recruit participants ethically and considerately, particularly in surveys with a sensitive nature. In survey research there are no established recommendations to guide recruitment approach and an ‘opt-in’ system that requires potential participants to request a copy of the questionnaire by returning a reply slip is frequently adopted. However, in observational research the risk to participants is lower than in clinical research and so some surveys have used an ‘opt-out’ system. The effect of this approach on response and distress is unknown. We sought to investigate this in a survey of end of life care completed by bereaved relatives. METHODS: Out of a sample of 1422 bereaved relatives we assigned potential participants to one of two study groups: an ‘opt in’ group (n=711) where a letter of invitation was issued with a reply slip to request a copy of the questionnaire; or an ‘opt out’ group (n=711) where the survey questionnaire was provided alongside the invitation letter. We assessed response and distress between groups. RESULTS: From a sample of 1422, 473 participants returned questionnaires. Response was higher in the ‘opt out’ group than in the ‘opt in’ group (40% compared to 26.4%: χ(2) =29.79, p-value<.01), there were no differences in distress or complaints about the survey between groups, and assignment to the ‘opt out’ group was an independent predictor of response (OR=1.84, 95% CI: 1.45-2.34). Moreover, the ‘opt in’ group were more likely to decline to participate (χ(2)=28.60, p-value<.01) and there was a difference in the pattern of questionnaire responses between study groups. CONCLUSION: Given that the ‘opt out’ method of recruitment is associated with a higher response than the ‘opt in’ method, seems to have no impact on complaints or distress about the survey, and there are differences in the patterns of responses between groups, the ‘opt out’ method could be recommended as the most efficient way to recruit into surveys, even in those with a sensitive nature

    Improving Probabilistic Record Linkage Using Statistical Prediction Models

    Get PDF
    Record linkage brings together information from records in two or more data sources that are believed to belong to the same statistical unit based on a common set of matching variables. Matching variables, however, can appear with errors and variations and the challenge is to link statistical units that are subject to error. We provide an overview of record linkage techniques and specifically investigate the classic Fellegi and Sunter probabilistic record linkage framework to assess whether the decision rule for classifying pairs into sets of matches and non-matches can be improved by incorporating a statistical prediction model. We also study whether the enhanced linkage rule can provide better results in terms of preserving associations between variables in the linked data file that are not used in the matching procedure. A simulation study and an application based on real data are used to evaluate the methods

    Improving Statistical Matching when Auxiliary Information is Available

    Get PDF
    There is growing interest within National Statistical Institutes in combining available datasets containing information on a large variety of social domains. Statistical matching approaches can be used to integrate data sources through a common set of variables where each dataset contains different units that belong to the same target population. However, a common problem is related to the assumption of conditional independence among variables observed in different data sources. In this context, an auxiliary dataset containing all the variables jointly can be used to improve the statistical matching by providing information on the correlation structure of variables observed across different datasets. We propose modifying the prediction models from the auxiliary dataset through a calibration step and show that we can improve the outcome of statistical matching in a variety of settings. We evaluate the proposed approach via simulation and an application based on the European Union Statistics for Income and Living Conditions and Living Costs and Food Survey for the United Kingdom
    corecore