This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical “key“ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness-of-fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of “reasonable“ models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria that detect “underfitting“ (associated with overestimation of the risk). The criteria may also reveal “overfitting“ (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. Our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several “good“ models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complex model. The risk estimates for this application display more variation but suggest a suitable upper bound
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.