Processing of Erroneous and Unsafe Data

Waal, A.G. de

thesis

Processing of Erroneous and Unsafe Data

Authors: A.G. de Waal
Publication date: 19 June 2003
Publisher: Statistical offices have to overcome many problems before they can publish reliable data. Two of these problems are examined in this thesis. The first problem is the occurrence of errors in the collected data. Due to these errors publication figures cannot be directly based on the collected data. Before publication the errors in the data have to be localised and corrected. In this thesis we focus on the localisation of errors in a mix of categorical and numerical data. The problem is formulated as a mathematical optimisation problem. Several new algorithms for solving this problem are proposed, and computational results of the most promising algorithms are compared to each other. The second problem that is examined in this thesis is the occurrence of unsafe data, i.e. data that would reveal too much sensitive information about individual respondents. Before publication of data, such unsafe data need to be protected. In the thesis we examine various aspects of the protection of unsafe data.

Abstract

Statistical offices have to overcome many problems before they can publish reliable data. Two of these problems are examined in this thesis. The first problem is the occurrence of errors in the collected data. Due to these errors publication figures cannot be directly based on the collected data. Before publication the errors in the data have to be localised and corrected. In this thesis we focus on the localisation of errors in a mix of categorical and numerical data. The problem is formulated as a mathematical optimisation problem. Several new algorithms for solving this problem are proposed, and computational results of the most promising algorithms are compared to each other. The second problem that is examined in this thesis is the occurrence of unsafe data, i.e. data that would reveal too much sensitive information about individual respondents. Before publication of data, such unsafe data need to be protected. In the thesis we examine various aspects of the protection of unsafe data.Statistische bureaus dienen tal van problemen te overwinnen voordat zij de resultaten van hun onderzoeken kunnen publiceren. In het proefschrift wordt ingegaan op twee van deze problemen. Het eerste probleem is dat verzamelde gegevens foutief kunnen zijn. Door de mogelijke aanwezigheid van fouten in de gegevens moeten deze gegevens eerst worden gecontroleerd en indien nodig worden gecorrigeerd voordat tot publicatie van resultaten wordt overgegaan. In het proefschrift wordt vooral aandacht besteed aan het opsporen van de foutieve gegevens. Door te veronderstellen dat er zo min mogelijk fouten zijn gemaakt kan het opsporen van de foutieve waarden als een wiskundig optimaliseringsprobleem worden geformuleerd. In het proefschrift wordt een aantal methoden ontwikkeld om dit complexe probleem efficient op te lossen. Het tweede probleem dat in het proefschrift onderzocht wordt is dat geen gegevens gepubliceerd mogen worden die de privacy van individuele respondenten of kleine groepen respondenten schaden. Om gegevens van individuele of kleine groepen respondenten te beschermen moeten beveiligingsmaatregelen, zoals het niet publiceren van bepaalde informatie, worden getroffen. In het proefschrift wordt ingegaan op de wiskundige problemen die het beveiligen van gevoelige gegevens met zich mee brengt. Voor een aantal problemen, zoals het berekenen van het informatieverlies ten gevolge van het beveiligen van gevoelige gegevens en het minimaliseren van de informatie die niet gepubliceerd wordt, worden oplossingen beschreven