1,071 research outputs found

    A genetic approach to statistical disclosure control

    Get PDF
    Statistical disclosure control is the collective name for a range of tools used by data providers such as government departments to protect the confidentiality of individuals or organizations. When the published tables contain magnitude data such as turnover or health statistics, the preferred method is to suppress the values of certain cells. Assigning a cost to the information lost by suppressing any given cell creates the cell suppression problem. This consists of finding the minimum cost solution which meets the confidentiality constraints. Solving this problem simultaneously for all of the sensitive cells in a table is NP-hard and not possible for medium to large sized tables. In this paper, we describe the development of a heuristic tool for this problem which hybridizes linear programming (to solve a relaxed version for a single sensitive cell) with a genetic algorithm (to seek an order for considering the sensitive cells which minimizes the final cost). Considering a range of real-world and representative artificial datasets, we show that the method is able to provide relatively low cost solutions for far larger tables than is possible for the optimal approach to tackle. We show that our genetic approach is able to significantly improve on the initial solutions provided by existing heuristics for cell ordering, and outperforms local search. This approach is then extended and applied to large statistical tables with over 200000 cells. © 2012 IEEE

    A posteriori disclosure risk measure for tabular data based on conditional entropy

    Get PDF
    Statistical database protection, also known as Statistical Disclosure Control (SDC), is a part of information security which tries to prevent published statistical information (tables, individual records) from disclosing the contribution of specific respondents. This paper deals with the assessment of the disclosure risk associated to the release of tabular data. So-called sensitivity rules are currently being used to measure the disclosure risk for tables. These rules operate on an a priori basis: the data are examined and the rules are used to decide whether the data can be released as they stand or should rather be protected. In this paper, we propose to complement a priori risk assessment with a posteriori risk assessment in order to achieve a higher level of security, that is, we propose to take the protected information into account when measuring the disclosure risk. The proposed a posteriori disclosure risk measure is compatible with a broad class of disclosure protection methods and can be extended for computing disclosure risk for a set of linked tables. In the case of linked table protection via cell suppression, the proposed measure allows detection of secondary suppression patterns which offer more protection than others

    Solving the disclosure auditing problem for secondary cell suppression by means of linear programming

    Get PDF
    National Statistical Institutes (NSIs) have the obligation to protect the privacy of individual persons or enterprises against disclosure of potentially sensitive information. For this reason, NSIs protect tabular data against disclosure of sensitive information before they are released. For tabular magnitude data, the starting point of this protection process usually is a sensitivity measure for individual cells. Such a sensitivity measure defines when a cell value is considered safe for publication or not. An often used method to protect a table with unsafe cells against disclosure of sensitive information is cell suppression. [5] argues that the standard criterion for deciding whether a table after suppression is safe or not is somewhat inconsistent and proposes a new criterion. [5] also gives a mixed-integer programming problem formulation for applying this new criterion. The problem with that formulation is that it is quite large and very hard to solve for even moderately sized tables. To be more precise, that mixed-integer programming problem formulation suggests that the auditing problem based on the criterion of [5] is NP-hard. The general assumption among operations research experts is that the computing time for NP-hard problems is non-polynomial in their input parameters. In the current paper, we propose solving a number of smaller and computationally much easier linear programming problems instead of solving one large mixed-integer programming problem. Solving linear programming problems can be done in time polynomial in their input parameter

    Protecting Micro-Data Privacy: The Moment-Based Density Estimation Method and its Application

    Get PDF
    Privacy concerns pertaining to the release of confidential micro-level information are increasingly relevant to organisations and institutions. Controlling the dissemination of disclosure-prone micro-data by means of suppression, aggregation and perturbation techniques often entails different levels of effectiveness and drawbacks depending on the context and properties of the data. In this dissertation, we briefly review existing disclosure control methods for microdata and undertake a study demonstrating the applicability of micro-data methods to proportion data. This is achieved by using the sample size efficiency related to a simple hypothesis test for a fixed significance level and power, as a measure of statistical utility. We compare a query-based differential privacy mechanism to the multiplicative noise method for disclosure control and demonstrate that with the correct specification of noise parameters, the multiplicative noise method, which is a micro-data based method, achieves similar disclosure protection properties with reduced statistical efficiency costs

    A Heuristic Evolutionary Method for the Complementary Cell Suppression Problem

    Get PDF
    Cell suppression is a common method for disclosure avoidance used to protect sensitive information in two-dimensional tables where row and column totals are published along with non-sensitive data. In tables with only positive cell values, cell suppression has been demonstrated to be non-deterministic NP-hard. Therefore, finding more efficient methods for producing low-cost solutions is an area of active research. Genetic algorithms (GA) have shown to be effective in finding good solutions to the cell suppression problem. However, these methods have the shortcoming that they tend to produce a large proportion of infeasible solutions. The primary goal of this research was to develop a GA that produced low-cost solutions with fewer infeasible solutions created at each generation than previous methods without introducing excessive CPU runtime costs. This research involved developing a GA that produces low-cost solutions with fewer infeasible solutions produced at each generation; and implementing selection and replacement operations that maintained genetic diversity during the evolution process. The GA\u27s performance was tested using tables containing 10,000 and 100,000 cells. The primary criterion for the evaluation of effectiveness of the GA was total cost of the complementary suppressions and the CPU runtime. Experimental results indicate that the GA-based method developed in this dissertation produced better quality solutions than those produced by extant heuristics. Because existing heuristics are very effective, this GA-based method was able to surpass them only modestly. Existing evolutionary methods have also been used to improve upon the quality of solutions produced by heuristics. Experimental results show that the GA-based method developed in this dissertation is computationally more efficient than GA-based methods proposed in the literature. This is attributed to the fact that the specialized genetic operators designed in this study produce fewer infeasible solutions. The results of these experiments suggest the need for continued research into non-probabilistic methods to seed the initial populations, selection and replacement strategies that factor in genetic diversity on the level of the circuits protecting sensitive cells; solution-preserving crossover and mutation operators; and the use of cost benefit ratios to determine program termination

    Statistical disclosure control for HESA: Part 1: Review of SDC theory

    Get PDF
    This report for the Higher Education Statistics Agency (HESA) is a summary of statistical disclosure control (SDC) methods for tabular outputs
    • …
    corecore