665 research outputs found

    A linear optimization based method for data privacy in statistical tabular data

    Get PDF
    National Statistical Agencies routinely disseminate large amounts of data. Prior to dissemination these data have to be protected to avoid releasing confidential information. Controlled tabular adjustment (CTA) is one of the available methods for this purpose. CTA formulates an optimization problem that looks for the safe table which is closest to the original one. The standard CTA approach results in a mixed integer linear optimization (MILO) problem, which is very challenging for current technology. In this work we present a much less costly variant of CTA that formulates a multiobjective linear optimization (LO) problem, where binary variables are pre-fixed, and the resulting continuous problem is solved by lexicographic optimization. Extensive computational results are reported using both commercial (CPLEX and XPRESS) and open source (Clp) solvers, with either simplex or interior-point methods, on a set of real instances. Most instances were successfully solved with the LO-CTA variant in less than one hour, while many of them are computationally very expensive with the MILO-CTA formulation. The interior-point method outperformed simplex in this particular application.Peer ReviewedPreprin

    Statistical disclosure control in tabular data

    Get PDF
    Data disseminated by National Statistical Agencies (NSAs) can be classified as either microdata or tabular data. Tabular data is obtained from microdata by crossing one or more categorical variables. Although cell tables provide aggregated information, they also need to be protected. This chapter is a short introduction to tabular data protection. It contains three main sections. The first one shows the different types of tables that can be obtained, and how they are modeled. The second describes the practical rules for detection of sensitive cells that are used by NSAs. Finally, an overview of protection methods is provided, with a particular focus on two of them: “cell suppression problem” and “controlled tabular adjustment”.Postprint (published version

    Stabilized Benders methods for large-scale combinatorial optimization, with appllication to data privacy

    Get PDF
    The Cell Suppression Problem (CSP) is a challenging Mixed-Integer Linear Problem arising in statistical tabular data protection. Medium sized instances of CSP involve thousands of binary variables and million of continuous variables and constraints. However, CSP has the typical structure that allows application of the renowned Benders’ decomposition method: once the “complicating” binary variables are fixed, the problem decomposes into a large set of linear subproblems on the “easy” continuous ones. This allows to project away the easy variables, reducing to a master problem in the complicating ones where the value functions of the subproblems are approximated with the standard cutting-plane approach. Hence, Benders’ decomposition suffers from the same drawbacks of the cutting-plane method, i.e., oscillation and slow convergence, compounded with the fact that the master problem is combinatorial. To overcome this drawback we present a stabilized Benders decomposition whose master is restricted to a neighborhood of successful candidates by local branching constraints, which are dynamically adjusted, and even dropped, during the iterations. Our experiments with randomly generated and real-world CSP instances with up to 3600 binary variables, 90M continuous variables and 15M inequality constraints show that our approach is competitive with both the current state-of-the-art (cutting-plane-based) code for cell suppression, and the Benders implementation in CPLEX 12.7. In some instances, stabilized Benders is able to quickly provide a very good solution in less than one minute, while the other approaches were not able to find any feasible solution in one hour.Peer ReviewedPreprin

    Testing variants of minimum distance controlled tabular adjustment

    Get PDF
    Controlled tabular adjustment (CTA), and its minimum distance variants, is a recent methodology for the protection of tabular data. Given a table to be protected, the purpose of the method is to fi nd the closest one that guarantees the confi dentiality of the sensitive cells. This is achieved by adding slight adjustments to the remaining cells, preferably excluding total ones, whose values are preserved. Unlike other approaches, this methodology can effi ciently protect large tables of any number of dimensions and structure. In this work, we test some minimum distance variants of CTA on a close-to-real data set, and analyze the quality of the solutions provided. As another alternative, we suggest a restricted CTA (RCTA) approach, where adjustments are only allowed in a subset of cells. This subset is a priori computed, for instance by a fast heuristic for the cell suppression problem. We discuss benefi ts of RCTA, and suggest several approaches for its solution.Postprint (published version

    Solving the disclosure auditing problem for secondary cell suppression by means of linear programming

    Get PDF
    National Statistical Institutes (NSIs) have the obligation to protect the privacy of individual persons or enterprises against disclosure of potentially sensitive information. For this reason, NSIs protect tabular data against disclosure of sensitive information before they are released. For tabular magnitude data, the starting point of this protection process usually is a sensitivity measure for individual cells. Such a sensitivity measure defines when a cell value is considered safe for publication or not. An often used method to protect a table with unsafe cells against disclosure of sensitive information is cell suppression. [5] argues that the standard criterion for deciding whether a table after suppression is safe or not is somewhat inconsistent and proposes a new criterion. [5] also gives a mixed-integer programming problem formulation for applying this new criterion. The problem with that formulation is that it is quite large and very hard to solve for even moderately sized tables. To be more precise, that mixed-integer programming problem formulation suggests that the auditing problem based on the criterion of [5] is NP-hard. The general assumption among operations research experts is that the computing time for NP-hard problems is non-polynomial in their input parameters. In the current paper, we propose solving a number of smaller and computationally much easier linear programming problems instead of solving one large mixed-integer programming problem. Solving linear programming problems can be done in time polynomial in their input parameter

    Avoiding disclosure of individually identifiable health information: a literature review

    Get PDF
    Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

    A systematic overview on methods to protect sensitive data provided for various analyses

    Get PDF
    In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters

    Optimization Methods for Tabular Data Protection

    Get PDF
    In this thesis we consider a minimum distance Controlled Tabular Adjustment (CTA) model for statistical disclosure limitation (control) of tabular data. The goal of the CTA model is to find the closest safe table to some original tabular data set that contains sensitive information. The measure of closeness is usually measured using l1 or l2 norm; with each measure having its advantages and disadvantages. According to the given norm CTA can be formulated as an optimization problem: Liner Programing (LP) for l1, Quadratic Programing (QP) for l2. In this thesis we present an alternative reformulation of l1-CTA as Second-Order Cone (SOC) optimization problems. All three models can be solved using appropriate versions of Interior-Point Methods (IPM). The validity of the new approach was tested on the randomly generated two-dimensional tabular data sets. It was shown numerically, that SOC formulation compares favorably to QP and LP formulations
    • …
    corecore