Search CORE

665 research outputs found

A linear optimization based method for data privacy in statistical tabular data

Author: Castro Pérez Jordi
González Alastrué José Antonio
Publication venue
Publication date: 01/01/2017
Field of study

National Statistical Agencies routinely disseminate large amounts of data. Prior to dissemination these data have to be protected to avoid releasing confidential information. Controlled tabular adjustment (CTA) is one of the available methods for this purpose. CTA formulates an optimization problem that looks for the safe table which is closest to the original one. The standard CTA approach results in a mixed integer linear optimization (MILO) problem, which is very challenging for current technology. In this work we present a much less costly variant of CTA that formulates a multiobjective linear optimization (LO) problem, where binary variables are pre-fixed, and the resulting continuous problem is solved by lexicographic optimization. Extensive computational results are reported using both commercial (CPLEX and XPRESS) and open source (Clp) solvers, with either simplex or interior-point methods, on a set of real instances. Most instances were successfully solved with the LO-CTA variant in less than one hour, while many of them are computationally very expensive with the MILO-CTA formulation. The interior-point method outperformed simplex in this particular application.Peer ReviewedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Statistical disclosure control in tabular data

Author: A. Hundepool
D.A. Robertson
F.D. Carvalho
J. Castro
J. Castro
J. Castro
J. Castro
J. Castro
J. Castro
J. Castro
J. Castro
J.F. Benders
J.J. Salazar-González
J.J. Salazar-González
J.P. Kelly
L.H. Cox
L.H. Cox
L.H. Cox
M. Bacharach
M. Fischetti
M. Fischetti
N.P. Dellaert
P.P. Wolf de
R.A. Dandekar
R.K. Ahuja
S. Giessing
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Data disseminated by National Statistical Agencies (NSAs) can be classified as either microdata or tabular data. Tabular data is obtained from microdata by crossing one or more categorical variables. Although cell tables provide aggregated information, they also need to be protected. This chapter is a short introduction to tabular data protection. It contains three main sections. The first one shows the different types of tables that can be obtained, and how they are modeled. The second describes the practical rules for detection of sensitive cells that are used by NSAs. Finally, an overview of protection methods is provided, with a particular focus on two of them: “cell suppression problem” and “controlled tabular adjustment”.Postprint (published version

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Stabilized Benders methods for large-scale combinatorial optimization, with appllication to data privacy

Author: Baena Daniel
Castro Pérez Jordi
Frangioni Antonio
Publication venue
Publication date: 01/01/2017
Field of study

The Cell Suppression Problem (CSP) is a challenging Mixed-Integer Linear Problem arising in statistical tabular data protection. Medium sized instances of CSP involve thousands of binary variables and million of continuous variables and constraints. However, CSP has the typical structure that allows application of the renowned Benders’ decomposition method: once the “complicating” binary variables are fixed, the problem decomposes into a large set of linear subproblems on the “easy” continuous ones. This allows to project away the easy variables, reducing to a master problem in the complicating ones where the value functions of the subproblems are approximated with the standard cutting-plane approach. Hence, Benders’ decomposition suffers from the same drawbacks of the cutting-plane method, i.e., oscillation and slow convergence, compounded with the fact that the master problem is combinatorial. To overcome this drawback we present a stabilized Benders decomposition whose master is restricted to a neighborhood of successful candidates by local branching constraints, which are dynamically adjusted, and even dropped, during the iterations. Our experiments with randomly generated and real-world CSP instances with up to 3600 binary variables, 90M continuous variables and 15M inequality constraints show that our approach is competitive with both the current state-of-the-art (cutting-plane-based) code for cell suppression, and the Benders implementation in CPLEX 12.7. In some instances, stabilized Benders is able to quickly provide a very good solution in less than one minute, while the other approaches were not able to find any feasible solution in one hour.Peer ReviewedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Testing variants of minimum distance controlled tabular adjustment

Author: Castro Pérez Jordi
Giessing Sarah
Publication venue: Eurostat-Office for Official Publications of the European Communities
Publication date: 01/01/2008
Field of study

Controlled tabular adjustment (CTA), and its minimum distance variants, is a recent methodology for the protection of tabular data. Given a table to be protected, the purpose of the method is to fi nd the closest one that guarantees the confi dentiality of the sensitive cells. This is achieved by adding slight adjustments to the remaining cells, preferably excluding total ones, whose values are preserved. Unlike other approaches, this methodology can effi ciently protect large tables of any number of dimensions and structure. In this work, we test some minimum distance variants of CTA on a close-to-real data set, and analyze the quality of the solutions provided. As another alternative, we suggest a restricted CTA (RCTA) approach, where adjustments are only allowed in a subset of cells. This subset is a priori computed, for instance by a fast heuristic for the cell suppression problem. We discuss benefi ts of RCTA, and suggest several approaches for its solution.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Solving the disclosure auditing problem for secondary cell suppression by means of linear programming

Author: Coutinho Wieger
de Waal Ton
Publication venue
Publication date: 01/01/2020
Field of study

National Statistical Institutes (NSIs) have the obligation to protect the privacy of individual persons or enterprises against disclosure of potentially sensitive information. For this reason, NSIs protect tabular data against disclosure of sensitive information before they are released. For tabular magnitude data, the starting point of this protection process usually is a sensitivity measure for individual cells. Such a sensitivity measure defines when a cell value is considered safe for publication or not. An often used method to protect a table with unsafe cells against disclosure of sensitive information is cell suppression. [5] argues that the standard criterion for deciding whether a table after suppression is safe or not is somewhat inconsistent and proposes a new criterion. [5] also gives a mixed-integer programming problem formulation for applying this new criterion. The problem with that formulation is that it is quite large and very hard to solve for even moderately sized tables. To be more precise, that mixed-integer programming problem formulation suggests that the auditing problem based on the criterion of [5] is NP-hard. The general assumption among operations research experts is that the computing time for NP-hard problems is non-polynomial in their input parameters. In the current paper, we propose solving a number of smaller and computationally much easier linear programming problems instead of solving one large mixed-integer programming problem. Solving linear programming problems can be done in time polynomial in their input parameter

Tilburg University Repository

Avoiding disclosure of individually identifiable health information: a literature review

Author: Borton Joshua
Fernandes-Huessy Johannes
Gonzalez Claudia
Hair Elizabeth
Holden Craig
Mulcahy Tim
Prada Sergio I
Publication venue
Publication date
Field of study

Achieving data and information dissemination without arming anyone is a central task of any entity in charge of collecting data. In this article, the authors examine the literature on data and statistical confidentiality. Rather than comparing the theoretical properties of specific methods, they emphasize the main themes that emerge from the ongoing discussion among scientists regarding how best to achieve the appropriate balance between data protection, data utility, and data dissemination. They cover the literature on de-identification and reidentification methods with emphasis on health care data. The authors also discuss the benefits and limitations for the most common access methods. Although there is abundant theoretical and empirical research, their review reveals lack of consensus on fundamental questions for empirical practice: How to assess disclosure risk, how to choose among disclosure methods, how to assess reidentification risk, and how to measure utility loss.public use files, disclosure avoidance, reidentification, de-identification, data utility

Research Papers in Economics

A systematic overview on methods to protect sensitive data provided for various analyses

Author: Sariyar Murat
Templ Matthias
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k-anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters

Berner Fachhochschule: ARBOR

ZHAW digitalcollection

Optimization Methods for Tabular Data Protection

Author: Petrenko Iryna
Publication venue: Digital Commons@Georgia Southern
Publication date: 01/01/2017
Field of study

In this thesis we consider a minimum distance Controlled Tabular Adjustment (CTA) model for statistical disclosure limitation (control) of tabular data. The goal of the CTA model is to find the closest safe table to some original tabular data set that contains sensitive information. The measure of closeness is usually measured using l1 or l2 norm; with each measure having its advantages and disadvantages. According to the given norm CTA can be formulated as an optimization problem: Liner Programing (LP) for l1, Quadratic Programing (QP) for l2. In this thesis we present an alternative reformulation of l1-CTA as Second-Order Cone (SOC) optimization problems. All three models can be solved using appropriate versions of Interior-Point Methods (IPM). The validity of the new approach was tested on the randomly generated two-dimensional tabular data sets. It was shown numerically, that SOC formulation compares favorably to QP and LP formulations

Georgia Southern University: Digital Commons@Georgia Southern