20 research outputs found
Rounding methods for protecting EU-aggregates
In the European Statistical System the statistical information is collected by the National
Statistical Institutes (NSIs). The NSIs produce aggregate tables at the national level. They are also
responsible for proper protection of these tables and hence they have to keep certain cells confidential,
suppressing them from publications. Eurostat produces statistical information at the EU-level. However,
the national suppressions hamper very much the publication of EU-aggregates although it is often only a
few smaller countries having to keep their contribution to the EU-total confidential.
This paper reports on a research-project that aims for making more EU aggregates available whilst at the
same time guaranteeing the national suppressed figures to remain confidential.Postprint (published version
A French Anonymization Experiment with Health Data
International audienceIn this paper, a case study about a microdata anonymization test is presented. The work has been made considering a French administrative health dataset with indirect identifiers and sensitive variables about hospital stays. Two approaches to build a k-anonymized file are described, and software used in the test are compared
A linear optimization based method for data privacy in statistical tabular data
National Statistical Agencies routinely disseminate large amounts of data. Prior to dissemination these data have to be protected to avoid releasing confidential information. Controlled tabular adjustment (CTA) is one of the available methods for this purpose. CTA formulates an optimization problem that looks for the safe table which is closest to the original one. The standard CTA approach results in a mixed integer linear optimization (MILO) problem, which is very challenging for current
technology. In this work we present a much less costly variant of CTA that formulates a multiobjective linear optimization (LO) problem, where binary variables are pre-fixed, and the resulting continuous problem is solved by lexicographic optimization. Extensive computational results are reported using both commercial (CPLEX and XPRESS) and open source (Clp) solvers, with either simplex or interior-point methods, on a set of real instances. Most instances were successfully solved with
the LO-CTA variant in less than one hour, while many of them are computationally very expensive with the MILO-CTA formulation. The interior-point method outperformed simplex in this particular application.Peer ReviewedPreprin
Statistical disclosure control in tabular data
Data disseminated by National Statistical Agencies (NSAs) can be classified
as either microdata or tabular data. Tabular data is obtained from microdata by
crossing one or more categorical variables. Although cell tables provide aggregated
information, they also need to be protected. This chapter is a short introduction to
tabular data protection. It contains three main sections. The first one shows the different
types of tables that can be obtained, and how they are modeled. The second
describes the practical rules for detection of sensitive cells that are used by NSAs.
Finally, an overview of protection methods is provided, with a particular focus on
two of them: “cell suppression problem” and “controlled tabular adjustment”.Postprint (published version
An ethical framework for sharing patient data without consent
Background There is no consensus on how to share patient records privately. Data privacy concepts are surveyed and a framework is presented for the safe sharing of sensitive data. It is argued that tailoring the data sharing to the privacy breach risks of each project holds out the best compromise for keeping the trust of the public and providing for the best quality data where detailed patient consent is not possible.
Objective To improve the protection of data by reducing privacy breaches and thus enable appropriate patient data sharing without consent.
Framework Any harm arising from data sharing must come from the data being identified, either fully or partially. The first step is an agreement on an acceptable privacy breach risk. Next, proceed to measure that risk for the proposed data when held by a given recipient. Finally, select from a menu of mitigation strategies (people, process and technical) to achieve acceptable risk. The framework is tested against the current UK approach administered by the Patient Information Advisory Group.
Discussion The hard problem of non-consented data sharing should be divided into the easier (though non-trivial) ones of data and recipient breach risk measurement. Directed research in these two areas will help move the data sharing problem into the 'solved' pile
On Utilizing Association and Interaction Concepts for Enhancing Microaggregation in Secure Statistical Databases
This paper presents a possibly pioneering endeavor to tackle the microaggregation techniques (MATs) in secure statistical databases by resorting to the principles of associative neural networks (NNs). The prior art has improved the available solutions to the MAT by incorporating proximity information, and this approach is done by recursively reducing the size of the data set by excluding points that are farthest from the centroid and points that are closest to these farthest points. Thus, although the method is extremely effective, arguably, it uses only the proximity information while ignoring the mutual interaction between the records. In this paper, we argue that interrecord relationships can be quantified in terms of the following two entities: 1) their ldquoassociationrdquo and 2) their ldquointeraction.rdquo This case means that records that are not necessarily close to each other may still be ldquogrouped,rdquo because their mutual interaction, which is quantified by invoking transitive-closure-like operations on the latter entity, could be significant, as suggested by the theoretically sound principles of NNs. By repeatedly invoking the interrecord associations and interactions, the records are grouped into sizes of cardinality ldquok,rdquo where k is the security parameter in the algorithm. Our experimental results, which are done on artificial data and benchmark real-life data sets, demonstrate that the newly proposed method is superior to the state of the art not only based on the information loss (IL) perspective but also when it concerns a criterion that involves a combination of the IL and the disclosure risk (DR)
Identificació i protecció de dades tabulars: el cas de l'estadística sobre pensions de l'INSS
Departament d'Econometria, Estadística i Economia Espanyola (Universitat de Barcelona)Aplicació del control de la revelació estadística a partir de les dades de pensions contributives de la Seguretat Social. Resolució del problema de supressió de cel·les en dades tabulars amb el software estadístic R a través del paquet sdcTable.. Es tracta de generar automatismes i criteris de control de la revelació estadística que permetin protegir eficaçment les dades tabulars d'una explotació sistemàtica de la informació que contenen els registres de Seguretat Social. Mitjançant l'ús de paquet R, es crearien les rutines per a la identificació de cel·les no segures i la corresponent protecció (via recodificació i altres tècniques) i es procediria a integrar els càlculs al paquet estadístic SPSS mitjançant R Integration Package for IBM Statistics
Exact and heuristic methods for statistical tabular data protection
One of the main purposes of National Statistical Agencies (NSAs) is to provide citizens or researchers with a large amount of trustful and high quality statistical information. NSAs must guarantee that no confidential individual information can be obtained from the released statistical outputs. The discipline of Statistical disclosure control (SDC) aims to avoid that confidential information is derived from data released while, at the same time, maintaining as much as possible the data utility. NSAs work with two types of data: microdata and tabular data. Microdata files contain records of individuals or respondents (persons or enterprises) with attributes. For instance, a national census might collect attributes such as age, address, salary, etc. Tabular data contains aggregated information obtained by crossing one or more categorical variables from those microdata files. Several SDC methods are available to avoid that no confidential individual information can be obtained from the released microdata or tabular data. This thesis focus on tabular data protection, although the research carried out can be applied to other classes of problems. Controlled Tabular Adjustment(CTA) and Cell Suppression Problem (CSP) have concentrated most of the recent research in the tabular data protection field. Both methods formulate Mixed Integer Linear Programming problems (MILPs) which are challenging for tables of moderate size. Even finding a feasible initial solution may be a challenging task for large instances. Due to the fact that many end users give priority to fast executions and are thus satisfied, in practice, with suboptimal solutions, as a first result of this thesis we present an improvement of a known and successful heuristic for finding feasible solutions of MILPs, called feasibility pump. The new approach, based on the computation of analytic centers, is named the Analytic Center Feasbility Pump.The second contribution consists in the application of the fix-and-relax heuristic (FR) to the CTA method. FR (alone or in combination with other heuristics) is shown to be competitive compared to CPLEX branch-and-cut in terms of quickly finding either a feasible solution or a good upper bound. The last contribution of this thesis deals with general Benders decomposition, which is improved with the application of stabilization techniques. A stabilized Benders decomposition is presented,which focus on finding new solutions in the neighborhood of "good'' points. This approach is efficiently applied to the solution of realistic and real-world CSP instances, outperforming alternative approaches.The first two contributions are already published in indexed journals (Operations Research Letters and Computers and Operations Research). The third contribution is a working paper to be submitted soon.Un dels principals objectius dels Instituts Nacionals d'Estadística (INEs) és proporcionar, als ciutadans o als investigadors, una gran quantitat de dades estadístiques fiables i precises. Al mateix temps els INEs deuen garantir la confidencialitat estadística i que cap dada personal pot ser obtinguda gràcies a les dades estadístiques disseminades. La disciplina Control de revelació estadística (en anglès Statistical Disclosure Control, SDC) s'ocupa de garantir que cap dada individual pot derivar-se dels outputs de estadístics publicats però intentant al mateix temps mantenir el màxim possible de riquesa de les dades. Els INEs treballen amb dos tipus de dades: microdades i dades tabulars. Les microdades son arxius amb registres individuals de persones o empreses amb un conjunt d'atributs. Per exemple, el censos nacional recull atributs tals com l'edat, sexe, adreça o salari entre d'altres. Les dades tabulars són dades agregades obtingudes a partir del creuament d’un o més atributs o variables categòriques dels fitxers de microdades. Varis mètodes CRE són disponibles per evitar la revelació estadística en fitxers de microdades o dades tabulars. Aquesta tesi es centra en la protecció de dades tabulars tot i que la recerca duta a terme pot ser aplicada també a altres tipus de problemes. Els mètodes CTA (en anglès Controlled Tabular Adjustment) i CSP (en anglès Cell Suppression Problem) ha centrat la major part de la recerca feta en el camp de protecció de dades tabulars. Tots dos mètodes formulen problemes MILP (Mixed Integer Linear Programming problems) difícils de solucionar en taules de mida moderada. Fins i tot trobar solucions inicials factibles pot resultar molt difícil. Donat el fet que molts usuaris finals donen prioritat a tenir solucions ràpides i bones tot i que aquestes no siguin les òptimes, la primera contribució de la tesis presenta una millora en una coneguda i exitosa heurística per trobar solucions factibles de MILPs, anomenada feasibility pump. La nova aproximació, basada en el càlcul de centres analítics, s'anomena Analytic Center Feasibility Pump. La segona contribució consisteix en l'aplicació de la heurística fix-and-relax (FR) al mètode CTA. FR (sol o en combinació amb d'altres heurístiques) es mostra com a competitiu davant CPLEX branch-and-cut en termes de trobar ràpidament solucions factibles o bons upper bounds. La darrera contribució d’aquesta tesi tracta sobre el problema general de descomposició de Benders, aportant una millora amb l'aplicació de tècniques d’estabilització. Presentem un mètode anomenat stabilized Benders decomposition que es centra en trobar noves solucions properes a punts considerats prèviament com a bons. Aquesta aproximació ha estat eficientment aplicada al problema CSP, obtenint molt bons resultats en dades tabulars reals, millorant altres alternatives conegudes del mètode CSP. Les dues primeres contribucions ja han estat publicades en revistes indexades (Operations Research Letters and Computers and Operations Research). Actualment estem treballant en la publicació de la tercera contribució i serà en breu enviada a revisar.Postprint (published version
Norby Collection Databases, Brookings Businesses Listed by Avenue Address
The Databases sub-group is composed of material was compiled by George Norby. This material covers Brookings (S.D.) related topics and includes businesses, historic homes, churches, city and county government, and South Dakota State University. As noted by George Norby within the collection, information compiled in these databases is as accurate as possible and was gathered from the following sources: Brookings County Press, Brookings Register, Brookings County Sentinel, Brookings telephone directories and business directories, Brookings City publications, Brookings County election returns, Brookings County Commission minutes, and records in the Brookings Count Register of Deeds office. While this material is quite extensive, it is recommended that researchers verify information from more than one source in order to conduct an accurate search