Search CORE

511 research outputs found

Cell suppression problem: A genetic-based approach

Author: Almeida Maria Teresa
Carvalho ilipa D.
Schütz G.
Publication venue: Elsevier
Publication date: 01/01/2008
Field of study

Cell suppression is one of the most frequently used techniques to prevent the disclosure of sensitive data in statistical tables. Finding the minimum cost set of nonsensitive entries to suppress, along with the sensitive ones, in order to make a table safe for publication, is a NP-hard problem, denoted the cell suppression problem (CSP). In this paper, we present GenSup, a new heuristic for the CSP, which combines the general features of genetic algorithms with safety conditions derived by several authors. The safety conditions are used to develop fast procedures to generate multiple initial solutions and also to recombine, to perturb and to repair solutions in order to improve their quality. The results obtained for 300 tables, with up to more than 90,000 entries, show that GenSup is very effective at finding low-cost sets of complementary suppressions to protect confidential data in two-dimensional tables.(2008).info:eu-repo/semantics/publishedVersio

UTL Repository

A genetic approach to statistical disclosure control

Author: Clark Alistair
Serpell Martin
Smith Jim
Staggemeier Andrea T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Statistical disclosure control is the collective name for a range of tools used by data providers such as government departments to protect the confidentiality of individuals or organizations. When the published tables contain magnitude data such as turnover or health statistics, the preferred method is to suppress the values of certain cells. Assigning a cost to the information lost by suppressing any given cell creates the cell suppression problem. This consists of finding the minimum cost solution which meets the confidentiality constraints. Solving this problem simultaneously for all of the sensitive cells in a table is NP-hard and not possible for medium to large sized tables. In this paper, we describe the development of a heuristic tool for this problem which hybridizes linear programming (to solve a relaxed version for a single sensitive cell) with a genetic algorithm (to seek an order for considering the sensitive cells which minimizes the final cost). Considering a range of real-world and representative artificial datasets, we show that the method is able to provide relatively low cost solutions for far larger tables than is possible for the optimal approach to tackle. We show that our genetic approach is able to significantly improve on the initial solutions provided by existing heuristics for cell ordering, and outperforms local search. This approach is then extended and applied to large statistical tables with over 200000 cells. © 2012 IEEE

Crossref

UWE Bristol Research Repository

Contribution to privacy-enhancing tecnologies for machine learning applications

Author: Rodríguez Hoyos Ana Fernanda
Publication venue: Universitat Politècnica de Catalunya
Publication date: 21/10/2020
Field of study

For some time now, big data applications have been enabling revolutionary innovation in every aspect of our daily life by taking advantage of lots of data generated from the interactions of users with technology. Supported by machine learning and unprecedented computation capabilities, different entities are capable of efficiently exploiting such data to obtain significant utility. However, since personal information is involved, these practices raise serious privacy concerns. Although multiple privacy protection mechanisms have been proposed, there are some challenges that need to be addressed for these mechanisms to be adopted in practice, i.e., to be “usable” beyond the privacy guarantee offered. To start, the real impact of privacy protection mechanisms on data utility is not clear, thus an empirical evaluation of such impact is crucial. Moreover, since privacy is commonly obtained through the perturbation of large data sets, usable privacy technologies may require not only preservation of data utility but also efficient algorithms in terms of computation speed. Satisfying both requirements is key to encourage the adoption of privacy initiatives. Although considerable effort has been devoted to design less “destructive” privacy mechanisms, the utility metrics employed may not be appropriate, thus the wellness of such mechanisms would be incorrectly measured. On the other hand, despite the advent of big data, more efficient approaches are not being considered. Not complying with the requirements of current applications may hinder the adoption of privacy technologies. In the first part of this thesis, we address the problem of measuring the effect of k-anonymous microaggregation on the empirical utility of microdata. We quantify utility accordingly as the accuracy of classification models learned from microaggregated data, evaluated over original test data. Our experiments show that the impact of the de facto microaggregation standard on the performance of machine-learning algorithms is often minor for a variety of data sets. Furthermore, experimental evidence suggests that the traditional measure of distortion in the community of microdata anonymization may be inappropriate for evaluating the utility of microaggregated data. Secondly, we address the problem of preserving the empirical utility of data. By transforming the original data records to a different data space, our approach, based on linear discriminant analysis, enables k-anonymous microaggregation to be adapted to the application domain of data. To do this, first, data is rotated (projected) towards the direction of maximum discrimination and, second, scaled in this direction, penalizing distortion across the classification threshold. As a result, data utility is preserved in terms of the accuracy of machine learned models for a number of standardized data sets. Afterwards, we propose a mechanism to reduce the running time for the k-anonymous microaggregation algorithm. This is obtained by simplifying the internal operations of the original algorithm. Through extensive experimentation over multiple data sets, we show that the new algorithm gets significantly faster. Interestingly, this remarkable speedup factor is achieved with no additional loss of data utility.Les aplicacions de big data impulsen actualment una accelerada innovació aprofitant la gran quantitat d’informació generada a partir de les interaccions dels usuaris amb la tecnologia. Així, qualsevol entitat és capaç d'explotar eficientment les dades per obtenir utilitat, emprant aprenentatge automàtic i capacitats de còmput sense precedents. No obstant això, sorgeixen en aquest escenari serioses preocupacions pel que fa a la privacitat dels usuaris ja que hi ha informació personal involucrada. Tot i que s'han proposat diversos mecanismes de protecció, hi ha alguns reptes per a la seva adopció en la pràctica, és a dir perquè es puguin utilitzar. Per començar, l’impacte real d'aquests mecanismes en la utilitat de les dades no esta clar, raó per la qual la seva avaluació empírica és important. A més, considerant que actualment es manegen grans volums de dades, una privacitat usable requereix, no només preservació de la utilitat de les dades, sinó també algoritmes eficients en temes de temps de còmput. És clau satisfer tots dos requeriments per incentivar l’adopció de mesures de privacitat. Malgrat que hi ha diversos esforços per dissenyar mecanismes de privacitat menys "destructius", les mètriques d'utilitat emprades no serien apropiades, de manera que aquests mecanismes de protecció podrien estar sent incorrectament avaluats. D'altra banda, tot i l’adveniment del big data, la investigació existent no s’enfoca molt en millorar la seva eficiència. Lamentablement, si els requisits de les aplicacions actuals no es satisfan, s’obstaculitzarà l'adopció de tecnologies de privacitat. A la primera part d'aquesta tesi abordem el problema de mesurar l'impacte de la microagregació k-Gnónima en la utilitat empírica de microdades. Per això, quantifiquem la utilitat com la precisió de models de classificació obtinguts a partir de les dades microagregades. i avaluats sobre dades de prova originals. Els experiments mostren que l'impacte de l’algoritme de rmicroagregació estàndard en el rendiment d’algoritmes d'aprenentatge automàtic és usualment menor per a una varietat de conjunts de dades avaluats. A més, l’evidència experimental suggereix que la mètrica tradicional de distorsió de les dades seria inapropiada per avaluar la utilitat empírica de dades microagregades. Així també estudiem el problema de preservar la utilitat empírica de les dades a l'ésser anonimitzades. Transformant els registres originaIs de dades en un espai de dades diferent, el nostre enfocament, basat en anàlisi de discriminant lineal, permet que el procés de microagregació k-anònima s'adapti al domini d’aplicació de les dades. Per això, primer, les dades són rotades o projectades en la direcció de màxima discriminació i, segon, escalades en aquesta direcció, penalitzant la distorsió a través del llindar de classificació. Com a resultat, la utilitat de les dades es preserva en termes de la precisió dels models d'aprenentatge automàtic en diversos conjunts de dades. Posteriorment, proposem un mecanisme per reduir el temps d'execució per a la microagregació k-anònima. Això s'aconsegueix simplificant les operacions internes de l'algoritme escollit Mitjançant una extensa experimentació sobre diversos conjunts de dades, vam mostrar que el nou algoritme és bastant més ràpid. Aquesta acceleració s'aconsegueix sense que hi ha pèrdua en la utilitat de les dades. Finalment, en un enfocament més aplicat, es proposa una eina de protecció de privacitat d'individus i organitzacions mitjançant l'anonimització de dades sensibles inclosos en logs de seguretat. Es dissenyen diferents mecanismes d'anonimat per implementar-los en base a la definició d'una política de privacitat, en el context d'un projecte europeu que té per objectiu construir un sistema de seguretat unificat

Tesis Doctorals en Xarxa

Contribution to privacy-enhancing tecnologies for machine learning applications

Author: Rodríguez Hoyos Ana Fernanda
Publication venue: Universitat Politècnica de Catalunya
Publication date: 21/10/2020
Field of study

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Data Mining: The Next Generation

Author: Agrawal Rakesh
Bollinger Toni
Clifton Christopher W.
Dzeroski Saso
Freytag Johann-Christoph
Hipp Jochen
Keim Daniel
Kramer Stefan
Kriegel Hans-Peter
Leser Ulf
Liu Bing
Mannila Heikki
Meo Rosa
Morishita Shinichi
Ng Raymond
Pei Jian
Raghavan Prabhakar
Ramakrishnan Raghu
Spiliopoulou Myra
Srivastava Jaideep
Torra Vicenc
Publication venue: Dagstuhl Seminar Proceedings. 04292 - Perspectives Workshop: Data Mining: The Next Generation
Publication date: 01/01/2005
Field of study

Dagstuhl Research Online Publication Server

Combinatorial Optimization

Author
Publication venue: Oberwolfach-Walke : Mathematisches Forschungsinstitut Oberwolfach
Publication date: 01/01/1999
Field of study

[no abstract available

Repositorium für Naturwissenschaften und Technik

Exact and heuristic methods for statistical tabular data protection

Author: Baena Mirabete Daniel
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2017
Field of study

One of the main purposes of National Statistical Agencies (NSAs) is to provide citizens or researchers with a large amount of trustful and high quality statistical information. NSAs must guarantee that no confidential individual information can be obtained from the released statistical outputs. The discipline of Statistical disclosure control (SDC) aims to avoid that confidential information is derived from data released while, at the same time, maintaining as much as possible the data utility. NSAs work with two types of data: microdata and tabular data. Microdata files contain records of individuals or respondents (persons or enterprises) with attributes. For instance, a national census might collect attributes such as age, address, salary, etc. Tabular data contains aggregated information obtained by crossing one or more categorical variables from those microdata files. Several SDC methods are available to avoid that no confidential individual information can be obtained from the released microdata or tabular data. This thesis focus on tabular data protection, although the research carried out can be applied to other classes of problems. Controlled Tabular Adjustment(CTA) and Cell Suppression Problem (CSP) have concentrated most of the recent research in the tabular data protection field. Both methods formulate Mixed Integer Linear Programming problems (MILPs) which are challenging for tables of moderate size. Even finding a feasible initial solution may be a challenging task for large instances. Due to the fact that many end users give priority to fast executions and are thus satisfied, in practice, with suboptimal solutions, as a first result of this thesis we present an improvement of a known and successful heuristic for finding feasible solutions of MILPs, called feasibility pump. The new approach, based on the computation of analytic centers, is named the Analytic Center Feasbility Pump.The second contribution consists in the application of the fix-and-relax heuristic (FR) to the CTA method. FR (alone or in combination with other heuristics) is shown to be competitive compared to CPLEX branch-and-cut in terms of quickly finding either a feasible solution or a good upper bound. The last contribution of this thesis deals with general Benders decomposition, which is improved with the application of stabilization techniques. A stabilized Benders decomposition is presented,which focus on finding new solutions in the neighborhood of "good'' points. This approach is efficiently applied to the solution of realistic and real-world CSP instances, outperforming alternative approaches.The first two contributions are already published in indexed journals (Operations Research Letters and Computers and Operations Research). The third contribution is a working paper to be submitted soon.Un dels principals objectius dels Instituts Nacionals d'Estadística (INEs) és proporcionar, als ciutadans o als investigadors, una gran quantitat de dades estadístiques fiables i precises. Al mateix temps els INEs deuen garantir la confidencialitat estadística i que cap dada personal pot ser obtinguda gràcies a les dades estadístiques disseminades. La disciplina Control de revelació estadística (en anglès Statistical Disclosure Control, SDC) s'ocupa de garantir que cap dada individual pot derivar-se dels outputs de estadístics publicats però intentant al mateix temps mantenir el màxim possible de riquesa de les dades. Els INEs treballen amb dos tipus de dades: microdades i dades tabulars. Les microdades son arxius amb registres individuals de persones o empreses amb un conjunt d'atributs. Per exemple, el censos nacional recull atributs tals com l'edat, sexe, adreça o salari entre d'altres. Les dades tabulars són dades agregades obtingudes a partir del creuament d’un o més atributs o variables categòriques dels fitxers de microdades. Varis mètodes CRE són disponibles per evitar la revelació estadística en fitxers de microdades o dades tabulars. Aquesta tesi es centra en la protecció de dades tabulars tot i que la recerca duta a terme pot ser aplicada també a altres tipus de problemes. Els mètodes CTA (en anglès Controlled Tabular Adjustment) i CSP (en anglès Cell Suppression Problem) ha centrat la major part de la recerca feta en el camp de protecció de dades tabulars. Tots dos mètodes formulen problemes MILP (Mixed Integer Linear Programming problems) difícils de solucionar en taules de mida moderada. Fins i tot trobar solucions inicials factibles pot resultar molt difícil. Donat el fet que molts usuaris finals donen prioritat a tenir solucions ràpides i bones tot i que aquestes no siguin les òptimes, la primera contribució de la tesis presenta una millora en una coneguda i exitosa heurística per trobar solucions factibles de MILPs, anomenada feasibility pump. La nova aproximació, basada en el càlcul de centres analítics, s'anomena Analytic Center Feasibility Pump. La segona contribució consisteix en l'aplicació de la heurística fix-and-relax (FR) al mètode CTA. FR (sol o en combinació amb d'altres heurístiques) es mostra com a competitiu davant CPLEX branch-and-cut en termes de trobar ràpidament solucions factibles o bons upper bounds. La darrera contribució d’aquesta tesi tracta sobre el problema general de descomposició de Benders, aportant una millora amb l'aplicació de tècniques d’estabilització. Presentem un mètode anomenat stabilized Benders decomposition que es centra en trobar noves solucions properes a punts considerats prèviament com a bons. Aquesta aproximació ha estat eficientment aplicada al problema CSP, obtenint molt bons resultats en dades tabulars reals, millorant altres alternatives conegudes del mètode CSP. Les dues primeres contribucions ja han estat publicades en revistes indexades (Operations Research Letters and Computers and Operations Research). Actualment estem treballant en la publicació de la tercera contribució i serà en breu enviada a revisar.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Data Mining

Author
Publication venue: 'IntechOpen'
Publication date: 27/07/2022
Field of study

The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

Directory of Open Access Books (DOAB)

Recommended from our members

Privacy-aware publication and utilization of healthcare data

Author: Park Yubin
Publication venue
Publication date: 28/10/2014
Field of study

textOpen access to health data can bring enormous social and economical benefits. However, such access can also lead to privacy breaches, which may result in discrimination in insurance and employment markets. Privacy is a subjective and contextual concept, thus it should be interpreted from both systemic and information perspectives to clearly understand potential breaches and consequences. This dissertation investigates three popular use cases of healthcare data: specifically, 1) synthetic data publication, 2) aggregate data utilization, and 3) privacy-aware API implementation. For each case, we develop statistical models that improve the privacy-utility Pareto frontier by leveraging a variety of machine learning techniques such as information theoretic privacy measures, Bayesian graphical models, non-parametric modeling, and low-rank factorization techniques. It shows that much utility can be extracted from health records while maintaining strong privacy guarantees and protection of sensitive health information.Electrical and Computer Engineerin

Texas ScholarWorks