43 research outputs found

    p-probabilistic k-anonymous microaggregation for the anonymization of surveys with uncertain participation

    Get PDF
    We develop a probabilistic variant of k-anonymous microaggregation which we term p-probabilistic resorting to a statistical model of respondent participation in order to aggregate quasi-identifiers in such a manner that k-anonymity is concordantly enforced with a parametric probabilistic guarantee. Succinctly owing the possibility that some respondents may not finally participate, sufficiently larger cells are created striving to satisfy k-anonymity with probability at least p. The microaggregation function is designed before the respondents submit their confidential data. More precisely, a specification of the function is sent to them which they may verify and apply to their quasi-identifying demographic variables prior to submitting the microaggregated data along with the confidential attributes to an authorized repository. We propose a number of metrics to assess the performance of our probabilistic approach in terms of anonymity and distortion which we proceed to investigate theoretically in depth and empirically with synthetic and standardized data. We stress that in addition to constituting a functional extension of traditional microaggregation, thereby broadening its applicability to the anonymization of statistical databases in a wide variety of contexts, the relaxation of trust assumptions is arguably expected to have a considerable impact on user acceptance and ultimately on data utility through mere availability.Peer ReviewedPostprint (author's final draft

    Contribution to privacy-enhancing tecnologies for machine learning applications

    Get PDF
    For some time now, big data applications have been enabling revolutionary innovation in every aspect of our daily life by taking advantage of lots of data generated from the interactions of users with technology. Supported by machine learning and unprecedented computation capabilities, different entities are capable of efficiently exploiting such data to obtain significant utility. However, since personal information is involved, these practices raise serious privacy concerns. Although multiple privacy protection mechanisms have been proposed, there are some challenges that need to be addressed for these mechanisms to be adopted in practice, i.e., to be “usable” beyond the privacy guarantee offered. To start, the real impact of privacy protection mechanisms on data utility is not clear, thus an empirical evaluation of such impact is crucial. Moreover, since privacy is commonly obtained through the perturbation of large data sets, usable privacy technologies may require not only preservation of data utility but also efficient algorithms in terms of computation speed. Satisfying both requirements is key to encourage the adoption of privacy initiatives. Although considerable effort has been devoted to design less “destructive” privacy mechanisms, the utility metrics employed may not be appropriate, thus the wellness of such mechanisms would be incorrectly measured. On the other hand, despite the advent of big data, more efficient approaches are not being considered. Not complying with the requirements of current applications may hinder the adoption of privacy technologies. In the first part of this thesis, we address the problem of measuring the effect of k-anonymous microaggregation on the empirical utility of microdata. We quantify utility accordingly as the accuracy of classification models learned from microaggregated data, evaluated over original test data. Our experiments show that the impact of the de facto microaggregation standard on the performance of machine-learning algorithms is often minor for a variety of data sets. Furthermore, experimental evidence suggests that the traditional measure of distortion in the community of microdata anonymization may be inappropriate for evaluating the utility of microaggregated data. Secondly, we address the problem of preserving the empirical utility of data. By transforming the original data records to a different data space, our approach, based on linear discriminant analysis, enables k-anonymous microaggregation to be adapted to the application domain of data. To do this, first, data is rotated (projected) towards the direction of maximum discrimination and, second, scaled in this direction, penalizing distortion across the classification threshold. As a result, data utility is preserved in terms of the accuracy of machine learned models for a number of standardized data sets. Afterwards, we propose a mechanism to reduce the running time for the k-anonymous microaggregation algorithm. This is obtained by simplifying the internal operations of the original algorithm. Through extensive experimentation over multiple data sets, we show that the new algorithm gets significantly faster. Interestingly, this remarkable speedup factor is achieved with no additional loss of data utility.Les aplicacions de big data impulsen actualment una accelerada innovació aprofitant la gran quantitat d’informació generada a partir de les interaccions dels usuaris amb la tecnologia. Així, qualsevol entitat és capaç d'explotar eficientment les dades per obtenir utilitat, emprant aprenentatge automàtic i capacitats de còmput sense precedents. No obstant això, sorgeixen en aquest escenari serioses preocupacions pel que fa a la privacitat dels usuaris ja que hi ha informació personal involucrada. Tot i que s'han proposat diversos mecanismes de protecció, hi ha alguns reptes per a la seva adopció en la pràctica, és a dir perquè es puguin utilitzar. Per començar, l’impacte real d'aquests mecanismes en la utilitat de les dades no esta clar, raó per la qual la seva avaluació empírica és important. A més, considerant que actualment es manegen grans volums de dades, una privacitat usable requereix, no només preservació de la utilitat de les dades, sinó també algoritmes eficients en temes de temps de còmput. És clau satisfer tots dos requeriments per incentivar l’adopció de mesures de privacitat. Malgrat que hi ha diversos esforços per dissenyar mecanismes de privacitat menys "destructius", les mètriques d'utilitat emprades no serien apropiades, de manera que aquests mecanismes de protecció podrien estar sent incorrectament avaluats. D'altra banda, tot i l’adveniment del big data, la investigació existent no s’enfoca molt en millorar la seva eficiència. Lamentablement, si els requisits de les aplicacions actuals no es satisfan, s’obstaculitzarà l'adopció de tecnologies de privacitat. A la primera part d'aquesta tesi abordem el problema de mesurar l'impacte de la microagregació k-Gnónima en la utilitat empírica de microdades. Per això, quantifiquem la utilitat com la precisió de models de classificació obtinguts a partir de les dades microagregades. i avaluats sobre dades de prova originals. Els experiments mostren que l'impacte de l’algoritme de rmicroagregació estàndard en el rendiment d’algoritmes d'aprenentatge automàtic és usualment menor per a una varietat de conjunts de dades avaluats. A més, l’evidència experimental suggereix que la mètrica tradicional de distorsió de les dades seria inapropiada per avaluar la utilitat empírica de dades microagregades. Així també estudiem el problema de preservar la utilitat empírica de les dades a l'ésser anonimitzades. Transformant els registres originaIs de dades en un espai de dades diferent, el nostre enfocament, basat en anàlisi de discriminant lineal, permet que el procés de microagregació k-anònima s'adapti al domini d’aplicació de les dades. Per això, primer, les dades són rotades o projectades en la direcció de màxima discriminació i, segon, escalades en aquesta direcció, penalitzant la distorsió a través del llindar de classificació. Com a resultat, la utilitat de les dades es preserva en termes de la precisió dels models d'aprenentatge automàtic en diversos conjunts de dades. Posteriorment, proposem un mecanisme per reduir el temps d'execució per a la microagregació k-anònima. Això s'aconsegueix simplificant les operacions internes de l'algoritme escollit Mitjançant una extensa experimentació sobre diversos conjunts de dades, vam mostrar que el nou algoritme és bastant més ràpid. Aquesta acceleració s'aconsegueix sense que hi ha pèrdua en la utilitat de les dades. Finalment, en un enfocament més aplicat, es proposa una eina de protecció de privacitat d'individus i organitzacions mitjançant l'anonimització de dades sensibles inclosos en logs de seguretat. Es dissenyen diferents mecanismes d'anonimat per implementar-los en base a la definició d'una política de privacitat, en el context d'un projecte europeu que té per objectiu construir un sistema de seguretat unificat.Postprint (published version

    Contribution to privacy-enhancing tecnologies for machine learning applications

    Get PDF
    For some time now, big data applications have been enabling revolutionary innovation in every aspect of our daily life by taking advantage of lots of data generated from the interactions of users with technology. Supported by machine learning and unprecedented computation capabilities, different entities are capable of efficiently exploiting such data to obtain significant utility. However, since personal information is involved, these practices raise serious privacy concerns. Although multiple privacy protection mechanisms have been proposed, there are some challenges that need to be addressed for these mechanisms to be adopted in practice, i.e., to be “usable” beyond the privacy guarantee offered. To start, the real impact of privacy protection mechanisms on data utility is not clear, thus an empirical evaluation of such impact is crucial. Moreover, since privacy is commonly obtained through the perturbation of large data sets, usable privacy technologies may require not only preservation of data utility but also efficient algorithms in terms of computation speed. Satisfying both requirements is key to encourage the adoption of privacy initiatives. Although considerable effort has been devoted to design less “destructive” privacy mechanisms, the utility metrics employed may not be appropriate, thus the wellness of such mechanisms would be incorrectly measured. On the other hand, despite the advent of big data, more efficient approaches are not being considered. Not complying with the requirements of current applications may hinder the adoption of privacy technologies. In the first part of this thesis, we address the problem of measuring the effect of k-anonymous microaggregation on the empirical utility of microdata. We quantify utility accordingly as the accuracy of classification models learned from microaggregated data, evaluated over original test data. Our experiments show that the impact of the de facto microaggregation standard on the performance of machine-learning algorithms is often minor for a variety of data sets. Furthermore, experimental evidence suggests that the traditional measure of distortion in the community of microdata anonymization may be inappropriate for evaluating the utility of microaggregated data. Secondly, we address the problem of preserving the empirical utility of data. By transforming the original data records to a different data space, our approach, based on linear discriminant analysis, enables k-anonymous microaggregation to be adapted to the application domain of data. To do this, first, data is rotated (projected) towards the direction of maximum discrimination and, second, scaled in this direction, penalizing distortion across the classification threshold. As a result, data utility is preserved in terms of the accuracy of machine learned models for a number of standardized data sets. Afterwards, we propose a mechanism to reduce the running time for the k-anonymous microaggregation algorithm. This is obtained by simplifying the internal operations of the original algorithm. Through extensive experimentation over multiple data sets, we show that the new algorithm gets significantly faster. Interestingly, this remarkable speedup factor is achieved with no additional loss of data utility.Les aplicacions de big data impulsen actualment una accelerada innovació aprofitant la gran quantitat d’informació generada a partir de les interaccions dels usuaris amb la tecnologia. Així, qualsevol entitat és capaç d'explotar eficientment les dades per obtenir utilitat, emprant aprenentatge automàtic i capacitats de còmput sense precedents. No obstant això, sorgeixen en aquest escenari serioses preocupacions pel que fa a la privacitat dels usuaris ja que hi ha informació personal involucrada. Tot i que s'han proposat diversos mecanismes de protecció, hi ha alguns reptes per a la seva adopció en la pràctica, és a dir perquè es puguin utilitzar. Per començar, l’impacte real d'aquests mecanismes en la utilitat de les dades no esta clar, raó per la qual la seva avaluació empírica és important. A més, considerant que actualment es manegen grans volums de dades, una privacitat usable requereix, no només preservació de la utilitat de les dades, sinó també algoritmes eficients en temes de temps de còmput. És clau satisfer tots dos requeriments per incentivar l’adopció de mesures de privacitat. Malgrat que hi ha diversos esforços per dissenyar mecanismes de privacitat menys "destructius", les mètriques d'utilitat emprades no serien apropiades, de manera que aquests mecanismes de protecció podrien estar sent incorrectament avaluats. D'altra banda, tot i l’adveniment del big data, la investigació existent no s’enfoca molt en millorar la seva eficiència. Lamentablement, si els requisits de les aplicacions actuals no es satisfan, s’obstaculitzarà l'adopció de tecnologies de privacitat. A la primera part d'aquesta tesi abordem el problema de mesurar l'impacte de la microagregació k-Gnónima en la utilitat empírica de microdades. Per això, quantifiquem la utilitat com la precisió de models de classificació obtinguts a partir de les dades microagregades. i avaluats sobre dades de prova originals. Els experiments mostren que l'impacte de l’algoritme de rmicroagregació estàndard en el rendiment d’algoritmes d'aprenentatge automàtic és usualment menor per a una varietat de conjunts de dades avaluats. A més, l’evidència experimental suggereix que la mètrica tradicional de distorsió de les dades seria inapropiada per avaluar la utilitat empírica de dades microagregades. Així també estudiem el problema de preservar la utilitat empírica de les dades a l'ésser anonimitzades. Transformant els registres originaIs de dades en un espai de dades diferent, el nostre enfocament, basat en anàlisi de discriminant lineal, permet que el procés de microagregació k-anònima s'adapti al domini d’aplicació de les dades. Per això, primer, les dades són rotades o projectades en la direcció de màxima discriminació i, segon, escalades en aquesta direcció, penalitzant la distorsió a través del llindar de classificació. Com a resultat, la utilitat de les dades es preserva en termes de la precisió dels models d'aprenentatge automàtic en diversos conjunts de dades. Posteriorment, proposem un mecanisme per reduir el temps d'execució per a la microagregació k-anònima. Això s'aconsegueix simplificant les operacions internes de l'algoritme escollit Mitjançant una extensa experimentació sobre diversos conjunts de dades, vam mostrar que el nou algoritme és bastant més ràpid. Aquesta acceleració s'aconsegueix sense que hi ha pèrdua en la utilitat de les dades. Finalment, en un enfocament més aplicat, es proposa una eina de protecció de privacitat d'individus i organitzacions mitjançant l'anonimització de dades sensibles inclosos en logs de seguretat. Es dissenyen diferents mecanismes d'anonimat per implementar-los en base a la definició d'una política de privacitat, en el context d'un projecte europeu que té per objectiu construir un sistema de seguretat unificat

    Enhanced Privacy Preserving Accesscontrol in Incremental Datausing Microaggregation

    Get PDF
    In microdata releases, main task is to protect the privacy of data subjects. Microaggregation technique use to disclose the limitation at protecting the privacy of microdata. This technique is an alternative to generalization and suppression, which use to generate k-anonymous data sets. In this dataset, identity of each subject is hidden within a group of k subjects. Microaggregation perturbs the data and additional masking allows refining data utility in many ways, like increasing data granularity, to avoid discretization of numerical data, to reduce the impact of outliers. If the variability of the private data values in a group of k subjects is too small, k-anonymity does not provide protection against attribute disclosure. In this work Role based access control is assumed. The access control policies define selection predicates to roles. Then use the concept of imprecision bound for each permission to define a threshold on the amount of imprecision that can be tolerated. So the proposed approach reduces the imprecision for each selection predicate. Anonymization is carried out only for the static relational table in the existing papers. Privacy preserving access control mechanism is applied to the incremental data

    Incremental k-Anonymous microaggregation in large-scale electronic surveys with optimized scheduling

    Get PDF
    Improvements in technology have led to enormous volumes of detailed personal information made available for any number of statistical studies. This has stimulated the need for anonymization techniques striving to attain a difficult compromise between the usefulness of the data and the protection of our privacy. k-Anonymous microaggregation permits releasing a dataset where each person remains indistinguishable from other k–1 individuals, through the aggregation of demographic attributes, otherwise a potential culprit for respondent reidentification. Although privacy guarantees are by no means absolute, the elegant simplicity of the k-anonymity criterion and the excellent preservation of information utility of microaggregation algorithms has turned them into widely popular approaches whenever data utility is critical. Unfortunately, high-utility algorithms on large datasets inherently require extensive computation. This work addresses the need of running k-anonymous microaggregation efficiently with mild distortion loss, exploiting the fact that the data may arrive over an extended period of time. Specifically, we propose to split the original dataset into two portions that will be processed subsequently, allowing the first process to start before the entire dataset is received, while leveraging the superlinearity of the microaggregation algorithms involved. A detailed mathematical formulation enables us to calculate the optimal time for the fastest anonymization, as well as for minimum distortion under a given deadline. Two incremental microaggregation algorithms are devised, for which extensive experimentation is reported. The theoretical methodology presented should prove invaluable in numerous data-collection applications, including largescale electronic surveys in which computation is possible as the data comes in.Peer ReviewedPostprint (published version

    Efficient k-anonymous microaggregation of multivariate numerical data via principal component analysis

    Get PDF
    © . This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/k-Anonymous microaggregation is a widespread technique to address the problem of protecting the privacy of the respondents involved beyond the mere suppression of their identifiers, in applications where preserving the utility of the information disclosed is critical. Unfortunately, microaggregation methods with high data utility may impose stringent computational demands when dealing with datasets containing a large number of records and attributes. This work proposes and analyzes various anonymization methods which draw upon the algebraic-statistical technique of principal component analysis (PCA), in order to effectively reduce the number of attributes processed, that is, the dimension of the multivariate microaggregation problem at hand. By preserving to a high degree the energy of the numerical dataset and carefully choosing the number of dominant components to process, we manage to achieve remarkable reductions in running time and memory usage with negligible impact in information utility. Our methods are readily applicable to high-utility SDC of large-scale datasets with numerical demographic attributes.Peer ReviewedPostprint (author's final draft

    Mathematically optimized, recursive prepartitioning strategies for k-anonymous microaggregation of large-scale datasets

    Get PDF
    © Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/The technical contents of this work fall within the statistical disclosure control (SDC) field, which concerns the postprocessing of the demographic portion of the statistical results of surveys containing sensitive personal information, in order to effectively safeguard the anonymity of the participating respondents. A widely known technique to solve the problem of protecting the privacy of the respondents involved beyond the mere suppression of their identifiers is the k-anonymous microaggregation. Unfortunately, most microaggregation algorithms that produce competitively low levels of distortions exhibit a superlinear running time, typically scaling with the square of the number of records in the dataset. This work proposes and analyzes an optimized prepartitioning strategy to reduce significantly the running time for the k-anonymous microaggregation algorithm operating on large datasets, with mild loss in data utility with respect to that of MDAV, the underlying method. The optimization strategy is based on prepartitioning a dataset recursively until the desired k-anonymity parameter is achieved. Traditional microaggregation algorithms have quadratic computational complexity in the form T(n2). By using the proposed method and fixing the number of recurrent prepartitions we obtain subquadratic complexity in the form T(n3/2), T(n4/3), ..., depending on the number of prepartitions. Alternatively, fixing the ratio between the size of the microcell and the macrocell on each prepartition, quasilinear complexity in the form T(nlog¿n) is achieved. Our method is readily applicable to large-scale datasets with numerical demographic attributes.Peer ReviewedPostprint (author's final draft

    Computational methods for large-scale microdata anonymization

    Get PDF
    En enquestes i votacions electròniques és necessari garantir l'anonimat de les persones que hi participen, però a la vegada seria interessant poder fer un processament estadístic dels resultats (per exemple, que ha votat/opinat les persones d'una determinada franja d'edat). L'objectiu d'aquest projecte es adaptar mecanismes d'anonimat dissenyats per a bases de dades a l'entorn d'enquestes/votacions electròniques.Abstract— Statistical disclosure control (SDC) concerns safeguarding the privacy of people and/or companies whose confidential information is released as large datasets. This data is used in statistical studies, and thus, in addition of ensuring the privacy of the individuals, the statistical dependences of the published data should be very similar to the original ones. Microaggregation, and more concretely, k-anonymous microaggregation, solves this problem with an acceptable preservation of the privacy. However, it fails in terms of execution time, since, even if for small amounts of data the required execution time is affordable, when dealing with bigger datasets the required execution time is not acceptable. This degree project report presents new algorithms that, while preserving the quality of the released information, improve the computational cost of k-anonymous microaggregation. This is done by using dimensionality-reduction techniques, which allow to decrease noticeably the execution time, and at the same time do not incur in a significant loss of statistical dependences of the information being published.Resumen — El objetivo del control de revelación estadística de microdatos es salvaguardar la privacidad de personas y/o empresas cuyos datos confidenciales se publican en forma de conjuntos de datos. Estos datos son utilizados en estudios estadísticos y, por lo tanto, además de asegurar la privacidad de las personas/empresas, las dependencias estadísticas de los datos publicados deberían ser muy similares a las de los datos originales. La microagregación, y más concretamente, la microagregación k-anonima, solventa este problema asegurando una preservación de la privacidad aceptable. Sin embargo, en términos de tiempo de ejecución, no es conveniente, ya que, aunque para conjuntos de datos no muy grandes el tiempo de ejecución es aceptable, cuando tratamos con conjuntos de datos más grandes, este tiempo incrementa hasta tal punto que hace el proceso inviable. Este informe de proyecto final de grado presenta nuevos algoritmos que, preservando la calidad de los datos publicados, mejora el coste computacional de la microagregación k-anonima. Para tal efecto se utilizan técnicas de reducción dimensional, que permiten reducir notablemente el tiempo de ejecución, y al mismo tiempo no conllevan una gran pérdida de información respecto a las dependencias estadísticas de los datos que se publican.Resum — L’objectiu del control de revelació estadística de microdades és protegir la privacitat de persones i/o empreses les dades confidencials de les quals es publiquen en forma de conjunts de dades. Aquestes dades són utilitzades en estudis estadístics i, per tant, a més d’assegurar la privacitat de les persones/empreses, les dependències estadístiques de les dades publicades haurien d’ésser molt similars a les de les dades originals. La microagregació, i més concretament, la microagregació k-anònima, solventa aquest problema assegurant una preservació de la privacitat acceptable. No obstant, en termes de temps d’execució, no és convenient, ja que, encara que per conjunts de dades no molt grans el temps d’execució és acceptable, quan tractem amb conjunts de dades més grans, aquest temps s’incrementa fins a tal punt que fa el procés inviable. Aquest informe de projecte final de grau presenta nous algoritmes que, preservant la qualitat de les dades publicades, millora el cost computacional de la microagregació k-anònima. Per a tal cosa s’utilitzen tècniques de reducció dimensional, que permeten reduir notablement el temps d’execució, i al mateix temps no comporten una gran pèrdua d’informació respecte a les dependències estadístiques de les dades que es publiquen

    Revisiting distance-based record linkage for privacy-preserving release of statistical datasets

    Get PDF
    Statistical Disclosure Control (SDC, for short) studies the problem of privacy-preserving data publishing in cases where the data is expected to be used for statistical analysis. An original dataset T containing sensitive information is transformed into a sanitized version T' which is released to the public. Both utility and privacy aspects are very important in this setting. For utility, T' must allow data miners or statisticians to obtain similar results to those which would have been obtained from the original dataset T. For privacy, T' must significantly reduce the ability of an adversary to infer sensitive information on the data subjects in T. One of the main a-posteriori measures that the SDC community has considered up to now when analyzing the privacy offered by a given protection method is the Distance-Based Record Linkage (DBRL) risk measure. In this work, we argue that the classical DBRL risk measure is insufficient. For this reason, we introduce the novel Global Distance-Based Record Linkage (GDBRL) risk measure. We claim that this new measure must be evaluated alongside the classical DBRL measure in order to better assess the risk in publishing T' instead of T. After that, we describe how this new measure can be computed by the data owner and discuss the scalability of those computations. We conclude by extensive experimentation where we compare the risk assessments offered by our novel measure as well as by the classical one, using well-known SDC protection methods. Those experiments validate our hypothesis that the GDBRL risk measure issues, in many cases, higher risk assessments than the classical DBRL measure. In other words, relying solely on the classical DBRL measure for risk assessment might be misleading, as the true risk may be in fact higher. Hence, we strongly recommend that the SDC community considers the new GDBRL risk measure as an additional measure when analyzing the privacy offered by SDC protection algorithms.Postprint (author's final draft
    corecore