422 research outputs found

    Supervised learning using a symmetric bilinear form for record linkage

    Get PDF
    Record Linkage is used to link records of two different files corresponding to the same individuals. These algorithms are used for database integration. In data privacy, these algorithms are used to evaluate the disclosure risk of a protected data set by linking records that belong to the same individual. The degree of success when linking the original (unprotected data) with the protected data gives an estimation of the disclosure risk. In this paper we propose a new parameterized aggregation operator and a supervised learning method for disclosure risk assessment. The parameterized operator is a symmetric bilinear form and the supervised learning method is formalized as an optimization problem. The target of the optimization problem is to find the values of the aggregation parameters that maximize the number of re-identification (or correct links). We evaluate and compare our proposal with other non-parametrized variations of record linkage, such as those using the Mahalanobis distance and the Euclidean distance (one of the most used approaches for this purpose). Additionally, we also compare it with other previously presented parameterized aggregation operators for record linkage such as the weighted mean and the Choquet integral. From these comparisons we show how the proposed aggregation operator is able to overcome or at least achieve similar results than the other parameterized operators. We also study which are the necessary optimization problem conditions to consider the described aggregation functions as metric functions

    Revisiting distance-based record linkage for privacy-preserving release of statistical datasets

    Get PDF
    Statistical Disclosure Control (SDC, for short) studies the problem of privacy-preserving data publishing in cases where the data is expected to be used for statistical analysis. An original dataset T containing sensitive information is transformed into a sanitized version T' which is released to the public. Both utility and privacy aspects are very important in this setting. For utility, T' must allow data miners or statisticians to obtain similar results to those which would have been obtained from the original dataset T. For privacy, T' must significantly reduce the ability of an adversary to infer sensitive information on the data subjects in T. One of the main a-posteriori measures that the SDC community has considered up to now when analyzing the privacy offered by a given protection method is the Distance-Based Record Linkage (DBRL) risk measure. In this work, we argue that the classical DBRL risk measure is insufficient. For this reason, we introduce the novel Global Distance-Based Record Linkage (GDBRL) risk measure. We claim that this new measure must be evaluated alongside the classical DBRL measure in order to better assess the risk in publishing T' instead of T. After that, we describe how this new measure can be computed by the data owner and discuss the scalability of those computations. We conclude by extensive experimentation where we compare the risk assessments offered by our novel measure as well as by the classical one, using well-known SDC protection methods. Those experiments validate our hypothesis that the GDBRL risk measure issues, in many cases, higher risk assessments than the classical DBRL measure. In other words, relying solely on the classical DBRL measure for risk assessment might be misleading, as the true risk may be in fact higher. Hence, we strongly recommend that the SDC community considers the new GDBRL risk measure as an additional measure when analyzing the privacy offered by SDC protection algorithms.Postprint (author's final draft

    Record-Linkage from a Technical Point of View

    Get PDF
    TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols

    Supervised learning using a symmetric bilinear form for record linkage

    Full text link

    Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

    Full text link
    Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation

    Parameter determination of ONN (Ordered Neural Networks)

    Get PDF
    The need for data privacy motivates the development of new methods that allow to protect data minimizing the disclosure risk without losing information. In this paper, we propose a new protection method for numerical data called Ordered Neural Networks (ONN) method. ONN presents a new way to protect data based on the use of Artificial Neural Networks (ANN). ONN combines the use of ANN with a new strategy for preprocessing data consisting in the vectorization, sorting and partitioning of all the values in the attributes to be protected in the data set. We also present an statistical analysis that allows to understand the most important parameters affecting the quality of our method, and we show that it is possible to find a good configuration for these parameters. Finally, we compare our method to the best methods presented in the literature, using data provided by the US Census Bureau. Our experiments show that ONN outperforms the previous methods proposed in the literature, proving that the use of ANNs in these situations is convenient to protect the data efficiently without losing the statistical properties of the set.Postprint (author’s final draft

    Challenges and Solutions in Constructing a Microsimulation Model of the Use and Costs of Medical Services in Australia

    Get PDF
    This paper describes the development of a microsimulation model =HealthMod‘ which simulates the use and costs of medical and related services by Australian families. Australia has a universal social insurance scheme known as =Medicare‘ which provides all Australians with access to free or low-cost essential medical services. These services are provided primarily by general practitioners as well as specialist doctors but also include diagnostic and imaging services. Individuals may pay a direct out-of pocket contribution if fees charged for services are higher than the reimbursement schedule set by government. HealthMod is based on the Australian 2001 National Health Survey. This survey had a number of deficiencies in terms of modelling the national medical benefits scheme. The article outlines three major methodological steps that had to be taken in the model construction: the imputation of synthetic families, the imputation of short-term health conditions, and the annualisation of doctor visits and costs. Some preliminary results on the use of doctor services subsidised through Australia‘s Medicare are presented.Economic microsimulation modelling, medical services, use and costs, Australia

    Anonymizing data via polynomial regression

    Get PDF
    The amount of confidential information accessible through the Internet is growing continuously. In this scenario, the improvement of anonymizing methods becomes crucial to avoid revealing sensible information of individuals. Among several protection methods proposed, those based on the use of linear regressions are widely utilized. However, there is not a reason to assume that linear regression is better than using more complex polynomial regressions. In this paper, we present PoROP-k, a family of anonymizing methods able to protect a data set using polynomial regressions. We show that PoROP-k not only reduces the loss of information, but it also obtains a better level of protection compared to previous proposals based on linear regressions.Postprint (author’s final draft

    Protecting Micro-Data Privacy: The Moment-Based Density Estimation Method and its Application

    Get PDF
    Privacy concerns pertaining to the release of confidential micro-level information are increasingly relevant to organisations and institutions. Controlling the dissemination of disclosure-prone micro-data by means of suppression, aggregation and perturbation techniques often entails different levels of effectiveness and drawbacks depending on the context and properties of the data. In this dissertation, we briefly review existing disclosure control methods for microdata and undertake a study demonstrating the applicability of micro-data methods to proportion data. This is achieved by using the sample size efficiency related to a simple hypothesis test for a fixed significance level and power, as a measure of statistical utility. We compare a query-based differential privacy mechanism to the multiplicative noise method for disclosure control and demonstrate that with the correct specification of noise parameters, the multiplicative noise method, which is a micro-data based method, achieves similar disclosure protection properties with reduced statistical efficiency costs
    • …
    corecore