72 research outputs found

    Spectral anonymization of data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 87-96).Data anonymization is the process of conditioning a dataset such that no sensitive information can be learned about any specific individual, but valid scientific analysis can nevertheless be performed on it. It is not sufficient to simply remove identifying information because the remaining data may be enough to infer the individual source of the record (a reidentification disclosure) or to otherwise learn sensitive information about a person (a predictive disclosure). The only known way to prevent these disclosures is to remove additional information from the dataset. Dozens of anonymization methods have been proposed over the past few decades; most work by perturbing or suppressing variable values. None have been successful at simultaneously providing perfect privacy protection and allowing perfectly accurate scientific analysis. This dissertation makes the new observation that the anonymizing operations do not need to be made in the original basis of the dataset. Operating in a different, judiciously chosen basis can improve privacy protection, analytic utility, and computational efficiency. I use the term 'spectral anonymization' to refer to anonymizing in a spectral basis, such as the basis provided by the data's eigenvectors. Additionally, I propose new measures of reidentification and prediction risk that are more generally applicable and more informative than existing measures. I also propose a measure of analytic utility that assesses the preservation of the multivariate probability distribution. Finally, I propose the demanding reference standard of nonparticipation in the study to define adequate privacy protection. I give three examples of spectral anonymization in practice. The first example improves basic cell swapping from a weak algorithm to one competitive with state of-the-art methods merely by a change of basis.(cont) The second example demonstrates avoiding the curse of dimensionality in microaggregation. The third describes a powerful algorithm that reduces computational disclosure risk to the same level as that of nonparticipants and preserves at least 4th order interactions in the multivariate distribution. No previously reported algorithm has achieved this combination of results.by Thomas Anton Lasko.Ph.D

    ASPECTS OF STATISTICAL DISCLOSURE CONTROL

    Get PDF

    GENETIC ALGORITHMS AND THEIR APPLICATIONS TO SYNTHETIC DATA GENERATION

    Get PDF

    Multiple Imputation Methods for Nonignorable Nonresponse, Adaptive Survey Design, and Dissemination of Synthetic Geographies

    Get PDF
    <p>This thesis presents methods for multiple imputation that can be applied to missing data and data with confidential variables. Imputation is useful for missing data because it results in a data set that can be analyzed with complete data statistical methods. The missing data are filled in by values generated from a model fit to the observed data. The model specification will depend on the observed data pattern and the missing data mechanism. For example, when the reason why the data is missing is related to the outcome of interest, that is nonignorable missingness, we need to alter the model fit to the observed data to generate the imputed values from a different distribution. Imputation is also used for generating synthetic values for data sets with disclosure restrictions. Since the synthetic values are not actual observations, they can be released for statistical analysis. The interest is in fitting a model that approximates well the relationships in the original data, keeping the utility of the synthetic data, while preserving the confidentiality of the original data. We consider applications of these methods to data from social sciences and epidemiology.</p><p>The first method is for imputation of multivariate continuous data with nonignorable missingness. Regular imputation methods have been used to deal with nonresponse in several types of survey data. However, in some of these studies, the assumption of missing at random is not valid since the probability of missing depends on the response variable. We propose an imputation method for multivariate data sets when there is nonignorable missingness. We fit a truncated Dirichlet process mixture of multivariate normals to the observed data under a Bayesian framework to provide flexibility. With the posterior samples from the mixture model, an analyst can alter the estimated distribution to obtain imputed data under different scenarios. To facilitate that, I developed an R application that allows the user to alter the values of the mixture parameters and visualize the imputation results automatically. I demonstrate this process of sensitivity analysis with an application to the Colombian Annual Manufacturing Survey. I also include a simulation study to show that the correct complete data distribution can be recovered if the true missing data mechanism is known, thus validating that the method can be meaningfully interpreted to do sensitivity analysis.</p><p>The second method uses the imputation techniques for nonignorable missingness to implement a procedure for adaptive design in surveys. Specifically, I develop a procedure that agencies can use to evaluate whether or not it is effective to stop data collection. This decision is based on utility measures to compare the data collected so far with potential follow-up samples. The options are assessed by imputation of the nonrespondents under different missingness scenarios considered by the analyst. The variation in the utility measures is compared to the cost induced by the follow-up sample sizes. We apply the proposed method to the 2007 U.S. Census of Manufactures.</p><p>The third method is for imputation of confidential data sets with spatial locations using disease mapping models. We consider data that include fine geographic information, such as census tract or street block identifiers. This type of data can be difficult to release as public use files, since fine geography provides information that ill-intentioned data users can use to identify individuals. We propose to release data with simulated geographies, so as to enable spatial analyses while reducing disclosure risks. We fit disease mapping models that predict areal-level counts from attributes in the file, and sample new locations based on the estimated models. I illustrate this approach using data on causes of death in North Carolina, including evaluations of the disclosure risks and analytic validity that can result from releasing synthetic geographies.</p>Dissertatio

    Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques

    Full text link
    The growing use of voice user interfaces has led to a surge in the collection and storage of speech data. While data collection allows for the development of efficient tools powering most speech services, it also poses serious privacy issues for users as centralized storage makes private personal speech data vulnerable to cyber threats. With the increasing use of voice-based digital assistants like Amazon's Alexa, Google's Home, and Apple's Siri, and with the increasing ease with which personal speech data can be collected, the risk of malicious use of voice-cloning and speaker/gender/pathological/etc. recognition has increased. This thesis proposes solutions for anonymizing speech and evaluating the degree of the anonymization. In this work, anonymization refers to making personal speech data unlinkable to an identity while maintaining the usefulness (utility) of the speech signal (e.g., access to linguistic content). We start by identifying several challenges that evaluation protocols need to consider to evaluate the degree of privacy protection properly. We clarify how anonymization systems must be configured for evaluation purposes and highlight that many practical deployment configurations do not permit privacy evaluation. Furthermore, we study and examine the most common voice conversion-based anonymization system and identify its weak points before suggesting new methods to overcome some limitations. We isolate all components of the anonymization system to evaluate the degree of speaker PPI associated with each of them. Then, we propose several transformation methods for each component to reduce as much as possible speaker PPI while maintaining utility. We promote anonymization algorithms based on quantization-based transformation as an alternative to the most-used and well-known noise-based approach. Finally, we endeavor a new attack method to invert anonymization.Comment: PhD Thesis Pierre Champion | Universit\'e de Lorraine - INRIA Nancy | for associated source code, see https://github.com/deep-privacy/SA-toolki

    Scalable and approximate privacy-preserving record linkage

    No full text
    Record linkage, the task of linking multiple databases with the aim to identify records that refer to the same entity, is occurring increasingly in many application areas. Generally, unique entity identifiers are not available in all the databases to be linked. Therefore, record linkage requires the use of personal identifying attributes, such as names and addresses, to identify matching records that need to be reconciled to the same entity. Often, it is not permissible to exchange personal identifying data across different organizations due to privacy and confidentiality concerns or regulations. This has led to the novel research area of privacy-preserving record linkage (PPRL). PPRL addresses the problem of how to link different databases to identify records that correspond to the same real-world entities, without revealing the identities of these entities or any private or confidential information to any party involved in the process, or to any external party, such as a researcher. The three key challenges that a PPRL solution in a real-world context needs to address are (1) scalability to largedatabases by efficiently conducting linkage; (2) achieving high quality of linkage through the use of approximate (string) matching and effective classification of the compared record pairs into matches (i.e. pairs of records that refer to the same entity) and non-matches (i.e. pairs of records that refer to different entities); and (3) provision of sufficient privacy guarantees such that the interested parties only learn the actual values of certain attributes of the records that were classified as matches, and the process is secure with regard to any internal or external adversary. In this thesis, we present extensive research in PPRL, where we have addressed several gaps and problems identified in existing PPRL approaches. First, we begin the thesis with a review of the literature and we propose a taxonomy of PPRL to characterize existing techniques. This allows us to identify gaps and research directions. In the remainder of the thesis, we address several of the identified shortcomings. One main shortcoming we address is a framework for empirical and comparative evaluation of different PPRL solutions, which has not been studied in the literature so far. Second, we propose several novel algorithms for scalable and approximate PPRL by addressing the three main challenges of PPRL. We propose efficient private blocking techniques, for both three-party and two-party scenarios, based on sorted neighborhood clustering to address the scalability challenge. Following, we propose two efficient two-party techniques for private matching and classification to address the linkage quality challenge in terms of approximate matching and effective classification. Privacy is addressed in these approaches using efficient data perturbation techniques including k-anonymous mapping, reference values, and Bloom filters. Finally, the thesis reports on an extensive comparative evaluation of our proposed solutions with several other state-of-the-art techniques on real-world datasets, which shows that our solutions outperform others in terms of all three key challenges
    • …
    corecore