197,148 research outputs found

    Parameterized Complexity of the k-anonymity Problem

    Full text link
    The problem of publishing personal data without giving up privacy is becoming increasingly important. An interesting formalization that has been recently proposed is the kk-anonymity. This approach requires that the rows of a table are partitioned in clusters of size at least kk and that all the rows in a cluster become the same tuple, after the suppression of some entries. The natural optimization problem, where the goal is to minimize the number of suppressed entries, is known to be APX-hard even when the records values are over a binary alphabet and k=3k=3, and when the records have length at most 8 and k=4k=4 . In this paper we study how the complexity of the problem is influenced by different parameters. In this paper we follow this direction of research, first showing that the problem is W[1]-hard when parameterized by the size of the solution (and the value kk). Then we exhibit a fixed parameter algorithm, when the problem is parameterized by the size of the alphabet and the number of columns. Finally, we investigate the computational (and approximation) complexity of the kk-anonymity problem, when restricting the instance to records having length bounded by 3 and k=3k=3. We show that such a restriction is APX-hard.Comment: 22 pages, 2 figure

    A theory and toolkit for the mathematics of privacy : methods for anonymizing data while minimizing information loss

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Engineering Systems Division, Technology and Policy Program; and, Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (leaves 85-86).Privacy laws are an important facet of our society. But they can also serve as formidable barriers to medical research. The same laws that prevent casual disclosure of medical data have also made it difficult for researchers to access the information they need to conduct research into the causes of disease. But it is possible to overcome some of these legal barriers through technology. The US law known as HIPAA, for example, allows medical records to be released to researchers without patient consent if the records are provably anonymized prior to their disclosure. It is not enough for records to be seemingly anonymous. For example, one researcher estimates that 87.1% of the US population can be uniquely identified by the combination of their zip, gender, and date of birth - fields that most people would consider anonymous. One promising technique for provably anonymizing records is called k-anonymity. It modifies each record so that it matches k other individuals in a population - where k is an arbitrary parameter. This is achieved by, for example, changing specific information such as a date of birth, to a less specific counterpart such as a year of birth.(cont.) Previous studies have shown that achieving k-anonymity while minimizing information loss is an NP-hard problem; thus a brute force search is out of the question for most real world data sets. In this thesis, we present an open source Java toolkit that seeks to anonymize data while minimizing information loss. It uses an optimization framework and methods typically used to attack NP-hard problems including greedy search and clustering strategies. To test the toolkit a number of previously unpublished algorithms and information loss metrics have been implemented. These algorithms and measures are then empirically evaluated using a data set consisting of 1000 real patient medical records taken from a local hospital. The theoretical contributions of this work include: (1) A new threat model for privacy - that allows an adversary's capabilities to be modeled using a formalism called a virtual attack database. (2) Rationally defensible information loss measures - we show that previously published information loss measures are difficult to defend because they fall prey to what is known as the "weighted indexing problem." To remedy this problem we propose a number of information-loss measures that are in principle more attractive than previously published measures.(cont.) (3) Shown that suppression and generalization - two concepts that were previously thought to be distinct - are in fact the same thing; insofar as each generalization can be represented by a suppression and vice versa. (4) We show that Domain Generalization Hierarchies can be harvested to assist the construction of a Bayesian network to measure information loss. (5) A database can be thought of as a sub-sample of a population. We outline a technique that allows one to predict k-anonymity in a population. This allows us, under some conditions, to release records that match fewer than k individuals in a database while still achieving k-anonymity against an adversary according to some probability and confidence interval. While we have chosen to focus our thesis on the anonymization of medical records, our methodologies, toolkit and command line tools are equally applicable to any tabular data such as the data one finds in relational databases - the most common type of database today.by Hooman Katirai.S.M

    A New Mathematical Optimization-Based Method for the m-invariance Problem

    Full text link
    The issue of ensuring privacy for users who share their personal information has been a growing priority in a business and scientific environment where the use of different types of data and the laws that protect it have increased in tandem. Different technologies have been widely developed for static publications, i.e., where the information is published only once, such as k-anonymity and {\epsilon}-differential privacy. In the case where microdata information is published dynamically, although established notions such as m-invariance and {\tau}-safety already exist, developments for improving utility remain superficial. We propose a new heuristic approach for the NP-hard combinatorial problem of m-invariance and {\tau}-safety, which is based on a mathematical optimization column generation scheme. The quality of a solution to m-invariance and {\tau}-safety can be measured by the Information Loss (IL), a value in [0,100], the closer to 0 the better. We show that our approach improves by far current heuristics, providing in some instances solutions with ILs of 1.87, 8.5 and 1.93, while the state-of-the art methods reported ILs of 39.03, 51.84 and 57.97, respectively

    On the Complexity of tt-Closeness Anonymization and Related Problems

    Full text link
    An important issue in releasing individual data is to protect the sensitive information from being leaked and maliciously utilized. Famous privacy preserving principles that aim to ensure both data privacy and data integrity, such as kk-anonymity and ll-diversity, have been extensively studied both theoretically and empirically. Nonetheless, these widely-adopted principles are still insufficient to prevent attribute disclosure if the attacker has partial knowledge about the overall sensitive data distribution. The tt-closeness principle has been proposed to fix this, which also has the benefit of supporting numerical sensitive attributes. However, in contrast to kk-anonymity and ll-diversity, the theoretical aspect of tt-closeness has not been well investigated. We initiate the first systematic theoretical study on the tt-closeness principle under the commonly-used attribute suppression model. We prove that for every constant tt such that 0t<10\leq t<1, it is NP-hard to find an optimal tt-closeness generalization of a given table. The proof consists of several reductions each of which works for different values of tt, which together cover the full range. To complement this negative result, we also provide exact and fixed-parameter algorithms. Finally, we answer some open questions regarding the complexity of kk-anonymity and ll-diversity left in the literature.Comment: An extended abstract to appear in DASFAA 201

    Balancing between data utility and privacy preservation in data mining

    Get PDF
    Data Mining plays a vital role in today‟s information world where it has been widely applied in various organizations. The current trend needs to share data for mutual benefit. However, there has been a lot of concern over privacy in the recent years .It has also raised a potential threat of revealing sensitive data of an individual when the data is released publically. Various methods have been proposed to tackle the privacy preservation problem like anonymization and perturbation. But the natural consequence of privacy preservation is information loss. The loss of specific information about certain individuals may affect the data quality and in extreme case the data may become completely useless. There are methods like cryptography which completely anonymize the dataset and which renders the dataset useless. So the utility of the data is completely lost. We need to protect the private information and preserve the data utility as much as possible. So the objective of the thesis is to find an optimum balance between privacy and utility while publishing dataset of any organization. Privacy preservation is hard requirement that must be satisfied and utility is the measure to be optimized. One of the methods for preserving privacy is K-anonymization which also preserves privacy to a good extent. K-anonymity demands that every tuple in the dataset released be indistinguishably related to no fewer than k respondents. We used K-means algorithm for clustering the dataset and followed by k-anonymization. Decision stump classification is used to determine utility and privacy is determined by firing random queries on the anonymized dataset. The balancing point is where the utility and privacy curves intersect or they tend to converge. The balancing point will vary from dataset to dataset and the choice of Quasi-identifier and sensitive attribute. For our experiment the balancing point is found to be around 50-60 percent which is the intersecting point of privacy and utility curves

    Towards trajectory anonymization: a generalization-based approach

    Get PDF
    Trajectory datasets are becoming popular due to the massive usage of GPS and locationbased services. In this paper, we address privacy issues regarding the identification of individuals in static trajectory datasets. We first adopt the notion of k-anonymity to trajectories and propose a novel generalization-based approach for anonymization of trajectories. We further show that releasing anonymized trajectories may still have some privacy leaks. Therefore we propose a randomization based reconstruction algorithm for releasing anonymized trajectory data and also present how the underlying techniques can be adapted to other anonymity standards. The experimental results on real and synthetic trajectory datasets show the effectiveness of the proposed techniques
    corecore