32 research outputs found

    Privacy-preserving publishing of hierarchical data

    Get PDF
    Many applications today rely on storage and management of semi-structured information, for example, XML databases and document-oriented databases. These data often have to be shared with untrusted third parties, which makes individuals’ privacy a fundamental problem. In this article, we propose anonymization techniques for privacy-preserving publishing of hierarchical data. We show that the problem of anonymizing hierarchical data poses unique challenges that cannot be readily solved by existing mechanisms. We extend two standards for privacy protection in tabular data (k-anonymity and ℓ-diversity) and apply them to hierarchical data. We present utility-aware algorithms that enforce these definitions of privacy using generalizations and suppressions of data values. To evaluate our algorithms and their heuristics, we experiment on synthetic and real datasets obtained from two universities. Our experiments show that we significantly outperform related methods that provide comparable privacy guarantees

    Privacy-preserving learning analytics: challenges and techniques

    Get PDF

    Graph-based modelling of query sets for differential privacy

    Get PDF
    Differential privacy has gained attention from the community as the mechanism for privacy protection. Significant effort has focused on its application to data analysis, where statistical queries are submitted in batch and answers to these queries are perturbed with noise. The magnitude of this noise depends on the privacy parameter " and the sensitivity of the query set. However, computing the sensitivity is known to be NP-hard. In this study, we propose a method that approximates the sensitivity of a query set. Our solution builds a query-region- intersection graph. We prove that computing the maximum clique size of this graph is equivalent to bounding the sensitivity from above. Our bounds, to the best of our knowledge, are the tightest known in the literature. Our solution currently supports a limited but expressive subset of SQL queries (i.e., range queries), and almost all popular aggregate functions directly (except AVERAGE). Experimental results show the efficiency of our approach: even for large query sets (e.g., more than 2K queries over 5 attributes), by utilizing a state-of-the-art solution for the maximum clique problem, we can approximate sensitivity in under a minute

    Explode: an extensible platform for differentially private data analysis

    Get PDF
    Differential privacy (DP) has emerged as a popular standard for privacy protection and received great attention from the research community. However, practitioners often find DP cumbersome to implement, since it requires additional protocols (e.g., for randomized response, noise addition) and changes to existing database systems. To avoid these issues we introduce Explode, a platform for differentially private data analysis. The power of Explode comes from its ease of deployment and use: The data owner can install Explode on top of an SQL server, without modifying any existing components. Explode then hosts a web application that allows users to conveniently perform many popular data analysis tasks through a graphical user interface, e.g., issuing statistical queries, classification, correlation analysis. Explode automatically converts these tasks to collections of SQL queries, and uses the techniques in [3] to determine the right amount of noise that should be added to satisfy DP while producing high utility outputs. This paper describes the current implementation of Explode, together with potential improvements and extensions

    Known sample attacks on relation preserving data transformations

    No full text
    Many data mining applications such as clustering and kkk-NN search rely on distances and relations in the data. Thus, distance preserving transformations, which perturb the data but retain records' distances, have emerged as a prominent privacy protection method. In this paper, we present a novel attack on a generalized form of distance preserving transformations, called relation preserving transformations. Our attack exploits not the exact distances between data, but the relationships between the distances. We show that an attacker with few known samples (4 to 10) and direct access to relations can retrieve unknown data records with more than 95 percent precision. In addition, experiments demonstrate that simple methods of noise addition or perturbation are not sufficient to prevent our attack, as they decrease precision by only 10 percent

    Location disclosure risks of releasing trajectory distances

    No full text
    Location tracking devices enable trajectories to be collected for new services and applications such as vehicle tracking and fleet management. While trajectory data is a lucrative source for data analytics, it also contains sensitive and commercially critical information. This has led to the development of systems that enable privacy-preserving computation over trajectory databases, but many of such systems in fact (directly or indirectly) allow an adversary to compute the distance (or similarity) between two trajectories. We show that the use of such systems raises privacy concerns when the adversary has a set of known trajectories. Specifically, given a set of known trajectories and their distances to a private, unknown trajectory, we devise an attack that yields the locations which the private trajectory has visited, with high confidence. The attack can be used to disclose both positive results (i.e., the victim has visited a certain location) and negative results (i.e., the victim has not visited a certain location). Experiments on real and synthetic datasets demonstrate the accuracy of our attack

    Sensitivity analysis for non-interactive differential privacy: bounds and efficient algorithms

    No full text
    Differential privacy (DP) has gained significant attention lately as the state of the art in privacy protection. It achieves privacy by adding noise to query answers. We study the problem of privately and accurately answering a set of statistical range queries in batch mode (i.e., under non-interactive DP). The noise magnitude in DP depends directly on the sensitivity of a query set, and calculating sensitivity was proven to be NP-hard. Therefore, efficiently bounding the sensitivity of a given query set is still an open research problem. In this work, we propose upper bounds on sensitivity that are tighter than those in previous work. We also propose a formulation to exactly calculate sensitivity for a set of COUNT queries. However, it is impractical to implement these bounds without sophisticated methods. We therefore introduce methods that build a graph model G based on a query set Q, such that implementing the aforementioned bounds can be achieved by solving two well-known clique problems on G. We make use of the literature in solving these clique problems to realize our bounds efficiently. Experimental results show that for query sets with a few hundred queries, it takes only a few seconds to obtain results

    Differentially private nearest neighbor classification

    No full text
    Instance-based learning, and the k-nearest neighbors algorithm (k-NN) in particular, provide simple yet effective classification algorithms for data mining. Classifiers are often executed on sensitive information such as medical or personal data. Differential privacy has recently emerged as the accepted standard for privacy protection in sensitive data. However, straightforward applications of differential privacy to k-NN classification yield rather inaccurate results. Motivated by this, we develop algorithms to increase the accuracy of private instance-based classification. We first describe the radius neighbors classifier (r-N) and show that its accuracy under differential privacy can be greatly improved by a non-trivial sensitivity analysis. Then, for k-NN classification, we build algorithms that convert k-NN classifiers to r-N classifiers. We experimentally evaluate the accuracy of both classifiers using various datasets. Experiments show that our proposed classifiers significantly outperform baseline private classifiers (i.e., straightforward applications of differential privacy) and executing the classifiers on a dataset published using differential privacy. In addition, the accuracy of our proposed k-NN classifiers are at least comparable to, and in many cases better than, the other differentially private machine learning techniques

    PRISM A Web Server for Prediction and Visualization of Protein Protein Interactions

    No full text
    PRISM is a web server for the querying, visualization and analysis of the protein interfaces and putative protein-protein interactions derived from known protein structures in PDB. Putative interactions between proteins are predicted with an efficient algorithm using structural and evolutionary similarities. The algorithm seeks possible binary interactions between proteins (targets) through similar known interfaces (templates). Template dataset is the structurally and evolutionarily representative subset of biological interfaces in PDB. Starting with all available ~50,000 interfaces as of February 2006, 8205 distinct interface clusters are generated with their representative interfaces. After elimination of the antigenantibody complexes, peptides, ligands, synthetic proteins, membrane proteins and interfaces having less than 3 hotspots – evolutionarily conserved residues on the interfaces – at each partner chain and consideration only biologically relevant interfaces, 1738 template interfaces are obtained. Target dataset is the sequentially non-redundant subset of all structures available in PDB which have less than 50 % homology. Target dataset contains 16415 structures, of which 4952 are complex structures, 11463 are monomeric structures. Surfaces of the target proteins are extracted by invoking NACCESS. If relative surface accessibility of a residue is greater than 5%, it is considered as surface residue. The prediction algorithm based on that if two proteins contain similar regions to complementary partner of a template interface, it is proposed that these two proteins interact through these similar complementary regions. After the template interfaces are split into its complementary partner chains, these partners are structurally aligned with the surfaces of the target proteins. To measure the similarity, a scoring function is used, which contains two parts; i) evolutionary similarity score and ii) structural similarity score. Evolutionary similarity includes hotspot match ratio; structural similarity part includes RMSD and residue match ratio between target protein and one partner of template interface. Prediction algorithm results in 58817 potential interactions for a score threshold of 0.85. For a biological evidence, we used the p53 (tumor suppressor protein) interaction network generated by Kohn et al., in 1999. In this network, there are 94 interactions between 55 PDB structures. We verified 84 of these 94 interactions by PRISM with a score threshold of 0.5. These verified interactions support the consistency of our prediction algorithm. Using PRISM web server, one can browse and query template, target datasets and the predicted interactions. Another feature of PRISM is the visualization of Protein Interaction Network derived from the predicted binary. This tool is based on a multi-modal network representation, incorporating the relationships between entities such as interfaces, proteins, domains, and functional annotations. This model together with its automatic layout feature results in identifying biologically important network modules
    corecore