563 research outputs found
An Efficient Technique to Secure Data Access for Multiple Domains using Overlapping Slicing
Data Mining is the process of analysing data from different perspective, summarizing it into useful information and extracts the needed information from the database. Most enterprises are collecting and storing data in large database. Database privacy is a important responsibility of organizations for to protects clients sensitive information, because their clients trust them to do so. Various anonymization techniques have been proposed for the privacy of sensitive microdata. However, there is considered between the level of privacy and the usefulness of the published data. Recently, slicing was proposed as a technique for anonymized published dataset by partitioning the dataset vertically and horizontally. This paper proposes a technique to increase the utility and privacy of a sliced dataset by allowing overlapped slicing while maintaining the prevention of membership disclosure. Also provide secure data access for multiple domains. This novel approaches work on overlapped slicing to improve preserve data utility and privacy better than traditional slicing.
DOI: 10.17762/ijritcc2321-8169.16045
Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances
Clinical data analysis could lead to breakthroughs. However, clinical data contain sensitive information about participants that could be utilized for unethical activities, such as blackmailing, identity theft, mass surveillance, or social engineering. Data anonymization is a standard step during data collection, before sharing, to overcome the risk of disclosure. However, conventional data anonymization techniques are not foolproof and also hinder the opportunity for personalized evaluations. Much research has been done for synthetic data generation using generative adversarial networks and many other machine learning methods; however, these methods are either not free to use or are limited in capacity. This study evaluates the performance of an emerging tool named synthpop, an R package producing synthetic data as an alternative approach for data anonymization. This paper establishes data standards derived from the original data set based on the utilities and quality of information and measures variations in the synthetic data set to evaluate the performance of the data synthesis process. The methods to assess the utility of the synthetic data set can be broadly divided into two approaches: general utility and specific utility. General utility assesses whether synthetic data have overall similarities in the statistical properties and multivariate relationships with the original data set. Simultaneously, the specific utility assesses the similarity of a fitted model’s performance on the synthetic data to its performance on the original data. The quality of information is assessed by comparing variations in entropy bits and mutual information to response variables within the original and synthetic data sets. The study reveals that synthetic data succeeded at all utility tests with a statistically non-significant difference and not only preserved the utilities but also preserved the complexity of the original data set according to the data standard established in this study. Therefore, synthpop fulfills all the necessities and unfolds a wide range of opportunities for the research community, including easy data sharing and information protection
On Analysis of Mixed Data Classification with Privacy Preservation
Privacy-preserving data classification is a pervasive task in privacy-preserving data mining (PPDM). The main goal is to secure the identification of individuals from the released data to prevent privacy breach. However, the goal of classification involves accurate data classification. Thus, the problem is, how to accurately mine large amount of data for extracting relevant knowledge while protecting at the same time sensitive information existing in the database. One of the ways is to anonymize the data set that contains the sensitive information of individuals before getting it released for data analysis. In this paper, we have mainly analyzed the proposed method Microaggregation based Classification Tree (MiCT) which use the properties of decision tree for privacy-preserving classification of mixed data. The evaluations are done based on various privacy models developed keeping in mind the various situations which may arise during data analysis.Keywords:Microaggregation, decision tree, mixed data, data perturbation, classification accuracy, anonymous data
Cor-Split: Defending Privacy in Data Re-Publication from Historical Correlations and Compromised Tuples
Abstract. Several approaches have been proposed for privacy preserving data publication. In this paper we consider the important case in which a certain view over a dynamic dataset has to be released a number of times during its history. The insufficiency of techniques used for one-shot publication in the case of subsequent releases has been previously recognized, and some new approaches have been proposed. Our research shows that relevant privacy threats, not recognized by previous proposals, can occur in practice. In particular, we show the cascading effects that a single (or a few) compromised tuples can have in data re-publication when coupled with the ability of an adversary to recognize historical correlations among released tuples. A theoretical study of the threats leads us to a defense algorithm, implemented as a significant extension of the m-invariance technique. Extensive experiments using publicly available datasets show that the proposed technique preserves the utility of published data and effectively protects from the identified privacy threats.
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method
Introduction: The amount of data generated by original research is growing
exponentially. Publicly releasing them is recommended to comply with the Open
Science principles. However, data collected from human participants cannot be
released as-is without raising privacy concerns. Fully synthetic data represent
a promising answer to this challenge. This approach is explored by the French
Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in
the form of a synthetic data generation framework based on Classification and
Regression Trees and an original distance-based filtering. The goal of this
work was to develop a refined version of this framework and to assess its
risk-utility profile with empirical and formal tools, including novel ones
developed for the purpose of this evaluation.Materials and Methods: Our
synthesis framework consists of four successive steps, each of which is
designed to prevent specific risks of disclosure. We assessed its performance
by applying two or more of these steps to a rich epidemiological dataset.
Privacy and utility metrics were computed for each of the resulting synthetic
datasets, which were further assessed using machine learning
approaches.Results: Computed metrics showed a satisfactory level of protection
against attribute disclosure attacks for each synthetic dataset, especially
when the full framework was used. Membership disclosure attacks were formally
prevented without significantly altering the data. Machine learning approaches
showed a low risk of success for simulated singling out and linkability
attacks. Distributional and inferential similarity with the original data were
high with all datasets.Discussion: This work showed the technical feasibility
of generating publicly releasable synthetic data using a multi-step framework.
Formal and empirical tools specifically developed for this demonstration are a
valuable contribution to this field. Further research should focus on the
extension and validation of these tools, in an effort to specify the intrinsic
qualities of alternative data synthesis methods.Conclusion: By successfully
assessing the quality of data produced using a novel multi-step synthetic data
generation framework, we showed the technical and conceptual soundness of the
Open-CESP initiative, which seems ripe for full-scale implementation
- …