33 research outputs found

    A universal global measure of univariate and bivariate data utility for anonymised microdata

    Get PDF
    A universal global measure of univariate and bivariate data utility for anonymised microdata. This paper presents a new global data utility measure, based on a benchmarking approach. Data utility measures assess the utility of anonymised microdata by measuring changes in distributions and their impact on bias, variance and other statistics derived from the data. Most existing data utility measures have significant shortcomings – that is, they are limited to continuous variables, to univariate utility assessment, or to local information loss measurements. Several solutions are presented in the proposed global data utility model. It combines univariate and bivariate data utility measures, which calculate information loss using various statistical tests and association measures, such as two-sample Kolmogorov–Smirnov test, chi-squared test (Cramer’s V), ANOVA F test (eta squared), Kruskal-Wallis H test (epsilon squared), Spearman coefficient (rho) and Pearson correlation coefficient (r). The model is universal, since it also includes new local utility measures for global recoding and variable removal data reduction approaches, and it can be used for data protected with all common masking methods and techniques, from data reduction and data perturbation to generation of synthetic data and sampling. At the bivariate level, the model includes all required data analysis steps: assumptions for statistical tests, statistical significance of the association, direction of the association and strength of the association (size effect). Since the model should be executed automatically with statistical software code or a package, our aim was to allow all steps to be done with no additional user input. For this reason, we propose approaches to automatically establish the direction of the association between two variables using test-reported standardised residuals and sums of squares between groups. Although the model is a global data utility model, individual local univariate and bivariate utility can still be assessed for different types of variables, as well as for both normal and non-normal distributions. The next important step in global data utility assessment would be to develop either program code or an R statistical software package for measuring data utility, and to establish the relationship between univariate, bivariate and multivariate data utility of anonymised data

    Min Max Normalization Based Data Perturbation Method for Privacy Protection

    Get PDF
    Data mining system contain large amount of private and sensitive data such as healthcare, financial and criminal records. These private and sensitive data can not be share to every one, so privacy protection of data is required in data mining system for avoiding privacy leakage of data. Data perturbation is one of the best methods for privacy preserving. We used data perturbation method for preserving privacy as well as accuracy. In this method individual data value are distorted before data mining application. In this paper we present min max normalization transformation based data perturbation. The privacy parameters are used for measurement of privacy protection and the utility measure shows the performance of data mining technique after data distortion. We performed experiment on real life dataset and the result show that min max normalization transformation based data perturbation method is effective to protect confidential information and also maintain the performance of data mining technique after data distortion

    Feedback-based integration of the whole process of data anonymization in a graphical interface

    Get PDF
    The interactive, web-based point-and-click application presented in this article, allows anonymizing data without any knowledge in a programming language. Anonymization in data mining, but creating safe, anonymized data is by no means a trivial task. Both the methodological issues as well as know-how from subject matter specialists should be taken into account when anonymizing data. Even though specialized software such as sdcMicro exists, it is often difficult for nonexperts in a particular software and without programming skills to actually anonymize datasets without an appropriate app. The presented app is not restricted to apply disclosure limitation techniques but rather facilitates the entire anonymization process. This interface allows uploading data to the system, modifying them and to create an object defining the disclosure scenario. Once such a statistical disclosure control (SDC) problem has been defined, users can apply anonymization techniques to this object and get instant feedback on the impact on risk and data utility after SDC methods have been applied. Additional features, such as an Undo Button, the possibility to export the anonymized dataset or the required code for reproducibility reasons, as well its interactive features, make it convenient both for experts and nonexperts in R – the free software environment for statistical computing and graphics – to protect a dataset using this app

    A risk model for privacy in trajectory data

    Get PDF
    Time sequence data relating to users, such as medical histories and mobility data, are good candidates for data mining, but often contain highly sensitive information. Different methods in privacy-preserving data publishing are utilised to release such private data so that individual records in the released data cannot be re-linked to specific users with a high degree of certainty. These methods provide theoretical worst-case privacy risks as measures of the privacy protection that they offer. However, often with many real-world data the worst-case scenario is too pessimistic and does not provide a realistic view of the privacy risks: the real probability of re-identification is often much lower than the theoretical worst-case risk. In this paper, we propose a novel empirical risk model for privacy which, in relation to the cost of privacy attacks, demonstrates better the practical risks associated with a privacy preserving data release. We show detailed evaluation of the proposed risk model by using k-anonymised real-world mobility data and then, we show how the empirical evaluation of the privacy risk has a different trend in synthetic data describing random movements

    Statistical Disclosure Control for Micro-Data Using the R Package sdcMicro

    Get PDF
    The demand for data from surveys, censuses or registers containing sensible information on people or enterprises has increased significantly over the last years. However, before data can be provided to the public or to researchers, confidentiality has to be respected for any data set possibly containing sensible information about individual units. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data in order to decrease the disclosure risk of data.The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network

    Preserving Privacy Against Side-Channel Leaks

    Get PDF
    The privacy preserving issues have received significant attentions in various domains. Various models and techniques have been proposed to achieve optimal privacy with minimal costs. However, side-channel leakages (such as, publicly-known algorithms of data publishing, observable traffic information in web application, fine-grained readings in smart metering) further complicate the process of privacy preservation. In this thesis, we make the first effort on investigating a general framework to model side-channel attacks across different domains and applying the framework to various categories of applications. In privacy-preserving data publishing with publicly-known algorithms, we first theoretically study a generic strategy independent of data utility measures and syntactic privacy properties. We then propose an efficient approach to preserving diversity. In privacy-preserving traffic padding in Web applications, we first propose a formal PPTP model to quantify the privacies and costs based on the key observation about the similarity between data publishing and traffic padding. We then introduce randomness into the previous solutions to provide background knowledge-resistant privacy guarantee. In privacy-preserving smart metering, we propose a light-weight approach to simultaneously preserving privacy on both billing and consumption aggregation based on the key observation about the privacy issue beyond the fine-grained readings
    corecore