Search CORE

684 research outputs found

Privacy preserving data mining

Author: Jain Aman
Sharma Bikash
Publication venue
Publication date: 01/01/2007
Field of study

A fruitful direction for future data mining research will be the development of technique that incorporates privacy concerns. Specifically, we address the following question. Since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? We analyze the possibility of privacy in data mining techniques in two phasesrandomization and reconstruction. Data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. To preserve client privacy in the data mining process, techniques based on random perturbation of data records are used. Suppose there are many clients, each having some personal information, and one server, which is interested only in aggregate, statistically significant, properties of this information. The clients can protect privacy of their data by perturbing it with a randomization algorithm and then submitting the randomized version. This approach is called randomization. The randomization algorithm is chosen so that aggregate properties of the data can be recovered with sufficient precision, while individual entries are significantly distorted. For the concept of using value distortion to protect privacy to be useful, we need to be able to reconstruct the original data distribution so that data mining techniques can be effectively utilized to yield the required statistics. Analysis Let xi be the original instance of data at client i. We introduce a random shift yi using randomization technique explained below. The server runs the reconstruction algorithm (also explained below) on the perturbed value zi = xi + yi to get an approximate of the original data distribution suitable for data mining applications. Randomization We have used the following randomizing operator for data perturbation: Given x, let R(x) be x+€ (mod 1001) where € is chosen uniformly at random in {-100…100}. Reconstruction of discrete data set P(X=x) = f X (x) ----Given P(Y=y) = F y (y) ---Given P (Z=z) = f Z (z) ---Given f (X/Z) = P(X=x | Z=z) = P(X=x, Z=z)/P (Z=z) = P(X=x, X+Y=Z)/ f Z (z) = P(X=x, Y=Z - X)/ f Z (z) = P(X=x)*P(Y=Z-X)/ f Z (z) = P(X=x)*P(Y=y)/ f Z (z) Results In this project we have done two aspects of privacy preserving data mining. The first phase involves perturbing the original data set using ‘randomization operator’ techniques and the second phase deals with reconstructing the randomized data set using the proposed algorithm to get an approximate of the original data set. The performance metrics like percentage deviation, accuracy and privacy breaches were calculated. In this project we studied the technical feasibility of realizing privacy preserving data mining. The basic promise was that the sensitive values in a user’s record will be perturbed using a randomizing function and an approximate of the perturbed data set be recovered using reconstruction algorithm

ethesis@nitr

Recommended from our members

Privacy-aware publication and utilization of healthcare data

Author: Park Yubin
Publication venue
Publication date: 28/10/2014
Field of study

textOpen access to health data can bring enormous social and economical benefits. However, such access can also lead to privacy breaches, which may result in discrimination in insurance and employment markets. Privacy is a subjective and contextual concept, thus it should be interpreted from both systemic and information perspectives to clearly understand potential breaches and consequences. This dissertation investigates three popular use cases of healthcare data: specifically, 1) synthetic data publication, 2) aggregate data utilization, and 3) privacy-aware API implementation. For each case, we develop statistical models that improve the privacy-utility Pareto frontier by leveraging a variety of machine learning techniques such as information theoretic privacy measures, Bayesian graphical models, non-parametric modeling, and low-rank factorization techniques. It shows that much utility can be extracted from health records while maintaining strong privacy guarantees and protection of sensitive health information.Electrical and Computer Engineerin

Texas ScholarWorks

Verification in Privacy Preserving Data Publishing

Author: Nadimpalli Sandeep Varma
Vatsavayi Valli Kumari
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 07/02/2017
Field of study

Privacy preserving data publication is a major concern for both the owners of data and the data publishers. Principles like k-anonymity, l-diversity were proposed to reduce privacy violations. On the other side, no studies were found on verification on the anonymized data in terms of adversarial breach and anonymity levels. However, the anonymized data is still prone to attacks due to the presence of dependencies among quasi-identifiers and sensitive attributes. This paper presents a novel framework to detect the existence of those dependencies and a solution to reduce them. The advantages of our approach are i) privacy violations can be detected, ii) the extent of privacy risk can be measured and iii) re-anonymization can be done on vulnerable blocks of data. The work is further extended to show how the adversarial breach knowledge eventually increased when new tuples are added and an on the fly solution to reduce it is discussed. Experimental results are reported and analyzed

On the use of economic price theory to determine the optimum levels of privacy and information utility in microdata anonymisation

Author: Zielinski Marek Piotr
Publication venue: 'University of Pretoria - Department of Philosophy'
Publication date: 09/06/2010
Field of study

Statistical data, such as in the form of microdata, is used by different organisations as a basis for creating knowledge to assist in their planning and decision-making activities. However, before microdata can be made available for analysis, it needs to be anonymised in order to protect the privacy of the individuals whose data is released. The protection of privacy requires us to hide or obscure the released data. On the other hand, making data useful for its users implies that we should provide data that is accurate, complete and precise. Ideally, we should maximise both the level of privacy and the level of information utility of a released microdata set. However, as we increase the level of privacy, the level of information utility decreases. Without guidelines to guide the selection of the optimum levels of privacy and information utility, it is difficult to determine the optimum balance between the two goals. The objective and constraints of this optimisation problem can be captured naturally with concepts from Economic Price Theory. In this thesis, we present an approach based on Economic Price Theory for guiding the process of microdata anonymisation such that optimum levels of privacy and information utility are achieved.Thesis (PhD)--University of Pretoria, 2010.Computer Scienceunrestricte

UPSpace at the University of Pretoria