47 research outputs found
A Data Perturbation Approach to Privacy Protection in Data Mining
Advances in data mining techniques have raised growing concerns about privacy of personal information. Organizations that use their customers’ records in data mining activities are forced to take actions to protect the privacy of the individuals involved. A common practice for many organizations today is to remove the identity-reated attributes from customer records before releasing them to data miners or analysts. In this study, we investigate the effect of this practice and demonstrate that a majority of the records in a dataset can be uniquely identified even after identity related attributes are removed. We propose a data perturbation method that can be used by organizations to prevent such unique identification of individual records, while providing the data to analysts for data mining. The proposed method attempts to preserve the statistical properties of the data based on privacy protection parameters specified by the organization. We show that the problem can be solved in two phases, with a linear programming formulation in phase one (to preserve the marginal distribution), followed by a simple Bayes-based swapping procedure in phase two (to preserve the joint distribution). The proposed method is compared with a random perturbation method in classification performance on two real-world datasets. The results of the experiments indicate that it significantly outperforms the random method
The misty crystal ball: Efficient concealment of privacy-sensitive attributes in predictive analytics
Individuals are becoming increasingly concerned with privacy. This curtails their willingness to share sensitive attributes like age, gender or personal preferences; yet firms largely rely upon customer data in any type of predictive analytics. Hence, organizations are confronted with a dilemma in which they need to make a tradeoff between a sparse use of data and the utility from better predictive analytics. This paper proposes a masking mechanism that obscures sensitive attributes while maintaining a large degree of predictive power. More precisely, we efficiently identify data partitions that are best suited for (i) shuffling, (ii) swapping and, as a form of randomization, (iii) perturbing attributes by conditional replacement. By operating on data partitions that are derived from a predictive algorithm, we achieve the objective of masking privacy-sensitive attributes with marginal downsides for predictive modeling. The resulting trade-off between masking and predictive utility is empirically evaluated in the context of customer churn where, for instance, a stratified shuffling of attribute values impedes predictive accuracy rarely by more than a percentage point. Our proposed framework entails direct managerial implications as a growing share of firms adopts predictive analytics and thus requires mechanisms that better adhere to user demands for information privacy
Identity Disclosure Protection: A Data Reconstruction Approach for Preserving Privacy in Data Mining
User's Privacy in Recommendation Systems Applying Online Social Network Data, A Survey and Taxonomy
Recommender systems have become an integral part of many social networks and
extract knowledge from a user's personal and sensitive data both explicitly,
with the user's knowledge, and implicitly. This trend has created major privacy
concerns as users are mostly unaware of what data and how much data is being
used and how securely it is used. In this context, several works have been done
to address privacy concerns for usage in online social network data and by
recommender systems. This paper surveys the main privacy concerns, measurements
and privacy-preserving techniques used in large-scale online social networks
and recommender systems. It is based on historical works on security,
privacy-preserving, statistical modeling, and datasets to provide an overview
of the technical difficulties and problems associated with privacy preserving
in online social networks.Comment: 26 pages, IET book chapter on big data recommender system
WHY THEY SELF-DISCLOSE?EXAMINING FACTORS INFLUENCING PEOPLE\u27S PERSONAL INFORMATION DISCLOSURE IN ONLINE HEALTHCARE COMMUNITIES RESEARCH-IN-PROGRESS
Online healthcare communities (OHCs) encourage people to disclose their personal information with others to seek support and to accelerate research and help create better treatments. However, disclosing personal information might cause privacy disclosure and some risks. This paper aims to explore what factors and how those factors affect people’s personal information disclosure intention in OHCs. Based on “risk-motivation” perspective, we identify perceived usefulness as extrinsic motivation and social support as intrinsic motivation, and distinguish four kinds of risks to test those motivation and risk factors’ effects on people’s personal information disclose intention in OHCs. As two constructs describing the characteristics of OHCs, expected disease severe extent and common identity are supposed having moderating effects’ on motivation and risk factors’ effects. The theoretical contribution of this paper is offering a model to explain people’s personal information disclose intention in OHCs and integrate constructs to describe the characteristic of OHCs; the practical implications is providing insight on OHC managers’ operation for communities’ viability and people’s privacy protection. Finally, limitations and future works also are presented
Sharing Patient Disease Data with Privacy Preservation
When patient data are shared for studying a specific disease, a privacy disclosure occurs as long as an individual is known to be in the shared data. Individuals in such specific disease data are thus subject to higher disclosure risk than those in datasets with different diseases. This problem has been overlooked in privacy research and practice. In this study, we analyze disclosure risks for this problem and identify appropriate risk measures. An efficient algorithm is developed for anonymizing the data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach
Customer-Base Analysis using Repeated Cross-Sectional Summary (RCSS) Data
We address a critical question that many firms are facing today: Can customer data be stored and analyzed in an easy-to-manage and scalable manner without significantly compromising the inferences that can be made about the customers’ transaction activity? We address this question in the context of customer-base analysis. A number of researchers have developed customer-base analysis models that perform very well given detailed individual-level data. We explore the possibility of estimating these models using aggregated data summaries alone, namely repeated cross-sectional summaries (RCSS) of the transaction data. Such summaries are easy to create, visualize, and distribute, irrespective of the size of the customer base. An added advantage of the RCSS data structure is that individual customers cannot be identified, which makes it desirable from a data privacy and security viewpoint as well. We focus on the widely used Pareto/NBD model and carry out a comprehensive simulation study covering a vast spectrum of market scenarios. We find that the RCSS format of four quarterly histograms serves as a suitable substitute for individual-level data. We confirm the results of the simulations on a real dataset of purchasing from an online fashion retailer
Protecting Privacy Against Regression Attacks in Predictive Data Mining
Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-mining technique, can be used to effectively reveal individuals\u27 sensitive data. This problem, which we call a regression attack, has been overlooked in the literature. Existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach adopts a novel measure which considers the tradeoff between disclosure risk and data utility in a regression tree pruning process. We also propose a dynamic value-concatenation method, which overcomes the limitation of requiring a user-defined generalization hierarchy in traditional k-anonymity approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach