47 research outputs found

    Privacy Protection in Data Mining

    Get PDF

    A Data Perturbation Approach to Privacy Protection in Data Mining

    Get PDF
    Advances in data mining techniques have raised growing concerns about privacy of personal information. Organizations that use their customers’ records in data mining activities are forced to take actions to protect the privacy of the individuals involved. A common practice for many organizations today is to remove the identity-reated attributes from customer records before releasing them to data miners or analysts. In this study, we investigate the effect of this practice and demonstrate that a majority of the records in a dataset can be uniquely identified even after identity related attributes are removed. We propose a data perturbation method that can be used by organizations to prevent such unique identification of individual records, while providing the data to analysts for data mining. The proposed method attempts to preserve the statistical properties of the data based on privacy protection parameters specified by the organization. We show that the problem can be solved in two phases, with a linear programming formulation in phase one (to preserve the marginal distribution), followed by a simple Bayes-based swapping procedure in phase two (to preserve the joint distribution). The proposed method is compared with a random perturbation method in classification performance on two real-world datasets. The results of the experiments indicate that it significantly outperforms the random method

    The misty crystal ball: Efficient concealment of privacy-sensitive attributes in predictive analytics

    Get PDF
    Individuals are becoming increasingly concerned with privacy. This curtails their willingness to share sensitive attributes like age, gender or personal preferences; yet firms largely rely upon customer data in any type of predictive analytics. Hence, organizations are confronted with a dilemma in which they need to make a tradeoff between a sparse use of data and the utility from better predictive analytics. This paper proposes a masking mechanism that obscures sensitive attributes while maintaining a large degree of predictive power. More precisely, we efficiently identify data partitions that are best suited for (i) shuffling, (ii) swapping and, as a form of randomization, (iii) perturbing attributes by conditional replacement. By operating on data partitions that are derived from a predictive algorithm, we achieve the objective of masking privacy-sensitive attributes with marginal downsides for predictive modeling. The resulting trade-off between masking and predictive utility is empirically evaluated in the context of customer churn where, for instance, a stratified shuffling of attribute values impedes predictive accuracy rarely by more than a percentage point. Our proposed framework entails direct managerial implications as a growing share of firms adopts predictive analytics and thus requires mechanisms that better adhere to user demands for information privacy

    User's Privacy in Recommendation Systems Applying Online Social Network Data, A Survey and Taxonomy

    Full text link
    Recommender systems have become an integral part of many social networks and extract knowledge from a user's personal and sensitive data both explicitly, with the user's knowledge, and implicitly. This trend has created major privacy concerns as users are mostly unaware of what data and how much data is being used and how securely it is used. In this context, several works have been done to address privacy concerns for usage in online social network data and by recommender systems. This paper surveys the main privacy concerns, measurements and privacy-preserving techniques used in large-scale online social networks and recommender systems. It is based on historical works on security, privacy-preserving, statistical modeling, and datasets to provide an overview of the technical difficulties and problems associated with privacy preserving in online social networks.Comment: 26 pages, IET book chapter on big data recommender system

    WHY THEY SELF-DISCLOSE?EXAMINING FACTORS INFLUENCING PEOPLE\u27S PERSONAL INFORMATION DISCLOSURE IN ONLINE HEALTHCARE COMMUNITIES RESEARCH-IN-PROGRESS

    Get PDF
    Online healthcare communities (OHCs) encourage people to disclose their personal information with others to seek support and to accelerate research and help create better treatments. However, disclosing personal information might cause privacy disclosure and some risks. This paper aims to explore what factors and how those factors affect people’s personal information disclosure intention in OHCs. Based on “risk-motivation” perspective, we identify perceived usefulness as extrinsic motivation and social support as intrinsic motivation, and distinguish four kinds of risks to test those motivation and risk factors’ effects on people’s personal information disclose intention in OHCs. As two constructs describing the characteristics of OHCs, expected disease severe extent and common identity are supposed having moderating effects’ on motivation and risk factors’ effects. The theoretical contribution of this paper is offering a model to explain people’s personal information disclose intention in OHCs and integrate constructs to describe the characteristic of OHCs; the practical implications is providing insight on OHC managers’ operation for communities’ viability and people’s privacy protection. Finally, limitations and future works also are presented

    Sharing Patient Disease Data with Privacy Preservation

    Get PDF
    When patient data are shared for studying a specific disease, a privacy disclosure occurs as long as an individual is known to be in the shared data. Individuals in such specific disease data are thus subject to higher disclosure risk than those in datasets with different diseases. This problem has been overlooked in privacy research and practice. In this study, we analyze disclosure risks for this problem and identify appropriate risk measures. An efficient algorithm is developed for anonymizing the data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach

    Customer-Base Analysis using Repeated Cross-Sectional Summary (RCSS) Data

    Get PDF
    We address a critical question that many firms are facing today: Can customer data be stored and analyzed in an easy-to-manage and scalable manner without significantly compromising the inferences that can be made about the customers’ transaction activity? We address this question in the context of customer-base analysis. A number of researchers have developed customer-base analysis models that perform very well given detailed individual-level data. We explore the possibility of estimating these models using aggregated data summaries alone, namely repeated cross-sectional summaries (RCSS) of the transaction data. Such summaries are easy to create, visualize, and distribute, irrespective of the size of the customer base. An added advantage of the RCSS data structure is that individual customers cannot be identified, which makes it desirable from a data privacy and security viewpoint as well. We focus on the widely used Pareto/NBD model and carry out a comprehensive simulation study covering a vast spectrum of market scenarios. We find that the RCSS format of four quarterly histograms serves as a suitable substitute for individual-level data. We confirm the results of the simulations on a real dataset of purchasing from an online fashion retailer

    Protecting Privacy Against Regression Attacks in Predictive Data Mining

    Get PDF
    Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-mining technique, can be used to effectively reveal individuals\u27 sensitive data. This problem, which we call a regression attack, has been overlooked in the literature. Existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach adopts a novel measure which considers the tradeoff between disclosure risk and data utility in a regression tree pruning process. We also propose a dynamic value-concatenation method, which overcomes the limitation of requiring a user-defined generalization hierarchy in traditional k-anonymity approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted to demonstrate the effectiveness of the proposed approach
    corecore