52 research outputs found

    RANDOMIZATION BASED PRIVACY PRESERVING CATEGORICAL DATA ANALYSIS

    Get PDF
    The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive infor- mation of individuals, the public has expressed a deep concern about their privacy. Privacy-preserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining. This dissertation investigates data utility and privacy of randomization-based mod- els in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for associ- ation rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective associ- ation measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting re- quirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient so- lutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomiza- tion approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss

    Verification in Privacy Preserving Data Publishing

    Get PDF
    Privacy preserving data publication is a major concern for both the owners of data and the data publishers. Principles like k-anonymity, l-diversity were proposed to reduce privacy violations. On the other side, no studies were found on verification on the anonymized data in terms of adversarial breach and anonymity levels. However, the anonymized data is still prone to attacks due to the presence of dependencies among quasi-identifiers and sensitive attributes. This paper presents a novel framework to detect the existence of those dependencies and a solution to reduce them. The advantages of our approach are i) privacy violations can be detected, ii) the extent of privacy risk can be measured and iii) re-anonymization can be done on vulnerable blocks of data. The work is further extended to show how the adversarial breach knowledge eventually increased when new tuples are added and an on the fly solution to reduce it is discussed. Experimental results are reported and analyzed

    Privacy by Design in Data Mining

    Get PDF
    Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”. In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones. This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing

    Secure information sharing on Decentralized Social Networks.

    Get PDF
    Decentralized Social Networks (DSNs) are web-based platforms built on distributed systems (federations) composed of multiple providers (pods) that run the same social networking service. DSNs have been presented as a valid alternative to Online Social Networks (OSNs), replacing the centralized paradigm of OSNs with a decentralized distribution of the features o↵ered by the social networking platform. Similarly to commercial OSNs, DSNs o↵er to their subscribed users a number of distinctive features, such as the possibility to share resources with other subscribed users or the possibility to establish virtual relationships with other DSN users. On the other hand, each DSN user takes part in the service, choosing to store personal data on his/her own trusted provider inside the federation or to deploy his/her own provider on a private machine. This, thus, gives each DSN user direct control of his/hers data and prevents the social network provider from performing data mining analysis over these information. Unfortunately, the deployment of a personal DSN pod is not as simple as it sounds. Indeed, each pod’s owner has to maintain the security, integrity, and reliability of all the data stored in that provider. Furthermore, given the amount of data produced each day in a social network service, it is reasonable to assume that the majority of users cannot a↵ord the upkeep of an hardware capable of handling such amount of information. As a result, it has been shown that most of DSN users prefer to subscribe to an existing provider despite setting up a new one, bringing to an indirect centralization of data that leads DSNs to su↵er of the same issues as centralized social network services. In order to overcome this issue in this thesis we have investigated the possibility for DSN providers to lean on modern cloud-based storage services so as to o↵er a cloudbased information sharing service. This has required to deal with many challenges. As such, we have investigated the definition of cryptographic protocols enabling DSN users to securely store their resources in the public cloud, along with the definition of communication protocols ensuring that decryption keys are distributed only to authorized users, that is users that satisfy at least one of the access control policies specified by data owner according to Relationship-based access control model (RelBAC) [20, 34]. In addition, it has emerged that even DSN users have the same difficulties as OSN users in defining RelBAC rules that properly express their attitude towards their own privacy. Indeed, it is nowadays well accepted that the definition of access control policies is an error-prone task. Then, since misconfigured RelBAC policies may lead to harmful data release and may expose the privacy of others as well, we believe that DSN users should be assisted in the RelBAC policy definition process. At this purpose, we have designed a RelBAC policy recommendation system such that it can learn from DSN users their own attitude towards privacy, and exploits all the learned data to assist DSN users in the definition of RelBAC policies by suggesting customized privacy rules. Nevertheless, despite the presence of the above mentioned policy recommender, it is reasonable to assume that misconfigured RelBAC rules may appear in the system. However, rather than considering all misconfigured policies as leading to potentially harmful situations, we have considered that they might even lead to an exacerbated data restriction that brings to a loss of utility to DSN users. As an example, assuming that a low resolution and an high resolution version of the same picture are uploaded in the network, we believe that the low-res version should be granted to all those users who are granted to access the hi-res version, even though, due to a misconfiurated system, no policy explicitly authorizes them on the low-res picture. As such, we have designed a technique capable of exploiting all the existing data dependencies (i.e., any correlation between data) as a mean for increasing the system utility, that is, the number of queries that can be safely answered. Then, we have defined a query rewriting technique capable of extending defined access control policy authorizations by exploiting data dependencies, in order to authorize unauthorized but inferable data. In this thesis we present a complete description of the above mentioned proposals, along with the experimental results of the tests that have been carried out so as to verify the feasibility of the presented techniques

    Secure information sharing on Decentralized Social Networks.

    Get PDF
    Decentralized Social Networks (DSNs) are web-based platforms built on distributed systems (federations) composed of multiple providers (pods) that run the same social networking service. DSNs have been presented as a valid alternative to Online Social Networks (OSNs), replacing the centralized paradigm of OSNs with a decentralized distribution of the features o\u21b5ered by the social networking platform. Similarly to commercial OSNs, DSNs o\u21b5er to their subscribed users a number of distinctive features, such as the possibility to share resources with other subscribed users or the possibility to establish virtual relationships with other DSN users. On the other hand, each DSN user takes part in the service, choosing to store personal data on his/her own trusted provider inside the federation or to deploy his/her own provider on a private machine. This, thus, gives each DSN user direct control of his/hers data and prevents the social network provider from performing data mining analysis over these information. Unfortunately, the deployment of a personal DSN pod is not as simple as it sounds. Indeed, each pod\u2019s owner has to maintain the security, integrity, and reliability of all the data stored in that provider. Furthermore, given the amount of data produced each day in a social network service, it is reasonable to assume that the majority of users cannot a\u21b5ord the upkeep of an hardware capable of handling such amount of information. As a result, it has been shown that most of DSN users prefer to subscribe to an existing provider despite setting up a new one, bringing to an indirect centralization of data that leads DSNs to su\u21b5er of the same issues as centralized social network services. In order to overcome this issue in this thesis we have investigated the possibility for DSN providers to lean on modern cloud-based storage services so as to o\u21b5er a cloudbased information sharing service. This has required to deal with many challenges. As such, we have investigated the definition of cryptographic protocols enabling DSN users to securely store their resources in the public cloud, along with the definition of communication protocols ensuring that decryption keys are distributed only to authorized users, that is users that satisfy at least one of the access control policies specified by data owner according to Relationship-based access control model (RelBAC) [20, 34]. In addition, it has emerged that even DSN users have the same difficulties as OSN users in defining RelBAC rules that properly express their attitude towards their own privacy. Indeed, it is nowadays well accepted that the definition of access control policies is an error-prone task. Then, since misconfigured RelBAC policies may lead to harmful data release and may expose the privacy of others as well, we believe that DSN users should be assisted in the RelBAC policy definition process. At this purpose, we have designed a RelBAC policy recommendation system such that it can learn from DSN users their own attitude towards privacy, and exploits all the learned data to assist DSN users in the definition of RelBAC policies by suggesting customized privacy rules. Nevertheless, despite the presence of the above mentioned policy recommender, it is reasonable to assume that misconfigured RelBAC rules may appear in the system. However, rather than considering all misconfigured policies as leading to potentially harmful situations, we have considered that they might even lead to an exacerbated data restriction that brings to a loss of utility to DSN users. As an example, assuming that a low resolution and an high resolution version of the same picture are uploaded in the network, we believe that the low-res version should be granted to all those users who are granted to access the hi-res version, even though, due to a misconfiurated system, no policy explicitly authorizes them on the low-res picture. As such, we have designed a technique capable of exploiting all the existing data dependencies (i.e., any correlation between data) as a mean for increasing the system utility, that is, the number of queries that can be safely answered. Then, we have defined a query rewriting technique capable of extending defined access control policy authorizations by exploiting data dependencies, in order to authorize unauthorized but inferable data. In this thesis we present a complete description of the above mentioned proposals, along with the experimental results of the tests that have been carried out so as to verify the feasibility of the presented techniques

    Sanitizing and Minimizing Databases for Software Application Test Outsourcing

    Full text link
    Abstract—Testing software applications that use nontrivial databases is increasingly outsourced to test centers in order to achieve lower cost and higher quality. Not only do different data privacy laws prevent organizations from sharing this data with test centers because databases contain sensitive information, but also this situation is aggravated by big data – it is time consum-ing and difficult to anonymize, distribute, and test with large databases. Deleting data randomly often leads to significantly worsened test coverages and fewer uncovered faults, thereby reducing the quality of software applications. We propose a novel approach for Protecting and mInimizing databases for Software TestIng taSks (PISTIS) that both sanitizes and minimizes a database that comes along with an application. PISTIS uses a weight-based data clustering algorithm that partitions data in the database using information obtained using program analysis that describes how this data is used by the application. For each cluster, a centroid object is computed that represents different persons or entities in the cluster, and we use associative rule mining to compute and use constraints to ensure that the centroid objects are representative of the general population of the data in the cluster. Doing so also sanitizes information, since these centroid objects replace the original data to make it difficult for attackers to infer sensitive information. Thus, we reduce a large database to a few centroid objects and we show in our experiments with two applications that test coverage stays within a close range to its original level. I

    Survey of Privacy-Preserving Data Publishing Methods and Speedy: a multi-threaded algorithm preserving k-anonymity

    Get PDF
    Στις μέρες μας, πολλοί οργανισμοί, επιχειρήσεις ή κρατικοί φορείς συλλέγουν και διαχειρίζονται μεγάλο όγκο προσωπικών πληροφοριών. Τυπικά παραδείγματα τέτοιων συνόλων δεδομένων περιλαμβάνουν κλινικές εξετάσεις νοσοκομείων, query logs μηχανών αναζήτησης, κοινωνικά δεδομένων προερχόμενα από δίκτυα κοινωνικής δικτύωσης, οικονομικά στοιχεία πληροφοριακών συστημάτων του δημοσίου κλπ. Αυτά τα σύνολα δεδομένων χρειάζεται συχνά να δημοσιευτούν για ερευνητικές ή στατιστικές μελέτες χωρίς να αποκαλυφθούν ευαίσθητα δεδομένα των ανθρώπων που περιλαμβάνουν. Η διαδικασία ανωνυμοποίησης είναι πιο περίπλοκη από την απλή απόκρυψη πεδίων που μπορούν άμεσα να προσδιορίσουν ένα άτομο (όνομα, AΦM κλπ). Ακόμα και χωρίς αυτά τα πεδία, ένας επιτιθέμενος μπορεί να προκαλέσει διαρροή ευαίσθητων πληροφοριών διασταυρώνοντας με άλλα δημόσια διαθέσιμα σύνολα δεδομένων ή έχοντας κάποιου είδους πρότερη γνώση. Επομένως, η διαφύλαξη της ιδιωτικότητας σε δεδομένα προς δημοσίευση έχει προσεγγίσει μεγάλο ενδιαφέρον τα τελευταία χρόνια με αρκετά μοντέλα ιδιωτικότητας να έχουν προταθεί στη βιβλιογραφία. Σε αυτή τη διπλωματική εργασία, αναλύουμε τις πιο συχνές επιθέσεις που μπορούν να γίνουν σε δημοσιευμένα σύνολα δεδομένων και παρουσιάζουμε τις πιο σύγχρονες εγγυήσεις ιδιωτικότητας και αλγορίθμους ανωνυμοποίησης για την αντιμετώπιση των επιθέσεων αυτών. Επιπλέον, προτείνουμε ένα νέο πολυνηματικό αλγόριθμο ανωνυμοποίησης που εκμεταλλεύεται τις δυνατότητες των σύγχρονων επεξεργαστών ώστε να επιταχυνθεί η διαδικασία ανωνυμοποίησης και να επιτευχθεί η k-ανωνυμία στο ανωνυμοποιημένο σύνολο δεδομένων.Nowadays, many organizations, enterprises or public services collect and manage a vast amount of personal information. Typical examples of such datasets include clinical tests conducted in hospitals, query logs held by search engines, social data produced by social networks, financial data from public sector information systems etc. These datasets often need to be published for research or statistical studies without revealing sensitive information of the individuals they describe. The anonymization process is more complicated than hiding attributes that can directly identify an individual (name, SSN etc.) from the published dataset. Even without these attributes an adversary can cause privacy leakage by cross-linking with other publicly available datasets or having some sort of background knowledge. Therefore, privacy preservation in data publishing has gained considerable attention during recent years with several privacy models proposed in the literature. In this thesis, we discuss the most common attacks that can be made on published datasets and we present state-of-the-art privacy guarantees and anonymization algorithms to counter these attacks. Furthermore, we propose a novel multi-threaded anonymization algorithm which exploits the capabilities of modern CPUs to speed up the anonymization process achieving k-anonymity in the anonymized dataset

    Abduction and Anonymity in Data Mining

    Get PDF
    This thesis investigates two new research problems that arise in modern data mining: reasoning on data mining results, and privacy implication of data mining results. Most of the data mining algorithms rely on inductive techniques, trying to infer information that is generalized from the input data. But very often this inductive step on raw data is not enough to answer the user questions, and there is the need to process data again using other inference methods. In order to answer high level user needs such as explanation of results, we describe an environment able to perform abductive (hypothetical) reasoning, since often the solutions of such queries can be seen as the set of hypothesis that satisfy some requirements. By using cost-based abduction, we show how classification algorithms can be boosted by performing abductive reasoning over the data mining results, improving the quality of the output. Another growing research area in data mining is the one of privacy-preserving data mining. Due to the availability of large amounts of data, easily collected and stored via computer systems, new applications are emerging, but unfortunately privacy concerns make data mining unsuitable. We study the privacy implications of data mining in a mathematical and logical context, focusing on the anonymity of people whose data are analyzed. A formal theory on anonymity preserving data mining is given, together with a number of anonymity-preserving algorithms for pattern mining. The post-processing improvement on data mining results (w.r.t. utility and privacy) is the central focus of the problems we investigated in this thesis

    Data Mining for Discrimination Discovery

    Get PDF
    In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Discrimination in credit, mortgage, insurance, labor market, and education has been investigated by researchers in economics and human sciences. With the advent of automatic decision support systems, such as credit scoring systems, the ease of data collection opens several challenges to data analysts for the fight against discrimination. In this paper, we introduce the problem of discovering discrimination through data mining in a dataset of historical decision records, taken by humans or by automatic systems. We formalize the processes of direct and indirect discrimination discovery by modelling protected-by-law groups and contexts where discrimination occurs in a classification rule based syntax. Basically, classification rules extracted from the dataset allow for unveiling contexts of unlawful discrimination, where the degree of burden over protected-bylaw groups is formalized by an extension of the lift measure of a classification rule. In direct discrimination, the extracted rules can be directly mined in search of discriminatory contexts. In indirect discrimination, the mining process needs some background knowledge as a further input, e.g., census data, that combined with the extracted rules might allow for unveiling contexts of discriminatory decisions. A strategy adopted for combining extracted classification rules with background knowledge is called an inference model. In this paper, we propose two inference models and provide automatic procedures for their implementation. An empirical assessment of our results is provided on the German credit dataset and on the PKDD Discovery Challenge 1999 financial dataset

    Toward Privacy in High-Dimensional Data Publishing

    Get PDF
    Nowadays data sharing among multiple parties has become inevitable in various application domains for diverse reasons, such as decision support, policy development and data mining. Yet, data in its raw format often contains person-specific sensitive information, and publishing such data without proper protection may jeopardize individual privacy. This fact has spawned extensive research on privacy-preserving data publishing (PPDP), which balances the fundamental trade-off between individual privacy and the utility of published data. Early research of PPDP focuses on protecting private and sensitive information in relational and statistical data. However, the recent prevalence of several emerging types of high-dimensional data has rendered unique challenges that prevent traditional PPDP techniques from being directly used. In this thesis, we address the privacy concerns in publishing four types of high-dimensional data, namely set-valued data, trajectory data, sequential data and network data. We develop effective and efficient non-interactive data publishing solutions for various utility requirements. Most of our solutions satisfy a rigorous privacy guarantee known as differential privacy, which has been the de facto standard for privacy protection. This thesis demonstrates that our solutions have exhibited great promise for releasing useful high-dimensional data without endangering individual privacy
    corecore