35 research outputs found
Comprehensive survey on big data privacy protection
In recent years, the ever-mounting problem of Internet phishing has been threatening the secure propagation of sensitive data over the web, thereby resulting in either outright decline of data distribution or inaccurate data distribution from several data providers. Therefore, user privacy has evolved into a critical issue in various data mining operations. User privacy has turned out to be a foremost criterion for allowing the transfer of confidential information. The intense surge in storing the personal data of customers (i.e., big data) has resulted in a new research area, which is referred to as privacy-preserving data mining (PPDM). A key issue of PPDM is how to manipulate data using a specific approach to enable the development of a good data mining model on modified data, thereby meeting a specified privacy need with minimum loss of information for the intended data analysis task. The current review study aims to utilize the tasks of data mining operations without risking the security of individuals’ sensitive information, particularly at the record level. To this end, PPDM techniques are reviewed and classified using various approaches for data modification. Furthermore, a critical comparative analysis is performed for the advantages and drawbacks of PPDM techniques. This review study also elaborates on the existing challenges and unresolved issues in PPDM.Published versio
Recommended from our members
Generative Adversarial Networks for Multi-Objective Synthetic Data Generation
Synthetic data has become increasingly accessible due to remarkable advancements in machine learning. This data is extremely useful to researchers due to its wide range of applications. Synthetic data may be used to robust populations that are under-sampled, or to create permutations of some existing data, generating combinations not seen in the original data. Synthetic data may also be used in place of the original data completely when sensitive aspects limit the distribution.Previously, research in synthetic data generation has been primarily focused on generating data that is maximally realistic. Significantly less attention has been paid to assurances of other components of the data, such as privacy concerns or data diversity. This has left a gap in the field of synthetic data generation. We address this through the investigation of multi-agent synthetic data generation.In this dissertation, we expand the scope of data generation by introducing agents that optimize various facets of synthetic data, such as privacy, class diversity, and training utility. We propose a novel, multi-objective synthetic generation framework to allow all of these objectives to be optimized. We finally demonstrate this framework can generate high quality data across multiple domains for an arbitrary number of objectives
RANDOMIZATION BASED PRIVACY PRESERVING CATEGORICAL DATA ANALYSIS
The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive infor- mation of individuals, the public has expressed a deep concern about their privacy. Privacy-preserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining.
This dissertation investigates data utility and privacy of randomization-based mod- els in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for associ- ation rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis.
We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective associ- ation measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information.
Data privacy and data utility are commonly considered as a pair of conflicting re- quirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient so- lutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomiza- tion approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss
Privacy in trajectory micro-data publishing : a survey
We survey the literature on the privacy of trajectory micro-data, i.e.,
spatiotemporal information about the mobility of individuals, whose collection
is becoming increasingly simple and frequent thanks to emerging information and
communication technologies. The focus of our review is on privacy-preserving
data publishing (PPDP), i.e., the publication of databases of trajectory
micro-data that preserve the privacy of the monitored individuals. We classify
and present the literature of attacks against trajectory micro-data, as well as
solutions proposed to date for protecting databases from such attacks. This
paper serves as an introductory reading on a critical subject in an era of
growing awareness about privacy risks connected to digital services, and
provides insights into open problems and future directions for research.Comment: Accepted for publication at Transactions for Data Privac
Spherical microaggregation : anonymizing sparse vector spaces
Unstructured texts are a very popular data type and still widely unexplored in the privacy preserving data mining field. We consider the problem of providing public information about a set of confidential documents. To that end we have developed a method to protect a Vector Space Model (VSM), to make it public even if the documents it represents are private. This method is inspired by microaggregation, a popular protection method from statistical disclosure control, and adapted to work with sparse and high dimensional data sets
Differential privacy by sampling
Collection and storage of immense volumes of data has become commonplace in today's digital age, making the protection of personal data increasingly important. Private data often includes sensitive information about an individual, and is gathered by medical and financial institutions, research and social science organisations, government, etc., taking full advantage of data-driven analytics and knowledge-based decision-making to improve products and services, enterprise statistical analysis, comprehensive studies of demographic trends, and many others. The disclosure or sharing of such information among different parties could infringe on privacy. This information can be used for malicious purposes, such as identity theft, scams or targeted advertising. This work examines the field of privacy-preserving data publishing.
Quality of published data significantly affects not only understanding and processing strategy, but the accuracy of data analysis as well as consequently the interpretation and decisions derived from the data. In order to meet this challenge, synthetic anonymization techniques, such as k-anonymity and its enhanced algorithms, are applied. However, they are based on the background knowledge of the adversary. A semantic model, or differential privacy, is a more rigorous mathematical notion of privacy assurance that operates under no assumptions. Nevertheless, differential privacy applies to the subsequent phase, namely privacy preserving data mining, query answering and aggregate statistics.
In the scope of this work, a subsampling anonymization algorithm DP-anonym providing k-anonymity with integrated differential privacy mechanisms, such as Laplace mechanism and exponential mechanisms, is elaborated. The algorithm provides synthetic and semantic privacy, combining the best of the two areas of private data exploration. According to experimental results, the proposed DP-anonym algorithm provides better data utility when compared to standard anonymization algorithms among general data utility metrics. It also provides more precise answers to typical database queries as it uses multidimensional generalization approach. In contrast to standard methods, DP-anonym achieves (epsilon, delta)-differential privacy, which guarantees the privacy of published anonymized data more efficiently.Im heutigen digitalen Zeitalter sind das Sammeln und Speichern enormer Datenmengen alltäglich geworden, womit die Sicherung persönlicher Daten stetig an Relevanz gewinnt. Private Daten zeichnen sich dadurch aus, dass sie häufig sensible Informationen über die betroffenen Individuen enthalten, welche von medizinischen Einrichtungen und Finanzinstitutionen, Forschungseinrichtungen und Wissenschaftlichen Organisationen, der Regierung und deren Behören sowie diversen anderen Stellen gesammelt und genutzt werden, um die Vorteile auszuschöpfen welche datenbasierter Analysen und wissensbasierter Entscheidungsfindung zur Optimierung von Produkten und Dienstleistungen, statistische Unternehmensanalysen, umfassende Studien zu demografischen Trends um einige zu nennen, mit sich bringen. Die Veröffentlichung oder gemeinsame Nutzung verschiedener Parteien solch sensibler Daten kann zu Verletzungen der Privatsphäre führen. So können die Daten mutwillig missbraucht werden, beispielsweise durch Identitätsdiebstahl, Betrug oder zielgerichtete Werbung. Diese Arbeit untersucht den Bereich der datenschutzfreundlichen Datenveröffentlichung.
Die Qualität der veröffentlichten Daten hat erhebliche Auswirkungen auf das Verständnis und die Verarbeitungsstrategie sowie auf die Genauigkeit der Datenanalyse und folglich auf die aus den Daten gewonnenen Erkenntnisse und Entscheidungen. Um diese Herausforderung zu meistern, werden synthetische Anonymisierungstechniken wie k-anonymity und ihre optimierten Algorithmen eingesetzt. Sie basieren jedoch auf dem Hintergrundwissen des Gegners. Das semantische Modell oder die differential Privacy ist ein rigoroserer mathematischer Ansatz zur Gewährleistung der Privatsphäre, der ohne Annahmen auskommt. Allerdings lässt sich die differential Privacy auf die nachfolgende Phase anwenden, d. h. auf die datenschutzfreundliche Data Mining, die Anfragebeantwortung und die aggregierten Statistiken.
Im Rahmen dieser Arbeit wird ein Subsampling-Anonymisierungsalgorithmus DP-anonym ausgearbeitet, der k-anonymity mit integrierten differential privacy Mechanismen, wie dem Laplace- und dem Exponentialmechanismus, gewährleistet. Der Algorithmus bietet synthetische und semantische Privatsphäre und kombiniert damit die Vorteile beider Bereiche der privaten Datenexploration. Die experimentellen Ergebnisse zeigen, dass der vorgeschlagene Algorithmus DP-anonym im Vergleich zu Standard-Anonymisierungsalgorithmen einen höheren Datennutzen unter den allgemeinen Datennutzenmetriken bietet. Außerdem liefert er präzisere Antworten auf typische Datenbankabfragen, denn er verwendet einen mehrdimensionalen Generalisierungsansatz. Im Gegensatz zu Standardmethoden erreicht DP-anonym die (epsilon, delta)-differential privacy, was effizienter die Privatsicherheit der veröffentlichten anonymisierten Daten gewährleistet
A Comprehensive Bibliometric Analysis on Social Network Anonymization: Current Approaches and Future Directions
In recent decades, social network anonymization has become a crucial research
field due to its pivotal role in preserving users' privacy. However, the high
diversity of approaches introduced in relevant studies poses a challenge to
gaining a profound understanding of the field. In response to this, the current
study presents an exhaustive and well-structured bibliometric analysis of the
social network anonymization field. To begin our research, related studies from
the period of 2007-2022 were collected from the Scopus Database then
pre-processed. Following this, the VOSviewer was used to visualize the network
of authors' keywords. Subsequently, extensive statistical and network analyses
were performed to identify the most prominent keywords and trending topics.
Additionally, the application of co-word analysis through SciMAT and the
Alluvial diagram allowed us to explore the themes of social network
anonymization and scrutinize their evolution over time. These analyses
culminated in an innovative taxonomy of the existing approaches and
anticipation of potential trends in this domain. To the best of our knowledge,
this is the first bibliometric analysis in the social network anonymization
field, which offers a deeper understanding of the current state and an
insightful roadmap for future research in this domain.Comment: 73 pages, 28 figure
Privacy in trajectory micro-data publishing: a survey
International audienceWe survey the literature on the privacy of trajectory micro-data, i.e., spatiotemporal information about the mobility of individuals, whose collection is becoming increasingly simple and frequent thanks to emerging information and communication technologies. The focus of our review is on privacy-preserving data publishing (PPDP), i.e., the publication of databases of trajectory micro-data that preserve the privacy of the monitored individuals. We classify and present the literature of attacks against trajectory micro-data, as well as solutions proposed to date for protecting databases from such attacks. This paper serves as an introductory reading on a critical subject in an era of growing awareness about privacy risks connected to digital services, and provides insights into open problems and future directions for research
A Data Mining Perspective in Privacy Preserving Data Mining Systems
Privacy Preserving Data Mining () presents a novel framework for extracting and deriving information when the data is distributed amongst the multiple parties. The privacy preservation of data and the use of efficient data mining algorithms in systems is a major issue that exists. Most of the existing systems employ the cryptographic key exchange process and the key computation process accomplished by means of certain trusted server or a third party. To eliminate the key exchange and key computation overheads this paper discusses the Key Distribution-Less Privacy Preserving Data Mining () system. The novelty of the system is that no data is published but only the association rules are published to achieve effective data mining results. The embodies the data mining algorithm for classification rule generation and data mining. The results discussed in this paper compare the based system with the based system and the efficiency in rule generation, overhead reduction and classification efficiency of the latter is proved
Privacidade em comunicações de dados para ambientes contextualizados
Doutoramento em InformáticaInternet users consume online targeted advertising based on information collected
about them and voluntarily share personal information in social networks.
Sensor information and data from smart-phones is collected and used
by applications, sometimes in unclear ways. As it happens today with smartphones,
in the near future sensors will be shipped in all types of connected
devices, enabling ubiquitous information gathering from the physical environment,
enabling the vision of Ambient Intelligence. The value of gathered data,
if not obvious, can be harnessed through data mining techniques and put to
use by enabling personalized and tailored services as well as business intelligence
practices, fueling the digital economy.
However, the ever-expanding information gathering and use undermines the
privacy conceptions of the past. Natural social practices of managing privacy
in daily relations are overridden by socially-awkward communication tools, service
providers struggle with security issues resulting in harmful data leaks,
governments use mass surveillance techniques, the incentives of the digital
economy threaten consumer privacy, and the advancement of consumergrade
data-gathering technology enables new inter-personal abuses.
A wide range of fields attempts to address technology-related privacy problems,
however they vary immensely in terms of assumptions, scope and approach.
Privacy of future use cases is typically handled vertically, instead
of building upon previous work that can be re-contextualized, while current
privacy problems are typically addressed per type in a more focused way.
Because significant effort was required to make sense of the relations and
structure of privacy-related work, this thesis attempts to transmit a structured
view of it. It is multi-disciplinary - from cryptography to economics, including
distributed systems and information theory - and addresses privacy issues of
different natures.
As existing work is framed and discussed, the contributions to the state-of-theart
done in the scope of this thesis are presented. The contributions add to
five distinct areas: 1) identity in distributed systems; 2) future context-aware
services; 3) event-based context management; 4) low-latency information flow
control; 5) high-dimensional dataset anonymity. Finally, having laid out such
landscape of the privacy-preserving work, the current and future privacy challenges
are discussed, considering not only technical but also socio-economic
perspectives.Quem usa a Internet vê publicidade direccionada com base nos seus hábitos
de navegação, e provavelmente partilha voluntariamente informação pessoal
em redes sociais. A informação disponível nos novos telemóveis é amplamente
acedida e utilizada por aplicações móveis, por vezes sem razões claras
para isso. Tal como acontece hoje com os telemóveis, no futuro muitos tipos
de dispositivos elecónicos incluirão sensores que permitirão captar dados do
ambiente, possibilitando o surgimento de ambientes inteligentes. O valor dos
dados captados, se não for óbvio, pode ser derivado através de técnicas de
análise de dados e usado para fornecer serviços personalizados e definir estratégias
de negócio, fomentando a economia digital.
No entanto estas práticas de recolha de informação criam novas questões de
privacidade. As práticas naturais de relações inter-pessoais são dificultadas
por novos meios de comunicação que não as contemplam, os problemas de
segurança de informação sucedem-se, os estados vigiam os seus cidadãos,
a economia digital leva á monitorização dos consumidores, e as capacidades
de captação e gravação dos novos dispositivos eletrónicos podem ser usadas
abusivamente pelos próprios utilizadores contra outras pessoas.
Um grande número de áreas científicas focam problemas de privacidade relacionados
com tecnologia, no entanto fazem-no de maneiras diferentes e
assumindo pontos de partida distintos. A privacidade de novos cenários é
tipicamente tratada verticalmente, em vez de re-contextualizar trabalho existente,
enquanto os problemas actuais são tratados de uma forma mais focada.
Devido a este fraccionamento no trabalho existente, um exercício muito relevante
foi a sua estruturação no âmbito desta tese. O trabalho identificado é
multi-disciplinar - da criptografia à economia, incluindo sistemas distribuídos
e teoria da informação - e trata de problemas de privacidade de naturezas
diferentes.
À medida que o trabalho existente é apresentado, as contribuições feitas por
esta tese são discutidas. Estas enquadram-se em cinco áreas distintas: 1)
identidade em sistemas distribuídos; 2) serviços contextualizados; 3) gestão
orientada a eventos de informação de contexto; 4) controlo de fluxo de
informação com latência baixa; 5) bases de dados de recomendação anónimas.
Tendo descrito o trabalho existente em privacidade, os desafios actuais
e futuros da privacidade são discutidos considerando também perspectivas
socio-económicas