1,611 research outputs found
An Efficient Rule-Hiding Method for Privacy Preserving in Transactional Databases
One of the obstacles in using data mining techniques such as association rules is the risk of leakage of sensitive data after the data is released to the public. Therefore, a trade-off between the data privacy and data mining is of a great importance and must be managed carefully. In this study an efficient algorithm is introduced for preserving the privacy of association rules according to distortion-based method, in which the sensitive association rules are hidden through deletion and reinsertion of items in the database. In this algorithm, in order to reduce the side effects on non-sensitive rules, the item correlation between sensitive and non-sensitive rules is calculated and the item with the minimum influence in non-sensitive rules is selected as the victim item. To reduce the distortion degree on data and preservation of data quality, transactions with highest number of sensitive items are selected for modification. The results show that the proposed algorithm has a better performance in the non-dense real database having less side effects and less data loss compared to its performance in dense real database. Further the results are far better in synthetic databases in compared to real databases
Middleware-based Database Replication: The Gaps between Theory and Practice
The need for high availability and performance in data management systems has
been fueling a long running interest in database replication from both academia
and industry. However, academic groups often attack replication problems in
isolation, overlooking the need for completeness in their solutions, while
commercial teams take a holistic approach that often misses opportunities for
fundamental innovation. This has created over time a gap between academic
research and industrial practice.
This paper aims to characterize the gap along three axes: performance,
availability, and administration. We build on our own experience developing and
deploying replication systems in commercial and academic settings, as well as
on a large body of prior related work. We sift through representative examples
from the last decade of open-source, academic, and commercial database
replication systems and combine this material with case studies from real
systems deployed at Fortune 500 customers. We propose two agendas, one for
academic research and one for industrial R&D, which we believe can bridge the
gap within 5-10 years. This way, we hope to both motivate and help researchers
in making the theory and practice of middleware-based database replication more
relevant to each other.Comment: 14 pages. Appears in Proc. ACM SIGMOD International Conference on
Management of Data, Vancouver, Canada, June 200
SoK: Cryptographically Protected Database Search
Protected database search systems cryptographically isolate the roles of
reading from, writing to, and administering the database. This separation
limits unnecessary administrator access and protects data in the case of system
breaches. Since protected search was introduced in 2000, the area has grown
rapidly; systems are offered by academia, start-ups, and established companies.
However, there is no best protected search system or set of techniques.
Design of such systems is a balancing act between security, functionality,
performance, and usability. This challenge is made more difficult by ongoing
database specialization, as some users will want the functionality of SQL,
NoSQL, or NewSQL databases. This database evolution will continue, and the
protected search community should be able to quickly provide functionality
consistent with newly invented databases.
At the same time, the community must accurately and clearly characterize the
tradeoffs between different approaches. To address these challenges, we provide
the following contributions:
1) An identification of the important primitive operations across database
paradigms. We find there are a small number of base operations that can be used
and combined to support a large number of database paradigms.
2) An evaluation of the current state of protected search systems in
implementing these base operations. This evaluation describes the main
approaches and tradeoffs for each base operation. Furthermore, it puts
protected search in the context of unprotected search, identifying key gaps in
functionality.
3) An analysis of attacks against protected search for different base
queries.
4) A roadmap and tools for transforming a protected search system into a
protected database, including an open-source performance evaluation platform
and initial user opinions of protected search.Comment: 20 pages, to appear to IEEE Security and Privac
Impacts of frequent itemset hiding algorithms on privacy preserving data mining
Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2010Includes bibliographical references (leaves: 54-58)Text in English; Abstract: Turkish and Englishx, 69 leavesThe invincible growing of computer capabilities and collection of large amounts of data in recent years, make data mining a popular analysis tool. Association rules (frequent itemsets), classification and clustering are main methods used in data mining research. The first part of this thesis is implementation and comparison of two frequent itemset mining algorithms that work without candidate itemset generation: Matrix Apriori and FP-Growth. Comparison of these algorithms revealed that Matrix Apriori has higher performance with its faster data structure. One of the great challenges of data mining is finding hidden patterns without violating data owners. privacy. Privacy preserving data mining came into prominence as a solution. In the second study of the thesis, Matrix Apriori algorithm is modified and a frequent itemset hiding framework is developed. Four frequent itemset hiding algorithms are proposed such that: i) all versions work without pre-mining so privacy breech caused by the knowledge obtained by finding frequent itemsets is prevented in advance, ii) efficiency is increased since no pre-mining is required, iii) supports are found during hiding process and at the end sanitized dataset and frequent itemsets of this dataset are given as outputs so no post-mining is required, iv) the heuristics use pattern lengths rather than transaction lengths eliminating the possibility of distorting more valuable data
Association rule hiding using integer linear programming
Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database
Survey of Privacy-Preserving Data Publishing Methods and Speedy: a multi-threaded algorithm preserving k-anonymity
Στις μέρες μας, πολλοί οργανισμοί, επιχειρήσεις ή κρατικοί φορείς συλλέγουν και
διαχειρίζονται μεγάλο όγκο προσωπικών πληροφοριών. Τυπικά παραδείγματα τέτοιων
συνόλων δεδομένων περιλαμβάνουν κλινικές εξετάσεις νοσοκομείων, query logs
μηχανών αναζήτησης, κοινωνικά δεδομένων προερχόμενα από δίκτυα κοινωνικής
δικτύωσης, οικονομικά στοιχεία πληροφοριακών συστημάτων του δημοσίου κλπ. Αυτά
τα σύνολα δεδομένων χρειάζεται συχνά να δημοσιευτούν για ερευνητικές ή
στατιστικές μελέτες χωρίς να αποκαλυφθούν ευαίσθητα δεδομένα των ανθρώπων που
περιλαμβάνουν. Η διαδικασία ανωνυμοποίησης είναι πιο περίπλοκη από την απλή
απόκρυψη πεδίων που μπορούν άμεσα να προσδιορίσουν ένα άτομο (όνομα, AΦM κλπ).
Ακόμα και χωρίς αυτά τα πεδία, ένας επιτιθέμενος μπορεί να προκαλέσει διαρροή
ευαίσθητων πληροφοριών διασταυρώνοντας με άλλα δημόσια διαθέσιμα σύνολα
δεδομένων ή έχοντας κάποιου είδους πρότερη γνώση. Επομένως, η διαφύλαξη της
ιδιωτικότητας σε δεδομένα προς δημοσίευση έχει προσεγγίσει μεγάλο ενδιαφέρον τα
τελευταία χρόνια με αρκετά μοντέλα ιδιωτικότητας να έχουν προταθεί στη
βιβλιογραφία. Σε αυτή τη διπλωματική εργασία, αναλύουμε τις πιο συχνές
επιθέσεις που μπορούν να γίνουν σε δημοσιευμένα σύνολα δεδομένων και
παρουσιάζουμε τις πιο σύγχρονες εγγυήσεις ιδιωτικότητας και αλγορίθμους
ανωνυμοποίησης για την αντιμετώπιση των επιθέσεων αυτών. Επιπλέον, προτείνουμε
ένα νέο πολυνηματικό αλγόριθμο ανωνυμοποίησης που εκμεταλλεύεται τις
δυνατότητες των σύγχρονων επεξεργαστών ώστε να επιταχυνθεί η διαδικασία
ανωνυμοποίησης και να επιτευχθεί η k-ανωνυμία στο ανωνυμοποιημένο σύνολο
δεδομένων.Nowadays, many organizations, enterprises or public services collect and manage
a vast amount of personal information. Typical examples of such datasets
include clinical tests conducted in hospitals, query logs held by search
engines, social data produced by social networks, financial data from public
sector information systems etc. These datasets often need to be published for
research or statistical studies without revealing sensitive information of the
individuals they describe. The anonymization process is more complicated than
hiding attributes that can directly identify an individual (name, SSN etc.)
from the published dataset. Even without these attributes an adversary can
cause privacy leakage by cross-linking with other publicly available datasets
or having some sort of background knowledge. Therefore, privacy preservation in
data publishing has gained considerable attention during recent years with
several privacy models proposed in the literature. In this thesis, we discuss
the most common attacks that can be made on published datasets and we present
state-of-the-art privacy guarantees and anonymization algorithms to counter
these attacks. Furthermore, we propose a novel multi-threaded anonymization
algorithm which exploits the capabilities of modern CPUs to speed up the
anonymization process achieving k-anonymity in the anonymized dataset
- …