Search CORE

136 research outputs found

A Classification Framework for Imbalanced Data

Author: Phoungphol Piyaphol
Publication venue: ScholarWorks @ Georgia State University
Publication date: 18/12/2013
Field of study

As information technology advances, the demands for developing a reliable and highly accurate predictive model from many domains are increasing. Traditional classification algorithms can be limited in their performance on highly imbalanced data sets. In this dissertation, we study two common problems when training data is imbalanced, and propose effective algorithms to solve them. Firstly, we investigate the problem in building a multi-class classification model from imbalanced class distribution. We develop an effective technique to improve the performance of the model by formulating the problem as a multi-class SVM with an objective to maximize G-mean value. A ramp loss function is used to simplify and solve the problem. Experimental results on multiple real-world datasets confirm that our new method can effectively solve the multi-class classification problem when the datasets are highly imbalanced. Secondly, we explore the problem in learning a global classification model from distributed data sources with privacy constraints. In this problem, not only data sources have different class distributions but combining data into one central data is also prohibited. We propose a privacy-preserving framework for building a global SVM from distributed data sources. Our new framework avoid constructing a global kernel matrix by mapping non-linear inputs to a linear feature space and then solve a distributed linear SVM from these virtual points. Our method can solve both imbalance and privacy problems while achieving the same level of accuracy as regular SVM. Finally, we extend our framework to handle high-dimensional data by utilizing Generalized Multiple Kernel Learning to select a sparse combination of features and kernels. This new model produces a smaller set of features, but yields much higher accuracy

CiteSeerX

ScholarWorks @ Georgia State University

Privacy-Preserving Methods for Sharing Financial Risk Exposures

Author: Amir E Khandani
Andrew W Lo
Beaver Donald
Ben-Or Michael
Chaum David
Cramer Ronald
Emmanuel A Abbe
Goldreich Oded
Yao Andrew C
Publication venue
Publication date: 01/01/2011
Field of study

Unlike other industries in which intellectual property is patentable, the financial industry relies on trade secrecy to protect its business processes and methods, which can obscure critical financial risk exposures from regulators and the public. We develop methods for sharing and aggregating such risk exposures that protect the privacy of all parties involved and without the need for a trusted third party. Our approach employs secure multi-party computation techniques from cryptography in which multiple parties are able to compute joint functions without revealing their individual inputs. In our framework, individual financial institutions evaluate a protocol on their proprietary data which cannot be inverted, leading to secure computations of real-valued statistics such a concentration indexes, pairwise correlations, and other single- and multi-point statistics. The proposed protocols are computationally tractable on realistic sample sizes. Potential financial applications include: the construction of privacy-preserving real-time indexes of bank capital and leverage ratios; the monitoring of delegated portfolio investments; financial audits; and the publication of new indexes of proprietary trading strategies

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Privacy-preserving distributed data mining

Author: Costa da Silva Josenildo
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2022
Field of study

This thesis is concerned with privacy-preserving distributed data mining algorithms. The main challenges in this setting are inference attacks and the formation of collusion groups. The inference problem is the reconstruction of sensitive data by attackers from non-sensitive sources, such as intermediate results, exchanged messages, or public information. Moreover, in a distributed scenario, malicious insiders can organize collusion groups to deploy more effective inference attacks. This thesis shows that existing privacy measures do not adequately protect privacy against inference and collusion. Therefore, in this thesis, new measures based on information theory are developed to overcome the identiffied limitations. Furthermore, a new distributed data clustering algorithm is presented. The clustering approach is based on a kernel density estimates approximation that generates a controlled amount of ambiguity in the density estimates and provides privacy to original data. Besides, this thesis also introduces the first privacy-preserving algorithms for frequent pattern discovery in a distributed time series. Time series are transformed into a set of n-dimensional data points and finding frequent patterns reduced to finding local maxima in the n-dimensional density space. The proposed algorithms are linear in the size of the dataset with low communication costs, validated by experimental evaluation using different datasets.Diese Arbeit befasst sich mit vertraulichkeitsbewahrendem Data Mining in verteilten Umgebungen mit Schwerpunkt auf ausgewählten N-Agenten-Angriffsszenarien für das Inferenzproblem im Data-Clustering und der Zeitreihenanalyse. Dabei handelt es sich um Angriffe von einzelnen oder Teilgruppen von Agenten innerhalb einer verteilten Data Mining-Gruppe oder von einem einzelnen Agenten außerhalb dieser Gruppe. Zunächst werden in dieser Arbeit zwei neue Privacy-Maße vorgestellt, die im Gegensatz zu bislang existierenden, die im verteilten Data Mining allgemein geforderte Eigenschaften zur Vertraulichkeitsbewahrung erfüllen und bei denen sich der gemessene Grad der Vertraulichkeit auf die verwendete Datenanalysemethode und die Anzahl von Angreifern bezieht. Für den Zweck eines vertraulichkeitsbewahrenden, verteilten Data-Clustering wird ein neues Kernel-Dichteabschätzungsbasiertes Verfahren namens KDECS vorgestellt. KDECS verwendet eine Approximation der originalen, lokalen Kernel-Dichteschätzung, so dass die ursprünglichen Daten anderer Agenten in der Data Mining-Gruppe mit einer höheren Wahrscheinlichkeit als einem hierfür vorgegebenen Wert nicht mehr zu rekonstruieren sind. Das Verfahren ist nachweislich sicherer als Data-Clustering mit generativen Mixture Modellen und SMC-basiert sicherem k-means Data-Clustering. Zusätzlich stellen wir neue Verfahren, namens DPD-TS, DPD-HE und DPDFS, für eine vertraulichkeitsbewahrende, verteilte Mustererkennung in Zeitreihen vor, deren Komplexität und Sicherheitsgrad wir mit den zuvor erwähnten neuen Privacy-Maßen analysieren. Dabei hängt ein von einzelnen Agenten einer Data Mining-Gruppe jeweils vorgegebener, minimaler Sicherheitsgrad von DPD-TS und DPD-FS nur von der Dimensionsreduktion der Zeitreihenwerte und ihrer Diskretisierung ab und kann leicht überprüft werden. Einen noch besseren Schutz von sensiblen Daten bietet das Verfahren DPD HE mit Hilfe von homomorpher Verschlüsselung. Neben der theoretischen Analyse wurden die experimentellen Leistungsbewertungen der entwickelten Verfahren mit verschiedenen, öffentlich verfügbaren Datensätzen durchgeführt

Universaar

Acronym

민감한 정보를 보호할 수 있는 프라이버시 보존 기계학습 기술 개발

Author: 변준영
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 산업공학과, 2022. 8. 이재욱.최근 인공지능의 성공에는 여러 가지 요인이 있으나, 새로운 알고리즘의 개발과 정제된 데이터 양의 기하급수적인 증가로 인한 영향이 크다. 따라서 기계학습 모델과 데이터는 실재적 가치를 가지게 되며, 현실 세계에서 개인 또는 기업은 학습된 모델 또는 학습에 사용할 데이터를 제공함으로써 이익을 얻을 수 있다. 그러나, 데이터 또는 모델의 공유는 개인의 민감 정보를 유출함으로써 프라이버시의 침해로 이어질 수 있다는 사실이 밝혀지고 있다. 본 논문의 목표는 민감 정보를 보호할 수 있는 프라이버시 보존 기계학습 방법론을 개발하는 것이다. 이를 위해 최근 활발히 연구되고 있는 두 가지 프라이버시 보존 기술, 즉 동형 암호와 차분 프라이버시를 사용한다. 먼저, 동형 암호는 암호화된 데이터에 대해 기계학습 알고리즘을 적용 가능하게 함으로써 데이터의 프라이버시를 보호할 수 있다. 그러나 동형 암호를 활용한 연산은 기존의 연산에 비해 매우 큰 연산 시간을 요구하므로 효율적인 알고리즘을 구성하는 것이 중요하다. 효율적인 연산을 위해 우리는 두 가지 접근법을 사용한다. 첫 번째는 학습 단계에서의 연산량을 줄이는 것이다. 학습 단계에서부터 동형 암호를 적용하면 학습 데이터의 프라이버시를 함께 보호할 수 있으므로 추론 단계에서만 동형 암호를 적용하는 것에 비해 프라이버시의 범위가 넓어지지만, 그만큼 연산량이 늘어난다. 본 논문에서는 일부 가장 중요한 정보만을 암호화함으로써 학습 단계를 효율적으로 하는 방법론을 제안한다. 구체적으로, 일부 민감 변수가 암호화되어 있을 때 연산량을 매우 줄일 수 있는 릿지 회귀 알고리즘을 개발한다. 또한 개발된 알고리즘을 확장시켜 동형 암호 친화적이지 않은 파라미터 탐색 과정을 최대한 제거할 수 있는 새로운 로지스틱 회귀 알고리즘을 함께 제안한다. 효율적인 연산을 위한 두 번째 접근법은 동형 암호를 기계학습의 추론 단계에서만 사용하는 것이다. 이를 통해 시험 데이터의 직접적인 노출을 막을 수 있다. 본 논문에서는 서포트 벡터 군집화 모델에 대한 동형 암호 친화적 추론 방법을 제안한다. 동형 암호는 여러 가지 위협에 대해서 데이터와 모델 정보를 보호할 수 있으나, 학습된 모델을 통해 새로운 데이터에 대한 추론 서비스를 제공할 때 추론 결과로부터 모델과 학습 데이터를 보호하지 못한다. 연구를 통해 공격자가 자신이 가진 데이터와 그 데이터에 대한 추론 결과만을 이용하여 이용하여 모델과 학습 데이터에 대한 정보를 추출할 수 있음이 밝혀지고 있다. 예를 들어, 공격자는 특정 데이터가 학습 데이터에 포함되어 있는지 아닌지를 추론할 수 있다. 차분 프라이버시는 학습된 모델에 대한 특정 데이터 샘플의 영향을 줄임으로써 이러한 공격에 대한 방어를 보장하는 프라이버시 기술이다. 차분 프라이버시는 프라이버시의 수준을 정량적으로 표현함으로써 원하는 만큼의 프라이버시를 충족시킬 수 있지만, 프라이버시를 충족시키기 위해서는 알고리즘에 그만큼의 무작위성을 더해야 하므로 모델의 성능을 떨어뜨린다. 따라서, 본문에서는 모스 이론을 이용하여 차분 프라이버시 군집화 방법론의 프라이버시를 유지하면서도 그 성능을 끌어올리는 새로운 방법론을 제안한다. 본 논문에서 개발하는 프라이버시 보존 기계학습 방법론은 각기 다른 수준에서 프라이버시를 보호하며, 따라서 상호 보완적이다. 제안된 방법론들은 하나의 통합 시스템을 구축하여 기계학습이 개인의 민감 정보롤 보호해야 하는 여러 분야에서 더욱 널리 사용될 수 있도록 하는 기대 효과를 가진다.Recent development of artificial intelligence systems has been driven by various factors such as the development of new algorithms and the the explosive increase in the amount of available data. In the real-world scenarios, individuals or corporations benefit by providing data for training a machine learning model or the trained model. However, it has been revealed that sharing of data or the model can lead to invasion of personal privacy by leaking personal sensitive information. In this dissertation, we focus on developing privacy-preserving machine learning methods which can protect sensitive information. Homomorphic encryption can protect the privacy of data and the models because machine learning algorithms can be applied to encrypted data, but requires much larger computation time than conventional operations. For efficient computation, we take two approaches. The first is to reduce the amount of computation in the training phase. We present an efficient training algorithm by encrypting only few important information. In specific, we develop a ridge regression algorithm that greatly reduces the amount of computation when one or two sensitive variables are encrypted. Furthermore, we extend the method to apply it to classification problems by developing a new logistic regression algorithm that can maximally exclude searching of hyper-parameters that are not suitable for machine learning with homomorphic encryption. Another approach is to apply homomorphic encryption only when the trained model is used for inference, which prevents direct exposure of the test data and the model information. We propose a homomorphic-encryption-friendly algorithm for inference of support based clustering. Though homomorphic encryption can prevent various threats to data and the model information, it cannot defend against secondary attacks through inference APIs. It has been reported that an adversary can extract information about the training data only with his or her input and the corresponding output of the model. For instance, the adversary can determine whether specific data is included in the training data or not. Differential privacy is a mathematical concept which guarantees defense against those attacks by reducing the impact of specific data samples on the trained model. Differential privacy has the advantage of being able to quantitatively express the degree of privacy, but it reduces the utility of the model by adding randomness to the algorithm. Therefore, we propose a novel method which can improve the utility while maintaining the privacy of differentially private clustering algorithms by utilizing Morse theory. The privacy-preserving machine learning methods proposed in this paper can complement each other to prevent different levels of attacks. We expect that our methods can construct an integrated system and be applied to various domains where machine learning involves sensitive personal information.Chapter 1 Introduction 1 1.1 Motivation of the Dissertation 1 1.2 Aims of the Dissertation 7 1.3 Organization of the Dissertation 10 Chapter 2 Preliminaries 11 2.1 Homomorphic Encryption 11 2.2 Differential Privacy 14 Chapter 3 Efficient Homomorphic Encryption Framework for Ridge Regression 18 3.1 Problem Statement 18 3.2 Framework 22 3.3 Proposed Method 25 3.3.1 Regression with one Encrypted Sensitive Variable 25 3.3.2 Regression with two Encrypted Sensitive Variables 30 3.3.3 Adversarial Perturbation Against Attribute Inference Attack 35 3.3.4 Algorithm for Ridge Regression 36 3.3.5 Algorithm for Adversarial Perturbation 37 3.4 Experiments 40 3.4.1 Experimental Setting 40 3.4.2 Experimental Results 42 3.5 Chapter Summary 47 Chapter 4 Parameter-free Homomorphic-encryption-friendly Logistic Regression 53 4.1 Problem Statement 53 4.2 Proposed Method 56 4.2.1 Motivation 56 4.2.2 Framework 58 4.3 Theoretical Results 63 4.4 Experiments 68 4.4.1 Experimental Setting 68 4.4.2 Experimental Results 70 4.5 Chapter Summary 75 Chapter 5 Homomorphic-encryption-friendly Evaluation for Support Vector Clustering 76 5.1 Problem Statement 76 5.2 Background 78 5.2.1 CKKS scheme 78 5.2.2 SVC 80 5.3 Proposed Method 82 5.4 Experiments 86 5.4.1 Experimental Setting 86 5.4.2 Experimental Results 87 5.5 Chapter Summary 89 Chapter 6 Differentially Private Mixture of Gaussians Clustering with Morse Theory 95 6.1 Problem Statement 95 6.2 Background 98 6.2.1 Mixture of Gaussians 98 6.2.2 Morse Theory 99 6.2.3 Dynamical System Perspective 101 6.3 Proposed Method 104 6.3.1 Differentially private clustering 105 6.3.2 Transition equilibrium vectors and the weighted graph 108 6.3.3 Hierarchical merging of sub-clusters 111 6.4 Theoretical Results 112 6.5 Experiments 117 6.5.1 Experimental Setting 117 6.5.2 Experimental Results 119 6.6 Chapter Summary 122 Chapter 7 Conclusion 124 7.1 Conclusion 124 7.2 Future Direction 126 Bibliography 128 국문초록 154박

SNU Open Repository and Archive

Accurate training of the Cox proportional hazards model on vertically-partitioned data while preserving privacy

Author: Cellamare M. (Matteo)
Kamphorst B. (Bart)
Knoors D. (Daan)
Rooijakkers T. (Thomas)
Veugen P.J.M. (Thijs)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/02/2022
Field of study

BACKGROUND: Analysing distributed medical data is challenging because of data sensitivity and various regulations to access and combine data. Some privacy-preserving methods are known for analyzing horizontally-partitioned data, where different organisations have similar data on disjoint sets of people. Technically more challenging is the case of vertically-partitioned data, dealing with data on overlapping sets of people. We use an emerging technology based on cryptographic techniques called secure multi-party computation (MPC), and apply it to perform privacy-preserving survival analysis on vertically-distributed data by means of the Cox proportional hazards (CPH) model. Both MPC and CPH are explained. METHODS: We use a Newton-Raphson solver to securely train the CPH model with MPC, jointly with all data holders, without revealing any sensitive data. In order to securely compute the log-partial likelihood in each iteration, we run into several technical challenges to preserve the efficiency and security of our solution. To tackle these technical challenges, we generalize a cryptographic protocol for securely computing the inverse of the Hessian matrix and develop a new method for securely computing exponentiations. A theoretical complexity estimate is given to get insight into the computational and communication effort that is needed. RESULTS: Our secure solution is implemented in a setting with three different machines, each presenting a different data holder, which can communicate through the internet. The MPyC platform is used for implementing this privacy-preserving solution to obtain the CPH model. We test the accuracy and computation time of our methods on three standard benchmark survival datasets. We identify future work to make our solution more efficient. CONCLUSIONS: Our secure solution is comparable with the standard, non-secure solver in terms of accuracy and convergence speed. The computation time is considerably larger, although the theoretical complexity is still cubic in the number of covariates and quadratic in the number of subjects. We conclude that this is a promising way of performing parametric survival analysis on vertically-distributed medical data, while realising high level of security and privacy

CWI's Institutional Repository

PubMed Central

Zero-knowledge Proof Meets Machine Learning in Verifiability: A Survey

Author: Li Meng
Liu Jiamou
Russello Giovanni
Xing Zhibo
Zhang Ziang
Zhang Zijian
Zhu Liehuang
Publication venue
Publication date: 23/10/2023
Field of study

With the rapid advancement of artificial intelligence technology, the usage of machine learning models is gradually becoming part of our daily lives. High-quality models rely not only on efficient optimization algorithms but also on the training and learning processes built upon vast amounts of data and computational power. However, in practice, due to various challenges such as limited computational resources and data privacy concerns, users in need of models often cannot train machine learning models locally. This has led them to explore alternative approaches such as outsourced learning and federated learning. While these methods address the feasibility of model training effectively, they introduce concerns about the trustworthiness of the training process since computations are not performed locally. Similarly, there are trustworthiness issues associated with outsourced model inference. These two problems can be summarized as the trustworthiness problem of model computations: How can one verify that the results computed by other participants are derived according to the specified algorithm, model, and input data? To address this challenge, verifiable machine learning (VML) has emerged. This paper presents a comprehensive survey of zero-knowledge proof-based verifiable machine learning (ZKP-VML) technology. We first analyze the potential verifiability issues that may exist in different machine learning scenarios. Subsequently, we provide a formal definition of ZKP-VML. We then conduct a detailed analysis and classification of existing works based on their technical approaches. Finally, we discuss the key challenges and future directions in the field of ZKP-based VML

arXiv.org e-Print Archive

Privacy-Preserving Crowdsourcing-Based Recommender Systems for E-Commerce & Health Services

Author: Casino Cembellin Francisco Jose
Publication venue: 'Universitat Rovira I Virgili'
Publication date: 01/01/2017
Field of study

En l’actualitat, els sistemes de recomanació han esdevingut un mecanisme fonamental per proporcionar als usuaris informació útil i filtrada, amb l’objectiu d’optimitzar la presa de decisions, com per exemple, en el camp del comerç electrònic. La quantitat de dades existent a Internet és tan extensa que els usuaris necessiten sistemes automàtics per ajudar-los a distingir entre informació valuosa i soroll. No obstant, sistemes de recomanació com el Filtratge Col·laboratiu tenen diverses limitacions, com ara la manca de resposta i la privadesa. Una part important d'aquesta tesi es dedica al desenvolupament de metodologies per fer front a aquestes limitacions. A més de les aportacions anteriors, en aquesta tesi també ens centrem en el procés d'urbanització que s'està produint a tot el món i en la necessitat de crear ciutats més sostenibles i habitables. En aquest context, ens proposem solucions de salut intel·ligent (s-health) i metodologies eficients de caracterització de canals sense fils, per tal de proporcionar assistència sanitària sostenible en el context de les ciutats intel·ligents.En la actualidad, los sistemas de recomendación se han convertido en una herramienta indispensable para proporcionar a los usuarios información útil y filtrada, con el objetivo de optimizar la toma de decisiones en una gran variedad de contextos. La cantidad de datos existente en Internet es tan extensa que los usuarios necesitan sistemas automáticos para ayudarles a distinguir entre información valiosa y ruido. Sin embargo, sistemas de recomendación como el Filtrado Colaborativo tienen varias limitaciones, tales como la falta de respuesta y la privacidad. Una parte importante de esta tesis se dedica al desarrollo de metodologías para hacer frente a esas limitaciones. Además de las aportaciones anteriores, en esta tesis también nos centramos en el proceso de urbanización que está teniendo lugar en todo el mundo y en la necesidad de crear ciudades más sostenibles y habitables. En este contexto, proponemos soluciones de salud inteligente (s-health) y metodologías eficientes de caracterización de canales inalámbricos, con el fin de proporcionar asistencia sanitaria sostenible en el contexto de las ciudades inteligentes.Our society lives an age where the eagerness for information has resulted in problems such as infobesity, especially after the arrival of Web 2.0. In this context, automatic systems such as recommenders are increasing their relevance, since they help to distinguish noise from useful information. However, recommender systems such as Collaborative Filtering have several limitations such as non-response and privacy. An important part of this thesis is devoted to the development of methodologies to cope with these limitations. In addition to the previously stated research topics, in this dissertation we also focus in the worldwide process of urbanisation that is taking place and the need for more sustainable and liveable cities. In this context, we focus on smart health solutions and efficient wireless channel characterisation methodologies, in order to provide sustainable healthcare in the context of smart cities

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Repositori Institucional URV

PRIVACY PRESERVING DATA MINING FOR NUMERICAL MATRICES, SOCIAL NETWORKS, AND BIG DATA

Author: Liu Lian
Publication venue: UKnowledge
Publication date: 01/01/2015
Field of study

Motivated by increasing public awareness of possible abuse of confidential information, which is considered as a significant hindrance to the development of e-society, medical and financial markets, a privacy preserving data mining framework is presented so that data owners can carefully process data in order to preserve confidential information and guarantee information functionality within an acceptable boundary. First, among many privacy-preserving methodologies, as a group of popular techniques for achieving a balance between data utility and information privacy, a class of data perturbation methods add a noise signal, following a statistical distribution, to an original numerical matrix. With the help of analysis in eigenspace of perturbed data, the potential privacy vulnerability of a popular data perturbation is analyzed in the presence of very little information leakage in privacy-preserving databases. The vulnerability to very little data leakage is theoretically proved and experimentally illustrated. Second, in addition to numerical matrices, social networks have played a critical role in modern e-society. Security and privacy in social networks receive a lot of attention because of recent security scandals among some popular social network service providers. So, the need to protect confidential information from being disclosed motivates us to develop multiple privacy-preserving techniques for social networks. Affinities (or weights) attached to edges are private and can lead to personal security leakage. To protect privacy of social networks, several algorithms are proposed, including Gaussian perturbation, greedy algorithm, and probability random walking algorithm. They can quickly modify original data in a large-scale situation, to satisfy different privacy requirements. Third, the era of big data is approaching on the horizon in the industrial arena and academia, as the quantity of collected data is increasing in an exponential fashion. Three issues are studied in the age of big data with privacy preservation, obtaining a high confidence about accuracy of any specific differentially private queries, speedily and accurately updating a private summary of a binary stream with I/O-awareness, and launching a mutual private information retrieval for big data. All three issues are handled by two core backbones, differential privacy and the Chernoff Bound

University of Kentucky

Continuous Release of Data Streams under both Centralized and Local Differential Privacy

Author: Chen Joann Qiongna
Cheng Yueqiang
Jha Somesh
Li Ninghui
Li Zhou
Su Dong
Wang Tianhao
Zhang Zhikun
Publication venue
Publication date: 24/05/2020
Field of study

In this paper, we study the problem of publishing a stream of real-valued data satisfying differential privacy (DP). One major challenge is that the maximal possible value can be quite large; thus it is necessary to estimate a threshold so that numbers above it are truncated to reduce the amount of noise that is required to all the data. The estimation must be done based on the data in a private fashion. We develop such a method that uses the Exponential Mechanism with a quality function that approximates well the utility goal while maintaining a low sensitivity. Given the threshold, we then propose a novel online hierarchical method and several post-processing techniques. Building on these ideas, we formalize the steps into a framework for private publishing of stream data. Our framework consists of three components: a threshold optimizer that privately estimates the threshold, a perturber that adds calibrated noises to the stream, and a smoother that improves the result using post-processing. Within our framework, we design an algorithm satisfying the more stringent setting of DP called local DP (LDP). To our knowledge, this is the first LDP algorithm for publishing streaming data. Using four real-world datasets, we demonstrate that our mechanism outperforms the state-of-the-art by a factor of 6-10 orders of magnitude in terms of utility (measured by the mean squared error of answering a random range query)

arXiv.org e-Print Archive

CISPA – Helmholtz-Zentrum für Informationssicherheit

A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

Author: Ranbaduge Thilina
Publication venue
Publication date
Field of study

Today many application domains, such as national statistics, healthcare, business analytic, fraud detection, and national security, require data to be integrated from multiple databases. Record linkage (RL) is a process used in data integration which links multiple databases to identify matching records that belong to the same entity. RL enriches the usefulness of data by removing duplicates, errors, and inconsistencies which improves the effectiveness of decision making in data analytic applications. Often, organisations are not willing or authorised to share the sensitive information in their databases with any other party due to privacy and confidentiality regulations. The linkage of databases of different organisations is an emerging research area known as privacy-preserving record linkage (PPRL). PPRL facilitates the linkage of databases by ensuring the privacy of the entities in these databases. In multidatabase (MD) context, PPRL is significantly challenged by the intrinsic exponential growth in the number of potential record pair comparisons. Such linkage often requires significant time and computational resources to produce the resulting matching sets of records. Due to increased risk of collusion, preserving the privacy of the data is more problematic with an increase of number of parties involved in the linkage process. Blocking is commonly used to scale the linkage of large databases. The aim of blocking is to remove those record pairs that correspond to non-matches (refer to different entities). Many techniques have been proposed for RL and PPRL for blocking two databases. However, many of these techniques are not suitable for blocking multiple databases. This creates a need to develop blocking technique for the multidatabase linkage context as real-world applications increasingly require more than two databases. This thesis is the first to conduct extensive research on blocking for multidatabase privacy-preserved record linkage (MD-PPRL). We consider several research problems in blocking of MD-PPRL. First, we start with a broad background literature on PPRL. This allow us to identify the main research gaps that need to be investigated in MD-PPRL. Second, we introduce a blocking framework for MD-PPRL which provides more flexibility and control to database owners in the block generation process. Third, we propose different techniques that are used in our framework for (1) blocking of multiple databases, (2) identifying blocks that need to be compared across subgroups of these databases, and (3) filtering redundant record pair comparisons by the efficient scheduling of block comparisons to improve the scalability of MD-PPRL. Each of these techniques covers an important aspect of blocking in real-world MD-PPRL applications. Finally, this thesis reports on an extensive evaluation of the combined application of these methods with real datasets, which illustrates that they outperform existing approaches in term of scalability, accuracy, and privacy

The Australian National University