694 research outputs found
Analyzing Partitioned FAIR Health Data Responsibly
It is widely anticipated that the use of health-related big data will enable
further understanding and improvements in human health and wellbeing. Our
current project, funded through the Dutch National Research Agenda, aims to
explore the relationship between the development of diabetes and socio-economic
factors such as lifestyle and health care utilization. The analysis involves
combining data from the Maastricht Study (DMS), a prospective clinical study,
and data collected by Statistics Netherlands (CBS) as part of its routine
operations. However, a wide array of social, legal, technical, and scientific
issues hinder the analysis. In this paper, we describe these challenges and our
progress towards addressing them.Comment: 6 pages, 1 figure, preliminary result, project repor
Privacy Preservation and Analytical Utility of E-Learning Data Mashups in the Web of Data
Virtual learning environments contain valuable data about students that can be correlated and analyzed to optimize learning. Modern learning environments based on data mashups that collect and integrate data from multiple sources are relevant for learning analytics systems because they provide insights into students' learning. However, data sets involved in mashups may contain personal information of sensitive nature that raises legitimate privacy concerns. Average privacy preservation methods are based on preemptive approaches that limit the published data in a mashup based on access control and authentication schemes. Such limitations may reduce the analytical utility of the data exposed to gain students' learning insights. In order to reconcile utility and privacy preservation of published data, this research proposes a new data mashup protocol capable of merging and k-anonymizing data sets in cloud-based learning environments without jeopardizing the analytical utility of the information. The implementation of the protocol is based on linked data so that data sets involved in the mashups are semantically described, thereby enabling their combination with relevant educational data sources. The k-anonymized data sets returned by the protocol still retain essential information for supporting general data exploration and statistical analysis tasks. The analytical and empirical evaluation shows that the proposed protocol prevents individuals' sensitive information from re-identifying.The Spanish National Research Agency (AEI) funded this research through the project CREPES (ref. PID2020-115844RB-I00) with ERDF funds
Transformation and integration of heterogeneous health data in a privacy-preserving distributed learning infrastructure
Problem statement: A growing volume and variety of personal health data are being collected by different entities, such as healthcare providers, insurance companies, and wearable device manufacturers. Combining heterogeneous health data offers unprecedented opportunities to augment our understanding of human health and disease. However, a major challenge to research lies in the difficulty of accessing and analyzing health data that are dispersed in their format (e.g. CSV, XML), sources (e.g., medical records, laboratory data), representation (unstructured, structured), and governance (e.g., data collection and maintenance)[2]. Such considerations are crucial when we link and use personal health data across multiple legal entities with different data governance and privacy concerns
ciTIzen-centric DatA pLatform (TIDAL): Sharing Distributed Personal Data in a Privacy-Preserving Manner for Health Research
Developing personal data sharing tools and standards in conformity with data protection regulations is essential to empower citizens to control and share their health data with authorized parties for any purpose they approve. This can be, among others, for primary use in healthcare, or secondary use for research to improve human health and well-being. Ensuring that citizens are able to make fine-grained decisions about how their personal health data can be used and shared will significantly encourage citizens to participate in more health-related research. In this paper, we propose a ciTIzen-centric DatA pLatform (TIDAL) to give individuals ownership of their own data, and connect them with researchers to donate the use of their personal data for research while being in control of the entire data life cycle, including data access, storage and analysis. We recognize that most existing technologies focus on one particular aspect such as personal data storage, or suffer from executing data analysis over a large number of participants, or face challenges of low data quality and insufficient data interoperability. To address these challenges, the TIDAL platform integrates a set of components for requesting subsets of RDF (Resource Description Framework) data stored in personal data vaults based on SOcial LInked Data (Solid) technology and analyzing them in a privacy-preserving manner. We demonstrate the feasibility and efficiency of the TIDAL platform by conducting a set of simulation experiments using three different pod providers (Inrupt, Solidcommunity, Self-hosted Server). On each pod provider, we evaluated the performance of TIDAL by querying and analyzing personal health data with varying scales of participants and configurations. The reasonable total time consumption and a linear correlation between the number of pods and variables on all pod providers show the feasibility and potential to implement and use the TIDAL platform in practice. TIDAL facilitates individuals to access their personal data in a fine-grained manner and to make their own decision on their data. Researchers are able to reach out to individuals and send them digital consent directly for using personal data for health-related research. TIDAL can play an important role to connect citizens, researchers, and data organizations to increase the trust placed by citizens in the processing of personal data.publishedVersio
Computing Statistics from Private Data
In several domains, privacy presents a significant obstacle to scientific and analytic research, and limits the economic, social, health and scholastic benefits that could be derived from such research. These concerns stem from the need for privacy about personally identifiable information (PII), commercial intellectual property, and other types of information. For example, businesses, researchers, and policymakers may benefit by analyzing aggregate information about markets, but individual companies may not be willing to reveal information about risks, strategies, and weaknesses that could be exploited by competitors. Extracting valuable utility from the new “big data” economy demands new privacy technologies to overcome barriers that impede sensitive data from being aggregated and analyzed. Secure multiparty computation (MPC) is a collection of cryptographic technologies that can be used to effectively cope with some of these obstacles, and provide a new means of allowing researchers to coordinate and analyze sensitive data collections, obviating the need for data-owners to share the underlying data sets with other researchers or with each other. This paper outlines the findings that were made during interdisciplinary workshops that examined potential applications of MPC to data in the social and health sciences. The primary goals of this work are to describe the computational needs of these disciplines and to develop a specific roadmap for selecting efficient algorithms and protocols that can be used as a starting point for interdisciplinary projects between cryptographers and data scientists
Exploiting Record Similarity for Practical Vertical Federated Learning
As the privacy of machine learning has drawn increasing attention, federated
learning is introduced to enable collaborative learning without revealing raw
data. Notably, \textit{vertical federated learning} (VFL), where parties share
the same set of samples but only hold partial features, has a wide range of
real-world applications. However, existing studies in VFL rarely study the
``record linkage'' process. They either design algorithms assuming the data
from different parties have been linked or use simple linkage methods like
exact-linkage or top1-linkage. These approaches are unsuitable for many
applications, such as the GPS location and noisy titles requiring fuzzy
matching. In this paper, we design a novel similarity-based VFL framework,
FedSim, which is suitable for more real-world applications and achieves higher
performance on traditional VFL tasks. Moreover, we theoretically analyze the
privacy risk caused by sharing similarities. Our experiments on three synthetic
datasets and five real-world datasets with various similarity metrics show that
FedSim consistently outperforms other state-of-the-art baselines
Accurate training of the Cox proportional hazards model on vertically-partitioned data while preserving privacy
BACKGROUND: Analysing distributed medical data is challenging because of data sensitivity and various regulations to access and combine data. Some privacy-preserving methods are known for analyzing horizontally-partitioned data, where different organisations have similar data on disjoint sets of people. Technically more challenging is the case of vertically-partitioned data, dealing with data on overlapping sets of people. We use an emerging technology based on cryptographic techniques called secure multi-party computation (MPC), and apply it to perform privacy-preserving survival analysis on vertically-distributed data by means of the Cox proportional hazards (CPH) model. Both MPC and CPH are explained. METHODS: We use a Newton-Raphson solver to securely train the CPH model with MPC, jointly with all data holders, without revealing any sensitive data. In order to securely compute the log-partial likelihood in each iteration, we run into several technical challenges to preserve the efficiency and security of our solution. To tackle these technical challenges, we generalize a cryptographic protocol for securely computing the inverse of the Hessian matrix and develop a new method for securely computing exponentiations. A theoretical complexity estimate is given to get insight into the computational and communication effort that is needed. RESULTS: Our secure solution is implemented in a setting with three different machines, each presenting a different data holder, which can communicate through the internet. The MPyC platform is used for implementing this privacy-preserving solution to obtain the CPH model. We test the accuracy and computation time of our methods on three standard benchmark survival datasets. We identify future work to make our solution more efficient. CONCLUSIONS: Our secure solution is comparable with the standard, non-secure solver in terms of accuracy and convergence speed. The computation time is considerably larger, although the theoretical complexity is still cubic in the number of covariates and quadratic in the number of subjects. We conclude that this is a promising way of performing parametric survival analysis on vertically-distributed medical data, while realising high level of security and privacy
- …