23 research outputs found

    Privacy at Risk: Exploiting Similarities in Health Data for Identity Inference

    Full text link
    Smartwatches enable the efficient collection of health data that can be used for research and comprehensive analysis to improve the health of individuals. In addition to the analysis capabilities, ensuring privacy when handling health data is a critical concern as the collection and analysis of such data become pervasive. Since health data contains sensitive information, it should be handled with responsibility and is therefore often treated anonymously. However, also the data itself can be exploited to reveal information and break anonymity. We propose a novel similarity-based re-identification attack on time-series health data and thereby unveil a significant vulnerability. Despite privacy measures that remove identifying information, our attack demonstrates that a brief amount of various sensor data from a target individual is adequate to possibly identify them within a database of other samples, solely based on sensor-level similarities. In our example scenario, where data owners leverage health data from smartwatches, findings show that we are able to correctly link the target data in two out of three cases. User privacy is thus already inherently threatened by the data itself and even when removing personal information

    DATA MINING AND RE-IDENTIFICATION: ANALYSIS OF DATABASE QUERY PATTERNS THAT POSE A THREAT TO ANONYMISED INFORMATION

    Get PDF
    To maintain the globally connected civilization culture in place today, a number of sectors are built on the gathering and sharing of data. Personal and sensitive data are collected and shared about the individuals using the services offered by these sectors. Data controllers rely on the robustness of anonymisation measures to keep personal and sensitive attributes in the shared dataset privacy safe. Typically, the dataset is stripped of direct identifiers such as names and National Insurance (NI) numbers, such that individuals in the dataset are not uniquely identifiable. However, details in the dataset perceived by data controllers to have no negative data privacy impact can be used by attackers to perform a re-identification attack. Such an attack uses the details shared in the dataset in conjunction with a secondary data source to rebuild a personally identifiable profile for individual(s) in the supposedly anonymised shared dataset. There have been a few publicised cases of re-identification attacks, and with the information reported about these attacks, it is unknown what constitutes a re-identification attack from a technical perspective other than its outcome. The work in this thesis explores real cases of successful re-identification attacks to analyse and build a technical profile of what re-identification entails. Using the Netflix Prize Data and the re-identification of Governor William Weld as case studies, synthetic datasets are created to represent the anonymised databases shared in each of these re-identification attack cases. An exploratory study to technically represent re-identification attacks as database queries in SQL is conducted. This involves the research performing re-identification attacks on the synthetic databases by executing a series of SQL queries. With a hypothesis that there is enough similarity in the patterns of SQL database queries that lead to re-identification attacks on anonymised databases, this research employs data mining techniques and machine learning algorithms to train classifiers to recognise re-identification patterns in SQL queries. Four classification algorithms: Multilayer Perceptron (MLP), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Logistic Regression (LR) are trained in this research to recognise and predict attempts of re-identification attacks. The results of the performance evaluation and unseen data testing indicate that the MLP, Multinomial Naive Bayes (MNB), and the LR classifiers are most effective at recognising patterns of re-identification attacks. During performance evaluation, the MLP classifier achieved an accuracy of 100%, the MNB achieved 79.3% and the LR achieved 100%. The unseen data testing shows that the MLP, MNB, and LR classifiers are able to predict new instances of re-identification attack attempts 79%, 71%, and 79% of the time respectively, indicating a good generalisation performance. To the best of this research’s knowledge, the work in this thesis is the only effort to date to automate the recognition and prediction of re-identification attack attempts on anonymised databases. The novel system developed in this research can be implemented to improve the monitoring of anonymised databases in data sharing environments

    A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

    No full text
    Today many application domains, such as national statistics, healthcare, business analytic, fraud detection, and national security, require data to be integrated from multiple databases. Record linkage (RL) is a process used in data integration which links multiple databases to identify matching records that belong to the same entity. RL enriches the usefulness of data by removing duplicates, errors, and inconsistencies which improves the effectiveness of decision making in data analytic applications. Often, organisations are not willing or authorised to share the sensitive information in their databases with any other party due to privacy and confidentiality regulations. The linkage of databases of different organisations is an emerging research area known as privacy-preserving record linkage (PPRL). PPRL facilitates the linkage of databases by ensuring the privacy of the entities in these databases. In multidatabase (MD) context, PPRL is significantly challenged by the intrinsic exponential growth in the number of potential record pair comparisons. Such linkage often requires significant time and computational resources to produce the resulting matching sets of records. Due to increased risk of collusion, preserving the privacy of the data is more problematic with an increase of number of parties involved in the linkage process. Blocking is commonly used to scale the linkage of large databases. The aim of blocking is to remove those record pairs that correspond to non-matches (refer to different entities). Many techniques have been proposed for RL and PPRL for blocking two databases. However, many of these techniques are not suitable for blocking multiple databases. This creates a need to develop blocking technique for the multidatabase linkage context as real-world applications increasingly require more than two databases. This thesis is the first to conduct extensive research on blocking for multidatabase privacy-preserved record linkage (MD-PPRL). We consider several research problems in blocking of MD-PPRL. First, we start with a broad background literature on PPRL. This allow us to identify the main research gaps that need to be investigated in MD-PPRL. Second, we introduce a blocking framework for MD-PPRL which provides more flexibility and control to database owners in the block generation process. Third, we propose different techniques that are used in our framework for (1) blocking of multiple databases, (2) identifying blocks that need to be compared across subgroups of these databases, and (3) filtering redundant record pair comparisons by the efficient scheduling of block comparisons to improve the scalability of MD-PPRL. Each of these techniques covers an important aspect of blocking in real-world MD-PPRL applications. Finally, this thesis reports on an extensive evaluation of the combined application of these methods with real datasets, which illustrates that they outperform existing approaches in term of scalability, accuracy, and privacy

    Privacy-preserving E-ticketing Systems for Public Transport Based on RFID/NFC Technologies

    Get PDF
    Pervasive digitization of human environment has dramatically changed our everyday lives. New technologies which have become an integral part of our daily routine have deeply affected our perception of the surrounding world and have opened qualitatively new opportunities. In an urban environment, the influence of such changes is especially tangible and acute. For example, ubiquitous computing (also commonly referred to as UbiComp) is a pure vision no more and has transformed the digital world dramatically. Pervasive use of smartphones, integration of processing power into various artefacts as well as the overall miniaturization of computing devices can already be witnessed on a daily basis even by laypersons. In particular, transport being an integral part of any urban ecosystem have been affected by these changes. Consequently, public transport systems have undergone transformation as well and are currently dynamically evolving. In many cities around the world, the concept of the so-called electronic ticketing (e-ticketing) is being extensively used for issuing travel permissions which may eventually result in conventional paper-based tickets being completely phased out already in the nearest future. Opal Card in Sydney, Oyster Card in London, Touch & Travel in Germany and many more are all the examples of how well the e-ticketing has been accepted both by customers and public transport companies. Despite numerous benefits provided by such e-ticketing systems for public transport, serious privacy concern arise. The main reason lies in the fact that using these systems may imply the dramatic multiplication of digital traces left by individuals, also beyond the transport scope. Unfortunately, there has been little effort so far to explicitly tackle this issue. There is still not enough motivation and public pressure imposed on industry to invest into privacy. In academia, the majority of solutions targeted at this problem quite often limit the real-world pertinence of the resultant privacy-preserving concepts due to the fact that inherent advantages of e-ticketing systems for public transport cannot be fully leveraged. This thesis is aimed at solving the aforementioned problem by providing a privacy-preserving framework which can be used for developing e-ticketing systems for public transport with privacy protection integrated from the outset. At the same time, the advantages of e-ticketing such as fine-grained billing, flexible pricing schemes, and transparent use (which are often the main drivers for public to roll out such systems) can be retained

    Design and Development of a Knowledge Modelling Approach to Govern the Use of Electronic Health Records for Research

    Get PDF
    There is now increasing commitment internationally to using electronic healthcare records collected during routine care delivery to conduct clinical research. This must be rigorously controlled by an extensive set of information governance requirements defining the legal, ethical and practical guidelines to respect the privacy rights of the people about whom the records are kept, uphold the clinical profession’s duty of confidentiality and protect the interests of participants, practitioners and researchers. The development of information security policies is a highly regarded method of meeting these requirements. This is hampered by the need to interpret a complex framework of legislation and guidelines, lack of clear advice and inconsistency in authoring, interpretation and understanding amongst the people whose behaviour they are expected to guide. By using the results of several UK and European research and information platform development projects in which the author has participated and by gathering requirements from stakeholders in the clinical and research communities, this thesis defines a knowledge management representation to specify policy requirements in a computable form. The work provides the first set of knowledge requirements for governing research uses of electronic healthcare records, and a knowledge model that describes information security policies and generates a web application tool. The tool allows policy control authoring that provides a consistent, clear and unambiguous view of governance requirements to researchers and service providers. The model and tool have been evaluated in a laboratory setting to explore their effects on behaviour and understanding of invited participants when authoring policy about handling healthcare records in research and making decisions about sharing information. The work has resulted in a validation of the model and demonstrated the potential positive effects of this new approach on practice. It makes recommendations about how it should be used in working practice and for educating people about information governance when performing clinical research to improve care provision
    corecore