6,722 research outputs found

    Data mining framework

    Get PDF
    The purpose of this document is building a framework for working with clinical data. Vast amounts of clinical records, stored in health repositories, contain information that can be used to improve the quality of health care. However, the information generated from these records depends vastly on the manner, in which the data is arranged. A number of factors need to be considered, before information can be extracted from the patient records. This document deals with the preparation of a framework for the data, before it can be mined.;One of the issues to deal with is information about the patient contained in the clinical records that can be used for identification purposes. A means to create anonymous records is discussed in this document. Once the records have been de-identified, they can be used for data mining. In addition to storing patient records, the document also discusses the possibility of \u27abstracting\u27 information from these documents and storing them in the repository. Information generated from the combination of patient records and abstracted information, could be used to improve the quality of health care.;This document also discusses the possibility of creating a means to query information from the data repository. A prototype application, which provides all these facilities in a form that can be accessed from any remote location, is discussed. In addition, the prospect of using Clinical Document Architecture format to store the clinical records is explored

    A Prototype For Learning Privacy-Preserving Data Publising

    Get PDF
    Erinevad organisatsioonid, valitsusasutused, firmad ja üksikisikud koguvad andmeid, mida on võimalik hiljem uute teadmiste saamiseks andmekaeve meetoditega töödelda. Töötlejaks ei tarvitse olla andmete koguja. Sageli ei ole teada andmetöötleja usaldusväärsus, mistõttu on oluline tagada, et avalikustatud andmetest poleks enam võimalik tagantjärgi privaatseid isikuandmeid identifitseerida. Selleks, et isikuid ei oleks enam võimalik identifitseerida, tuleb enne andmete töötlejatele väljastamist rakendada privaatsust säilitavaid meetodeid. Käesolevas lõputöös kirjeldatakse erinevaid ohte privaatsusele, meetodeid nende ohtude ennetamiseks, võrreldakse neid meetodeid omavahel ja kirjeldatakse erinevaid viise, kuidas andmeidanonümiseerida. Lõputöö teiseks väljundiks on õpitarkvara, mis võimaldabtudengitel antud valdkonnaga tutvuda. Lõputöö viimases osas valideeritakse loodud tarkvara.Our data gets collected every day by governments and different organizations for data mining. It is often not known who the receiving part of data is and whether data receiver can be trusted. Therefore it is necessary to anonymize data in a way what it would be not possible to identify persons from released data sets. This master thesis will discuss different threats to privacy, discuss and compare different privacy-preserving methods to mitigate these threats. The thesis will give an overview of different possible implementations for these privacy-preserving methods. The other output of this thesis is educational purpose software that allows students to learn and practice privacy-preserving methods. The final part of this thesis is a validation of designed software

    The VISTA Science Archive

    Full text link
    We describe the VISTA Science Archive (VSA) and its first public release of data from five of the six VISTA Public Surveys. The VSA exists to support the VISTA Surveys through their lifecycle: the VISTA Public Survey consortia can use it during their quality control assessment of survey data products before submission to the ESO Science Archive Facility (ESO SAF); it supports their exploitation of survey data prior to its publication through the ESO SAF; and, subsequently, it provides the wider community with survey science exploitation tools that complement the data product repository functionality of the ESO SAF. This paper has been written in conjunction with the first public release of public survey data through the VSA and is designed to help its users understand the data products available and how the functionality of the VSA supports their varied science goals. We describe the design of the database and outline the database-driven curation processes that take data from nightly pipeline-processed and calibrated FITS files to create science-ready survey datasets. Much of this design, and the codebase implementing it, derives from our earlier WFCAM Science Archive (WSA), so this paper concentrates on the VISTA-specific aspects and on improvements made to the system in the light of experience gained in operating the WSA.Comment: 22 pages, 16 figures. Minor edits to fonts and typos after sub-editting. Published in A&

    A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

    Get PDF
    Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.Comment: 30 pages, plus supplementary material

    Technical Privacy Metrics: a Systematic Survey

    Get PDF
    The file attached to this record is the author's final peer reviewed versionThe goal of privacy metrics is to measure the degree of privacy enjoyed by users in a system and the amount of protection offered by privacy-enhancing technologies. In this way, privacy metrics contribute to improving user privacy in the digital world. The diversity and complexity of privacy metrics in the literature makes an informed choice of metrics challenging. As a result, instead of using existing metrics, new metrics are proposed frequently, and privacy studies are often incomparable. In this survey we alleviate these problems by structuring the landscape of privacy metrics. To this end, we explain and discuss a selection of over eighty privacy metrics and introduce categorizations based on the aspect of privacy they measure, their required inputs, and the type of data that needs protection. In addition, we present a method on how to choose privacy metrics based on nine questions that help identify the right privacy metrics for a given scenario, and highlight topics where additional work on privacy metrics is needed. Our survey spans multiple privacy domains and can be understood as a general framework for privacy measurement

    Computational disclosure control : a primer on data privacy protection

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2001.Includes bibliographical references (leaves 213-216) and index.Today's globally networked society places great demand on the dissemination and sharing of person specific data for many new and exciting uses. When these data are linked together, they provide an electronic shadow of a person or organization that is as identifying and personal as a fingerprint even when the information contains no explicit identifiers, such as name and phone number. Other distinctive data, such as birth date and ZIP code, often combine uniquely and can be linked to publicly available information to re-identify individuals. Producing anonymous data that remains specific enough to be useful is often a very difficult task and practice today tends to either incorrectly believe confidentiality is maintained when it is not or produces data that are practically useless. The goal of the work presented in this book is to explore computational techniques for releasing useful information in such a way that the identity of any individual or entity contained in data cannot be recognized while the data remain practically useful. I begin by demonstrating ways to learn information about entities from publicly available information. I then provide a formal framework for reasoning about disclosure control and the ability to infer the identities of entities contained within the data. I formally define and present null-map, k-map and wrong-map as models of protection. Each model provides protection by ensuring that released information maps to no, k or incorrect entities, respectively. The book ends by examining four computational systems that attempt to maintain privacy while releasing electronic information. These systems are: (1) my Scrub System, which locates personally-identifying information in letters between doctors and notes written by clinicians; (2) my Datafly II System, which generalizes and suppresses values in field-structured data sets; (3) Statistics Netherlands' pt-Argus System, which is becoming a European standard for producing public-use data; and, (4) my k-Similar algorithm, which finds optimal solutions such that data are minimally distorted while still providing adequate protection. By introducing anonymity and quality metrics, I show that Datafly II can overprotect data, Scrub and p-Argus can fail to provide adequate protection, but k-similar finds optimal results.by Latanya Sweeney.Ph.D

    The Prom Problem: Fair and Privacy-Enhanced Matchmaking with Identity Linked Wishes

    Get PDF
    In the Prom Problem (TPP), Alice wishes to attend a school dance with Bob and needs a risk-free, privacy preserving way to find out whether Bob shares that same wish. If not, no one should know that she inquired about it, not even Bob. TPP represents a special class of matchmaking challenges, augmenting the properties of privacy-enhanced matchmaking, further requiring fairness and support for identity linked wishes (ILW) – wishes involving specific identities that are only valid if all involved parties have those same wishes. The Horne-Nair (HN) protocol was proposed as a solution to TPP along with a sample pseudo-code embodiment leveraging an untrusted matchmaker. Neither identities nor pseudo-identities are included in any messages or stored in the matchmaker’s database. Privacy relevant data stay within user control. A security analysis and proof-of-concept implementation validated the approach, fairness was quantified, and a feasibility analysis demonstrated practicality in real-world networks and systems, thereby bounding risk prior to incurring the full costs of development. The SecretMatch™ Prom app leverages one embodiment of the patented HN protocol to achieve privacy-enhanced and fair matchmaking with ILW. The endeavor led to practical lessons learned and recommendations for privacy engineering in an era of rapidly evolving privacy legislation. Next steps include design of SecretMatch™ apps for contexts like voting negotiations in legislative bodies and executive recruiting. The roadmap toward a quantum resistant SecretMatch™ began with design of a Hybrid Post-Quantum Horne-Nair (HPQHN) protocol. Future directions include enhancements to HPQHN, a fully Post Quantum HN protocol, and more
    corecore