222 research outputs found

    Mathematical techniques for the protection of patient's privacy in medical databases

    Get PDF
    In modern society, keeping the balance between privacy and public access to information is becoming a widespread problem more and more often. Valid data is crucial for many kinds of research, but the public good should not be achieved at the expense of individuals. While creating a central database of patients, the CSIOZ wishes to provide statistical information for selected institutions. However, there are some plans to extend the access by providing the statistics to researchers or even to citizens. This might pose a significant risk of disclosure of some private, sensitive information about individuals. This report proposes some methods to prevent data leaks. One category of suggestions is based on the idea of modifying statistics, so that they would maintain importance for statisticians and at the same time guarantee the protection of patient's privacy. Another group of proposed mechanisms, though sometimes difficult to implement, enables one to obtain precise statistics, while restricting such queries which might reveal sensitive information

    A posteriori disclosure risk measure for tabular data based on conditional entropy

    Get PDF
    Statistical database protection, also known as Statistical Disclosure Control (SDC), is a part of information security which tries to prevent published statistical information (tables, individual records) from disclosing the contribution of specific respondents. This paper deals with the assessment of the disclosure risk associated to the release of tabular data. So-called sensitivity rules are currently being used to measure the disclosure risk for tables. These rules operate on an a priori basis: the data are examined and the rules are used to decide whether the data can be released as they stand or should rather be protected. In this paper, we propose to complement a priori risk assessment with a posteriori risk assessment in order to achieve a higher level of security, that is, we propose to take the protected information into account when measuring the disclosure risk. The proposed a posteriori disclosure risk measure is compatible with a broad class of disclosure protection methods and can be extended for computing disclosure risk for a set of linked tables. In the case of linked table protection via cell suppression, the proposed measure allows detection of secondary suppression patterns which offer more protection than others

    De-identifying a public use microdata file from the Canadian national discharge abstract database

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Canadian Institute for Health Information (CIHI) collects hospital discharge abstract data (DAD) from Canadian provinces and territories. There are many demands for the disclosure of this data for research and analysis to inform policy making. To expedite the disclosure of data for some of these purposes, the construction of a DAD public use microdata file (PUMF) was considered. Such purposes include: confirming some published results, providing broader feedback to CIHI to improve data quality, training students and fellows, providing an easily accessible data set for researchers to prepare for analyses on the full DAD data set, and serve as a large health data set for computer scientists and statisticians to evaluate analysis and data mining techniques. The objective of this study was to measure the probability of re-identification for records in a PUMF, and to de-identify a national DAD PUMF consisting of 10% of records.</p> <p>Methods</p> <p>Plausible attacks on a PUMF were evaluated. Based on these attacks, the 2008-2009 national DAD was de-identified. A new algorithm was developed to minimize the amount of suppression while maximizing the precision of the data. The acceptable threshold for the probability of correct re-identification of a record was set at between 0.04 and 0.05. Information loss was measured in terms of the extent of suppression and entropy.</p> <p>Results</p> <p>Two different PUMF files were produced, one with geographic information, and one with no geographic information but more clinical information. At a threshold of 0.05, the maximum proportion of records with the diagnosis code suppressed was 20%, but these suppressions represented only 8-9% of all values in the DAD. Our suppression algorithm has less information loss than a more traditional approach to suppression. Smaller regions, patients with longer stays, and age groups that are infrequently admitted to hospitals tend to be the ones with the highest rates of suppression.</p> <p>Conclusions</p> <p>The strategies we used to maximize data utility and minimize information loss can result in a PUMF that would be useful for the specific purposes noted earlier. However, to create a more detailed file with less information loss suitable for more complex health services research, the risk would need to be mitigated by requiring the data recipient to commit to a data sharing agreement.</p

    Economic Analysis and Statistical Disclosure Limitation

    Get PDF
    This paper explores the consequences for economic research of methods used by data publishers to protect the privacy of their respondents. We review the concept of statistical disclosure limitation for an audience of economists who may be unfamiliar with these methods. We characterize what it means for statistical disclosure limitation to be ignorable. When it is not ignorable, we consider the effects of statistical disclosure limitation for a variety of research designs common in applied economic research. Because statistical agencies do not always report the methods they use to protect confidentiality, we also characterize settings in which statistical disclosure limitation methods are discoverable; that is, they can be learned from the released data. We conclude with advice for researchers, journal editors, and statistical agencies

    Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods

    Get PDF
    This paper has been replaced with http://digitalcommons.ilr.cornell.edu/ldi/37. We consider the problem of the public release of statistical information about a population–explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner’s problem using the technology set implied by (ε, δ)-differential privacy with (α, β)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner’s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial

    Statistical disclosure control: Applications in healthcare

    Get PDF
    Statistical disclosure control is a progressive subject which offers techniques with which tables of data intended for public release can be protected from the threat of disclosure. In this sense disclosure will usually mean information on an individual subject being revealed by the release of a table. The techniques used centre around detecting potential disclosure in a table and then removing this disclosure by somehow adjusting the original table. This thesis has been produced in conjunction with Information and Services Division (Scotland) (ISD) and therefore will concentrate on the applications of statistical disclosure control in the field of healthcare with particular reference to the problems encountered by ISD. The thesis predominately aims to give an overview of current statistical disclosure control techniques. It will investigate how these techniques would work in the ISD scenario and will ultimately aim to provide ISD with advice on how they should proceed in any future update of their statistical disclosure control policy. Chapter 1 introduces statistical disclosure and investigates some of the legal and social issues associated with the field. It also provides information on the techniques which are used by other organisations worldwide. Further there is an introduction to both the ISD scenario and a leading computing package in the area, Tau-Argus. Chapter 2 gives an overview of the techniques currently used in statistical disclosure control. This overview includes technical justification for the techniques along with the advantages and disadvantages associated with using each technique. Chapter 3 provides a decision rule approach to the selection of disclosure control techniques described in Chapter 2 and much of Chapter 3 revolves around a description of the implications derived from the choices made. Chapter 4 presents the results from an application of statistical disclosure control techniques to a real ISD data set concerned with diabetes in children in Scotland. The results include a quantification of the information lost in the table when the disclosure control technique is applied. The investigation concentrated on two and three- dimensional tables and the analysis was carried out using the Tau-Argus computing package. Chapter 5 concludes by providing a summary of the main findings of the thesis and providing recommendations based on these findings. There is also a discussion of potential further study which may be useful to ISD as they attempt to update their statistical disclosure control policy

    A Study of Inference Control Techniques

    Get PDF
    Security is a major issue in every field and it impacts more when intruders get the information of an individual from the database either directly or indirectly. There are two approaches to break the confidentiality, either directly or indirectly. Here we study about different types of techniques which protect the confidentiality from indirect disclosure. These techniques are called Inference Control Techniques and also known as Statistical Disclosure Control methods. Indirect disclosure differs from the other security problems because in this presumptive intruders or external users deduce the information by set of available queries having low security risk i.e., they do computations on set of available information which is non sensitive and from that information they get the sensitive information. Inference control techniques protect the publicly released statistics of companies and institutions, such that presumptive user could not get any private information about any individual entity. Inference control in statistical databases is a part of information security which tries to prevent published statistical information (tables, individual records) from disclosing the contribution of specific respondents. Here we shall analyze about information loss, disclosure risk measures and performance of the various techniques. The major challenge to Statistical Disclosure Control is that modified data should be such that it provides more useful information with less disclosure risk i.e., protection should be maximized and information loss should be minimized. Since there is a trade-off between information loss and disclosure risk i.e., generally it happens that if disclosure risk is less then information loss will be more and vice versa. Here we propose some ideas which may provide optimal results
    corecore