58 research outputs found

    Priv Stat Databases

    Get PDF
    In this paper we propose a method for statistical disclosure limitation of categorical variables that we call Conditional Group Swapping. This approach is suitable for design and strata-defining variables, the cross-classification of which leads to the formation of important groups or subpopulations. These groups are considered important because from the point of view of data analysis it is desirable to preserve analytical characteristics within them. In general data swapping can be quite distorting ([12, 18, 15]), especially for the relationships between the variables not only within the subpopulations but for the overall data. To reduce the damage incurred by swapping, we propose to choose the records for swapping using conditional probabilities which depend on the characteristics of the exchanged records. In particular, our approach exploits the results of propensity scores methodology for the computation of swapping probabilities. The experimental results presented in the paper show good utility properties of the method.CC999999/ImCDC/Intramural CDC HHS/United States2020-03-23T00:00:00Z32206763PMC70874077412vault:3515

    Confidentiality challenges in releasing longitudinally linked data

    Get PDF
    Longitudinally linked household data allows researchers to analyse trends over time as well as on a cross-sectional level. Such analysis requires households to be linked across waves, but this increases the possibility of disclosure risks. We focus on an inter-wave disclosure risk specific to such data sets where intruders can make use of intimate knowledge gained about the household in one wave to learn new sensitive information about the household in future waves. We consider a specific way this risk could occur when households split in one wave, so an individual has left the household, and illustrate this risk using the Wealth and Assets survey. We also show that simply removing the links between waves may be insufficient to adequately protect confidentiality. To mitigate this risk we investigate two statistical disclosure control methods, perturbation and synthesis, that alter sensitive information on these households in the current wave. In this way no new sensitive information will be disclosed to these individuals, while utility should be largely preserved provided the SDC measures are applied appropriately. © 2020, University of Skovde. All rights reserved

    Disclosure Risk Assessments and Control.

    Full text link
    Recent advances in technology dramatically increase the volume of data that statistical agencies can gather and disseminate. The improved accessibility translates into a higher risk of identifying individuals from public microdata, and therefore increases the importance of the evaluation of disclosure risk and confidentiality control. This dissertation addresses three related but distinct research questions in statistical data confidentiality. The first study concerns the evaluation of disclosure risk for microdata when an intruder attempts to identify survey respondents by linking data records with a large external commercial data file based on a set of common variables. The dependence of disclosure risk to the commercial data coverage, the accuracy of the common identification information, and the amount of identification information to which an intruder accesses, is discussed theoretically and empirically tested using an experiment. The second study presents a practical implementation of fully-imputed synthetic data approach for a large, complex longitudinal survey as means of protecting confidentiality, following the initial proposal by Rubin (1993) and Little (1993). The imputation uses separate semiparametric algorithms for continuous, binary and categorical variables. A new combining rule of synthetic data inference is proposed to account for the uncertainty due to simultaneously imputing item-missing data and generating synthetic data. The loss of data utility is evaluated via the use of a propensity score approach in addition to three information loss metrics. The third study extends this fully-synthetic data approach to cope with situations where small area statistics are essential important. This research is the first in the statistical disclosure control literature to consider small area statistics. The goal is to create synthetic data with enough geographical details to permit small area analyses, which otherwise is impossible because such geographical identifiers are usually suppressed due to disclosure control. A Bayesian framework for appropriate small area models is proposed to generate synthetic microdata from the predictive posterior distributions. Two simulation studies and one empirical illustration are used to evaluate this approach.Ph.D.Survey MethodologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61661/1/mandiyu_1.pd

    Investigating Solutions to Minimise Participation Bias in Case-Control Studies

    Get PDF
    Case-control studies are used in epidemiology to try to determine variables associated with a disease, by comparing those with the disease (cases) against those without (controls). Participation rates in epidemiology studies have declined over recent years, particularly in the control group where there is less motivation to participate. Non-participation can lead to bias and this can result in the findings differing from the truth. A literature review of the last nine years shows that non-participation occurred in published studies as recently as 2015, and an assessment of articles from three high impact factor epidemiology journals concludes that participation bias is a possibility which is not always controlled for. Methods to reduce bias resulting from non-participation are provided, which suit different data structures and purposes. A guidance tool is subsequently developed to aid the selection of a suitable approach. Many of these methods rely on the assumption that the data are missing at random. Therefore, a new solution is developed which utilises population data in place of the control data, which recovers the true odds ratio even when data are missing not at random. Chain event graphs are a graphical representation of a statistical model which are used for the first time to draw conclusions about the missingness mechanisms resulting from non-participation in case-control data. These graphs are also adapted specifically to further investigate non-participation in case-control studies. Throughout, in addition to hypothetical examples and simulated data, a diabetes dataset is used to demonstrate the methods. Critical comparisons are drawn between existing methods and the new methods developed here, and discussion provided for when each method is suitable. Identification of factors associated with a disease are crucial for improved patient care, and accurate analyses of case-control data, with minimal biases, are one way in which this can be achieved

    Generation and assessment of useful and privacy preserving synthetic datasets

    Get PDF
    Synthetic datasets are gaining traction as a potential solution for allowing access to sensitive data while protecting the privacy of individuals. However, the assessment of both the utility and disclosure risk of synthetic data is still an open question for which there is little consensus. Solutions that are theoretically good have been proposed but these are not currently feasible for most use cases. Meanwhile, most practicable disclosure risk assessments are ad hoc, unsuitable for more than a few sensitive variables, and only consider a narrow range of risk scenarios. For greater uptake of synthetic data it is important to establish a standard for its assessment. In this thesis, we evaluate methods for the assessment of synthetic data and identify several clear issues in the literature. We develop a practical framework for the quantitative assessment of disclosure risk for synthetic data. Hierarchical regression models are used for the evaluation and comparison of disclosure risk for multiple sensitive variables, synthetic datasets and intruder assumptions simultaneously. We demonstrate our methods on two example datasets. A small dataset containing less than 1000 samples and 9 variables, and a larger dataset that contains over 50000 samples and 40 variables. We find that the method of prediction has a significantly larger effect on attribute disclosure risk than the synthetic data generation method

    Propensity Score Based Conditional Group Swapping for Disclosure Limitation of Strata-Defining Variables

    No full text
    In this paper we propose a method for statistical disclosure limitation of categorical variables that we call Conditional Group Swapping. This approach is suitable for design and strata-defining variables, the cross-classification of which leads to the formation of important groups or subpopulations. These groups are considered important because from the point of view of data analysis it is desirable to preserve analytical characteristics within them. In general data swapping can be quite distorting [13, 16, 20], especially for the relationships between the variables not only within the subpopulations but for the overall data. To reduce the damage incurred by swapping, we propose to choose the records for swapping using conditional probabilities which depend on the characteristics of the exchanged records. In particular, our approach exploits the results of propensity scores methodology for the computation of swapping probabilities. The experimental results presented in the paper show good utility properties of the method

    SIS 2017. Statistics and Data Science: new challenges, new generations

    Get PDF
    The 2017 SIS Conference aims to highlight the crucial role of the Statistics in Data Science. In this new domain of ‘meaning’ extracted from the data, the increasing amount of produced and available data in databases, nowadays, has brought new challenges. That involves different fields of statistics, machine learning, information and computer science, optimization, pattern recognition. These afford together a considerable contribute in the analysis of ‘Big data’, open data, relational and complex data, structured and no-structured. The interest is to collect the contributes which provide from the different domains of Statistics, in the high dimensional data quality validation, sampling extraction, dimensional reduction, pattern selection, data modelling, testing hypotheses and confirming conclusions drawn from the data

    The drivers of Corporate Social Responsibility in the supply chain. A case study.

    Get PDF
    Purpose: The paper studies the way in which a SME integrates CSR into its corporate strategy, the practices it puts in place and how its CSR strategies reflect on its suppliers and customers relations. Methodology/Research limitations: A qualitative case study methodology is used. The use of a single case study limits the generalizing capacity of these findings. Findings: The entrepreneur’s ethical beliefs and value system play a fundamental role in shaping sustainable corporate strategy. Furthermore, the type of competitive strategy selected based on innovation, quality and responsibility clearly emerges both in terms of well defined management procedures and supply chain relations as a whole aimed at involving partners in the process of sustainable innovation. Originality/value: The paper presents a SME that has devised an original innovative business model. The study pivots on the issues of innovation and eco-sustainability in a context of drivers for CRS and business ethics. These values are considered fundamental at International level; the United Nations has declared 2011 the “International Year of Forestry”

    Perceived Alcohol Stigma and Treatment for Alcohol Use Disorders

    Get PDF
    Despite the availability of effective treatments, the overwhelming majority: 85%) of individuals suffering from alcohol use disorders: AUDs) never receive help for their problems. AUDs include the disorders of alcohol abuse and alcohol dependence. An objective of Healthy People 2020 is to increase the number of individuals diagnosed with AUDs who receive alcohol treatment. The extent to which one believes that stigmatizing attitudes towards those with AUDs exist is defined as perceived alcohol stigma : PAS). Although it is known that persons with AUDs who have higher levels of PAS are at an even greater risk of not receiving treatment, the specific mechanisms by which PAS affects treatment utilization remain unknown. Additionally, while the comorbidity of AUDs and other psychiatric disorders is highly prevalent, scant research has explored the relationship between PAS and comorbidity. The aims of this study were:: 1) to examine how PAS may influence the receipt of alcohol treatment for those who have met criteria for AUDs in their lifetime, and: 2) to examine PAS in persons with AUDs alone as compared to those with co-occurring AUDs and other psychiatric disorders. This study used data from the National Epidemiologic Survey on Alcohol and Related Conditions: NESARC), which is a population-representative survey of United States adults living in noninstitutionalized settings. Respondents were included in the analyses if they completed both Wave 1: collected during 2001-2002) and Wave 2: collected during 2004-2005) survey interviews, and met criteria for DSM-IV AUD. Based on these criteria, data from 11,303 out of 43,093 respondents were analyzed. The primary analytic strategy was structural equation modeling. While prior work identified an inverse relationship between PAS and alcohol treatment utilization among persons with lifetime AUDs, this study revealed that the relationship between PAS and perceived need for treatment and actual treatment utilization is complex. In each of the two aims of this study, one of three hypotheses was directly supported. Important considerations for design, measurement, and theory development were derived. However, longitudinal research and an improvement in the assessments of alcohol stigma, problem recognition, and perceived need for alcohol treatment must be accomplished in order to better quantify and describe any potential effect of PAS on treatment utilization
    corecore