77,912 research outputs found

    Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation

    Full text link
    The growing expanse of e-commerce and the widespread availability of online databases raise many fears regarding loss of privacy and many statistical challenges. Even with encryption and other nominal forms of protection for individual databases, we still need to protect against the violation of privacy through linkages across multiple databases. These issues parallel those that have arisen and received some attention in the context of homeland security. Following the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. We present an overview of some proposals that have surfaced for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. We also explore their link to the related literature on privacy-preserving data mining. In particular, we focus on the matching problem across databases and the concept of ``selective revelation'' and their confidentiality implications.Comment: Published at http://dx.doi.org/10.1214/088342306000000240 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Robust scaling in fusion science: case study for the L-H power threshold

    Get PDF
    In regression analysis for deriving scaling laws in the context of fusion studies, standard regression methods are usually applied, of which ordinary least squares (OLS) is the most popular. However, concerns have been raised with respect to several assumptions underlying OLS in its application to fusion data. More sophisticated statistical techniques are available, but they are not widely used in the fusion community and, moreover, the predictions by scaling laws may vary significantly depending on the particular regression technique. Therefore we have developed a new regression method, which we call geodesic least squares regression (GLS), that is robust in the presence of significant uncertainty on both the data and the regression model. The method is based on probabilistic modeling of all variables involved in the scaling expression, using adequate probability distributions and a natural similarity measure between them (geodesic distance). In this work we revisit the scaling law for the power threshold for the L-to-H transition in tokamaks, using data from the multi-machine ITPA databases. Depending on model assumptions, OLS can yield different predictions of the power threshold for ITER. In contrast, GLS regression delivers consistent results. Consequently, given the ubiquity and importance of scaling laws and parametric dependence studies in fusion research, GLS regression is proposed as a robust and easily implemented alternative to classic regression techniques

    Monte Carlo Co-Ordinate Ascent Variational Inference

    Get PDF
    In Variational Inference (VI), coordinate-ascent and gradient-based approaches are two major types of algorithms for approximating difficult-to-compute probability densities. In real-world implementations of complex models, Monte Carlo methods are widely used to estimate expectations in coordinate-ascent approaches and gradients in derivative-driven ones. We discuss a Monte Carlo Co-ordinate Ascent VI (MC-CAVI) algorithm that makes use of Markov chain Monte Carlo (MCMC) methods in the calculation of expectations required within Co-ordinate Ascent VI (CAVI). We show that, under regularity conditions, an MC-CAVI recursion will get arbitrarily close to a maximiser of the evidence lower bound (ELBO) with any given high probability. In numerical examples, the performance of MC-CAVI algorithm is compared with that of MCMC and -- as a representative of derivative-based VI methods -- of Black Box VI (BBVI). We discuss and demonstrate MC-CAVI's suitability for models with hard constraints in simulated and real examples. We compare MC-CAVI's performance with that of MCMC in an important complex model used in Nuclear Magnetic Resonance (NMR) spectroscopy data analysis -- BBVI is nearly impossible to be employed in this setting due to the hard constraints involved in the model

    A general method for the statistical evaluation of typological distributions

    Get PDF
    The distribution of linguistic structures in the world is the joint product of universal principles, inheritance from ancestor languages, language contact, social structures, and random fluctuation. This paper proposes a method for evaluating the relative significance of each factor ā€” and in particular, of universal principles ā€” via regression modeling: statistical evidence for universal principles is found if the odds for families to have skewed responses (e.g. all or most members have postnominal relative clauses) as opposed to having an opposite response skewing or no skewing at all, is significantly higher for some condition (e.g. VO order) than for another condition, independently of other factors

    Routes for breaching and protecting genetic privacy

    Full text link
    We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.Comment: Draft for comment
    • ā€¦
    corecore