77,912 research outputs found
Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation
The growing expanse of e-commerce and the widespread availability of online
databases raise many fears regarding loss of privacy and many statistical
challenges. Even with encryption and other nominal forms of protection for
individual databases, we still need to protect against the violation of privacy
through linkages across multiple databases. These issues parallel those that
have arisen and received some attention in the context of homeland security.
Following the events of September 11, 2001, there has been heightened attention
in the United States and elsewhere to the use of multiple government and
private databases for the identification of possible perpetrators of future
attacks, as well as an unprecedented expansion of federal government data
mining activities, many involving databases containing personal information. We
present an overview of some proposals that have surfaced for the search of
multiple databases which supposedly do not compromise possible pledges of
confidentiality to the individuals whose data are included. We also explore
their link to the related literature on privacy-preserving data mining. In
particular, we focus on the matching problem across databases and the concept
of ``selective revelation'' and their confidentiality implications.Comment: Published at http://dx.doi.org/10.1214/088342306000000240 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Robust scaling in fusion science: case study for the L-H power threshold
In regression analysis for deriving scaling laws in the context of fusion studies, standard regression methods are usually applied, of which ordinary least squares (OLS) is the most popular. However, concerns have been raised with respect to several assumptions underlying OLS in its application to fusion data. More sophisticated statistical techniques are available, but they are not widely used in the fusion community and, moreover, the predictions by scaling laws may vary significantly depending on the particular regression technique. Therefore we have developed a new regression method, which we call geodesic least squares regression (GLS), that is robust in the presence of significant uncertainty on both the data and the regression model. The method is based on probabilistic modeling of all variables involved in the scaling expression, using adequate probability distributions and a natural similarity measure between them (geodesic distance). In this work we revisit the scaling law for the power threshold for the L-to-H transition in tokamaks, using data from the multi-machine ITPA databases. Depending on model assumptions, OLS can yield different predictions of the power threshold for ITER. In contrast, GLS regression delivers consistent results. Consequently, given the ubiquity and importance of scaling laws and parametric dependence studies in fusion research, GLS regression is proposed as a robust and easily implemented alternative to classic regression techniques
Monte Carlo Co-Ordinate Ascent Variational Inference
In Variational Inference (VI), coordinate-ascent and gradient-based
approaches are two major types of algorithms for approximating
difficult-to-compute probability densities. In real-world implementations of
complex models, Monte Carlo methods are widely used to estimate expectations in
coordinate-ascent approaches and gradients in derivative-driven ones. We
discuss a Monte Carlo Co-ordinate Ascent VI (MC-CAVI) algorithm that makes use
of Markov chain Monte Carlo (MCMC) methods in the calculation of expectations
required within Co-ordinate Ascent VI (CAVI). We show that, under regularity
conditions, an MC-CAVI recursion will get arbitrarily close to a maximiser of
the evidence lower bound (ELBO) with any given high probability. In numerical
examples, the performance of MC-CAVI algorithm is compared with that of MCMC
and -- as a representative of derivative-based VI methods -- of Black Box VI
(BBVI). We discuss and demonstrate MC-CAVI's suitability for models with hard
constraints in simulated and real examples. We compare MC-CAVI's performance
with that of MCMC in an important complex model used in Nuclear Magnetic
Resonance (NMR) spectroscopy data analysis -- BBVI is nearly impossible to be
employed in this setting due to the hard constraints involved in the model
A general method for the statistical evaluation of typological distributions
The distribution of linguistic structures in the world is the joint product of universal principles, inheritance from ancestor languages, language contact, social structures, and random fluctuation. This paper proposes a method for evaluating the relative significance of each factor ā and in particular, of universal principles ā via regression modeling: statistical evidence for universal principles is found if the odds for families to have skewed responses (e.g. all or most members have postnominal relative clauses) as opposed to having an opposite response skewing or no skewing at all, is significantly higher for some condition (e.g. VO order) than for another condition, independently of other factors
Routes for breaching and protecting genetic privacy
We are entering the era of ubiquitous genetic information for research,
clinical care, and personal curiosity. Sharing these datasets is vital for
rapid progress in understanding the genetic basis of human diseases. However,
one growing concern is the ability to protect the genetic privacy of the data
originators. Here, we technically map threats to genetic privacy and discuss
potential mitigation strategies for privacy-preserving dissemination of genetic
data.Comment: Draft for comment
- ā¦