10 research outputs found
Linear Regression from Strategic Data Sources
Linear regression is a fundamental building block of statistical data
analysis. It amounts to estimating the parameters of a linear model that maps
input features to corresponding outputs. In the classical setting where the
precision of each data point is fixed, the famous Aitken/Gauss-Markov theorem
in statistics states that generalized least squares (GLS) is a so-called "Best
Linear Unbiased Estimator" (BLUE). In modern data science, however, one often
faces strategic data sources, namely, individuals who incur a cost for
providing high-precision data.
In this paper, we study a setting in which features are public but
individuals choose the precision of the outputs they reveal to an analyst. We
assume that the analyst performs linear regression on this dataset, and
individuals benefit from the outcome of this estimation. We model this scenario
as a game where individuals minimize a cost comprising two components: (a) an
(agent-specific) disclosure cost for providing high-precision data; and (b) a
(global) estimation cost representing the inaccuracy in the linear model
estimate. In this game, the linear model estimate is a public good that
benefits all individuals. We establish that this game has a unique non-trivial
Nash equilibrium. We study the efficiency of this equilibrium and we prove
tight bounds on the price of stability for a large class of disclosure and
estimation costs. Finally, we study the estimator accuracy achieved at
equilibrium. We show that, in general, Aitken's theorem does not hold under
strategic data sources, though it does hold if individuals have identical
disclosure costs (up to a multiplicative factor). When individuals have
non-identical costs, we derive a bound on the improvement of the equilibrium
estimation cost that can be achieved by deviating from GLS, under mild
assumptions on the disclosure cost functions.Comment: This version (v3) extends the results on the sub-optimality of GLS
(Section 6) and improves writing in multiple places compared to v2. Compared
to the initial version v1, it also fixes an error in Theorem 6 (now Theorem
5), and extended many of the result
Engineering Privacy in Public: Confounding Face Recognition
The objective of DARPA’s Human ID at a Distance (HID) program is to develop automated biometric identification technologies to detect, recognize and identify humans at great distances. While nominally intended for security applications, if deployed widely, such technologies could become an enormous privacy threat, making practical the automatic surveillance of individuals on a grand scale. Face recognition, as the HID technology most rapidly approaching maturity, deserves immediate research attention in order to understand its strengths and limitations, with an objective of reliably foiling it when it is used inappropriately. This paper is a status report for a research program designed to achieve this objective within a larger goal of similarly defeating all HID technologies
DATA CLUSTERING AND MICRO-PERTURBATION FOR PRIVACY-PRESERVING DATA SHARING AND ANALYSIS
Clustering-based data masking approaches are widely used for privacy-preserving data sharing and data mining. Existing approaches, however, cannot cope with the situation where confidential attributes are categorical. For numeric data, these approaches are also unable to preserve important statistical properties such as variance and covariance of the data. We propose a new approach that handles these problems effectively. The proposed approach adopts a minimum spanning tree technique for clustering data and a micro-perturbation method for masking data. Our approach is novel in that it (i) incorporates an entropy-based measure, which represents the disclosure risk of the categorical confidential attribute, into the traditional distance measure used for clustering in an innovative way; and (ii) introduces the notion of cluster-level microperturbation (as opposed to conventional micro-aggregation) for masking data, to preserve the statistical properties of the data. We provide both analytical and empirical justification for the proposed methodology
Disclosure Analysis for Two-Way Contingency Tables
Ministry of Education, Singapore under its Academic Research Funding Tier 1; SMU Research Offic
Privacy-preserving data mining
In the research of privacy-preserving data mining, we address issues related to extracting
knowledge from large amounts of data without violating the privacy of the data owners.
In this study, we first introduce an integrated baseline architecture, design principles, and
implementation techniques for privacy-preserving data mining systems. We then discuss
the key components of privacy-preserving data mining systems which include three
protocols: data collection, inference control, and information sharing. We present and
compare strategies for realizing these protocols. Theoretical analysis and experimental
evaluation show that our protocols can generate accurate data mining models while
protecting the privacy of the data being mined