Search CORE

10 research outputs found

Linear Regression from Strategic Data Sources

Author: Gast Nicolas
Ioannidis Stratis
Loiseau Patrick
Roussillon Benjamin
Publication venue
Publication date: 12/12/2019
Field of study

Linear regression is a fundamental building block of statistical data analysis. It amounts to estimating the parameters of a linear model that maps input features to corresponding outputs. In the classical setting where the precision of each data point is fixed, the famous Aitken/Gauss-Markov theorem in statistics states that generalized least squares (GLS) is a so-called "Best Linear Unbiased Estimator" (BLUE). In modern data science, however, one often faces strategic data sources, namely, individuals who incur a cost for providing high-precision data. In this paper, we study a setting in which features are public but individuals choose the precision of the outputs they reveal to an analyst. We assume that the analyst performs linear regression on this dataset, and individuals benefit from the outcome of this estimation. We model this scenario as a game where individuals minimize a cost comprising two components: (a) an (agent-specific) disclosure cost for providing high-precision data; and (b) a (global) estimation cost representing the inaccuracy in the linear model estimate. In this game, the linear model estimate is a public good that benefits all individuals. We establish that this game has a unique non-trivial Nash equilibrium. We study the efficiency of this equilibrium and we prove tight bounds on the price of stability for a large class of disclosure and estimation costs. Finally, we study the estimator accuracy achieved at equilibrium. We show that, in general, Aitken's theorem does not hold under strategic data sources, though it does hold if individuals have identical disclosure costs (up to a multiplicative factor). When individuals have non-identical costs, we derive a bound on the improvement of the equilibrium estimation cost that can be achieved by deviating from GLS, under mild assumptions on the disclosure cost functions.Comment: This version (v3) extends the results on the sub-optimality of GLS (Section 6) and improves writing in multiple places compared to v2. Compared to the initial version v1, it also fixes an error in Theorem 6 (now Theorem 5), and extended many of the result

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Engineering Privacy in Public: Confounding Face Recognition

Author: A. Leon-Garcia
A. Serjantov
A.M. Martínez
C. Díaz
D. Valentin
D. Valentin
J. Daugman
J.-F. Raymond
J.F. Traub
J.R. Rao
K. Fukunaga
M.K. Reiter
M.S. Aldenderfer
P.J. Phillips
R. Motwani
R.O. Duda
R.R. Sokal
S. Garfinkel
S. Lawrence
T.M. Cover
Publication venue: ScholarlyCommons
Publication date: 01/01/2003
Field of study

The objective of DARPA’s Human ID at a Distance (HID) program is to develop automated biometric identification technologies to detect, recognize and identify humans at great distances. While nominally intended for security applications, if deployed widely, such technologies could become an enormous privacy threat, making practical the automatic surveillance of individuals on a grand scale. Face recognition, as the HID technology most rapidly approaching maturity, deserves immediate research attention in order to understand its strengths and limitations, with an objective of reliably foiling it when it is used inappropriately. This paper is a status report for a research program designed to achieve this objective within a larger goal of similarly defeating all HID technologies

Crossref

ScholarlyCommons@Penn

DATA CLUSTERING AND MICRO-PERTURBATION FOR PRIVACY-PRESERVING DATA SHARING AND ANALYSIS

Author: Li Xiao-Bai
Sarkar Sumit
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2010
Field of study

Clustering-based data masking approaches are widely used for privacy-preserving data sharing and data mining. Existing approaches, however, cannot cope with the situation where confidential attributes are categorical. For numeric data, these approaches are also unable to preserve important statistical properties such as variance and covariance of the data. We propose a new approach that handles these problems effectively. The proposed approach adopts a minimum spanning tree technique for clustering data and a micro-perturbation method for masking data. Our approach is novel in that it (i) incorporates an entropy-based measure, which represents the disclosure risk of the categorical confidential attribute, into the traditional distance measure used for clustering in an innovative way; and (ii) introduces the notion of cluster-level microperturbation (as opposed to conventional micro-aggregation) for masking data, to preserve the statistical properties of the data. We provide both analytical and empirical justification for the proposed methodology

AIS Electronic Library (AISeL)

Disclosure Analysis for Two-Way Contingency Tables

Author: LI Yingjiu
LU Haibing
WU Xintao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2006
Field of study

Ministry of Education, Singapore under its Academic Research Funding Tier 1; SMU Research Offic

Institutional Knowledge at Singapore Management University

Population recovery and partial identification

Author: A Blum
A Kalai
Amir Yehudayoff
Avi Wigderson
CK Liew
E Kushilevitz
FA Matsen
J Feldman
J Traub
L Beck
R Agrawal
S Floyd
SL Warner
W Johnson
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Privacy-preserving data mining

Author: Zhang Nan
Publication venue
Publication date: 15/05/2009
Field of study

In the research of privacy-preserving data mining, we address issues related to extracting knowledge from large amounts of data without violating the privacy of the data owners. In this study, we first introduce an integrated baseline architecture, design principles, and implementation techniques for privacy-preserving data mining systems. We then discuss the key components of privacy-preserving data mining systems which include three protocols: data collection, inference control, and information sharing. We present and compare strategies for realizing these protocols. Theoretical analysis and experimental evaluation show that our protocols can generate accurate data mining models while protecting the privacy of the data being mined

Texas A&M Repository