Search CORE

5,189 research outputs found

Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

Author: Geambasu Roxana
Huang Tzu-Kuo
Lecuyer Mathias
Sen Siddhartha
Spahn Riley
Publication venue
Publication date: 21/05/2017
Field of study

Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data

arXiv.org e-Print Archive

Crossref

Linear and Range Counting under Metric-based Local Differential Privacy

Author: Ding Bolin
He Xi
Xiang Zhuolun
Zhou Jingren
Publication venue
Publication date: 16/05/2020
Field of study

Local differential privacy (LDP) enables private data sharing and analytics without the need for a trusted data collector. Error-optimal primitives (for, e.g., estimating means and item frequencies) under LDP have been well studied. For analytical tasks such as range queries, however, the best known error bound is dependent on the domain size of private data, which is potentially prohibitive. This deficiency is inherent as LDP protects the same level of indistinguishability between any pair of private data values for each data downer. In this paper, we utilize an extension of

\epsilon

-LDP called Metric-LDP or

E

-LDP, where a metric

E

defines heterogeneous privacy guarantees for different pairs of private data values and thus provides a more flexible knob than

\epsilon

does to relax LDP and tune utility-privacy trade-offs. We show that, under such privacy relaxations, for analytical workloads such as linear counting, multi-dimensional range counting queries, and quantile queries, we can achieve significant gains in utility. In particular, for range queries under

E

-LDP where the metric

E

is the

L^1

-distance function scaled by

\epsilon

, we design mechanisms with errors independent on the domain sizes; instead, their errors depend on the metric

E

, which specifies in what granularity the private data is protected. We believe that the primitives we design for

E

-LDP will be useful in developing mechanisms for other analytical tasks, and encourage the adoption of LDP in practice

arXiv.org e-Print Archive

Crossref

An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices

Author: Abowd John M
Schmutte Ian M
Publication venue: DigitalCommons@ILR
Publication date: 15/08/2018
Field of study

Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy

DigitalCommons@ILR