28 research outputs found
Advances in privacy-preserving machine learning
Building useful predictive models often involves learning from personal data. For instance, companies use customer data to target advertisements, online education platforms collect student data to recommend content and improve user engagement, and medical researchers fit diagnostic models to patient data. A recent line of research aims to design learning algorithms that provide rigorous privacy guarantees for user data, in the sense that their outputs---models or predictions---leak as little information as possible about individuals in the training data. The goal of this dissertation is to design private learning algorithms with performance comparable to the best possible non-private ones. We quantify privacy using \emph{differential privacy}, a well-studied privacy notion that limits how much information is leaked about an individual by the output of an algorithm. Training a model using a differentially private algorithm prevents an adversary from confidently determining whether a specific person's data was used for training the model.
We begin by presenting a technique for practical differentially private convex optimization that can leverage any off-the-shelf optimizer as a black box. We also perform an extensive empirical evaluation of the state-of-the-art algorithms on a range of publicly available datasets, as well as in an industry application.
Next, we present a learning algorithm that outputs a private classifier when given black-box access to a non-private learner and a limited amount of unlabeled public data. We prove that the accuracy guarantee of our private algorithm in the PAC model of learning is comparable to that of the underlying non-private learner. Such a guarantee is not possible, in general, without public data.
Lastly, we consider building recommendation systems, which we model using matrix completion. We present the first algorithm for matrix completion with provable user-level privacy and accuracy guarantees. Our algorithm consistently outperforms the state-of-the-art private algorithms on a suite of datasets. Along the way, we give an optimal algorithm for differentially private singular vector computation which leads to significant savings in terms of space and time when operating on sparse matrices. It can also be used for private low-rank approximation
Unintended Memorization in Large ASR Models, and How to Mitigate It
It is well-known that neural networks can unintentionally memorize their
training examples, causing privacy concerns. However, auditing memorization in
large non-auto-regressive automatic speech recognition (ASR) models has been
challenging due to the high compute cost of existing methods such as hardness
calibration. In this work, we design a simple auditing method to measure
memorization in large ASR models without the extra compute overhead.
Concretely, we speed up randomly-generated utterances to create a mapping
between vocal and text information that is difficult to learn from typical
training examples. Hence, accurate predictions only for sped-up training
examples can serve as clear evidence for memorization, and the corresponding
accuracy can be used to measure memorization. Using the proposed method, we
showcase memorization in the state-of-the-art ASR models. To mitigate
memorization, we tried gradient clipping during training to bound the influence
of any individual example on the final model. We empirically show that clipping
each example's gradient can mitigate memorization for sped-up training examples
with up to 16 repetitions in the training set. Furthermore, we show that in
large-scale distributed training, clipping the average gradient on each compute
core maintains neutral model quality and compute cost while providing strong
privacy protection
Guaranteed validity for empirical approaches to adaptive data analysis
We design a general framework for answering
adaptive statistical queries that focuses on
providing explicit confidence intervals along
with point estimates. Prior work in this area
has either focused on providing tight confidence
intervals for specific analyses, or providing
general worst-case bounds for point estimates.
Unfortunately, as we observe, these
worst-case bounds are loose in many settings
— often not even beating simple baselines like
sample splitting. Our main contribution is
to design a framework for providing valid,
instance-specific confidence intervals for point
estimates that can be generated by heuristics.
When paired with good heuristics, this
method gives guarantees that are orders of
magnitude better than the best worst-case
bounds. We provide a Python library implementing
our method.http://proceedings.mlr.press/v108/rogers20a.htm
Revealing and protecting labels in distributed training
Distributed learning paradigms such as federated learning often involve transmission
of model updates, or gradients, over a network, thereby avoiding transmission
of private data. However, it is possible for sensitive information about the training
data to be revealed from such gradients. Prior works have demonstrated that labels
can be revealed analytically from the last layer of certain models (e.g., ResNet),
or they can be reconstructed jointly with model inputs by using Gradients Matching
[1] with additional knowledge about the current state of the model. In this work,
we propose a method to discover the set of labels of training samples from only the
gradient of the last layer and the id to label mapping. Our method is applicable to
a wide variety of model architectures across multiple domains. We demonstrate
the effectiveness of our method for model training in two domains - image classification,
and automatic speech recognition. Furthermore, we show that existing
reconstruction techniques improve their efficacy when used in conjunction with our
method. Conversely, we demonstrate that gradient quantization and sparsification
can significantly reduce the success of the attack.Published versio
Sex differences in the Simon task help to interpret sex differences in selective attention.
In the last decade, a number of studies have reported sex differences in selective attention, but a unified explanation for these effects is still missing. This study aims to better understand these differences and put them in an evolutionary psychological context. 418 adult participants performed a computer-based Simon task, in which they responded to the direction of a left or right pointing arrow appearing left or right from a fixation point. Women were more strongly influenced by task-irrelevant spatial information than men (i.e., the Simon effect was larger in women, Cohen's d = 0.39). Further, the analysis of sex differences in behavioral adjustment to errors revealed that women slow down more than men following mistakes (d = 0.53). Based on the combined results of previous studies and the current data, it is proposed that sex differences in selective attention are caused by underlying sex differences in core abilities, such as spatial or verbal cognition