14,077 research outputs found
The Limits of Post-Selection Generalization
While statistics and machine learning offers numerous methods for ensuring
generalization, these methods often fail in the presence of adaptivity---the
common practice in which the choice of analysis depends on previous
interactions with the same dataset. A recent line of work has introduced
powerful, general purpose algorithms that ensure post hoc generalization (also
called robust or post-selection generalization), which says that, given the
output of the algorithm, it is hard to find any statistic for which the data
differs significantly from the population it came from.
In this work we show several limitations on the power of algorithms
satisfying post hoc generalization. First, we show a tight lower bound on the
error of any algorithm that satisfies post hoc generalization and answers
adaptively chosen statistical queries, showing a strong barrier to progress in
post selection data analysis. Second, we show that post hoc generalization is
not closed under composition, despite many examples of such algorithms
exhibiting strong composition properties
Hacking Smart Machines with Smarter Ones: How to Extract Meaningful Data from Machine Learning Classifiers
Machine Learning (ML) algorithms are used to train computers to perform a
variety of complex tasks and improve with experience. Computers learn how to
recognize patterns, make unintended decisions, or react to a dynamic
environment. Certain trained machines may be more effective than others because
they are based on more suitable ML algorithms or because they were trained
through superior training sets. Although ML algorithms are known and publicly
released, training sets may not be reasonably ascertainable and, indeed, may be
guarded as trade secrets. While much research has been performed about the
privacy of the elements of training sets, in this paper we focus our attention
on ML classifiers and on the statistical information that can be unconsciously
or maliciously revealed from them. We show that it is possible to infer
unexpected but useful information from ML classifiers. In particular, we build
a novel meta-classifier and train it to hack other classifiers, obtaining
meaningful information about their training sets. This kind of information
leakage can be exploited, for example, by a vendor to build more effective
classifiers or to simply acquire trade secrets from a competitor's apparatus,
potentially violating its intellectual property rights
Using Metrics Suites to Improve the Measurement of Privacy in Graphs
The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Social graphs are widely used in research (e.g., epidemiology) and business (e.g., recommender systems). However, sharing these graphs poses privacy risks because they contain sensitive information about individuals. Graph anonymization techniques aim to protect individual users in a graph, while graph de-anonymization aims to re-identify users. The effectiveness of anonymization and de-anonymization algorithms is usually evaluated with privacy metrics. However, it is unclear how strong existing privacy metrics are when they are used in graph privacy. In this paper, we study 26 privacy metrics for graph anonymization and de-anonymization and evaluate their strength in terms of three criteria: monotonicity indicates whether the metric indicates lower privacy for stronger adversaries; for within-scenario comparisons, evenness indicates whether metric values are spread evenly; and for between-scenario comparisons, shared value range indicates whether metrics use a consistent value range across scenarios. Our extensive experiments indicate that no single metric fulfills all three criteria perfectly. We therefore use methods from multi-criteria decision analysis to aggregate multiple metrics in a metrics suite, and we show that these metrics suites improve monotonicity compared to the best individual metric. This important result enables more monotonic, and thus more accurate, evaluations of new graph anonymization and de-anonymization algorithms
Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting
Machine learning algorithms, when applied to sensitive data, pose a distinct
threat to privacy. A growing body of prior work demonstrates that models
produced by these algorithms may leak specific private information in the
training data to an attacker, either through the models' structure or their
observable behavior. However, the underlying cause of this privacy risk is not
well understood beyond a handful of anecdotal accounts that suggest overfitting
and influence might play a role.
This paper examines the effect that overfitting and influence have on the
ability of an attacker to learn information about the training data from
machine learning models, either through training set membership inference or
attribute inference attacks. Using both formal and empirical analyses, we
illustrate a clear relationship between these factors and the privacy risk that
arises in several popular machine learning algorithms. We find that overfitting
is sufficient to allow an attacker to perform membership inference and, when
the target attribute meets certain conditions about its influence, attribute
inference attacks. Interestingly, our formal analysis also shows that
overfitting is not necessary for these attacks and begins to shed light on what
other factors may be in play. Finally, we explore the connection between
membership inference and attribute inference, showing that there are deep
connections between the two that lead to effective new attacks
Online graph coloring against a randomized adversary
Electronic version of an article published as
Online graph coloring against a randomized adversary. "International journal of foundations of computer science", 1 Juny 2018, vol. 29, nĂșm. 4, p. 551-569. DOI:10.1142/S0129054118410058 © 2018 copyright World Scientific Publishing Company. https://www.worldscientific.com/doi/abs/10.1142/S0129054118410058We consider an online model where an adversary constructs a set of 2s instances S instead of one single instance. The algorithm knows S and the adversary will choose one instance from S at random to present to the algorithm. We further focus on adversaries that construct sets of k-chromatic instances. In this setting, we provide upper and lower bounds on the competitive ratio for the online graph coloring problem as a function of the parameters in this model. Both bounds are linear in s and matching upper and lower bound are given for a specific set of algorithms that we call âminimalistic online algorithmsâ.Peer ReviewedPostprint (author's final draft
Online Isotonic Regression
We consider the online version of the isotonic regression problem. Given a
set of linearly ordered points (e.g., on the real line), the learner must
predict labels sequentially at adversarially chosen positions and is evaluated
by her total squared loss compared against the best isotonic (non-decreasing)
function in hindsight. We survey several standard online learning algorithms
and show that none of them achieve the optimal regret exponent; in fact, most
of them (including Online Gradient Descent, Follow the Leader and Exponential
Weights) incur linear regret. We then prove that the Exponential Weights
algorithm played over a covering net of isotonic functions has a regret bounded
by and present a matching
lower bound on regret. We provide a computationally efficient version of this
algorithm. We also analyze the noise-free case, in which the revealed labels
are isotonic, and show that the bound can be improved to or even to
(when the labels are revealed in isotonic order). Finally, we extend the
analysis beyond squared loss and give bounds for entropic loss and absolute
loss.Comment: 25 page
Efficient Transductive Online Learning via Randomized Rounding
Most traditional online learning algorithms are based on variants of mirror
descent or follow-the-leader. In this paper, we present an online algorithm
based on a completely different approach, tailored for transductive settings,
which combines "random playout" and randomized rounding of loss subgradients.
As an application of our approach, we present the first computationally
efficient online algorithm for collaborative filtering with trace-norm
constrained matrices. As a second application, we solve an open question
linking batch learning and transductive online learningComment: To appear in a Festschrift in honor of V.N. Vapnik. Preliminary
version presented in NIPS 201
- âŠ