128,038 research outputs found
Privately Releasing Conjunctions and the Statistical Query Barrier
Suppose we would like to know all answers to a set of statistical queries C
on a data set up to small error, but we can only access the data itself using
statistical queries. A trivial solution is to exhaustively ask all queries in
C. Can we do any better?
+ We show that the number of statistical queries necessary and sufficient for
this task is---up to polynomial factors---equal to the agnostic learning
complexity of C in Kearns' statistical query (SQ) model. This gives a complete
answer to the question when running time is not a concern.
+ We then show that the problem can be solved efficiently (allowing arbitrary
error on a small fraction of queries) whenever the answers to C can be
described by a submodular function. This includes many natural concept classes,
such as graph cuts and Boolean disjunctions and conjunctions.
While interesting from a learning theoretic point of view, our main
applications are in privacy-preserving data analysis:
Here, our second result leads to the first algorithm that efficiently
releases differentially private answers to of all Boolean conjunctions with 1%
average error. This presents significant progress on a key open problem in
privacy-preserving data analysis.
Our first result on the other hand gives unconditional lower bounds on any
differentially private algorithm that admits a (potentially
non-privacy-preserving) implementation using only statistical queries. Not only
our algorithms, but also most known private algorithms can be implemented using
only statistical queries, and hence are constrained by these lower bounds. Our
result therefore isolates the complexity of agnostic learning in the SQ-model
as a new barrier in the design of differentially private algorithms
Robust Interactive Learning
In this paper we propose and study a generalization of the standard
active-learning model where a more general type of query, class conditional
query, is allowed. Such queries have been quite useful in applications, but
have been lacking theoretical understanding. In this work, we characterize the
power of such queries under two well-known noise models. We give nearly tight
upper and lower bounds on the number of queries needed to learn both for the
general agnostic setting and for the bounded noise model. We further show that
our methods can be made adaptive to the (unknown) noise rate, with only
negligible loss in query complexity
On the Power of Learning from k-Wise Queries
Several well-studied models of access to data samples, including statistical queries, local differential privacy and low-communication algorithms rely on queries that provide information about a function of a single sample. (For example, a statistical query (SQ) gives an estimate of Ex_{x ~ D}[q(x)] for any choice of the query function q mapping X to the reals, where D is
an unknown data distribution over X.) Yet some data analysis algorithms rely on properties of functions that depend on multiple samples. Such algorithms would be naturally implemented using k-wise queries each of which is specified by a function q mapping X^k to the reals. Hence it is natural to ask whether algorithms using k-wise queries can solve learning problems more efficiently and by how much.
Blum, Kalai and Wasserman (2003) showed that for any weak PAC learning problem over a fixed distribution, the complexity of learning with k-wise SQs is smaller than the (unary) SQ complexity by a factor of at most 2^k. We show that for more general problems over distributions the picture is substantially richer. For every k, the complexity of distribution-independent PAC learning with k-wise queries can be exponentially larger than learning with (k+1)-wise queries. We then give two approaches for simulating a k-wise query using unary queries. The first approach exploits the structure of the
problem that needs to be solved. It generalizes and strengthens (exponentially)
the results of Blum et al.. It allows us to derive strong lower bounds for
learning DNF formulas and stochastic constraint satisfaction problems that hold
against algorithms using k-wise queries. The second approach exploits the
k-party communication complexity of the k-wise query function
Active classification with comparison queries
We study an extension of active learning in which the learning algorithm may
ask the annotator to compare the distances of two examples from the boundary of
their label-class. For example, in a recommendation system application (say for
restaurants), the annotator may be asked whether she liked or disliked a
specific restaurant (a label query); or which one of two restaurants did she
like more (a comparison query).
We focus on the class of half spaces, and show that under natural
assumptions, such as large margin or bounded bit-description of the input
examples, it is possible to reveal all the labels of a sample of size using
approximately queries. This implies an exponential improvement over
classical active learning, where only label queries are allowed. We complement
these results by showing that if any of these assumptions is removed then, in
the worst case, queries are required.
Our results follow from a new general framework of active learning with
additional queries. We identify a combinatorial dimension, called the
\emph{inference dimension}, that captures the query complexity when each
additional query is determined by examples (such as comparison queries,
each of which is determined by the two compared examples). Our results for half
spaces follow by bounding the inference dimension in the cases discussed above.Comment: 23 pages (not including references), 1 figure. The new version
contains a minor fix in the proof of Lemma 4.
Near-Optimal Active Learning of Halfspaces via Query Synthesis in the Noisy Setting
In this paper, we consider the problem of actively learning a linear
classifier through query synthesis where the learner can construct artificial
queries in order to estimate the true decision boundaries. This problem has
recently gained a lot of interest in automated science and adversarial reverse
engineering for which only heuristic algorithms are known. In such
applications, queries can be constructed de novo to elicit information (e.g.,
automated science) or to evade detection with minimal cost (e.g., adversarial
reverse engineering). We develop a general framework, called dimension coupling
(DC), that 1) reduces a d-dimensional learning problem to d-1 low dimensional
sub-problems, 2) solves each sub-problem efficiently, 3) appropriately
aggregates the results and outputs a linear classifier, and 4) provides a
theoretical guarantee for all possible schemes of aggregation. The proposed
method is proved resilient to noise. We show that the DC framework avoids the
curse of dimensionality: its computational complexity scales linearly with the
dimension. Moreover, we show that the query complexity of DC is near optimal
(within a constant factor of the optimum algorithm). To further support our
theoretical analysis, we compare the performance of DC with the existing work.
We observe that DC consistently outperforms the prior arts in terms of query
complexity while often running orders of magnitude faster.Comment: Accepted by AAAI 201
- …