929 research outputs found
Noise-adaptive Margin-based Active Learning and Lower Bounds under Tsybakov Noise Condition
We present a simple noise-robust margin-based active learning algorithm to
find homogeneous (passing the origin) linear separators and analyze its error
convergence when labels are corrupted by noise. We show that when the imposed
noise satisfies the Tsybakov low noise condition (Mammen, Tsybakov, and others
1999; Tsybakov 2004) the algorithm is able to adapt to unknown level of noise
and achieves optimal statistical rate up to poly-logarithmic factors. We also
derive lower bounds for margin based active learning algorithms under Tsybakov
noise conditions (TNC) for the membership query synthesis scenario (Angluin
1988). Our result implies lower bounds for the stream based selective sampling
scenario (Cohn 1990) under TNC for some fairly simple data distributions. Quite
surprisingly, we show that the sample complexity cannot be improved even if the
underlying data distribution is as simple as the uniform distribution on the
unit ball. Our proof involves the construction of a well separated hypothesis
set on the d-dimensional unit ball along with carefully designed label
distributions for the Tsybakov noise condition. Our analysis might provide
insights for other forms of lower bounds as well.Comment: 16 pages, 2 figures. An abridged version to appear in Thirtieth AAAI
Conference on Artificial Intelligence (AAAI), which is held in Phoenix, AZ
USA in 201
Learning by mirror averaging
Given a finite collection of estimators or classifiers, we study the problem
of model selection type aggregation, that is, we construct a new estimator or
classifier, called aggregate, which is nearly as good as the best among them
with respect to a given risk criterion. We define our aggregate by a simple
recursive procedure which solves an auxiliary stochastic linear programming
problem related to the original nonlinear one and constitutes a special case of
the mirror averaging algorithm. We show that the aggregate satisfies sharp
oracle inequalities under some general assumptions. The results are applied to
several problems including regression, classification and density estimation.Comment: Published in at http://dx.doi.org/10.1214/07-AOS546 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization'' by V. Koltchinskii
Discussion of ``2004 IMS Medallion Lecture: Local Rademacher complexities and
oracle inequalities in risk minimization'' by V. Koltchinskii [arXiv:0708.0083]Comment: Published at http://dx.doi.org/10.1214/009053606000001064 in the
Annals of Statistics (http://www.imstat.org/aos/) by the Institute of
Mathematical Statistics (http://www.imstat.org
An -Regularization Approach to High-Dimensional Errors-in-variables Models
Several new estimation methods have been recently proposed for the linear
regression model with observation error in the design. Different assumptions on
the data generating process have motivated different estimators and analysis.
In particular, the literature considered (1) observation errors in the design
uniformly bounded by some , and (2) zero mean independent
observation errors. Under the first assumption, the rates of convergence of the
proposed estimators depend explicitly on , while the second
assumption has been applied when an estimator for the second moment of the
observational error is available. This work proposes and studies two new
estimators which, compared to other procedures for regression models with
errors in the design, exploit an additional -norm regularization.
The first estimator is applicable when both (1) and (2) hold but does not
require an estimator for the second moment of the observational error. The
second estimator is applicable under (2) and requires an estimator for the
second moment of the observation error. Importantly, we impose no assumption on
the accuracy of this pilot estimator, in contrast to the previously known
procedures. As the recent proposals, we allow the number of covariates to be
much larger than the sample size. We establish the rates of convergence of the
estimators and compare them with the bounds obtained for related estimators in
the literature. These comparisons show interesting insights on the interplay of
the assumptions and the achievable rates of convergence
Exponential convergence of testing error for stochastic gradient methods
We consider binary classification problems with positive definite kernels and
square loss, and study the convergence rates of stochastic gradient methods. We
show that while the excess testing loss (squared loss) converges slowly to zero
as the number of observations (and thus iterations) goes to infinity, the
testing error (classification error) converges exponentially fast if low-noise
conditions are assumed
Estimation of high-dimensional low-rank matrices
Suppose that we observe entries or, more generally, linear combinations of
entries of an unknown -matrix corrupted by noise. We are
particularly interested in the high-dimensional setting where the number
of unknown entries can be much larger than the sample size . Motivated by
several applications, we consider estimation of matrix under the assumption
that it has small rank. This can be viewed as dimension reduction or sparsity
assumption. In order to shrink toward a low-rank representation, we investigate
penalized least squares estimators with a Schatten- quasi-norm penalty term,
. We study these estimators under two possible assumptions---a modified
version of the restricted isometry condition and a uniform bound on the ratio
"empirical norm induced by the sampling operator/Frobenius norm." The main
results are stated as nonasymptotic upper bounds on the prediction risk and on
the Schatten- risk of the estimators, where . The rates that we
obtain for the prediction risk are of the form (for ), up to
logarithmic factors, where is the rank of . The particular examples of
multi-task learning and matrix completion are worked out in detail. The proofs
are based on tools from the theory of empirical processes. As a by-product, we
derive bounds for the th entropy numbers of the quasi-convex Schatten class
embeddings , , which are of independent
interest.Comment: Published in at http://dx.doi.org/10.1214/10-AOS860 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Gradient-free optimization of highly smooth functions: improved analysis and a new algorithm
This work studies minimization problems with zero-order noisy oracle
information under the assumption that the objective function is highly smooth
and possibly satisfies additional properties. We consider two kinds of
zero-order projected gradient descent algorithms, which differ in the form of
the gradient estimator. The first algorithm uses a gradient estimator based on
randomization over the sphere due to Bach and Perchet (2016). We
present an improved analysis of this algorithm on the class of highly smooth
and strongly convex functions studied in the prior work, and we derive rates of
convergence for two more general classes of non-convex functions. Namely, we
consider highly smooth functions satisfying the Polyak-{\L}ojasiewicz condition
and the class of highly smooth functions with no additional property. The
second algorithm is based on randomization over the sphere, and it
extends to the highly smooth setting the algorithm that was recently proposed
for Lipschitz convex functions in Akhavan et al. (2022). We show that, in the
case of noiseless oracle, this novel algorithm enjoys better bounds on bias and
variance than the randomization and the commonly used Gaussian
randomization algorithms, while in the noisy case both and
algorithms benefit from similar improved theoretical guarantees. The
improvements are achieved thanks to a new proof techniques based on Poincar\'e
type inequalities for uniform distributions on the or
spheres. The results are established under weak (almost adversarial)
assumptions on the noise. Moreover, we provide minimax lower bounds proving
optimality or near optimality of the obtained upper bounds in several cases
Pac-bayesian bounds for sparse regression estimation with exponential weights
We consider the sparse regression model where the number of parameters is
larger than the sample size . The difficulty when considering
high-dimensional problems is to propose estimators achieving a good compromise
between statistical and computational performances. The BIC estimator for
instance performs well from the statistical point of view \cite{BTW07} but can
only be computed for values of of at most a few tens. The Lasso estimator
is solution of a convex minimization problem, hence computable for large value
of . However stringent conditions on the design are required to establish
fast rates of convergence for this estimator. Dalalyan and Tsybakov
\cite{arnak} propose a method achieving a good compromise between the
statistical and computational aspects of the problem. Their estimator can be
computed for reasonably large and satisfies nice statistical properties
under weak assumptions on the design. However, \cite{arnak} proposes sparsity
oracle inequalities in expectation for the empirical excess risk only. In this
paper, we propose an aggregation procedure similar to that of \cite{arnak} but
with improved statistical performances. Our main theoretical result is a
sparsity oracle inequality in probability for the true excess risk for a
version of exponential weight estimator. We also propose a MCMC method to
compute our estimator for reasonably large values of .Comment: 19 page
Efficient Active Learning Halfspaces with Tsybakov Noise: A Non-convex Optimization Approach
We study the problem of computationally and label efficient PAC active
learning -dimensional halfspaces with Tsybakov
Noise~\citep{tsybakov2004optimal} under structured unlabeled data
distributions. Inspired by~\cite{diakonikolas2020learning}, we prove that any
approximate first-order stationary point of a smooth nonconvex loss function
yields a halfspace with a low excess error guarantee. In light of the above
structural result, we design a nonconvex optimization-based algorithm with a
label complexity of \footnote{In the main body
of this work, we use to hide factors
of the form \polylog(d, \frac{1}{\epsilon}, \frac{1}{\delta})}, under the
assumption that the Tsybakov noise parameter , which
narrows down the gap between the label complexities of the previously known
efficient passive or active
algorithms~\citep{diakonikolas2020polynomial,zhang2021improved} and the
information-theoretic lower bound in this setting.Comment: 29 page
- …