49,875 research outputs found
Asymptotics of Discrete MDL for Online Prediction
Minimum Description Length (MDL) is an important principle for induction and
prediction, with strong relations to optimal Bayesian learning. This paper
deals with learning non-i.i.d. processes by means of two-part MDL, where the
underlying model class is countable. We consider the online learning framework,
i.e. observations come in one by one, and the predictor is allowed to update
his state of mind after each time step. We identify two ways of predicting by
MDL for this setup, namely a static} and a dynamic one. (A third variant,
hybrid MDL, will turn out inferior.) We will prove that under the only
assumption that the data is generated by a distribution contained in the model
class, the MDL predictions converge to the true values almost surely. This is
accomplished by proving finite bounds on the quadratic, the Hellinger, and the
Kullback-Leibler loss of the MDL learner, which are however exponentially worse
than for Bayesian prediction. We demonstrate that these bounds are sharp, even
for model classes containing only Bernoulli distributions. We show how these
bounds imply regret bounds for arbitrary loss functions. Our results apply to a
wide range of setups, namely sequence prediction, pattern classification,
regression, and universal induction in the sense of Algorithmic Information
Theory among others.Comment: 34 page
Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Various optimality properties of universal sequence predictors based on
Bayes-mixtures in general, and Solomonoff's prediction scheme in particular,
will be studied. The probability of observing at time , given past
observations can be computed with the chain rule if the true
generating distribution of the sequences is known. If
is unknown, but known to belong to a countable or continuous class \M
one can base ones prediction on the Bayes-mixture defined as a
-weighted sum or integral of distributions \nu\in\M. The cumulative
expected loss of the Bayes-optimal universal prediction scheme based on
is shown to be close to the loss of the Bayes-optimal, but infeasible
prediction scheme based on . We show that the bounds are tight and that no
other predictor can lead to significantly smaller bounds. Furthermore, for
various performance measures, we show Pareto-optimality of and give an
Occam's razor argument that the choice for the weights
is optimal, where is the length of the shortest program describing
. The results are applied to games of chance, defined as a sequence of
bets, observations, and rewards. The prediction schemes (and bounds) are
compared to the popular predictors based on expert advice. Extensions to
infinite alphabets, partial, delayed and probabilistic prediction,
classification, and more active systems are briefly discussed.Comment: 34 page
Strong Asymptotic Assertions for Discrete MDL in Regression and Classification
We study the properties of the MDL (or maximum penalized complexity)
estimator for Regression and Classification, where the underlying model class
is countable. We show in particular a finite bound on the Hellinger losses
under the only assumption that there is a "true" model contained in the class.
This implies almost sure convergence of the predictive distribution to the true
one at a fast rate. It corresponds to Solomonoff's central theorem of universal
induction, however with a bound that is exponentially larger.Comment: 6 two-column page
Pattern Recognition for Conditionally Independent Data
In this work we consider the task of relaxing the i.i.d assumption in pattern
recognition (or classification), aiming to make existing learning algorithms
applicable to a wider range of tasks. Pattern recognition is guessing a
discrete label of some object based on a set of given examples (pairs of
objects and labels). We consider the case of deterministically defined labels.
Traditionally, this task is studied under the assumption that examples are
independent and identically distributed. However, it turns out that many
results of pattern recognition theory carry over a weaker assumption. Namely,
under the assumption of conditional independence and identical distribution of
objects, while the only assumption on the distribution of labels is that the
rate of occurrence of each label should be above some positive threshold.
We find a broad class of learning algorithms for which estimations of the
probability of a classification error achieved under the classical i.i.d.
assumption can be generalised to the similar estimates for the case of
conditionally i.i.d. examples.Comment: parts of results published at ALT'04 and ICML'0
On the Convergence Speed of MDL Predictions for Bernoulli Sequences
We consider the Minimum Description Length principle for online sequence
prediction. If the underlying model class is discrete, then the total expected
square loss is a particularly interesting performance measure: (a) this
quantity is bounded, implying convergence with probability one, and (b) it
additionally specifies a `rate of convergence'. Generally, for MDL only
exponential loss bounds hold, as opposed to the linear bounds for a Bayes
mixture. We show that this is even the case if the model class contains only
Bernoulli distributions. We derive a new upper bound on the prediction error
for countable Bernoulli classes. This implies a small bound (comparable to the
one for Bayes mixtures) for certain important model classes. The results apply
to many Machine Learning tasks including classification and hypothesis testing.
We provide arguments that our theorems generalize to countable classes of
i.i.d. models.Comment: 17 page
Fast rates in statistical and online learning
The speed with which a learning algorithm converges as it is presented with
more data is a central problem in machine learning --- a fast rate of
convergence means less data is needed for the same level of performance. The
pursuit of fast rates in online and statistical learning has led to the
discovery of many conditions in learning theory under which fast learning is
possible. We show that most of these conditions are special cases of a single,
unifying condition, that comes in two forms: the central condition for 'proper'
learning algorithms that always output a hypothesis in the given model, and
stochastic mixability for online algorithms that may make predictions outside
of the model. We show that under surprisingly weak assumptions both conditions
are, in a certain sense, equivalent. The central condition has a
re-interpretation in terms of convexity of a set of pseudoprobabilities,
linking it to density estimation under misspecification. For bounded losses, we
show how the central condition enables a direct proof of fast rates and we
prove its equivalence to the Bernstein condition, itself a generalization of
the Tsybakov margin condition, both of which have played a central role in
obtaining fast rates in statistical learning. Yet, while the Bernstein
condition is two-sided, the central condition is one-sided, making it more
suitable to deal with unbounded losses. In its stochastic mixability form, our
condition generalizes both a stochastic exp-concavity condition identified by
Juditsky, Rigollet and Tsybakov and Vovk's notion of mixability. Our unifying
conditions thus provide a substantial step towards a characterization of fast
rates in statistical learning, similar to how classical mixability
characterizes constant regret in the sequential prediction with expert advice
setting.Comment: 69 pages, 3 figure
Online Optimization Methods for the Quantification Problem
The estimation of class prevalence, i.e., the fraction of a population that
belongs to a certain class, is a very useful tool in data analytics and
learning, and finds applications in many domains such as sentiment analysis,
epidemiology, etc. For example, in sentiment analysis, the objective is often
not to estimate whether a specific text conveys a positive or a negative
sentiment, but rather estimate the overall distribution of positive and
negative sentiments during an event window. A popular way of performing the
above task, often dubbed quantification, is to use supervised learning to train
a prevalence estimator from labeled data.
Contemporary literature cites several performance measures used to measure
the success of such prevalence estimators. In this paper we propose the first
online stochastic algorithms for directly optimizing these
quantification-specific performance measures. We also provide algorithms that
optimize hybrid performance measures that seek to balance quantification and
classification performance. Our algorithms present a significant advancement in
the theory of multivariate optimization and we show, by a rigorous theoretical
analysis, that they exhibit optimal convergence. We also report extensive
experiments on benchmark and real data sets which demonstrate that our methods
significantly outperform existing optimization techniques used for these
performance measures.Comment: 26 pages, 6 figures. A short version of this manuscript will appear
in the proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, KDD 201
Learning From Noisy Singly-labeled Data
Supervised learning depends on annotated examples, which are taken to be the
\emph{ground truth}. But these labels often come from noisy crowdsourcing
platforms, like Amazon Mechanical Turk. Practitioners typically collect
multiple labels per example and aggregate the results to mitigate noise (the
classic crowdsourcing problem). Given a fixed annotation budget and unlimited
unlabeled data, redundant annotation comes at the expense of fewer labeled
examples. This raises two fundamental questions: (1) How can we best learn from
noisy workers? (2) How should we allocate our labeling budget to maximize the
performance of a classifier? We propose a new algorithm for jointly modeling
labels and worker quality from noisy crowd-sourced data. The alternating
minimization proceeds in rounds, estimating worker quality from disagreement
with the current model and then updating the model by optimizing a loss
function that accounts for the current estimate of worker quality. Unlike
previous approaches, even with only one annotation per example, our algorithm
can estimate worker quality. We establish a generalization error bound for
models learned with our algorithm and establish theoretically that it's better
to label many examples once (vs less multiply) when worker quality is above a
threshold. Experiments conducted on both ImageNet (with simulated noisy
workers) and MS-COCO (using the real crowdsourced labels) confirm our
algorithm's benefits.Comment: 18 pages, 3 figure
- …