1,102 research outputs found
Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
Various optimality properties of universal sequence predictors based on
Bayes-mixtures in general, and Solomonoff's prediction scheme in particular,
will be studied. The probability of observing at time , given past
observations can be computed with the chain rule if the true
generating distribution of the sequences is known. If
is unknown, but known to belong to a countable or continuous class \M
one can base ones prediction on the Bayes-mixture defined as a
-weighted sum or integral of distributions \nu\in\M. The cumulative
expected loss of the Bayes-optimal universal prediction scheme based on
is shown to be close to the loss of the Bayes-optimal, but infeasible
prediction scheme based on . We show that the bounds are tight and that no
other predictor can lead to significantly smaller bounds. Furthermore, for
various performance measures, we show Pareto-optimality of and give an
Occam's razor argument that the choice for the weights
is optimal, where is the length of the shortest program describing
. The results are applied to games of chance, defined as a sequence of
bets, observations, and rewards. The prediction schemes (and bounds) are
compared to the popular predictors based on expert advice. Extensions to
infinite alphabets, partial, delayed and probabilistic prediction,
classification, and more active systems are briefly discussed.Comment: 34 page
Bad Universal Priors and Notions of Optimality
A big open question of algorithmic information theory is the choice of the
universal Turing machine (UTM). For Kolmogorov complexity and Solomonoff
induction we have invariance theorems: the choice of the UTM changes bounds
only by a constant. For the universally intelligent agent AIXI (Hutter, 2005)
no invariance theorem is known. Our results are entirely negative: we discuss
cases in which unlucky or adversarial choices of the UTM cause AIXI to
misbehave drastically. We show that Legg-Hutter intelligence and thus balanced
Pareto optimality is entirely subjective, and that every policy is Pareto
optimal in the class of all computable environments. This undermines all
existing optimality properties for AIXI. While it may still serve as a gold
standard for AI, our results imply that AIXI is a relative theory, dependent on
the choice of the UTM.Comment: COLT 201
On Universal Prediction and Bayesian Confirmation
The Bayesian framework is a well-studied and successful framework for
inductive reasoning, which includes hypothesis testing and confirmation,
parameter estimation, sequence prediction, classification, and regression. But
standard statistical guidelines for choosing the model class and prior are not
always available or fail, in particular in complex situations. Solomonoff
completed the Bayesian framework by providing a rigorous, unique, formal, and
universal choice for the model class and the prior. We discuss in breadth how
and in which sense universal (non-i.i.d.) sequence prediction solves various
(philosophical) problems of traditional Bayesian sequence prediction. We show
that Solomonoff's model possesses many desirable properties: Strong total and
weak instantaneous bounds, and in contrast to most classical continuous prior
densities has no zero p(oste)rior problem, i.e. can confirm universal
hypotheses, is reparametrization and regrouping invariant, and avoids the
old-evidence and updating problem. It even performs well (actually better) in
non-computable environments.Comment: 24 page
Algorithmic Complexity Bounds on Future Prediction Errors
We bound the future loss when predicting any (computably) stochastic sequence
online. Solomonoff finitely bounded the total deviation of his universal
predictor from the true distribution by the algorithmic complexity of
. Here we assume we are at a time and already observed .
We bound the future prediction performance on by a new
variant of algorithmic complexity of given , plus the complexity of the
randomness deficiency of . The new complexity is monotone in its condition
in the sense that this complexity can only decrease if the condition is
prolonged. We also briefly discuss potential generalizations to Bayesian model
classes and to classification problems.Comment: 21 page
Discrete Denoising with Shifts
We introduce S-DUDE, a new algorithm for denoising DMC-corrupted data. The
algorithm, which generalizes the recently introduced DUDE (Discrete Universal
DEnoiser) of Weissman et al., aims to compete with a genie that has access, in
addition to the noisy data, also to the underlying clean data, and can choose
to switch, up to times, between sliding window denoisers in a way that
minimizes the overall loss. When the underlying data form an individual
sequence, we show that the S-DUDE performs essentially as well as this genie,
provided that is sub-linear in the size of the data. When the clean data is
emitted by a piecewise stationary process, we show that the S-DUDE achieves the
optimum distribution-dependent performance, provided that the same
sub-linearity condition is imposed on the number of switches. To further
substantiate the universal optimality of the S-DUDE, we show that when the
number of switches is allowed to grow linearly with the size of the data,
\emph{any} (sequence of) scheme(s) fails to compete in the above senses. Using
dynamic programming, we derive an efficient implementation of the S-DUDE, which
has complexity (time and memory) growing only linearly with the data size and
the number of switches . Preliminary experimental results are presented,
suggesting that S-DUDE has the capacity to significantly improve on the
performance attained by the original DUDE in applications where the nature of
the data abruptly changes in time (or space), as is often the case in practice.Comment: 30 pages, 3 figures, submitted to IEEE Trans. Inform. Theor
Minimax Optimal Bayes Mixtures for Memoryless Sources
Tasks such as data compression and prediction commonly require choosing a probability distribution over all possible sequences. To achieve an efficient prediction strategy, the chosen distribution should be a good approximation of the true distribution underlying the data. Similarly, an efficient compression strategy should assign shorter codes for more probable sequences. In particular, a compression strategy that minimizes the code-length can be shown to minimize the often-used logarithmic prediction loss. However, the optimal strategy requires knowing the true distribution which is not available in most applications.
In universal compression or prediction we assume that the true probability distribution is not known but belongs to a known class of distributions. A universal code is a code that can compress the data essentially as well as the best distribution in the class in hindsight. Similarly, a universal predictor achieves low prediction loss regardless of the distribution. We call a universal code minimax optimal if it minimizes the worst-case regret, i.e. excess code-length or prediction loss compared to the best distribution in the class.
In this thesis we assume the known class to be discrete memoryless sources. The minimax optimal code for this class is given by the normalized maximum likelihood (NML) distribution. However, in practice computationally more efficient distributions such as Bayes mixtures have to be used. A Bayes mixture is a mixture of the probability distributions in the class weighted by a prior distribution. The conjugate prior to the multinomial distribution is the Dirichlet distribution, using which asymptotically minimax codes have been developed. The Dirichlet distribution requires a hyperparameter that dictates the amount of prior mass given to the outcomes. The distribution given by the symmetric hyperparameter 1/2 has been widely studied and has been shown to minimize the worst-case expected regret asymptotically.
Previous work on minimax optimal Bayes mixtures has mainly been concerned with large sample sizes in comparison to the alphabet size. In this thesis we investigate the minimax optimal Dirichlet prior in the large alphabet setting. In particular, we find that when the alphabet size is large compared to the sample size, the optimal hyperparameter for the Dirichlet distribution is 1/3. The worst-case regret of this mixture turns out to approach the NML regret when the alphabet size grows and the distribution provides an efficient approximation of the NML distribution. Furthermore, we develop an efficient algorithm for finding the optimal hyperparameter for any sample size or alphabet size
Universality of Bayesian mixture predictors
The problem is that of sequential probability forecasting for finite-valued
time series. The data is generated by an unknown probability distribution over
the space of all one-way infinite sequences. It is known that this measure
belongs to a given set C, but the latter is completely arbitrary (uncountably
infinite, without any structure given). The performance is measured with
asymptotic average log loss. In this work it is shown that the minimax
asymptotic performance is always attainable, and it is attained by a convex
combination of a countably many measures from the set C (a Bayesian mixture).
This was previously only known for the case when the best achievable asymptotic
error is 0. This also contrasts previous results that show that in the
non-realizable case all Bayesian mixtures may be suboptimal, while there is a
predictor that achieves the optimal performance
- …