140,405 research outputs found

    Characterizing predictable classes of processes

    Get PDF
    The problem is sequence prediction in the following setting. A sequence x1,...,xn,...x_1,...,x_n,... of discrete-valued observations is generated according to some unknown probabilistic law (measure) μ\mu. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure μ\mu belongs to an arbitrary class \C of stochastic processes. We are interested in predictors ρ\rho whose conditional probabilities converge to the "true" μ\mu-conditional probabilities if any \mu\in\C is chosen to generate the data. We show that if such a predictor exists, then a predictor can also be obtained as a convex combination of a countably many elements of \C. In other words, it can be obtained as a Bayesian predictor whose prior is concentrated on a countable set. This result is established for two very different measures of performance of prediction, one of which is very strong, namely, total variation, and the other is very weak, namely, prediction in expected average Kullback-Leibler divergence

    Universality of Bayesian mixture predictors

    Full text link
    The problem is that of sequential probability forecasting for finite-valued time series. The data is generated by an unknown probability distribution over the space of all one-way infinite sequences. It is known that this measure belongs to a given set C, but the latter is completely arbitrary (uncountably infinite, without any structure given). The performance is measured with asymptotic average log loss. In this work it is shown that the minimax asymptotic performance is always attainable, and it is attained by a convex combination of a countably many measures from the set C (a Bayesian mixture). This was previously only known for the case when the best achievable asymptotic error is 0. This also contrasts previous results that show that in the non-realizable case all Bayesian mixtures may be suboptimal, while there is a predictor that achieves the optimal performance

    Asymptotics of Discrete MDL for Online Prediction

    Get PDF
    Minimum Description Length (MDL) is an important principle for induction and prediction, with strong relations to optimal Bayesian learning. This paper deals with learning non-i.i.d. processes by means of two-part MDL, where the underlying model class is countable. We consider the online learning framework, i.e. observations come in one by one, and the predictor is allowed to update his state of mind after each time step. We identify two ways of predicting by MDL for this setup, namely a static} and a dynamic one. (A third variant, hybrid MDL, will turn out inferior.) We will prove that under the only assumption that the data is generated by a distribution contained in the model class, the MDL predictions converge to the true values almost surely. This is accomplished by proving finite bounds on the quadratic, the Hellinger, and the Kullback-Leibler loss of the MDL learner, which are however exponentially worse than for Bayesian prediction. We demonstrate that these bounds are sharp, even for model classes containing only Bernoulli distributions. We show how these bounds imply regret bounds for arbitrary loss functions. Our results apply to a wide range of setups, namely sequence prediction, pattern classification, regression, and universal induction in the sense of Algorithmic Information Theory among others.Comment: 34 page

    Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

    Full text link
    Various optimality properties of universal sequence predictors based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular, will be studied. The probability of observing xtx_t at time tt, given past observations x1...xt1x_1...x_{t-1} can be computed with the chain rule if the true generating distribution μ\mu of the sequences x1x2x3...x_1x_2x_3... is known. If μ\mu is unknown, but known to belong to a countable or continuous class \M one can base ones prediction on the Bayes-mixture ξ\xi defined as a wνw_\nu-weighted sum or integral of distributions \nu\in\M. The cumulative expected loss of the Bayes-optimal universal prediction scheme based on ξ\xi is shown to be close to the loss of the Bayes-optimal, but infeasible prediction scheme based on μ\mu. We show that the bounds are tight and that no other predictor can lead to significantly smaller bounds. Furthermore, for various performance measures, we show Pareto-optimality of ξ\xi and give an Occam's razor argument that the choice wν2K(ν)w_\nu\sim 2^{-K(\nu)} for the weights is optimal, where K(ν)K(\nu) is the length of the shortest program describing ν\nu. The results are applied to games of chance, defined as a sequence of bets, observations, and rewards. The prediction schemes (and bounds) are compared to the popular predictors based on expert advice. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.Comment: 34 page

    On Generalized Computable Universal Priors and their Convergence

    Full text link
    Solomonoff unified Occam's razor and Epicurus' principle of multiple explanations to one elegant, formal, universal theory of inductive inference, which initiated the field of algorithmic information theory. His central result is that the posterior of the universal semimeasure M converges rapidly to the true sequence generating posterior mu, if the latter is computable. Hence, M is eligible as a universal predictor in case of unknown mu. The first part of the paper investigates the existence and convergence of computable universal (semi)measures for a hierarchy of computability classes: recursive, estimable, enumerable, and approximable. For instance, M is known to be enumerable, but not estimable, and to dominate all enumerable semimeasures. We present proofs for discrete and continuous semimeasures. The second part investigates more closely the types of convergence, possibly implied by universality: in difference and in ratio, with probability 1, in mean sum, and for Martin-Loef random sequences. We introduce a generalized concept of randomness for individual sequences and use it to exhibit difficulties regarding these issues. In particular, we show that convergence fails (holds) on generalized-random sequences in gappy (dense) Bernoulli classes.Comment: 22 page

    Discrete MDL Predicts in Total Variation

    Get PDF
    The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed.Comment: 15 LaTeX page

    Absolutely No Free Lunches!

    Get PDF
    This paper is concerned with learners who aim to learn patterns in infinite binary sequences: shown longer and longer initial segments of a binary sequence, they either attempt to predict whether the next bit will be a 0 or will be a 1 or they issue forecast probabilities for these events. Several variants of this problem are considered. In each case, a no-free-lunch result of the following form is established: the problem of learning is a formidably difficult one, in that no matter what method is pursued, failure is incomparably more common that success; and difficult choices must be faced in choosing a method of learning, since no approach dominates all others in its range of success. In the simplest case, the comparison of the set of situations in which a method fails and the set of situations in which it succeeds is a matter of cardinality (countable vs. uncountable); in other cases, it is a topological matter (meagre vs. co-meagre) or a hybrid computational-topological matter (effectively meagre vs. effectively co-meagre)
    corecore