15,915 research outputs found

    The Computational Power of Optimization in Online Learning

    Full text link
    We consider the fundamental problem of prediction with expert advice where the experts are "optimizable": there is a black-box optimization oracle that can be used to compute, in constant time, the leading expert in retrospect at any point in time. In this setting, we give a novel online algorithm that attains vanishing regret with respect to NN experts in total O~(N)\widetilde{O}(\sqrt{N}) computation time. We also give a lower bound showing that this running time cannot be improved (up to log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the standard, oracle-free setting where the required time for vanishing regret is Θ~(N)\widetilde{\Theta}(N). These results demonstrate an exponential gap between the power of optimization in online learning and its power in statistical learning: in the latter, an optimization oracle---i.e., an efficient empirical risk minimizer---allows to learn a finite hypothesis class of size NN in time O(logN)O(\log{N}). We also study the implications of our results to learning in repeated zero-sum games, in a setting where the players have access to oracles that compute, in constant time, their best-response to any mixed strategy of their opponent. We show that the runtime required for approximating the minimax value of the game in this setting is Θ~(N)\widetilde{\Theta}(\sqrt{N}), yielding again a quadratic improvement upon the oracle-free setting, where Θ~(N)\widetilde{\Theta}(N) is known to be tight

    Universal Learning of Repeated Matrix Games

    Full text link
    We study and compare the learning dynamics of two universal learning algorithms, one based on Bayesian learning and the other on prediction with expert advice. Both approaches have strong asymptotic performance guarantees. When confronted with the task of finding good long-term strategies in repeated 2x2 matrix games, they behave quite differently.Comment: 16 LaTeX pages, 8 eps figure

    Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet

    Full text link
    Various optimality properties of universal sequence predictors based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular, will be studied. The probability of observing xtx_t at time tt, given past observations x1...xt1x_1...x_{t-1} can be computed with the chain rule if the true generating distribution μ\mu of the sequences x1x2x3...x_1x_2x_3... is known. If μ\mu is unknown, but known to belong to a countable or continuous class \M one can base ones prediction on the Bayes-mixture ξ\xi defined as a wνw_\nu-weighted sum or integral of distributions \nu\in\M. The cumulative expected loss of the Bayes-optimal universal prediction scheme based on ξ\xi is shown to be close to the loss of the Bayes-optimal, but infeasible prediction scheme based on μ\mu. We show that the bounds are tight and that no other predictor can lead to significantly smaller bounds. Furthermore, for various performance measures, we show Pareto-optimality of ξ\xi and give an Occam's razor argument that the choice wν2K(ν)w_\nu\sim 2^{-K(\nu)} for the weights is optimal, where K(ν)K(\nu) is the length of the shortest program describing ν\nu. The results are applied to games of chance, defined as a sequence of bets, observations, and rewards. The prediction schemes (and bounds) are compared to the popular predictors based on expert advice. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.Comment: 34 page

    Justification of Logarithmic Loss via the Benefit of Side Information

    Full text link
    We consider a natural measure of relevance: the reduction in optimal prediction risk in the presence of side information. For any given loss function, this relevance measure captures the benefit of side information for performing inference on a random variable under this loss function. When such a measure satisfies a natural data processing property, and the random variable of interest has alphabet size greater than two, we show that it is uniquely characterized by the mutual information, and the corresponding loss function coincides with logarithmic loss. In doing so, our work provides a new characterization of mutual information, and justifies its use as a measure of relevance. When the alphabet is binary, we characterize the only admissible forms the measure of relevance can assume while obeying the specified data processing property. Our results naturally extend to measuring causal influence between stochastic processes, where we unify different causal-inference measures in the literature as instantiations of directed information
    corecore