66 research outputs found

    Minimax Optimal Bayes Mixtures for Memoryless Sources

    Get PDF
    Tasks such as data compression and prediction commonly require choosing a probability distribution over all possible sequences. To achieve an efficient prediction strategy, the chosen distribution should be a good approximation of the true distribution underlying the data. Similarly, an efficient compression strategy should assign shorter codes for more probable sequences. In particular, a compression strategy that minimizes the code-length can be shown to minimize the often-used logarithmic prediction loss. However, the optimal strategy requires knowing the true distribution which is not available in most applications. In universal compression or prediction we assume that the true probability distribution is not known but belongs to a known class of distributions. A universal code is a code that can compress the data essentially as well as the best distribution in the class in hindsight. Similarly, a universal predictor achieves low prediction loss regardless of the distribution. We call a universal code minimax optimal if it minimizes the worst-case regret, i.e. excess code-length or prediction loss compared to the best distribution in the class. In this thesis we assume the known class to be discrete memoryless sources. The minimax optimal code for this class is given by the normalized maximum likelihood (NML) distribution. However, in practice computationally more efficient distributions such as Bayes mixtures have to be used. A Bayes mixture is a mixture of the probability distributions in the class weighted by a prior distribution. The conjugate prior to the multinomial distribution is the Dirichlet distribution, using which asymptotically minimax codes have been developed. The Dirichlet distribution requires a hyperparameter that dictates the amount of prior mass given to the outcomes. The distribution given by the symmetric hyperparameter 1/2 has been widely studied and has been shown to minimize the worst-case expected regret asymptotically. Previous work on minimax optimal Bayes mixtures has mainly been concerned with large sample sizes in comparison to the alphabet size. In this thesis we investigate the minimax optimal Dirichlet prior in the large alphabet setting. In particular, we find that when the alphabet size is large compared to the sample size, the optimal hyperparameter for the Dirichlet distribution is 1/3. The worst-case regret of this mixture turns out to approach the NML regret when the alphabet size grows and the distribution provides an efficient approximation of the NML distribution. Furthermore, we develop an efficient algorithm for finding the optimal hyperparameter for any sample size or alphabet size

    Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

    Get PDF
    The normalized maximum likelihood model achieves the minimax coding (log-loss) regret for data of fixed sample size n. However, it is a batch strategy, i.e., it requires that n be known in advance. Furthermore, it is computationally infeasible for most statistical models, and several computationally feasible alternative strategies have been devised. We characterize the achievability of asymptotic minimaxity by batch strategies (i.e., strategies that depend on n) as well as online strategies (i.e., strategies independent of n). On one hand, we conjecture that for a large class of models, no online strategy can be asymptotically minimax. We prove that this holds under a slightly stronger definition of asymptotic minimaxity. Our numerical experiments support the conjecture about non-achievability by so called last-step minimax algorithms, which are independent of n. On the other hand, we show that in the multinomial model, a Bayes mixture defined by the conjugate Dirichlet prior with a simple dependency on n achieves asymptotic minimaxity for all sequences, thus providing a simpler asymptotic minimax strategy compared to earlier work by Xie and Barron. The numerical results also demonstrate superior finite-sample behavior by a number of novel batch and online algorithms.Peer reviewe

    Minimum Description Length Model Selection - Problems and Extensions

    Get PDF
    The thesis treats a number of open problems in Minimum Description Length model selection, especially prediction problems. It is shown how techniques from the "Prediction with Expert Advice" literature can be used to improve model selection performance, which is particularly useful in nonparametric settings

    A tutorial introduction to the minimum description length principle

    Full text link
    This tutorial provides an overview of and introduction to Rissanen's Minimum Description Length (MDL) Principle. The first chapter provides a conceptual, entirely non-technical introduction to the subject. It serves as a basis for the technical introduction given in the second chapter, in which all the ideas of the first chapter are made mathematically precise. The main ideas are discussed in great conceptual and technical detail. This tutorial is an extended version of the first two chapters of the collection "Advances in Minimum Description Length: Theory and Application" (edited by P.Grunwald, I.J. Myung and M. Pitt, to be published by the MIT Press, Spring 2005).Comment: 80 pages 5 figures Report with 2 chapter

    Empirical Bayes estimation: When does gg-modeling beat ff-modeling in theory (and in practice)?

    Full text link
    Empirical Bayes (EB) is a popular framework for large-scale inference that aims to find data-driven estimators to compete with the Bayesian oracle that knows the true prior. Two principled approaches to EB estimation have emerged over the years: ff-modeling, which constructs an approximate Bayes rule by estimating the marginal distribution of the data, and gg-modeling, which estimates the prior from data and then applies the learned Bayes rule. For the Poisson model, the prototypical examples are the celebrated Robbins estimator and the nonparametric MLE (NPMLE), respectively. It has long been recognized in practice that the Robbins estimator, while being conceptually appealing and computationally simple, lacks robustness and can be easily derailed by "outliers" (data points that were rarely observed before), unlike the NPMLE which provides more stable and interpretable fit thanks to its Bayes form. On the other hand, not only do the existing theories shed little light on this phenomenon, but they all point to the opposite, as both methods have recently been shown optimal in terms of the \emph{regret} (excess over the Bayes risk) for compactly supported and subexponential priors with exact logarithmic factors. In this paper we provide a theoretical justification for the superiority of NPMLE over Robbins for heavy-tailed data by considering priors with bounded ppth moment previously studied for the Gaussian model. For the Poisson model with sample size nn, assuming p>1p>1 (for otherwise triviality arises), we show that the NPMLE with appropriate regularization and truncation achieves a total regret Θ~(n32p+1)\tilde \Theta(n^{\frac{3}{2p+1}}), which is minimax optimal within logarithmic factors. In contrast, the total regret of Robbins estimator (with similar truncation) is Θ~(n3p+2)\tilde{\Theta}(n^{\frac{3}{p+2}}) and hence suboptimal by a polynomial factor

    MDL, Penalized Likelihood, and Statistical Risk

    Get PDF
    Abstract-We determine, for both countable and uncountable collections of functions, information-theoretic conditions on a penalty pen(f ) such that the optimizerf of the penalized log likelihood criterion log 1/likelihood(f )+pen(f ) has risk not more than the index of resolvability corresponding to the accuracy of the optimizer of the expected value of the criterion. If F is the linear span of a dictionary of functions, traditional descriptionlength penalties are based on the number of non-zero terms (the 0 norm of the coefficients). We specialize our general conclusions to show the 1 norm of the coefficients times a suitable multiplier λ is also an information-theoretically valid penalty

    Learning Locally Minimax Optimal Bayesian Networks

    Get PDF
    We consider the problem of learning Bayesian network models in a non-informative setting, where the only available information is a set of observational data, and no background knowledge is available. The problem can be divided into two different subtasks: learning the structure of the network (a set of independence relations), and learning the parameters of the model (that fix the probability distribution from the set of all distributions consistent with the chosen structure). There are not many theoretical frameworks that consistently handle both these problems together, the Bayesian framework being an exception. In this paper we propose an alternative, information-theoretic framework which sidesteps some of the technical problems facing the Bayesian approach. The framework is based on the minimax-optimal Normalized Maximum Likelihood (NML) distribution, which is motivated by the Minimum Description Length (MDL) principle. The resulting model selection criterion is consistent, and it provides a way to construct highly predictive Bayesian network models. Our empirical tests show that the proposed method compares favorably with alternative approaches in both model selection and prediction tasks.
    • 

    corecore