66 research outputs found
Minimax Optimal Bayes Mixtures for Memoryless Sources
Tasks such as data compression and prediction commonly require choosing a probability distribution over all possible sequences. To achieve an efficient prediction strategy, the chosen distribution should be a good approximation of the true distribution underlying the data. Similarly, an efficient compression strategy should assign shorter codes for more probable sequences. In particular, a compression strategy that minimizes the code-length can be shown to minimize the often-used logarithmic prediction loss. However, the optimal strategy requires knowing the true distribution which is not available in most applications.
In universal compression or prediction we assume that the true probability distribution is not known but belongs to a known class of distributions. A universal code is a code that can compress the data essentially as well as the best distribution in the class in hindsight. Similarly, a universal predictor achieves low prediction loss regardless of the distribution. We call a universal code minimax optimal if it minimizes the worst-case regret, i.e. excess code-length or prediction loss compared to the best distribution in the class.
In this thesis we assume the known class to be discrete memoryless sources. The minimax optimal code for this class is given by the normalized maximum likelihood (NML) distribution. However, in practice computationally more efficient distributions such as Bayes mixtures have to be used. A Bayes mixture is a mixture of the probability distributions in the class weighted by a prior distribution. The conjugate prior to the multinomial distribution is the Dirichlet distribution, using which asymptotically minimax codes have been developed. The Dirichlet distribution requires a hyperparameter that dictates the amount of prior mass given to the outcomes. The distribution given by the symmetric hyperparameter 1/2 has been widely studied and has been shown to minimize the worst-case expected regret asymptotically.
Previous work on minimax optimal Bayes mixtures has mainly been concerned with large sample sizes in comparison to the alphabet size. In this thesis we investigate the minimax optimal Dirichlet prior in the large alphabet setting. In particular, we find that when the alphabet size is large compared to the sample size, the optimal hyperparameter for the Dirichlet distribution is 1/3. The worst-case regret of this mixture turns out to approach the NML regret when the alphabet size grows and the distribution provides an efficient approximation of the NML distribution. Furthermore, we develop an efficient algorithm for finding the optimal hyperparameter for any sample size or alphabet size
Achievability of Asymptotic Minimax Regret in Online and Batch Prediction
The normalized maximum likelihood model achieves the minimax coding (log-loss) regret for data of fixed sample size n. However, it is a batch strategy, i.e., it requires that n be known in advance. Furthermore, it is computationally infeasible for most statistical models, and several computationally feasible alternative strategies have been devised. We characterize the achievability of asymptotic minimaxity by batch strategies (i.e., strategies that depend on n) as well as online strategies (i.e., strategies independent of n). On one hand, we conjecture that for a large class of models, no online strategy can be asymptotically minimax. We prove that this holds under a slightly stronger definition of asymptotic minimaxity. Our numerical experiments support the conjecture about non-achievability by so called last-step minimax algorithms, which are independent of n. On the other hand, we show that in the multinomial model, a Bayes mixture defined by the conjugate Dirichlet prior with a simple dependency on n achieves asymptotic minimaxity for all sequences, thus providing a simpler asymptotic minimax strategy compared to earlier work by Xie and Barron. The numerical results also demonstrate superior finite-sample behavior by a number of novel batch and online algorithms.Peer reviewe
Minimum Description Length Model Selection - Problems and Extensions
The thesis treats a number of open problems in Minimum Description Length model selection, especially prediction problems. It is shown how techniques from the "Prediction with Expert Advice" literature can be used to improve model selection performance, which is particularly useful in nonparametric settings
A tutorial introduction to the minimum description length principle
This tutorial provides an overview of and introduction to Rissanen's Minimum
Description Length (MDL) Principle. The first chapter provides a conceptual,
entirely non-technical introduction to the subject. It serves as a basis for
the technical introduction given in the second chapter, in which all the ideas
of the first chapter are made mathematically precise. The main ideas are
discussed in great conceptual and technical detail. This tutorial is an
extended version of the first two chapters of the collection "Advances in
Minimum Description Length: Theory and Application" (edited by P.Grunwald, I.J.
Myung and M. Pitt, to be published by the MIT Press, Spring 2005).Comment: 80 pages 5 figures Report with 2 chapter
Empirical Bayes estimation: When does -modeling beat -modeling in theory (and in practice)?
Empirical Bayes (EB) is a popular framework for large-scale inference that
aims to find data-driven estimators to compete with the Bayesian oracle that
knows the true prior. Two principled approaches to EB estimation have emerged
over the years: -modeling, which constructs an approximate Bayes rule by
estimating the marginal distribution of the data, and -modeling, which
estimates the prior from data and then applies the learned Bayes rule. For the
Poisson model, the prototypical examples are the celebrated Robbins estimator
and the nonparametric MLE (NPMLE), respectively. It has long been recognized in
practice that the Robbins estimator, while being conceptually appealing and
computationally simple, lacks robustness and can be easily derailed by
"outliers" (data points that were rarely observed before), unlike the NPMLE
which provides more stable and interpretable fit thanks to its Bayes form. On
the other hand, not only do the existing theories shed little light on this
phenomenon, but they all point to the opposite, as both methods have recently
been shown optimal in terms of the \emph{regret} (excess over the Bayes risk)
for compactly supported and subexponential priors with exact logarithmic
factors.
In this paper we provide a theoretical justification for the superiority of
NPMLE over Robbins for heavy-tailed data by considering priors with bounded
th moment previously studied for the Gaussian model. For the Poisson model
with sample size , assuming (for otherwise triviality arises), we show
that the NPMLE with appropriate regularization and truncation achieves a total
regret , which is minimax optimal within
logarithmic factors. In contrast, the total regret of Robbins estimator (with
similar truncation) is and hence suboptimal
by a polynomial factor
MDL, Penalized Likelihood, and Statistical Risk
Abstract-We determine, for both countable and uncountable collections of functions, information-theoretic conditions on a penalty pen(f ) such that the optimizerf of the penalized log likelihood criterion log 1/likelihood(f )+pen(f ) has risk not more than the index of resolvability corresponding to the accuracy of the optimizer of the expected value of the criterion. If F is the linear span of a dictionary of functions, traditional descriptionlength penalties are based on the number of non-zero terms (the 0 norm of the coefficients). We specialize our general conclusions to show the 1 norm of the coefficients times a suitable multiplier λ is also an information-theoretically valid penalty
Learning Locally Minimax Optimal Bayesian Networks
We consider the problem of learning Bayesian network models in a non-informative setting, where the only available information is a set of observational data, and no background knowledge is available. The problem can be divided into two different subtasks: learning the structure of the network (a set of independence relations), and learning the parameters of the model (that fix the probability distribution from the set of all distributions consistent with the chosen structure). There are not many theoretical frameworks that consistently handle both these problems together, the Bayesian framework being an exception. In this paper we propose an alternative, information-theoretic framework which sidesteps some of the technical problems facing the Bayesian approach. The framework is based on the minimax-optimal Normalized Maximum Likelihood (NML) distribution, which is motivated by the Minimum Description Length (MDL) principle. The resulting model selection criterion is consistent, and it provides a way to construct highly predictive Bayesian network models. Our empirical tests show that the proposed method compares favorably with alternative approaches in both model selection and prediction tasks.
- âŠ