The Minimum Description Length (MDL) principle selects the model that has the
shortest code for data plus model. We show that for a countable class of
models, MDL predictions are close to the true distribution in a strong sense.
The result is completely general. No independence, ergodicity, stationarity,
identifiability, or other assumption on the model class need to be made. More
formally, we show that for any countable class of models, the distributions
selected by MDL (or MAP) asymptotically predict (merge with) the true measure
in the class in total variation distance. Implications for non-i.i.d. domains
like time-series forecasting, discriminative learning, and reinforcement
learning are discussed.Comment: 15 LaTeX page