1,187 research outputs found
A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)
The concept of overfitting in model selection is explained and demonstrated
with an example. After providing some background information on information
theory and Kolmogorov complexity, we provide a short explanation of Minimum
Description Length and error minimization. We conclude with a discussion of the
typical features of overfitting in model selection.Comment: 20 pages, Chapter 1 of The Paradox of Overfitting, Master's thesis,
Rijksuniversiteit Groningen, 200
Applying MDL to Learning Best Model Granularity
The Minimum Description Length (MDL) principle is solidly based on a provably
ideal method of inference using Kolmogorov complexity. We test how the theory
behaves in practice on a general problem in model selection: that of learning
the best model granularity. The performance of a model depends critically on
the granularity, for example the choice of precision of the parameters. Too
high precision generally involves modeling of accidental noise and too low
precision may lead to confusion of models that should be distinguished. This
precision is often determined ad hoc. In MDL the best model is the one that
most compresses a two-part code of the data set: this embodies ``Occam's
Razor.'' In two quite different experimental settings the theoretical value
determined using MDL coincides with the best value found experimentally. In the
first experiment the task is to recognize isolated handwritten characters in
one subject's handwriting, irrespective of size and orientation. Based on a new
modification of elastic matching, using multiple prototypes per character, the
optimal prediction rate is predicted for the learned parameter (length of
sampling interval) considered most likely by MDL, which is shown to coincide
with the best value found experimentally. In the second experiment the task is
to model a robot arm with two degrees of freedom using a three layer
feed-forward neural network where we need to determine the number of nodes in
the hidden layer giving best modeling performance. The optimal model (the one
that extrapolizes best on unseen examples) is predicted for the number of nodes
in the hidden layer considered most likely by MDL, which again is found to
coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To
appea
Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity
The relationship between the Bayesian approach and the minimum description
length approach is established. We sharpen and clarify the general modeling
principles MDL and MML, abstracted as the ideal MDL principle and defined from
Bayes's rule by means of Kolmogorov complexity. The basic condition under which
the ideal principle should be applied is encapsulated as the Fundamental
Inequality, which in broad terms states that the principle is valid when the
data are random, relative to every contemplated hypothesis and also these
hypotheses are random relative to the (universal) prior. Basically, the ideal
principle states that the prior probability associated with the hypothesis
should be given by the algorithmic universal probability, and the sum of the
log universal probability of the model plus the log of the probability of the
data given the model should be minimized. If we restrict the model class to the
finite sets then application of the ideal principle turns into Kolmogorov's
minimal sufficient statistic. In general we show that data compression is
almost always the best strategy, both in hypothesis identification and
prediction.Comment: 35 pages, Latex. Submitted IEEE Trans. Inform. Theor
We Are Not Your Real Parents: Telling Causal from Confounded using MDL
Given data over variables we consider the problem of finding out whether jointly causes or whether they are all confounded by an unobserved latent variable . To do so, we take an information-theoretic approach based on Kolmogorov complexity. In a nutshell, we follow the postulate that first encoding the true cause, and then the effects given that cause, results in a shorter description than any other encoding of the observed variables. The ideal score is not computable, and hence we have to approximate it. We propose to do so using the Minimum Description Length (MDL) principle. We compare the MDL scores under the models where causes and where there exists a latent variables confounding both and and show our scores are consistent. To find potential confounders we propose using latent factor modeling, in particular, probabilistic PCA (PPCA). Empirical evaluation on both synthetic and real-world data shows that our method, CoCa, performs very well -- even when the true generating process of the data is far from the assumptions made by the models we use. Moreover, it is robust as its accuracy goes hand in hand with its confidence
Kolmogorov's Structure Functions and Model Selection
In 1974 Kolmogorov proposed a non-probabilistic approach to statistics and
model selection. Let data be finite binary strings and models be finite sets of
binary strings. Consider model classes consisting of models of given maximal
(Kolmogorov) complexity. The ``structure function'' of the given data expresses
the relation between the complexity level constraint on a model class and the
least log-cardinality of a model in the class containing the data. We show that
the structure function determines all stochastic properties of the data: for
every constrained model class it determines the individual best-fitting model
in the class irrespective of whether the ``true'' model is in the model class
considered or not. In this setting, this happens {\em with certainty}, rather
than with high probability as is in the classical case. We precisely quantify
the goodness-of-fit of an individual model with respect to individual data. We
show that--within the obvious constraints--every graph is realized by the
structure function of some data. We determine the (un)computability properties
of the various functions contemplated and of the ``algorithmic minimal
sufficient statistic.''Comment: 25 pages LaTeX, 5 figures. In part in Proc 47th IEEE FOCS; this final
version (more explanations, cosmetic modifications) to appear in IEEE Trans
Inform T
Facticity as the amount of self-descriptive information in a data set
Using the theory of Kolmogorov complexity the notion of facticity {\phi}(x)
of a string is defined as the amount of self-descriptive information it
contains. It is proved that (under reasonable assumptions: the existence of an
empty machine and the availability of a faithful index) facticity is definite,
i.e. random strings have facticity 0 and for compressible strings 0 < {\phi}(x)
< 1/2 |x| + O(1). Consequently facticity measures the tension in a data set
between structural and ad-hoc information objectively. For binary strings there
is a so-called facticity threshold that is dependent on their entropy. Strings
with facticty above this threshold have no optimal stochastic model and are
essentially computational. The shape of the facticty versus entropy plot
coincides with the well-known sawtooth curves observed in complex systems. The
notion of factic processes is discussed. This approach overcomes problems with
earlier proposals to use two-part code to define the meaningfulness or
usefulness of a data set.Comment: 10 pages, 2 figure
Shannon Information and Kolmogorov Complexity
We compare the elementary theories of Shannon information and Kolmogorov
complexity, the extent to which they have a common purpose, and where they are
fundamentally different. We discuss and relate the basic notions of both
theories: Shannon entropy versus Kolmogorov complexity, the relation of both to
universal coding, Shannon mutual information versus Kolmogorov (`algorithmic')
mutual information, probabilistic sufficient statistic versus algorithmic
sufficient statistic (related to lossy compression in the Shannon theory versus
meaningful information in the Kolmogorov theory), and rate distortion theory
versus Kolmogorov's structure function. Part of the material has appeared in
print before, scattered through various publications, but this is the first
comprehensive systematic comparison. The last mentioned relations are new.Comment: Survey, LaTeX 54 pages, 3 figures, Submitted to IEEE Trans
Information Theor
VoG: Summarizing and Understanding Large Graphs
How can we succinctly describe a million-node graph with a few simple
sentences? How can we measure the "importance" of a set of discovered subgraphs
in a large graph? These are exactly the problems we focus on. Our main ideas
are to construct a "vocabulary" of subgraph-types that often occur in real
graphs (e.g., stars, cliques, chains), and from a set of subgraphs, find the
most succinct description of a graph in terms of this vocabulary. We measure
success in a well-founded way by means of the Minimum Description Length (MDL)
principle: a subgraph is included in the summary if it decreases the total
description length of the graph.
Our contributions are three-fold: (a) formulation: we provide a principled
encoding scheme to choose vocabulary subgraphs; (b) algorithm: we develop
\method, an efficient method to minimize the description cost, and (c)
applicability: we report experimental results on multi-million-edge real
graphs, including Flickr and the Notre Dame web graph.Comment: SIAM International Conference on Data Mining (SDM) 201
Algorithmic complexity for psychology: A user-friendly implementation of the coding theorem method
Kolmogorov-Chaitin complexity has long been believed to be impossible to
approximate when it comes to short sequences (e.g. of length 5-50). However,
with the newly developed \emph{coding theorem method} the complexity of strings
of length 2-11 can now be numerically estimated. We present the theoretical
basis of algorithmic complexity for short strings (ACSS) and describe an
R-package providing functions based on ACSS that will cover psychologists'
needs and improve upon previous methods in three ways: (1) ACSS is now
available not only for binary strings, but for strings based on up to 9
different symbols, (2) ACSS no longer requires time-consuming computing, and
(3) a new approach based on ACSS gives access to an estimation of the
complexity of strings of any length. Finally, three illustrative examples show
how these tools can be applied to psychology.Comment: to appear in "Behavioral Research Methods", 14 pages in journal
format, R package at http://cran.r-project.org/web/packages/acss/index.htm
- …