4,955 research outputs found
Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity
The relationship between the Bayesian approach and the minimum description
length approach is established. We sharpen and clarify the general modeling
principles MDL and MML, abstracted as the ideal MDL principle and defined from
Bayes's rule by means of Kolmogorov complexity. The basic condition under which
the ideal principle should be applied is encapsulated as the Fundamental
Inequality, which in broad terms states that the principle is valid when the
data are random, relative to every contemplated hypothesis and also these
hypotheses are random relative to the (universal) prior. Basically, the ideal
principle states that the prior probability associated with the hypothesis
should be given by the algorithmic universal probability, and the sum of the
log universal probability of the model plus the log of the probability of the
data given the model should be minimized. If we restrict the model class to the
finite sets then application of the ideal principle turns into Kolmogorov's
minimal sufficient statistic. In general we show that data compression is
almost always the best strategy, both in hypothesis identification and
prediction.Comment: 35 pages, Latex. Submitted IEEE Trans. Inform. Theor
Absolutely No Free Lunches!
This paper is concerned with learners who aim to learn patterns in infinite
binary sequences: shown longer and longer initial segments of a binary
sequence, they either attempt to predict whether the next bit will be a 0 or
will be a 1 or they issue forecast probabilities for these events. Several
variants of this problem are considered. In each case, a no-free-lunch result
of the following form is established: the problem of learning is a formidably
difficult one, in that no matter what method is pursued, failure is
incomparably more common that success; and difficult choices must be faced in
choosing a method of learning, since no approach dominates all others in its
range of success. In the simplest case, the comparison of the set of situations
in which a method fails and the set of situations in which it succeeds is a
matter of cardinality (countable vs. uncountable); in other cases, it is a
topological matter (meagre vs. co-meagre) or a hybrid computational-topological
matter (effectively meagre vs. effectively co-meagre)
Learning Recursive Functions Refutably
Learning of recursive functions refutably means that for every recursive function, the learning machine has either to learn this function or to refute it, i.e., to signal that it is not able to learn it. Three modi of making precise the notion of refuting are considered. We show that the corresponding types of learning refutably are of strictly increasing power, where already the most stringent of them turns out to be of remarkable topological and algorithmical richness. All these types are closed under union, though in different strengths. Also, these types are shown to be different with respect to their intrinsic complexity; two of them do not contain function classes that are “most difficult” to learn, while the third one does. Moreover, we present characterizations for these types of learning refutably. Some of these characterizations make clear where the refuting ability of the corresponding learning machines comes from and how it can be realized, in general.
For learning with anomalies refutably, we show that several results from standard learning without refutation stand refutably. Then we derive hierarchies for refutable learning. Finally, we show that stricter refutability constraints cannot be traded for more liberal learning criteria
On Learning of Functions Refutably
Learning of recursive functions refutably informally means that for every recursive function, the learning machine has either to learn this function or to refute it, that is to signal that it is not able to learn it. Three modi of making precise the notion of refuting are considered. We show that the corresponding types of learning refutably are of strictly increasing power, where already the most stringent of them turns out to be of remarkable topological and algorithmical richness. Furthermore, all these types are closed under union, though in different strengths. Also, these types are shown to be different with respect to their intrinsic complexity; two of them do not contain function classes that are “most difficult” to learn, while the third one does. Moreover, we present several characterizations for these types of learning refutably. Some of these characterizations make clear where the refuting ability of the corresponding learning machines comes from and how it can be realized, in general.For learning with anomalies refutably, we show that several results from standard learning without refutation stand refutably. From this we derive some hierarchies for refutable learning. Finally, we prove that in general one cannot trade stricter refutability constraints for more liberal learning criteria
Inductive reasoning and Kolmogorov complexity
AbstractReasoning to obtain the “truth” about reality, from external data, is an important, controversial, and complicated issue in man's effort to understand nature. (Yet, today, we try to make machines do this.) There have been old useful principles, new exciting models, and intricate theories scattered in vastly different areas including philosophy of science, statistics, computer science, and psychology. We focus on inductive reasoning in correspondence with ideas of R. J. Solomonoff. While his proposals result in perfect procedures, they involve the noncomputable notion of Kolmogorov complexity. In this paper we develop the thesis that Solomonoff's method is fundamental in the sense that many other induction principles can be viewed as particular ways to obtain computable approximations to it. We demonstrate this explicitly in the cases of Gold's paradigm for inductive inference, Rissanen's minimum description length (MDL) principle, Fisher's maximum likelihood principle, and Jaynes' maximum entropy principle. We present several new theorems and derivations to this effect. We also delimit what can be learned and what cannot be learned in terms of Kolmogorov complexity, and we describe an experiment in machine learning of handwritten characters. We also give an application of Kolmogorov complexity in Valiant style learning, where we want to learn a concept probably approximately correct in feasible time and examples
Applying MDL to Learning Best Model Granularity
The Minimum Description Length (MDL) principle is solidly based on a provably
ideal method of inference using Kolmogorov complexity. We test how the theory
behaves in practice on a general problem in model selection: that of learning
the best model granularity. The performance of a model depends critically on
the granularity, for example the choice of precision of the parameters. Too
high precision generally involves modeling of accidental noise and too low
precision may lead to confusion of models that should be distinguished. This
precision is often determined ad hoc. In MDL the best model is the one that
most compresses a two-part code of the data set: this embodies ``Occam's
Razor.'' In two quite different experimental settings the theoretical value
determined using MDL coincides with the best value found experimentally. In the
first experiment the task is to recognize isolated handwritten characters in
one subject's handwriting, irrespective of size and orientation. Based on a new
modification of elastic matching, using multiple prototypes per character, the
optimal prediction rate is predicted for the learned parameter (length of
sampling interval) considered most likely by MDL, which is shown to coincide
with the best value found experimentally. In the second experiment the task is
to model a robot arm with two degrees of freedom using a three layer
feed-forward neural network where we need to determine the number of nodes in
the hidden layer giving best modeling performance. The optimal model (the one
that extrapolizes best on unseen examples) is predicted for the number of nodes
in the hidden layer considered most likely by MDL, which again is found to
coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To
appea
On the Impact of Forgetting on Learning Machines
People tend not to have perfect memories when it comes to learning, or to anything else for that matter. Most formal studies of learning, however, assume a perfect memory. Some approaches have restricted the number of items that could be retained. We introduce a complexity theoretic accounting of memory utilization by learning machines. In our new model, memory is measured in bits as a function of the size of the input. There is a hierarchy of learnability based on increasing memory allotment. The lower bound results are proved using an unusual combination of pumping and mutual recursion theorem arguments. For technical reasons, it was necessary to consider two types of memory: long and short term
Causal inference using the algorithmic Markov condition
Inferring the causal structure that links n observables is usually based upon
detecting statistical dependences and choosing simple graphs that make the
joint measure Markovian. Here we argue why causal inference is also possible
when only single observations are present.
We develop a theory how to generate causal graphs explaining similarities
between single objects. To this end, we replace the notion of conditional
stochastic independence in the causal Markov condition with the vanishing of
conditional algorithmic mutual information and describe the corresponding
causal inference rules.
We explain why a consistent reformulation of causal inference in terms of
algorithmic complexity implies a new inference principle that takes into
account also the complexity of conditional probability densities, making it
possible to select among Markov equivalent causal graphs. This insight provides
a theoretical foundation of a heuristic principle proposed in earlier work.
We also discuss how to replace Kolmogorov complexity with decidable
complexity criteria. This can be seen as an algorithmic analog of replacing the
empirically undecidable question of statistical independence with practical
independence tests that are based on implicit or explicit assumptions on the
underlying distribution.Comment: 16 figure
- …