40,545 research outputs found
A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)
The concept of overfitting in model selection is explained and demonstrated
with an example. After providing some background information on information
theory and Kolmogorov complexity, we provide a short explanation of Minimum
Description Length and error minimization. We conclude with a discussion of the
typical features of overfitting in model selection.Comment: 20 pages, Chapter 1 of The Paradox of Overfitting, Master's thesis,
Rijksuniversiteit Groningen, 200
Kolmogorov's Structure Functions and Model Selection
In 1974 Kolmogorov proposed a non-probabilistic approach to statistics and
model selection. Let data be finite binary strings and models be finite sets of
binary strings. Consider model classes consisting of models of given maximal
(Kolmogorov) complexity. The ``structure function'' of the given data expresses
the relation between the complexity level constraint on a model class and the
least log-cardinality of a model in the class containing the data. We show that
the structure function determines all stochastic properties of the data: for
every constrained model class it determines the individual best-fitting model
in the class irrespective of whether the ``true'' model is in the model class
considered or not. In this setting, this happens {\em with certainty}, rather
than with high probability as is in the classical case. We precisely quantify
the goodness-of-fit of an individual model with respect to individual data. We
show that--within the obvious constraints--every graph is realized by the
structure function of some data. We determine the (un)computability properties
of the various functions contemplated and of the ``algorithmic minimal
sufficient statistic.''Comment: 25 pages LaTeX, 5 figures. In part in Proc 47th IEEE FOCS; this final
version (more explanations, cosmetic modifications) to appear in IEEE Trans
Inform T
A model-free feature selection technique of feature screening and random forest based recursive feature elimination
In this paper, we propose a model-free feature selection method for
ultra-high dimensional data with mass features. This is a two phases procedure
that we propose to use the fused Kolmogorov filter with the random forest based
RFE to remove model limitations and reduce the computational complexity. The
method is fully nonparametric and can work with various types of datasets. It
has several appealing characteristics, i.e., accuracy, model-free, and
computational efficiency, and can be widely used in practical problems, such as
multiclass classification, nonparametric regression, and Poisson regression,
among others. We show that the proposed method is selection consistent and
consistent under weak regularity conditions. We further demonstrate the
superior performance of the proposed method over other existing methods by
simulations and real data examples
Applying MDL to Learning Best Model Granularity
The Minimum Description Length (MDL) principle is solidly based on a provably
ideal method of inference using Kolmogorov complexity. We test how the theory
behaves in practice on a general problem in model selection: that of learning
the best model granularity. The performance of a model depends critically on
the granularity, for example the choice of precision of the parameters. Too
high precision generally involves modeling of accidental noise and too low
precision may lead to confusion of models that should be distinguished. This
precision is often determined ad hoc. In MDL the best model is the one that
most compresses a two-part code of the data set: this embodies ``Occam's
Razor.'' In two quite different experimental settings the theoretical value
determined using MDL coincides with the best value found experimentally. In the
first experiment the task is to recognize isolated handwritten characters in
one subject's handwriting, irrespective of size and orientation. Based on a new
modification of elastic matching, using multiple prototypes per character, the
optimal prediction rate is predicted for the learned parameter (length of
sampling interval) considered most likely by MDL, which is shown to coincide
with the best value found experimentally. In the second experiment the task is
to model a robot arm with two degrees of freedom using a three layer
feed-forward neural network where we need to determine the number of nodes in
the hidden layer giving best modeling performance. The optimal model (the one
that extrapolizes best on unseen examples) is predicted for the number of nodes
in the hidden layer considered most likely by MDL, which again is found to
coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To
appea
An approach for selecting cost estimation techniques for innovative high value manufacturing products
This paper presents an approach for determining the most appropriate technique for cost estimation of innovative high value manufacturing products depending on the amount of prior data available. Case study data from the United States Scheduled Annual Summary Reports for the Joint Strike Fighter (1997-2010) is used to exemplify how, depending on the attributes of a priori data certain techniques for cost estimation are more suitable than others. The data attribute focused on is the computational complexity involved in identifying whether or not there are patterns suited for propagation. Computational complexity is calculated based upon established mathematical principles for pattern recognition which argue that at least 42 data sets are required for the application of standard regression analysis techniques. The paper proposes that below this threshold a generic dependency model and starting conditions should be used and iteratively adapted to the context. In the special case of having less than four datasets available it is suggested that no contemporary cost estimating techniques other than analogy or expert opinion are currently applicable and alternate techniques must be explored if more quantitative results are desired. By applying the mathematical principles of complexity groups the paper argues that when less than four consecutive datasets are available the principles of topological data analysis should be applied. The preconditions being that the cost variance of at least three cost variance types for one to three time discrete continuous intervals is available so that it can be quantified based upon its geometrical attributes, visualised as an n-dimensional point cloud and then evaluated based upon the symmetrical properties of the evolving shape. Further work is suggested to validate the provided decision-trees in cost estimation practice
- …