Search CORE

40,545 research outputs found

A Short Introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length (MDL)

Author: Nannen Volker
Publication venue
Publication date: 01/01/2010
Field of study

The concept of overfitting in model selection is explained and demonstrated with an example. After providing some background information on information theory and Kolmogorov complexity, we provide a short explanation of Minimum Description Length and error minimization. We conclude with a discussion of the typical features of overfitting in model selection.Comment: 20 pages, Chapter 1 of The Paradox of Overfitting, Master's thesis, Rijksuniversiteit Groningen, 200

arXiv.org e-Print Archive

University of Groningen Digital Archive

CERN Document Server

Kolmogorov's Structure Functions and Model Selection

Author: Vereshchagin Nikolai
Vitanyi Paul
Publication venue
Publication date: 01/01/2004
Field of study

In 1974 Kolmogorov proposed a non-probabilistic approach to statistics and model selection. Let data be finite binary strings and models be finite sets of binary strings. Consider model classes consisting of models of given maximal (Kolmogorov) complexity. The ``structure function'' of the given data expresses the relation between the complexity level constraint on a model class and the least log-cardinality of a model in the class containing the data. We show that the structure function determines all stochastic properties of the data: for every constrained model class it determines the individual best-fitting model in the class irrespective of whether the ``true'' model is in the model class considered or not. In this setting, this happens {\em with certainty}, rather than with high probability as is in the classical case. We precisely quantify the goodness-of-fit of an individual model with respect to individual data. We show that--within the obvious constraints--every graph is realized by the structure function of some data. We determine the (un)computability properties of the various functions contemplated and of the ``algorithmic minimal sufficient statistic.''Comment: 25 pages LaTeX, 5 figures. In part in Proc 47th IEEE FOCS; this final version (more explanations, cosmetic modifications) to appear in IEEE Trans Inform T

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

A model-free feature selection technique of feature screening and random forest based recursive feature elimination

Author: Xia Siwei
Yang Yuehan
Publication venue
Publication date: 14/02/2023
Field of study

In this paper, we propose a model-free feature selection method for ultra-high dimensional data with mass features. This is a two phases procedure that we propose to use the fused Kolmogorov filter with the random forest based RFE to remove model limitations and reduce the computational complexity. The method is fully nonparametric and can work with various types of datasets. It has several appealing characteristics, i.e., accuracy, model-free, and computational efficiency, and can be widely used in practical problems, such as multiclass classification, nonparametric regression, and Poisson regression, among others. We show that the proposed method is selection consistent and

L_2

consistent under weak regularity conditions. We further demonstrate the superior performance of the proposed method over other existing methods by simulations and real data examples

arXiv.org e-Print Archive

Applying MDL to Learning Best Model Granularity

Author: Gao Qiong
Li Ming
Vitanyi Paul
Publication venue
Publication date: 01/01/2000
Field of study

The Minimum Description Length (MDL) principle is solidly based on a provably ideal method of inference using Kolmogorov complexity. We test how the theory behaves in practice on a general problem in model selection: that of learning the best model granularity. The performance of a model depends critically on the granularity, for example the choice of precision of the parameters. Too high precision generally involves modeling of accidental noise and too low precision may lead to confusion of models that should be distinguished. This precision is often determined ad hoc. In MDL the best model is the one that most compresses a two-part code of the data set: this embodies ``Occam's Razor.'' In two quite different experimental settings the theoretical value determined using MDL coincides with the best value found experimentally. In the first experiment the task is to recognize isolated handwritten characters in one subject's handwriting, irrespective of size and orientation. Based on a new modification of elastic matching, using multiple prototypes per character, the optimal prediction rate is predicted for the learned parameter (length of sampling interval) considered most likely by MDL, which is shown to coincide with the best value found experimentally. In the second experiment the task is to model a robot arm with two degrees of freedom using a three layer feed-forward neural network where we need to determine the number of nodes in the hidden layer giving best modeling performance. The optimal model (the one that extrapolizes best on unseen examples) is predicted for the number of nodes in the hidden layer considered most likely by MDL, which again is found to coincide with the best value found experimentally.Comment: LaTeX, 32 pages, 5 figures. Artificial Intelligence journal, To appea

arXiv.org e-Print Archive

Elsevier - Publisher Connector

CWI's Institutional Repository

CERN Document Server

International Migration, Integration and Social Cohesion online publications

An approach for selecting cost estimation techniques for innovative high value manufacturing products

Author: Erkoyuncu John
Schwabe Oliver
Shehab Essam
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

This paper presents an approach for determining the most appropriate technique for cost estimation of innovative high value manufacturing products depending on the amount of prior data available. Case study data from the United States Scheduled Annual Summary Reports for the Joint Strike Fighter (1997-2010) is used to exemplify how, depending on the attributes of a priori data certain techniques for cost estimation are more suitable than others. The data attribute focused on is the computational complexity involved in identifying whether or not there are patterns suited for propagation. Computational complexity is calculated based upon established mathematical principles for pattern recognition which argue that at least 42 data sets are required for the application of standard regression analysis techniques. The paper proposes that below this threshold a generic dependency model and starting conditions should be used and iteratively adapted to the context. In the special case of having less than four datasets available it is suggested that no contemporary cost estimating techniques other than analogy or expert opinion are currently applicable and alternate techniques must be explored if more quantitative results are desired. By applying the mathematical principles of complexity groups the paper argues that when less than four consecutive datasets are available the principles of topological data analysis should be applied. The preconditions being that the cost variance of at least three cost variance types for one to three time discrete continuous intervals is available so that it can be quantified based upon its geometrical attributes, visualised as an n-dimensional point cloud and then evaluated based upon the symmetrical properties of the evolving shape. Further work is suggested to validate the provided decision-trees in cost estimation practice

Elsevier - Publisher Connector

Cranfield CERES