298,748 research outputs found
An application of minimum description length clustering to partitioning learning curves
© Copyright 2005 IEEEWe apply a Minimum Description Length–based clustering technique to the problem of partitioning a set of learning curves. The goal is to partition experimental data collected from different sources into groups of sources that are statistically the same.We solve this problem by defining statistical models for the data generating processes, then partitioning them using the Normalized Maximum Likelihood criterion. Unlike many alternative model selection methods, this approach which is optimal (in a minimax coding sense) for data of any sample size. We present an application of the method to the cognitive modeling problem of partitioning of human learning curves for different categorization tasks
Recommended from our members
Structure selection for convolutive non-negative matrix factorization using normalized maximum likelihood coding
Convolutive non-negative matrix factorization (CNMF) is a promising method for extracting features from sequential multivariate data. Conventional algorithms for CNMF require that the structure, or the number of bases for expressing the data, be specified in advance. We are concerned with the issue of how we can select the best structure of CNMF from given data. We first introduce a framework of probabilistic modeling of CNMF and reduce this issue to statistical model selection. The problem is here that conventional model selection criteria such as AIC, BIC, MDL cannot straightforwardly be applied since the probabilistic model for CNMF is irregular in the sense that parameters are not uniquely identifiable. We overcome this problem to propose a novel criterion for best structure selection for CNMF. The key idea is to apply the technique of latent variable completion in combination with normalized maximum likelihood coding criterion under the minimum description length principle. We empirically demonstrate the effectiveness of our method using artificial and real data sets
Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MIC-GROUP) and multi-task feature selection (MIC-MULTI). MIC-GROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-MULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, TRANSFEAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems
An F-ratio-Based Method for Estimating the Number of Active Sources in MEG
Magnetoencephalography (MEG) is a powerful technique for studying the human
brain function. However, accurately estimating the number of sources that
contribute to the MEG recordings remains a challenging problem due to the low
signal-to-noise ratio (SNR), the presence of correlated sources, inaccuracies
in head modeling, and variations in individual anatomy. To address these
issues, our study introduces a robust method for accurately estimating the
number of active sources in the brain based on the F-ratio statistical
approach, which allows for a comparison between a full model with a higher
number of sources and a reduced model with fewer sources. Using this approach,
we developed a formal statistical procedure that sequentially increases the
number of sources in the multiple dipole localization problem until all sources
are found. Our results revealed that the selection of thresholds plays a
critical role in determining the method`s overall performance, and appropriate
thresholds needed to be adjusted for the number of sources and SNR levels,
while they remained largely invariant to different inter-source correlations,
modeling inaccuracies, and different cortical anatomies. By identifying optimal
thresholds and validating our F-ratio-based method in simulated, real phantom,
and human MEG data, we demonstrated the superiority of our F-ratio-based method
over existing state-of-the-art statistical approaches, such as the Akaike
Information Criterion (AIC) and Minimum Description Length (MDL). Overall, when
tuned for optimal selection of thresholds, our method offers researchers a
precise tool to estimate the true number of active brain sources and accurately
model brain function
The Minimum Description Length Principle and Model Selection in Spectropolarimetry
It is shown that the two-part Minimum Description Length Principle can be
used to discriminate among different models that can explain a given observed
dataset. The description length is chosen to be the sum of the lengths of the
message needed to encode the model plus the message needed to encode the data
when the model is applied to the dataset. It is verified that the proposed
principle can efficiently distinguish the model that correctly fits the
observations while avoiding over-fitting. The capabilities of this criterion
are shown in two simple problems for the analysis of observed
spectropolarimetric signals. The first is the de-noising of observations with
the aid of the PCA technique. The second is the selection of the optimal number
of parameters in LTE inversions. We propose this criterion as a quantitative
approach for distinguising the most plausible model among a set of proposed
models. This quantity is very easy to implement as an additional output on the
existing inversion codes.Comment: Accepted for publication in the Astrophysical Journa
Model selection and adaptation for biochemical pathways
In bioinformatics, biochemical signal pathways can be modeled by many differential equations. It is still an open problem how to fit the huge amount of parameters of the equations to the available data. Here, the approach of systematically obtaining the most appropriate model and learning its parameters is extremely interesting. One of the most often used approaches for model selection is to choose the least complex model which “fits the needs”. For noisy measurements, the model which has the smallest mean squared error of the observed data results in a model which fits too accurately to the data – it is overfitting. Such a model will perform good on the training data, but worse on unknown data. This paper propose as model selection criterion the least complex description of the observed data by the model, the minimum description length. For the small, but important example of inflammation modeling the performance of the approach is evaluated. Keywords: biochemical pathways, differential equations, septic shock, parameter estimation, overfitting, minimum description length
- …