161,081 research outputs found
Substructure Discovery Using Minimum Description Length and Background Knowledge
The ability to identify interesting and repetitive substructures is an
essential component to discovering knowledge in structural data. We describe a
new version of our SUBDUE substructure discovery system based on the minimum
description length principle. The SUBDUE system discovers substructures that
compress the original data and represent structural concepts in the data. By
replacing previously-discovered substructures in the data, multiple passes of
SUBDUE produce a hierarchical description of the structural regularities in the
data. SUBDUE uses a computationally-bounded inexact graph match that identifies
similar, but not identical, instances of a substructure and finds an
approximate measure of closeness of two substructures when under computational
constraints. In addition to the minimum description length principle, other
background knowledge can be used by SUBDUE to guide the search towards more
appropriate substructures. Experiments in a variety of domains demonstrate
SUBDUE's ability to find substructures capable of compressing the original data
and to discover structural concepts important to the domain. Description of
Online Appendix: This is a compressed tar file containing the SUBDUE discovery
system, written in C. The program accepts as input databases represented in
graph form, and will output discovered substructures with their corresponding
value.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
The Unsupervised Acquisition of a Lexicon from Continuous Speech
We present an unsupervised learning algorithm that acquires a
natural-language lexicon from raw speech. The algorithm is based on the optimal
encoding of symbol sequences in an MDL framework, and uses a hierarchical
representation of language that overcomes many of the problems that have
stymied previous grammar-induction procedures. The forward mapping from symbol
sequences to the speech stream is modeled using features based on articulatory
gestures. We present results on the acquisition of lexicons and language models
from raw speech, text, and phonetic transcripts, and demonstrate that our
algorithm compares very favorably to other reported results with respect to
segmentation performance and statistical efficiency.Comment: 27 page technical repor
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
- …