30 research outputs found
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
Interpretable multiclass classification by MDL-based rule lists
Interpretable classifiers have recently witnessed an increase in attention
from the data mining community because they are inherently easier to understand
and explain than their more complex counterparts. Examples of interpretable
classification models include decision trees, rule sets, and rule lists.
Learning such models often involves optimizing hyperparameters, which typically
requires substantial amounts of data and may result in relatively large models.
In this paper, we consider the problem of learning compact yet accurate
probabilistic rule lists for multiclass classification. Specifically, we
propose a novel formalization based on probabilistic rule lists and the minimum
description length (MDL) principle. This results in virtually parameter-free
model selection that naturally allows to trade-off model complexity with
goodness of fit, by which overfitting and the need for hyperparameter tuning
are effectively avoided. Finally, we introduce the Classy algorithm, which
greedily finds rule lists according to the proposed criterion. We empirically
demonstrate that Classy selects small probabilistic rule lists that outperform
state-of-the-art classifiers when it comes to the combination of predictive
performance and interpretability. We show that Classy is insensitive to its
only parameter, i.e., the candidate set, and that compression on the training
set correlates with classification performance, validating our MDL-based
selection criterion
MDL for FCA: is there a place for background knowledge?
International audienceThe Minimal Description Length (MDL) principle is a powerful and well founded approach, which has been successfully applied in a wide range of Data Mining tasks. In this paper we address the problem of pattern mining with MDL. We discuss how constraints-background knowledge on interestingness of patterns-can be embedded into MDL and argue the benefits of MDL over a simple selection of patterns based on measures
{MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization
Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior
Robust subgroup discovery
We introduce the problem of robust subgroup discovery, i.e., finding a set of
interpretable descriptions of subsets that 1) stand out with respect to one or
more target attributes, 2) are statistically robust, and 3) non-redundant. Many
attempts have been made to mine either locally robust subgroups or to tackle
the pattern explosion, but we are the first to address both challenges at the
same time from a global modelling perspective. First, we formulate the broad
model class of subgroup lists, i.e., ordered sets of subgroups, for univariate
and multivariate targets that can consist of nominal or numeric variables, and
that includes traditional top-1 subgroup discovery in its definition. This
novel model class allows us to formalise the problem of optimal robust subgroup
discovery using the Minimum Description Length (MDL) principle, where we resort
to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and
numeric targets, respectively. Second, as finding optimal subgroup lists is
NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists
and guarantees that the most significant subgroup found according to the MDL
criterion is added in each iteration, which is shown to be equivalent to a
Bayesian one-sample proportions, multinomial, or t-test between the subgroup
and dataset marginal target distributions plus a multiple hypothesis testing
penalty. We empirically show on 54 datasets that SSD++ outperforms previous
subgroup set discovery methods in terms of quality and subgroup list size.Comment: For associated code, see https://github.com/HMProenca/RuleList ;
submitted to Data Mining and Knowledge Discovery Journa
Workshop Notes of the Sixth International Workshop "What can FCA do for Artificial Intelligence?"
International audienc