365 research outputs found
A Bayesian Network Model for Interesting Itemsets
Mining itemsets that are the most interesting under a statistical model of
the underlying data is a commonly used and well-studied technique for
exploratory data analysis, with the most recent interestingness models
exhibiting state of the art performance. Continuing this highly promising line
of work, we propose the first, to the best of our knowledge, generative model
over itemsets, in the form of a Bayesian network, and an associated novel
measure of interestingness. Our model is able to efficiently infer interesting
itemsets directly from the transaction database using structural EM, in which
the E-step employs the greedy approximation to weighted set cover. Our approach
is theoretically simple, straightforward to implement, trivially parallelizable
and retrieves itemsets whose quality is comparable to, if not better than,
existing state of the art algorithms as we demonstrate on several real-world
datasets.Comment: Supplementary material attached as Ancillary File; in PKDD 2016:
European Conference on Machine Learning and Knowledge Discovery in Database
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
{MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization
Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior
- …