4 research outputs found
An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming
The main advantage of Constraint Programming (CP) approaches for sequential
pattern mining (SPM) is their modularity, which includes the ability to add new
constraints (regular expressions, length restrictions, etc). The current best
CP approach for SPM uses a global constraint (module) that computes the
projected database and enforces the minimum frequency; it does this with a
filtering algorithm similar to the PrefixSpan method. However, the resulting
system is not as scalable as some of the most advanced mining systems like
Zaki's cSPADE. We show how, using techniques from both data mining and CP, one
can use a generic constraint solver and yet outperform existing specialized
systems. This is mainly due to two improvements in the module that computes the
projected frequencies: first, computing the projected database can be sped up
by pre-computing the positions at which an symbol can become unsupported by a
sequence, thereby avoiding to scan the full sequence each time; and second by
taking inspiration from the trailing used in CP solvers to devise a
backtracking-aware data structure that allows fast incremental storing and
restoring of the projected database. Detailed experiments show how this
approach outperforms existing CP as well as specialized systems for SPM, and
that the gain in efficiency translates directly into increased efficiency for
other settings such as mining with regular expressions.Comment: frequent sequence mining, constraint programmin
Hybrid ASP-based Approach to Pattern Mining
Detecting small sets of relevant patterns from a given dataset is a central
challenge in data mining. The relevance of a pattern is based on user-provided
criteria; typically, all patterns that satisfy certain criteria are considered
relevant. Rule-based languages like Answer Set Programming (ASP) seem
well-suited for specifying such criteria in a form of constraints. Although
progress has been made, on the one hand, on solving individual mining problems
and, on the other hand, developing generic mining systems, the existing methods
either focus on scalability or on generality. In this paper we make steps
towards combining local (frequency, size, cost) and global (various condensed
representations like maximal, closed, skyline) constraints in a generic and
efficient way. We present a hybrid approach for itemset, sequence and graph
mining which exploits dedicated highly optimized mining systems to detect
frequent patterns and then filters the results using declarative ASP. To
further demonstrate the generic nature of our hybrid framework we apply it to a
problem of approximately tiling a database. Experiments on real-world datasets
show the effectiveness of the proposed method and computational gains for
itemset, sequence and graph mining, as well as approximate tiling.
Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table
A unified framework for frequent sequence mining with subsequence constraints
Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints—including and beyond those considered in the literature—can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive “pattern expressions” to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms—although more general—are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods
An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming
The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc.). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki’s cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which a symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions. The data and software related to this paper are available at http://​sites.​uclouvain.​be/​cp4dm/​spm/​