107 research outputs found
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Reductions for Frequency-Based Data Mining Problems
Studying the computational complexity of problems is one of the - if not the
- fundamental questions in computer science. Yet, surprisingly little is known
about the computational complexity of many central problems in data mining. In
this paper we study frequency-based problems and propose a new type of
reduction that allows us to compare the complexities of the maximal frequent
pattern mining problems in different domains (e.g. graphs or sequences). Our
results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader
range of data mining problems. Our results show that, by allowing constraints
in the pattern space, the complexities of many maximal frequent pattern mining
problems collapse. These problems include maximal frequent subgraphs in
labelled graphs, maximal frequent itemsets, and maximal frequent subsequences
with no repetitions. In addition to theoretical interest, our results might
yield more efficient algorithms for the studied problems.Comment: This is an extended version of a paper of the same title to appear in
the Proceedings of the 17th IEEE International Conference on Data Mining
(ICDM'17
Using Answer Set Programming for pattern mining
Serial pattern mining consists in extracting the frequent sequential patterns
from a unique sequence of itemsets. This paper explores the ability of a
declarative language, such as Answer Set Programming (ASP), to solve this issue
efficiently. We propose several ASP implementations of the frequent sequential
pattern mining task: a non-incremental and an incremental resolution. The
results show that the incremental resolution is more efficient than the
non-incremental one, but both ASP programs are less efficient than dedicated
algorithms. Nonetheless, this approach can be seen as a first step toward a
generic framework for sequential pattern mining with constraints.Comment: Intelligence Artificielle Fondamentale (2014
Constraint-based sequence mining using constraint programming
The goal of constraint-based sequence mining is to find sequences of symbols
that are included in a large number of input sequences and that satisfy some
constraints specified by the user. Many constraints have been proposed in the
literature, but a general framework is still missing. We investigate the use of
constraint programming as general framework for this task. We first identify
four categories of constraints that are applicable to sequence mining. We then
propose two constraint programming formulations. The first formulation
introduces a new global constraint called exists-embedding. This formulation is
the most efficient but does not support one type of constraint. To support such
constraints, we develop a second formulation that is more general but incurs
more overhead. Both formulations can use the projected database technique used
in specialised algorithms. Experiments demonstrate the flexibility towards
constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming
(CPAIOR), 201
Mining Frequent Graph Patterns with Differential Privacy
Discovering frequent graph patterns in a graph database offers valuable
information in a variety of applications. However, if the graph dataset
contains sensitive data of individuals such as mobile phone-call graphs and
web-click graphs, releasing discovered frequent patterns may present a threat
to the privacy of individuals. {\em Differential privacy} has recently emerged
as the {\em de facto} standard for private data analysis due to its provable
privacy guarantee. In this paper we propose the first differentially private
algorithm for mining frequent graph patterns.
We first show that previous techniques on differentially private discovery of
frequent {\em itemsets} cannot apply in mining frequent graph patterns due to
the inherent complexity of handling structural information in graphs. We then
address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling
based algorithm. Unlike previous work on frequent itemset mining, our
techniques do not rely on the output of a non-private mining algorithm.
Instead, we observe that both frequent graph pattern mining and the guarantee
of differential privacy can be unified into an MCMC sampling framework. In
addition, we establish the privacy and utility guarantee of our algorithm and
propose an efficient neighboring pattern counting technique as well.
Experimental results show that the proposed algorithm is able to output
frequent patterns with good precision
Exploiting incomparability in solution dominance : improving general purpose constraint-based mining
In data mining, finding interesting patterns is a challenging task. Constraint-based mining is a well-known approach to this, and one for which constraint programming has been shown to be a well-suited and generic framework. Constraint dominance programming (CDP) has been proposed as an extension that can capture an even wider class of constraint-based mining problems, by allowing us to compare relations between patterns. In this paper we improve CDP with the ability to specify an incomparability condition. This allows us to overcome two major shortcomings of CDP: finding dominated solutions that must then be filtered out after search, and unnecessarily adding dominance blocking constraints between incomparable solutions. We demonstrate the efficacy of our approach by extending the problem specification language ESSENCE and implementing it in a solver-independent manner on top of the constraint modelling tool CONJURE. Our experiments on pattern mining tasks with both a CP solver and a SAT solver show that using the incomparability condition during search significantly improves the efficiency of dominance programming and reduces (and often eliminates entirely) the need for post-processing to filter dominated solutions.Publisher PD
Efficient incremental modelling and solving
Funding: This work is supported by EPSRC grant EP/P015638/1. Nguyen Dang is a Leverhulme Trust Early Career Fellow (ECF-2020-168).In various scenarios, a single phase of modelling and solving is either not sufficient or not feasible to solve the problem at hand. A standard approach to solving AI planning problems, for example, is to incrementally extend the planning horizon and solve the problem of trying to find a plan of a particular length. Indeed, any optimization problem can be solved as a sequence of decision problems in which the objective value is incrementally updated. Another example is constraint dominance programming (CDP), in which search is organized into a sequence of levels. The contribution of this work is to enable a native interaction between SAT solvers and the automated modelling system Savile Row to support efficient incremental modelling and solving. This allows adding new decision variables, posting new constraints and removing existing constraints (via assumptions) between incremental steps. Two additional benefits of the native coupling of modelling and solving are the ability to retain learned information between SAT solver calls and to enable SAT assumptions, further improving flexibility and efficiency. Experiments on one optimisation problem and five pattern mining tasks demonstrate that the native interaction between the modelling system and SAT solver consistently improves performance significantly.Publisher PD
- …