56 research outputs found
Constraint-based Sequential Pattern Mining with Decision Diagrams
Constrained sequential pattern mining aims at identifying frequent patterns
on a sequential database of items while observing constraints defined over the
item attributes. We introduce novel techniques for constraint-based sequential
pattern mining that rely on a multi-valued decision diagram representation of
the database. Specifically, our representation can accommodate multiple item
attributes and various constraint types, including a number of non-monotone
constraints. To evaluate the applicability of our approach, we develop an
MDD-based prefix-projection algorithm and compare its performance against a
typical generate-and-check variant, as well as a state-of-the-art
constraint-based sequential pattern mining algorithm. Results show that our
approach is competitive with or superior to these other methods in terms of
scalability and efficiency.Comment: AAAI201
Reductions for Frequency-Based Data Mining Problems
Studying the computational complexity of problems is one of the - if not the
- fundamental questions in computer science. Yet, surprisingly little is known
about the computational complexity of many central problems in data mining. In
this paper we study frequency-based problems and propose a new type of
reduction that allows us to compare the complexities of the maximal frequent
pattern mining problems in different domains (e.g. graphs or sequences). Our
results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader
range of data mining problems. Our results show that, by allowing constraints
in the pattern space, the complexities of many maximal frequent pattern mining
problems collapse. These problems include maximal frequent subgraphs in
labelled graphs, maximal frequent itemsets, and maximal frequent subsequences
with no repetitions. In addition to theoretical interest, our results might
yield more efficient algorithms for the studied problems.Comment: This is an extended version of a paper of the same title to appear in
the Proceedings of the 17th IEEE International Conference on Data Mining
(ICDM'17
Direct mining of subjectively interesting relational patterns
Data is typically complex and relational. Therefore, the development of relational data mining methods is an increasingly active topic of research. Recent work has resulted in new formalisations of patterns in relational data and in a way to quantify their interestingness in a subjective manner, taking into account the data analyst's prior beliefs about the data. Yet, a scalable algorithm to find such most interesting patterns is lacking. We introduce a new algorithm based on two notions: (1) the use of Constraint Programming, which results in a notably shorter development time, faster runtimes, and more flexibility for extensions such as branch-and-bound search, and (2), the direct search for the most interesting patterns only, instead of exhaustive enumeration of patterns before ranking them. Through empirical evaluation, we find that our novel bounds yield speedups up to several orders of magnitude, especially on dense data with a simple schema. This makes it possible to mine the most subjectively-interesting relational patterns present in databases where this was previously impractical or impossible
The most persistent soft-clique in a set of sampled graphs
When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data, we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations
A Triclustering Approach for Time Evolving Graphs
This paper introduces a novel technique to track structures in time evolving
graphs. The method is based on a parameter free approach for three-dimensional
co-clustering of the source vertices, the target vertices and the time. All
these features are simultaneously segmented in order to build time segments and
clusters of vertices whose edge distributions are similar and evolve in the
same way over the time segments. The main novelty of this approach lies in that
the time segments are directly inferred from the evolution of the edge
distribution between the vertices, thus not requiring the user to make an a
priori discretization. Experiments conducted on a synthetic dataset illustrate
the good behaviour of the technique, and a study of a real-life dataset shows
the potential of the proposed approach for exploratory data analysis
Prefix-Projection Global Constraint for Sequential Pattern Mining
Sequential pattern mining under constraints is a challenging data mining
task. Many efficient ad hoc methods have been developed for mining sequential
patterns, but they are all suffering from a lack of genericity. Recent works
have investigated Constraint Programming (CP) methods, but they are not still
effective because of their encoding. In this paper, we propose a global
constraint based on the projected databases principle which remedies to this
drawback. Experiments show that our approach clearly outperforms CP approaches
and competes well with ad hoc methods on large datasets
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
BIG DATA MINING FOR INTERESTING PATTERNS WITH MAP REDUCE TECHNIQUE
There are many algorithms available in data mining to search interesting patterns from transactional databases of precise data. Frequent pattern mining is a technique to find the frequently occurred items in data mining. Most of the techniques used to find all the interesting patterns from a collection of precise data, where items occurred in each transaction are certainly known to the system. As well as in many real-time applications, users are interested in a tiny portion of large frequent patterns. So the proposed user constrained mining approach, will help to find frequent patterns in which user is interested. This approach will efficiently find user interested frequent patterns by applying user constraints on the collections of uncertain data. The user can specify their own interest in the form of constraints and uses the Map Reduce model to find uncertain frequent pattern that satisfy the user-specified constraintsÂ
Discovering Knowledge from Local Patterns with Global Constraints
It is well known that local patterns are at the core of a lot of
knowledge which may be discovered from data. Nevertheless, use of local
patterns is limited by
their huge number and computational costs. Several approaches (e.g.,
condensed representations, pattern set discovery) aim at grouping or
synthesizing local patterns to provide a global view of the data. A
global pattern is a pattern which is a set or a synthesis of local
patterns coming from the data. In this paper, we propose the idea of
global constraints to write queries addressing global patterns. A key
point is the ability to bias the designing of global patterns according
to the expectation of the user. For instance, a global pattern can be
oriented towards the search of exceptions or a clustering. It requires
to write queries taking into account such biases. Open issues are to
design a generic framework to express powerful global constraints and
solvers to mine them. We think that global constraints are a promising
way to discover relevant global patterns
- …