23 research outputs found
Discovering Knowledge using a Constraint-based Language
Discovering pattern sets or global patterns is an attractive issue from the
pattern mining community in order to provide useful information. By combining
local patterns satisfying a joint meaning, this approach produces patterns of
higher level and thus more useful for the data analyst than the usual local
patterns, while reducing the number of patterns. In parallel, recent works
investigating relationships between data mining and constraint programming (CP)
show that the CP paradigm is a nice framework to model and mine such patterns
in a declarative and generic way. We present a constraint-based language which
enables us to define queries addressing patterns sets and global patterns. The
usefulness of such a declarative approach is highlighted by several examples
coming from the clustering based on associations. This language has been
implemented in the CP framework.Comment: 12 page
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Discovering Knowledge from Local Patterns with Global Constraints
It is well known that local patterns are at the core of a lot of
knowledge which may be discovered from data. Nevertheless, use of local
patterns is limited by
their huge number and computational costs. Several approaches (e.g.,
condensed representations, pattern set discovery) aim at grouping or
synthesizing local patterns to provide a global view of the data. A
global pattern is a pattern which is a set or a synthesis of local
patterns coming from the data. In this paper, we propose the idea of
global constraints to write queries addressing global patterns. A key
point is the ability to bias the designing of global patterns according
to the expectation of the user. For instance, a global pattern can be
oriented towards the search of exceptions or a clustering. It requires
to write queries taking into account such biases. Open issues are to
design a generic framework to express powerful global constraints and
solvers to mine them. We think that global constraints are a promising
way to discover relevant global patterns
Extraction sous Contraintes d'Ensembles de Cliques Homogènes
Document sur site LIRIS : http://liris.cnrs.fr/Documents/Liris-4915.pdfNational audienceNous proposons une méthode de fouille de données sur des graphes ayant un ensemble d'étiquettes associé à chaque sommet. Une application est, par exemple, d'analyser un réseau social de chercheurs co-auteurs lorsque des étiquettes précisent les conférences dans lesquelles ils publient.Nous définissons l'extraction sous contraintes d'ensembles de cliques tel que chaque sommet des cliques impliquées partage suffisamment d'étiquettes. Nous proposons une méthode pour calculer tous les Ensembles Maximaux de Cliques dits Homogènes qui satisfont une conjonction de contraintes fixée par l'analyste et concernant le nombre de cliques séparées, la taille des cliques ainsi que le nombre d'étiquettes partagées. Les expérimentations montrent que l'approche fonctionne sur de grands graphes construits à partir de données réelles et permet la mise en évidence de structures intéressantes
Gibbs sampling subjectively interesting tiles
International audienceThe local pattern mining literature has long struggled with the so-called pattern explosion problem: the size of the set of patterns found exceeds the size of the original data. This causes computational problems (enumerating a large set of patterns will inevitably take a substantial amount of time) as well as problems for interpretation and usabil-ity (trawling through a large set of patterns is often impractical). Two complementary research lines aim to address this problem. The first aims to develop better measures of interestingness, in order to reduce the number of uninteresting patterns that are returned [6, 10]. The second aims to avoid an exhaustive enumeration of all 'interesting' patterns (where interestingness is quantified in a more traditional way, e.g. frequency), by directly sampling from this set in a way that more 'interest-ing' patterns are sampled with higher probability [2]. Unfortunately, the first research line does not reduce computational cost, while the second may miss out on the most interesting patterns. In this paper, we combine the best of both worlds for mining interesting tiles [8] from binary databases. Specifically, we propose a new pattern sampling approach based on Gibbs sampling, where the probability of sampling a pattern is proportional to their subjective interest-ingness [6]-an interestingness measure reported to better represent true interestingness. The experimental evaluation confirms the theory, but also reveals an important weakness of the proposed approach which we speculate is shared with any other pattern sampling approach. We thus conclude with a broader discussion of this issue, and a forward look
FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery
International audienceSubgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time