Search CORE

15 research outputs found

A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database

Author: Charnois Thierry
Loudni Samir
Métivier Jean-Philippe
Publication venue
Publication date: 23/09/2013
Field of study

Constraint-based pattern discovery is at the core of numerous data mining tasks. Patterns are extracted with respect to a given set of constraints (frequency, closedness, size, etc). In the context of sequential pattern mining, a large number of devoted techniques have been developed for solving particular classes of constraints. The aim of this paper is to investigate the use of Constraint Programming (CP) to model and mine sequential patterns in a sequence database. Our CP approach offers a natural way to simultaneously combine in a same framework a large set of constraints coming from various origins. Experiments show the feasibility and the interest of our approach

arXiv.org e-Print Archive

HAL - Normandie Université

HAL-Paris 13

Reductions for Frequency-Based Data Mining Problems

Author: Miettinen Pauli
Neumann Stefan
Publication venue
Publication date: 01/01/2017
Field of study

Studying the computational complexity of problems is one of the - if not the - fundamental questions in computer science. Yet, surprisingly little is known about the computational complexity of many central problems in data mining. In this paper we study frequency-based problems and propose a new type of reduction that allows us to compare the complexities of the maximal frequent pattern mining problems in different domains (e.g. graphs or sequences). Our results extend those of Kimelfeld and Kolaitis [ACM TODS, 2014] to a broader range of data mining problems. Our results show that, by allowing constraints in the pattern space, the complexities of many maximal frequent pattern mining problems collapse. These problems include maximal frequent subgraphs in labelled graphs, maximal frequent itemsets, and maximal frequent subsequences with no repetitions. In addition to theoretical interest, our results might yield more efficient algorithms for the studied problems.Comment: This is an extended version of a paper of the same title to appear in the Proceedings of the 17th IEEE International Conference on Data Mining (ICDM'17

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Constraint Programming for Multi-criteria Conceptual Clustering

Author: B Ganter
J Motwani
L Hossain
M Khiari
MM Ahmad
N Lazaar
N Pasquier
P Schaus
T Guns
T Guns
T Guns
T Uno
TBH Dao
W Ugarte
YC Law
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 28/08/2017
Field of study

International audienceA conceptual clustering is a set of formal concepts (i.e., closed itemsets) that defines a partition of a set of transactions. Finding a conceptual clustering is an N P-complete problem for which Constraint Programming (CP) and Integer Linear Programming (ILP) approaches have been recently proposed. We introduce new CP models to solve this problem: a pure CP model that uses set constraints, and an hybrid model that uses a data mining tool to extract formal concepts in a preprocessing step and then uses CP to select a subset of formal concepts that defines a partition. We compare our new models with recent CP and ILP approaches on classical machine learning instances. We also introduce a new set of instances coming from a real application case, which aims at extracting setting concepts from an Enterprise Resource Planning (ERP) software. We consider two classic criteria to optimize, i.e., the frequency and the size. We show that these criteria lead to extreme solutions with either very few small formal concepts or many large formal concepts, and that compromise clusterings may be obtained by computing the Pareto front of non dominated clusterings

Crossref

FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery

Author: Belfodil Adnene
Belfodil Aimene
Bendimerad Anes
Kaytoue Mehdi
Lamarre Philippe
Plantevit Marc
Robardet Céline
Publication venue: HAL CCSD
Publication date: 05/10/2019
Field of study

International audienceSubgroup discovery (SD) is the task of discovering interpretable patterns in the data that stand out w.r.t. some property of interest. Discovering patterns that accurately discriminate a class from the others is one of the most common SD tasks. Standard approaches of the literature are based on local pattern discovery, which is known to provide an overwhelmingly large number of redundant patterns. To solve this issue, pattern set mining has been proposed: instead of evaluating the quality of patterns separately, one should consider the quality of a pattern set as a whole. The goal is to provide a small pattern set that is diverse and well-discriminant to the target class. In this work, we introduce a novel formulation of the task of diverse subgroup set discovery where both discriminative power and diversity of the subgroup set are incorporated in the same quality measure. We propose an efficient and parameter-free algorithm dubbed FSSD and based on a greedy scheme. FSSD uses several optimization strategies that enable to efficiently provide a high quality pattern set in a short amount of time

Crossref

HAL

Hal-Diderot

Cost-optimal constrained correlation clustering via weighted partial Maximum Satisfiability

Author: Berg Jeremias
Järvisalo Matti
Publication venue
Publication date: 01/03/2017
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Pattern Set Mining with Schema-based Constraint

Author: Bruno Giulia
Cagliero Luca
Chiusano Silvia
Garza Paolo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Pattern set mining entails discovering groups of frequent itemsets that represent potentially relevant knowledge. Global constraints are commonly enforced to focus the analysis on most interesting pattern sets. However, these constraints evaluate and select each pattern set individually based on its itemset characteristics. This paper extends traditional global constraints by proposing a novel constraint, called schema-based constraint, tailored to relational data. When coping with relational data itemsets consist of sets of items belonging to distinct data attributes, which constitute the itemset schema. The schema-based constraint allows us to effectively combine all the itemsets that are semantically correlated with each other into a unique pattern set, while filtering out those pattern sets covering a mixture of different data facets or giving a partial view of a single facet. Specifically, it selects all the pattern sets that are (i) composed only of frequent itemsets with the same schema and (ii) characterized by maximal size among those corresponding to that schema. Since existing approaches are unable to select one representative pattern set per schema in a single extraction, we propose a new Apriori-based algorithm to efficiently mine pattern sets satisfying the schema-based constraint. The experimental results achieved on both real and synthetic datasets demonstrate the efficiency and effectiveness of our approach

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Hybrid ASP-based Approach to Pattern Mining

Author: Miettinen Pauli
Paramonov Sergey
Stepanova Daria
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2017
Field of study

Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset, sequence and graph mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. To further demonstrate the generic nature of our hybrid framework we apply it to a problem of approximately tiling a database. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains for itemset, sequence and graph mining, as well as approximate tiling. Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table

arXiv.org e-Print Archive

MPG.PuRe

The Minimum Description Length Principle for Pattern Mining: A Survey

Author: Galbrun Esther
Publication venue
Publication date: 28/07/2021
Field of study

This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

arXiv.org e-Print Archive

Relational data factorization

Author: Leeuwen M. van
Paramonov S.
Raedt L. de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/08/2017
Field of study

Motivated by an analogy with matrix factorization, we introduce the problem of factorizing relational data. In matrix factorization, one is given a matrix and has to factorize it as a product of other matrices. In relational data factorization, the task is to factorize a given relation as a conjunctive query over other relations, i.e., as a combination of natural join operations. Given a conjunctive query and the input relation, the problem is to compute the extensions of the relations used in the query. Thus, relational data factorization is a relational analog of matrix factorization; it is also a form of inverse querying as one has to compute the relations in the query from the result of the query. The result of relational data factorization is neither necessarily unique nor required to be a lossless decomposition of the original relation. Therefore, constraints can be imposed on the desired factorization and a scoring function is used to determine its quality (often similarity to the original data). Relational data factorization is thus a constraint satisfaction and optimization problem. We show how answer set programming can be used for solving relational data factorization problems.Algorithms and the Foundations of Software technolog

Leiden University Scholary Publications