Search CORE

18 research outputs found

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Author: A Kemmar
B Negrevergne
G Perez
NR Mabroukeh
T Guns
Publication venue
Publication date: 01/01/2016
Field of study

The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki's cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which an symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions.Comment: frequent sequence mining, constraint programmin

arXiv.org e-Print Archive

Crossref

DIAL UCLouvain

Community Structure Characterization

Author: A Clauset
A Lancichinetti
A Lancichinetti
C Bothorel
F Radicchi
G Palla
GK Orman
Hongyun Cai
J Creusefond
J Shi
J Yang
L da Fontoura Costa
M Girvan
M Rosvall
M Rosvall
M Tumminello
MEJ Newman
MEJ Newman
MEJ Newman
MEJ Newman
MEJ Newman
N Dugué
N Kashtan
NR Mabroukeh
P Bródka
R Guimera
S Asur
S Fortunato
S Fortunato
T Aynaud
T-C Fu
V Labatut
Vincent Labatut
X Han
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

This entry discusses the problem of describing some communities identified in a complex network of interest, in a way allowing to interpret them. We suppose the community structure has already been detected through one of the many methods proposed in the literature. The question is then to know how to extract valuable information from this first result, in order to allow human interpretation. This requires subsequent processing, which we describe in the rest of this entry

arXiv.org e-Print Archive

Crossref

An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

Author: A Kemmar
B Negrevergne
G Perez
NR Mabroukeh
T Guns
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc.). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki’s cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which a symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions. The data and software related to this paper are available at http://sites.uclouvain.be/cp4dm/spm/

Crossref

DIAL UCLouvain

Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns

Author: A Serin
D Martin
D-I Lin
E Salvemini
F Zhu
H Cheng
J Han
J Pei
J Pei
MJ Zaki
NR Mabroukeh
P Kumar
R Srikant
RJ Bayardo
Z Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A Novel Decision Tree Approach for the Handling of Time Series

Author: A Brunello
A Brunello
A Kampouraki
AE Eiben
AS Adesuyi
C Gagné
I Gonçalves
I Karlsson
JR Quinlan
K Deb
LY Wei
M Arathi
M Nerlove
NR Mabroukeh
R Moskovitch
RC Barros
TA Welch
Thanawin Rakthanmanon
TK Ho
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2018
Field of study

Time series play a major role in many analysis tasks. As an example, in the stock market, they can be used to model price histories and to make predictions about future trends. Sometimes, information contained in a time series is complemented by other kinds of data, which may be encoded by static attributes, e.g., categorical or numeric ones, or by more general discrete data sequences. In this paper, we present J48SS, a novel decision tree learning algorithm capable of natively mixing static, sequential, and time series data for classification purposes. The proposed solution is based on the well-known C4.5 decision tree learner, and it relies on the concept of time series shapelets, which are generated by means of multi-objective evolutionary computation techniques and, differently from most previous approaches, are not required to be part of the training set. We evaluate the algorithm against a set of well-known UCR time series datasets, and we show that it provides better classification performances with respect to previous approaches based on decision trees, while generating highly interpretable models and effectively reducing the data preparation effort. Moreover, some preliminary insights suggest that J48SS trees may be combined in relatively small ensemble models, providing even higher classification accuracies, although at the price of a loss in interpretability

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università di Ferrara

J48S: A Sequence Classification Approach to Text Analysis Based on Decision Trees

Author: A Gomariz
BH Jun
David Lo
F Cailliau
H Duong
IH Witten
J Pei
J Rissanen
JR Quinlan
JR Quinlan
M Saberi
MJ Zaki
N Gans
NR Mabroukeh
P Fournier-Viger
P Fournier-Viger
P Fournier-Viger
Xifeng Yan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Sequences play a major role in the extraction of information from data. As an example, in business intelligence, they can be used to track the evolution of customer behaviors over time or to model relevant relationships. In this paper, we focus our attention on the domain of contact centers, where sequential data typically take the form of oral or written interactions, and word sequences often play a major role in text classification, and we investigate the connections between sequential data and text mining techniques. The main contribution of the paper is a new machine learning algorithm, called J48S, that associates semantic knowledge with telephone conversations. The proposed solution is based on the well-known C4.5 decision tree learner, and it is natively able to mix static, that is, numeric or categorical, data and sequential ones, such as texts, for classification purposes. The algorithm, evaluated in a real business setting, is shown to provide competitive classification performances compared with classical approaches, while generating highly interpretable models and effectively reducing the data preparation effort

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università di Ferrara