18 research outputs found

    An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

    Full text link
    The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki's cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which an symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions.Comment: frequent sequence mining, constraint programmin

    Community Structure Characterization

    Get PDF
    This entry discusses the problem of describing some communities identified in a complex network of interest, in a way allowing to interpret them. We suppose the community structure has already been detected through one of the many methods proposed in the literature. The question is then to know how to extract valuable information from this first result, in order to allow human interpretation. This requires subsequent processing, which we describe in the rest of this entry

    An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming

    No full text
    The main advantage of Constraint Programming (CP) approaches for sequential pattern mining (SPM) is their modularity, which includes the ability to add new constraints (regular expressions, length restrictions, etc.). The current best CP approach for SPM uses a global constraint (module) that computes the projected database and enforces the minimum frequency; it does this with a filtering algorithm similar to the PrefixSpan method. However, the resulting system is not as scalable as some of the most advanced mining systems like Zaki’s cSPADE. We show how, using techniques from both data mining and CP, one can use a generic constraint solver and yet outperform existing specialized systems. This is mainly due to two improvements in the module that computes the projected frequencies: first, computing the projected database can be sped up by pre-computing the positions at which a symbol can become unsupported by a sequence, thereby avoiding to scan the full sequence each time; and second by taking inspiration from the trailing used in CP solvers to devise a backtracking-aware data structure that allows fast incremental storing and restoring of the projected database. Detailed experiments show how this approach outperforms existing CP as well as specialized systems for SPM, and that the gain in efficiency translates directly into increased efficiency for other settings such as mining with regular expressions. The data and software related to this paper are available at http://​sites.​uclouvain.​be/​cp4dm/​spm/​

    A Novel Decision Tree Approach for the Handling of Time Series

    No full text
    Time series play a major role in many analysis tasks. As an example, in the stock market, they can be used to model price histories and to make predictions about future trends. Sometimes, information contained in a time series is complemented by other kinds of data, which may be encoded by static attributes, e.g., categorical or numeric ones, or by more general discrete data sequences. In this paper, we present J48SS, a novel decision tree learning algorithm capable of natively mixing static, sequential, and time series data for classification purposes. The proposed solution is based on the well-known C4.5 decision tree learner, and it relies on the concept of time series shapelets, which are generated by means of multi-objective evolutionary computation techniques and, differently from most previous approaches, are not required to be part of the training set. We evaluate the algorithm against a set of well-known UCR time series datasets, and we show that it provides better classification performances with respect to previous approaches based on decision trees, while generating highly interpretable models and effectively reducing the data preparation effort. Moreover, some preliminary insights suggest that J48SS trees may be combined in relatively small ensemble models, providing even higher classification accuracies, although at the price of a loss in interpretability

    J48S: A Sequence Classification Approach to Text Analysis Based on Decision Trees

    No full text
    Sequences play a major role in the extraction of information from data. As an example, in business intelligence, they can be used to track the evolution of customer behaviors over time or to model relevant relationships. In this paper, we focus our attention on the domain of contact centers, where sequential data typically take the form of oral or written interactions, and word sequences often play a major role in text classification, and we investigate the connections between sequential data and text mining techniques. The main contribution of the paper is a new machine learning algorithm, called J48S, that associates semantic knowledge with telephone conversations. The proposed solution is based on the well-known C4.5 decision tree learner, and it is natively able to mix static, that is, numeric or categorical, data and sequential ones, such as texts, for classification purposes. The algorithm, evaluated in a real business setting, is shown to provide competitive classification performances compared with classical approaches, while generating highly interpretable models and effectively reducing the data preparation effort
    corecore