14,652 research outputs found

    Free-rider Episode Screening via Dual Partition Model

    Full text link
    One of the drawbacks of frequent episode mining is that overwhelmingly many of the discovered patterns are redundant. Free-rider episode, as a typical example, consists of a real pattern doped with some additional noise events. Because of the possible high support of the inside noise events, such free-rider episodes may have abnormally high support that they cannot be filtered by frequency based framework. An effective technique for filtering free-rider episodes is using a partition model to divide an episode into two consecutive subepisodes and comparing the observed support of such episode with its expected support under the assumption that these two subepisodes occur independently. In this paper, we take more complex subepisodes into consideration and develop a novel partition model named EDP for free-rider episode filtering from a given set of episodes. It combines (1) a dual partition strategy which divides an episode to an underlying real pattern and potential noises; (2) a novel definition of the expected support of a free-rider episode based on the proposed partition strategy. We can deem the episode interesting if the observed support is substantially higher than the expected support estimated by our model. The experiments on synthetic and real-world datasets demonstrate EDP can effectively filter free-rider episodes compared with existing state-of-the-arts.Comment: The 23rd International Conference on Database Systems for Advanced Applications(DASFAA 2018), 16 Page

    Discovering Compressing Serial Episodes from Event Sequences

    Full text link
    Most pattern mining methods output a very large number of frequent patterns and isolating a small but relevant subset is a challenging problem of current interest in frequent pattern mining. In this paper we consider discovery of a small set of relevant frequent episodes from data sequences. We make use of the Minimum Description Length principle to formulate the problem of selecting a subset of episodes. Using an interesting class of serial episodes with inter-event constraints and a novel encoding scheme for data using such episodes, we present algorithms for discovering small set of episodes that achieve good data compression. Using an example of the data streams obtained from distributed sensors in a composable coupled conveyor system, we show that our method is very effective in unearthing highly relevant episodes and that our scheme also achieves good data compression.Comment: 27 pages 3 figur

    Temporal data mining for root-cause analysis of machine faults in automotive assembly lines

    Full text link
    Engine assembly is a complex and heavily automated distributed-control process, with large amounts of faults data logged everyday. We describe an application of temporal data mining for analyzing fault logs in an engine assembly plant. Frequent episode discovery framework is a model-free method that can be used to deduce (temporal) correlations among events from the logs in an efficient manner. In addition to being theoretically elegant and computationally efficient, frequent episodes are also easy to interpret in the form actionable recommendations. Incorporation of domain-specific information is critical to successful application of the method for analyzing fault logs in the manufacturing domain. We show how domain-specific knowledge can be incorporated using heuristic rules that act as pre-filters and post-filters to frequent episode discovery. The system described here is currently being used in one of the engine assembly plants of General Motors and is planned for adaptation in other plants. To the best of our knowledge, this paper presents the first real, large-scale application of temporal data mining in the manufacturing domain. We believe that the ideas presented in this paper can help practitioners engineer tools for analysis in other similar or related application domains as well

    Inferring Neuronal Network Connectivity using Time-constrained Episodes

    Full text link
    Discovering frequent episodes in event sequences is an interesting data mining task. In this paper, we argue that this framework is very effective for analyzing multi-neuronal spike train data. Analyzing spike train data is an important problem in neuroscience though there are no data mining approaches reported for this. Motivated by this application, we introduce different temporal constraints on the occurrences of episodes. We present algorithms for discovering frequent episodes under temporal constraints. Through simulations, we show that our method is very effective for analyzing spike train data for unearthing underlying connectivity patterns.Comment: 9 pages. See also http://neural-code.cs.vt.edu

    Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application

    Full text link
    In this paper we address the problem of discovering a small set of frequent serial episodes from sequential data so as to adequately characterize or summarize the data. We discuss an algorithm based on the Minimum Description Length (MDL) principle and the algorithm is a slight modification of an earlier method, called CSC-2. We present a novel generative model for sequence data containing prominent pairs of serial episodes and, using this, provide some statistical justification for the algorithm. We believe this is the first instance of such a statistical justification for an MDL based algorithm for summarizing event sequence data. We then present a novel application of this data mining algorithm in text classification. By considering text documents as temporal sequences of words, the data mining algorithm can find a set of characteristic episodes for all the training data as a whole. The words that are part of these characteristic episodes could then be considered the only relevant words for the dictionary thus resulting in a considerably reduced feature vector dimension. We show, through simulation experiments using benchmark data sets, that the discovered frequent episodes can be used to achieve more than four-fold reduction in dictionary size without losing any classification accuracy.Comment: 12 pages. Under review for IEEE TKD

    Mining Closed Strict Episodes

    Full text link
    Discovering patterns in a sequence is an important aspect of data mining. One popular choice of such patterns are episodes, patterns in sequential data describing events that often occur in the vicinity of each other. Episodes also enforce in which order the events are allowed to occur. In this work we introduce a technique for discovering closed episodes. Adopting existing approaches for discovering traditional patterns, such as closed itemsets, to episodes is not straightforward. First of all, we cannot define a unique closure based on frequency because an episode may have several closed superepisodes. Moreover, to define a closedness concept for episodes we need a subset relationship between episodes, which is not trivial to define. We approach these problems by introducing strict episodes. We argue that this class is general enough, and at the same time we are able to define a natural subset relationship within it and use it efficiently. In order to mine closed episodes we define an auxiliary closure operator. We show that this closure satisfies the needed properties so that we can use the existing framework for mining closed patterns. Discovering the true closed episodes can be done as a post-processing step. We combine these observations into an efficient mining algorithm and demonstrate empirically its performance in practice.Comment: Journal version. The previous version is the conference versio

    Mining Local Process Models

    Full text link
    In this paper we describe a method to discover frequent behavioral patterns in event logs. We express these patterns as \emph{local process models}. Local process model mining can be positioned in-between process discovery and episode / sequential pattern mining. The technique presented in this paper is able to learn behavioral patterns involving sequential composition, concurrency, choice and loop, like in process mining. However, we do not look at start-to-end models, which distinguishes our approach from process discovery and creates a link to episode / sequential pattern mining. We propose an incremental procedure for building local process models capturing frequent patterns based on so-called process trees. We propose five quality dimensions and corresponding metrics for local process models, given an event log. We show monotonicity properties for some quality dimensions, enabling a speedup of local process model discovery through pruning. We demonstrate through a real life case study that mining local patterns allows us to get insights in processes where regular start-to-end process discovery techniques are only able to learn unstructured, flower-like, models.Comment: Published in Elsevier's Journal of Innovation in Digital Ecosystems, Special Issue on Data Minin

    Mining Non-Redundant Local Process Models From Sequence Databases

    Full text link
    Sequential pattern mining techniques extract patterns corresponding to frequent subsequences from a sequence database. A practical limitation of these techniques is that they overload the user with too many patterns. Local Process Model (LPM) mining is an alternative approach coming from the field of process mining. While in traditional sequential pattern mining, a pattern describes one subsequence, an LPM captures a set of subsequences. Also, while traditional sequential patterns only match subsequences that are observed in the sequence database, an LPM may capture subsequences that are not explicitly observed, but that are related to observed subsequences. In other words, LPMs generalize the behavior observed in the sequence database. These properties make it possible for a set of LPMs to cover the behavior of a much larger set of sequential patterns. Yet, existing LPM mining techniques still suffer from the pattern explosion problem because they produce sets of redundant LPMs. In this paper, we propose several heuristics to mine a set of non-redundant LPMs either from a set of redundant LPMs or from a set of sequential patterns. We empirically compare the proposed heuristics between them and against existing (local) process mining techniques in terms of coverage, redundancy, and complexity of the produced sets of LPMs

    A unified view of Automata-based algorithms for Frequent Episode Discovery

    Full text link
    Frequent Episode Discovery framework is a popular framework in Temporal Data Mining with many applications. Over the years many different notions of frequencies of episodes have been proposed along with different algorithms for episode discovery. In this paper we present a unified view of all such frequency counting algorithms. We present a generic algorithm such that all current algorithms are special cases of it. This unified view allows one to gain insights into different frequencies and we present quantitative relationships among different frequencies. Our unified view also helps in obtaining correctness proofs for various algorithms as we show here. We also point out how this unified view helps us to consider generalization of the algorithm so that they can discover episodes with general partial orders

    ONCE and ONCE+: Counting the Frequency of Time-constrained Serial Episodes in a Streaming Sequence

    Full text link
    As a representative sequential pattern mining problem, counting the frequency of serial episodes from a streaming sequence has drawn continuous attention in academia due to its wide application in practice, e.g., telecommunication alarms, stock market, transaction logs, bioinformatics, etc. Although a number of serial episodes mining algorithms have been developed recently, most of them are neither stream-oriented, as they require multi-pass of dataset, nor time-aware, as they fail to take into account the time constraint of serial episodes. In this paper, we propose two novel one-pass algorithms, ONCE and ONCE+, each of which can respectively compute two popular frequencies of given episodes satisfying predefined time-constraint as signals in a stream arrives one-after-another. ONCE is only used for non-overlapped frequency where the occurrences of a serial episode in sequence are not intersected. ONCE+ is designed for the distinct frequency where the occurrences of a serial episode do not share any event. Theoretical study proves that our algorithm can correctly mine the frequency of target time constraint serial episodes in a given stream. Experimental study over both real-world and synthetic datasets demonstrates that the proposed algorithm can work, with little time and space, in signal-intensive streams where millions of signals arrive within a single second. Moreover, the algorithm has been applied in a real stream processing system, where the efficacy and efficiency of this work is tested in practical applications.Comment: 14 pages, 7 figures, 4 table
    • …
    corecore