14,652 research outputs found
Free-rider Episode Screening via Dual Partition Model
One of the drawbacks of frequent episode mining is that overwhelmingly many
of the discovered patterns are redundant. Free-rider episode, as a typical
example, consists of a real pattern doped with some additional noise events.
Because of the possible high support of the inside noise events, such
free-rider episodes may have abnormally high support that they cannot be
filtered by frequency based framework. An effective technique for filtering
free-rider episodes is using a partition model to divide an episode into two
consecutive subepisodes and comparing the observed support of such episode with
its expected support under the assumption that these two subepisodes occur
independently. In this paper, we take more complex subepisodes into
consideration and develop a novel partition model named EDP for free-rider
episode filtering from a given set of episodes. It combines (1) a dual
partition strategy which divides an episode to an underlying real pattern and
potential noises; (2) a novel definition of the expected support of a
free-rider episode based on the proposed partition strategy. We can deem the
episode interesting if the observed support is substantially higher than the
expected support estimated by our model. The experiments on synthetic and
real-world datasets demonstrate EDP can effectively filter free-rider episodes
compared with existing state-of-the-arts.Comment: The 23rd International Conference on Database Systems for Advanced
Applications(DASFAA 2018), 16 Page
Discovering Compressing Serial Episodes from Event Sequences
Most pattern mining methods output a very large number of frequent patterns
and isolating a small but relevant subset is a challenging problem of current
interest in frequent pattern mining. In this paper we consider discovery of a
small set of relevant frequent episodes from data sequences. We make use of the
Minimum Description Length principle to formulate the problem of selecting a
subset of episodes. Using an interesting class of serial episodes with
inter-event constraints and a novel encoding scheme for data using such
episodes, we present algorithms for discovering small set of episodes that
achieve good data compression. Using an example of the data streams obtained
from distributed sensors in a composable coupled conveyor system, we show that
our method is very effective in unearthing highly relevant episodes and that
our scheme also achieves good data compression.Comment: 27 pages 3 figur
Temporal data mining for root-cause analysis of machine faults in automotive assembly lines
Engine assembly is a complex and heavily automated distributed-control
process, with large amounts of faults data logged everyday. We describe an
application of temporal data mining for analyzing fault logs in an engine
assembly plant. Frequent episode discovery framework is a model-free method
that can be used to deduce (temporal) correlations among events from the logs
in an efficient manner. In addition to being theoretically elegant and
computationally efficient, frequent episodes are also easy to interpret in the
form actionable recommendations. Incorporation of domain-specific information
is critical to successful application of the method for analyzing fault logs in
the manufacturing domain. We show how domain-specific knowledge can be
incorporated using heuristic rules that act as pre-filters and post-filters to
frequent episode discovery. The system described here is currently being used
in one of the engine assembly plants of General Motors and is planned for
adaptation in other plants. To the best of our knowledge, this paper presents
the first real, large-scale application of temporal data mining in the
manufacturing domain. We believe that the ideas presented in this paper can
help practitioners engineer tools for analysis in other similar or related
application domains as well
Inferring Neuronal Network Connectivity using Time-constrained Episodes
Discovering frequent episodes in event sequences is an interesting data
mining task. In this paper, we argue that this framework is very effective for
analyzing multi-neuronal spike train data. Analyzing spike train data is an
important problem in neuroscience though there are no data mining approaches
reported for this. Motivated by this application, we introduce different
temporal constraints on the occurrences of episodes. We present algorithms for
discovering frequent episodes under temporal constraints. Through simulations,
we show that our method is very effective for analyzing spike train data for
unearthing underlying connectivity patterns.Comment: 9 pages. See also http://neural-code.cs.vt.edu
Summarizing Event Sequences with Serial Episodes: A Statistical Model and an Application
In this paper we address the problem of discovering a small set of frequent
serial episodes from sequential data so as to adequately characterize or
summarize the data. We discuss an algorithm based on the Minimum Description
Length (MDL) principle and the algorithm is a slight modification of an earlier
method, called CSC-2. We present a novel generative model for sequence data
containing prominent pairs of serial episodes and, using this, provide some
statistical justification for the algorithm. We believe this is the first
instance of such a statistical justification for an MDL based algorithm for
summarizing event sequence data. We then present a novel application of this
data mining algorithm in text classification. By considering text documents as
temporal sequences of words, the data mining algorithm can find a set of
characteristic episodes for all the training data as a whole. The words that
are part of these characteristic episodes could then be considered the only
relevant words for the dictionary thus resulting in a considerably reduced
feature vector dimension. We show, through simulation experiments using
benchmark data sets, that the discovered frequent episodes can be used to
achieve more than four-fold reduction in dictionary size without losing any
classification accuracy.Comment: 12 pages. Under review for IEEE TKD
Mining Closed Strict Episodes
Discovering patterns in a sequence is an important aspect of data mining. One
popular choice of such patterns are episodes, patterns in sequential data
describing events that often occur in the vicinity of each other. Episodes also
enforce in which order the events are allowed to occur.
In this work we introduce a technique for discovering closed episodes.
Adopting existing approaches for discovering traditional patterns, such as
closed itemsets, to episodes is not straightforward. First of all, we cannot
define a unique closure based on frequency because an episode may have several
closed superepisodes. Moreover, to define a closedness concept for episodes we
need a subset relationship between episodes, which is not trivial to define.
We approach these problems by introducing strict episodes. We argue that this
class is general enough, and at the same time we are able to define a natural
subset relationship within it and use it efficiently. In order to mine closed
episodes we define an auxiliary closure operator. We show that this closure
satisfies the needed properties so that we can use the existing framework for
mining closed patterns. Discovering the true closed episodes can be done as a
post-processing step. We combine these observations into an efficient mining
algorithm and demonstrate empirically its performance in practice.Comment: Journal version. The previous version is the conference versio
Mining Local Process Models
In this paper we describe a method to discover frequent behavioral patterns
in event logs. We express these patterns as \emph{local process models}. Local
process model mining can be positioned in-between process discovery and episode
/ sequential pattern mining. The technique presented in this paper is able to
learn behavioral patterns involving sequential composition, concurrency, choice
and loop, like in process mining. However, we do not look at start-to-end
models, which distinguishes our approach from process discovery and creates a
link to episode / sequential pattern mining. We propose an incremental
procedure for building local process models capturing frequent patterns based
on so-called process trees. We propose five quality dimensions and
corresponding metrics for local process models, given an event log. We show
monotonicity properties for some quality dimensions, enabling a speedup of
local process model discovery through pruning. We demonstrate through a real
life case study that mining local patterns allows us to get insights in
processes where regular start-to-end process discovery techniques are only able
to learn unstructured, flower-like, models.Comment: Published in Elsevier's Journal of Innovation in Digital Ecosystems,
Special Issue on Data Minin
Mining Non-Redundant Local Process Models From Sequence Databases
Sequential pattern mining techniques extract patterns corresponding to
frequent subsequences from a sequence database. A practical limitation of these
techniques is that they overload the user with too many patterns. Local Process
Model (LPM) mining is an alternative approach coming from the field of process
mining. While in traditional sequential pattern mining, a pattern describes one
subsequence, an LPM captures a set of subsequences. Also, while traditional
sequential patterns only match subsequences that are observed in the sequence
database, an LPM may capture subsequences that are not explicitly observed, but
that are related to observed subsequences. In other words, LPMs generalize the
behavior observed in the sequence database. These properties make it possible
for a set of LPMs to cover the behavior of a much larger set of sequential
patterns. Yet, existing LPM mining techniques still suffer from the pattern
explosion problem because they produce sets of redundant LPMs. In this paper,
we propose several heuristics to mine a set of non-redundant LPMs either from a
set of redundant LPMs or from a set of sequential patterns. We empirically
compare the proposed heuristics between them and against existing (local)
process mining techniques in terms of coverage, redundancy, and complexity of
the produced sets of LPMs
A unified view of Automata-based algorithms for Frequent Episode Discovery
Frequent Episode Discovery framework is a popular framework in Temporal Data
Mining with many applications. Over the years many different notions of
frequencies of episodes have been proposed along with different algorithms for
episode discovery. In this paper we present a unified view of all such
frequency counting algorithms. We present a generic algorithm such that all
current algorithms are special cases of it. This unified view allows one to
gain insights into different frequencies and we present quantitative
relationships among different frequencies. Our unified view also helps in
obtaining correctness proofs for various algorithms as we show here. We also
point out how this unified view helps us to consider generalization of the
algorithm so that they can discover episodes with general partial orders
ONCE and ONCE+: Counting the Frequency of Time-constrained Serial Episodes in a Streaming Sequence
As a representative sequential pattern mining problem, counting the frequency
of serial episodes from a streaming sequence has drawn continuous attention in
academia due to its wide application in practice, e.g., telecommunication
alarms, stock market, transaction logs, bioinformatics, etc. Although a number
of serial episodes mining algorithms have been developed recently, most of them
are neither stream-oriented, as they require multi-pass of dataset, nor
time-aware, as they fail to take into account the time constraint of serial
episodes. In this paper, we propose two novel one-pass algorithms, ONCE and
ONCE+, each of which can respectively compute two popular frequencies of given
episodes satisfying predefined time-constraint as signals in a stream arrives
one-after-another. ONCE is only used for non-overlapped frequency where the
occurrences of a serial episode in sequence are not intersected. ONCE+ is
designed for the distinct frequency where the occurrences of a serial episode
do not share any event. Theoretical study proves that our algorithm can
correctly mine the frequency of target time constraint serial episodes in a
given stream. Experimental study over both real-world and synthetic datasets
demonstrates that the proposed algorithm can work, with little time and space,
in signal-intensive streams where millions of signals arrive within a single
second. Moreover, the algorithm has been applied in a real stream processing
system, where the efficacy and efficiency of this work is tested in practical
applications.Comment: 14 pages, 7 figures, 4 table
- …