6,827 research outputs found
Discovering general partial orders in event streams
Frequent episode discovery is a popular framework for pattern discovery in
event streams. An episode is a partially ordered set of nodes with each node
associated with an event type. Efficient (and separate) algorithms exist for
episode discovery when the associated partial order is total (serial episode)
and trivial (parallel episode). In this paper, we propose efficient algorithms
for discovering frequent episodes with general partial orders. These algorithms
can be easily specialized to discover serial or parallel episodes. Also, the
algorithms are flexible enough to be specialized for mining in the space of
certain interesting subclasses of partial orders. We point out that there is an
inherent combinatorial explosion in frequent partial order mining and most
importantly, frequency alone is not a sufficient measure of interestingness. We
propose a new interestingness measure for general partial order episodes and a
discovery method based on this measure, for filtering out uninteresting partial
orders. Simulations demonstrate the effectiveness of our algorithms
ONCE and ONCE+: Counting the Frequency of Time-constrained Serial Episodes in a Streaming Sequence
As a representative sequential pattern mining problem, counting the frequency
of serial episodes from a streaming sequence has drawn continuous attention in
academia due to its wide application in practice, e.g., telecommunication
alarms, stock market, transaction logs, bioinformatics, etc. Although a number
of serial episodes mining algorithms have been developed recently, most of them
are neither stream-oriented, as they require multi-pass of dataset, nor
time-aware, as they fail to take into account the time constraint of serial
episodes. In this paper, we propose two novel one-pass algorithms, ONCE and
ONCE+, each of which can respectively compute two popular frequencies of given
episodes satisfying predefined time-constraint as signals in a stream arrives
one-after-another. ONCE is only used for non-overlapped frequency where the
occurrences of a serial episode in sequence are not intersected. ONCE+ is
designed for the distinct frequency where the occurrences of a serial episode
do not share any event. Theoretical study proves that our algorithm can
correctly mine the frequency of target time constraint serial episodes in a
given stream. Experimental study over both real-world and synthetic datasets
demonstrates that the proposed algorithm can work, with little time and space,
in signal-intensive streams where millions of signals arrive within a single
second. Moreover, the algorithm has been applied in a real stream processing
system, where the efficacy and efficiency of this work is tested in practical
applications.Comment: 14 pages, 7 figures, 4 table
A unified view of Automata-based algorithms for Frequent Episode Discovery
Frequent Episode Discovery framework is a popular framework in Temporal Data
Mining with many applications. Over the years many different notions of
frequencies of episodes have been proposed along with different algorithms for
episode discovery. In this paper we present a unified view of all such
frequency counting algorithms. We present a generic algorithm such that all
current algorithms are special cases of it. This unified view allows one to
gain insights into different frequencies and we present quantitative
relationships among different frequencies. Our unified view also helps in
obtaining correctness proofs for various algorithms as we show here. We also
point out how this unified view helps us to consider generalization of the
algorithm so that they can discover episodes with general partial orders
Discovering Compressing Serial Episodes from Event Sequences
Most pattern mining methods output a very large number of frequent patterns
and isolating a small but relevant subset is a challenging problem of current
interest in frequent pattern mining. In this paper we consider discovery of a
small set of relevant frequent episodes from data sequences. We make use of the
Minimum Description Length principle to formulate the problem of selecting a
subset of episodes. Using an interesting class of serial episodes with
inter-event constraints and a novel encoding scheme for data using such
episodes, we present algorithms for discovering small set of episodes that
achieve good data compression. Using an example of the data streams obtained
from distributed sensors in a composable coupled conveyor system, we show that
our method is very effective in unearthing highly relevant episodes and that
our scheme also achieves good data compression.Comment: 27 pages 3 figur
Towards Chip-on-Chip Neuroscience: Fast Mining of Frequent Episodes Using Graphics Processors
Computational neuroscience is being revolutionized with the advent of
multi-electrode arrays that provide real-time, dynamic, perspectives into brain
function. Mining event streams from these chips is critical to understanding
the firing patterns of neurons and to gaining insight into the underlying
cellular activity. We present a GPGPU solution to mining spike trains. We focus
on mining frequent episodes which captures coordinated events across time even
in the presence of intervening background/"junk" events. Our algorithmic
contributions are two-fold: MapConcatenate, a new computation-to-core mapping
scheme, and a two-pass elimination approach to quickly find supported episodes
from a large number of candidates. Together, they help realize a real-time
"chip-on-chip" solution to neuroscience data mining, where one chip (the
multi-electrode array) supplies the spike train data and another (the GPGPU)
mines it at a scale unachievable previously. Evaluation on both synthetic and
real datasets demonstrate the potential of our approach
Inferring Neuronal Network Connectivity using Time-constrained Episodes
Discovering frequent episodes in event sequences is an interesting data
mining task. In this paper, we argue that this framework is very effective for
analyzing multi-neuronal spike train data. Analyzing spike train data is an
important problem in neuroscience though there are no data mining approaches
reported for this. Motivated by this application, we introduce different
temporal constraints on the occurrences of episodes. We present algorithms for
discovering frequent episodes under temporal constraints. Through simulations,
we show that our method is very effective for analyzing spike train data for
unearthing underlying connectivity patterns.Comment: 9 pages. See also http://neural-code.cs.vt.edu
Temporal data mining for root-cause analysis of machine faults in automotive assembly lines
Engine assembly is a complex and heavily automated distributed-control
process, with large amounts of faults data logged everyday. We describe an
application of temporal data mining for analyzing fault logs in an engine
assembly plant. Frequent episode discovery framework is a model-free method
that can be used to deduce (temporal) correlations among events from the logs
in an efficient manner. In addition to being theoretically elegant and
computationally efficient, frequent episodes are also easy to interpret in the
form actionable recommendations. Incorporation of domain-specific information
is critical to successful application of the method for analyzing fault logs in
the manufacturing domain. We show how domain-specific knowledge can be
incorporated using heuristic rules that act as pre-filters and post-filters to
frequent episode discovery. The system described here is currently being used
in one of the engine assembly plants of General Motors and is planned for
adaptation in other plants. To the best of our knowledge, this paper presents
the first real, large-scale application of temporal data mining in the
manufacturing domain. We believe that the ideas presented in this paper can
help practitioners engineer tools for analysis in other similar or related
application domains as well
Efficient Discovery of Large Synchronous Events in Neural Spike Streams
We address the problem of finding patterns from multi-neuronal spike trains
that give us insights into the multi-neuronal codes used in the brain and help
us design better brain computer interfaces. We focus on the synchronous firings
of groups of neurons as these have been shown to play a major role in coding
and communication. With large electrode arrays, it is now possible to
simultaneously record the spiking activity of hundreds of neurons over large
periods of time. Recently, techniques have been developed to efficiently count
the frequency of synchronous firing patterns. However, when the number of
neurons being observed grows they suffer from the combinatorial explosion in
the number of possible patterns and do not scale well. In this paper, we
present a temporal data mining scheme that overcomes many of these problems. It
generates a set of candidate patterns from frequent patterns of smaller size;
all possible patterns are not counted. Also we count only a certain well
defined subset of occurrences and this makes the process more efficient. We
highlight the computational advantage that this approach offers over the
existing methods through simulations.
We also propose methods for assessing the statistical significance of the
discovered patterns. We detect only those patterns that repeat often enough to
be significant and thus be able to automatically fix the threshold for the
data-mining application. Finally we discuss the usefulness of these methods for
brain computer interfaces
Efficient Discovery of Large Synchronous Events in Neural Spike Streams
We address the problem of finding patterns from multi-neuronal spike trains
that give us insights into the multi-neuronal codes used in the brain and help
us design better brain computer interfaces. We focus on the synchronous firings
of groups of neurons as these have been shown to play a major role in coding
and communication. With large electrode arrays, it is now possible to
simultaneously record the spiking activity of hundreds of neurons over large
periods of time. Recently, techniques have been developed to efficiently count
the frequency of synchronous firing patterns. However, when the number of
neurons being observed grows they suffer from the combinatorial explosion in
the number of possible patterns and do not scale well. In this paper, we
present a temporal data mining scheme that overcomes many of these problems. It
generates a set of candidate patterns from frequent patterns of smaller size;
all possible patterns are not counted. Also we count only a certain well
defined subset of occurrences and this makes the process more efficient. We
highlight the computational advantage that this approach offers over the
existing methods through simulations.
We also propose methods for assessing the statistical significance of the
discovered patterns. We detect only those patterns that repeat often enough to
be significant and thus be able to automatically fix the threshold for the
data-mining application. Finally we discuss the usefulness of these methods for
brain computer interfaces
Ranking Episodes using a Partition Model
One of the biggest setbacks in traditional frequent pattern mining is that
overwhelmingly many of the discovered patterns are redundant. A prototypical
example of such redundancy is a freerider pattern where the pattern contains a
true pattern and some additional noise events. A technique for filtering
freerider patterns that has proved to be efficient in ranking itemsets is to
use a partition model where a pattern is divided into two subpatterns and the
observed support is compared to the expected support under the assumption that
these two subpatterns occur independently.
In this paper we develop a partition model for episodes, patterns discovered
from sequential data. An episode is essentially a set of events, with possible
restrictions on the order of events. Unlike with itemset mining, computing the
expected support of an episode requires surprisingly sophisticated methods. In
order to construct the model, we partition the episode into two subepisodes. We
then model how likely the events in each subepisode occur close to each other.
If this probability is high---which is often the case if the subepisode has a
high support---then we can expect that when one event from a subepisode occurs,
then the remaining events occur also close by. This approach increases the
expected support of the episode, and if this increase explains the observed
support, then we can deem the episode uninteresting. We demonstrate in our
experiments that using the partition model can effectively and efficiently
reduce the redundancy in episodes
- …