Towards Chip-on-Chip Neuroscience: Fast Mining of Frequent Episodes
  Using Graphics Processors by Cao, Yong et al.
Towards Chip-on-Chip Neuroscience:
Fast Mining of Frequent Episodes Using Graphics Processors
Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta,
Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan
Department of Computer Science
Virginia Tech, VA 24061, USA
Abstract
Computational neuroscience is being revolutionized with the advent of multi-electrode arrays that provide real-time,
dynamic, perspectives into brain function. Mining event streams from these chips is critical to understanding the firing
patterns of neurons and to gaining insight into the underlying cellular activity. We present a GPGPU solution to mining
spike trains. We focus on mining frequent episodes which captures coordinated events across time even in the presence of
intervening background/“junk” events. Our algorithmic contributions are two-fold: MapConcatenate, a new computation-
to-core mapping scheme, and a two-pass elimination approach to quickly find supported episodes from a large number
of candidates. Together, they help realize a real-time “chip-on-chip” solution to neuroscience data mining, where one
chip (the multi-electrode array) supplies the spike train data and another (the GPGPU) mines it at a scale unachievable
previously. Evaluation on both synthetic and real datasets demonstrate the potential of our approach.
Keywords: Frequent episode mining, GPGPU, spike train datasets, computational neuroscience.
1 Introduction
Temporal (symbolic) event streams are popular in scientific domains, such as physical plants, medical diagnostics, and
neuroscience. In all these domains, we are given occurrences of events of interest over a time course and the goal is to
identify trends and behaviors that serve discriminatory or descriptive purposes. In this paper, we focus exclusively on event
streams gathered through multi-electrode array (MEA) chips for studying neuronal function although our algorithms and
implementations are applicable to a wider variety of domains.
An MEA records spiking action potentials from an ensemble of neurons which after various pre-processing steps, yields
a spike train dataset providing real-time, dynamic, perspectives into brain function (see Figure 1). Identifying sequences
(e.g., cascades) of firing neurons, determining their characteristic delays, and reconstructing the functional connectivity
of neuronal circuits are key problems of interest. This provides critical insights into the cellular activity recorded in the
neuronal tissue.
1
ar
X
iv
:0
90
5.
22
00
v1
  [
cs
.D
C]
  1
3 M
ay
 20
09
Figure 1: (left) Chip-on-Chip neuroscience: spike trains recorded from a multi-electrode array (MEA) are mined by a
GPGPU to yield frequent episodes which can be summarized to reconstruct the underlying neuronal circuitry. (right)
Architecture of the NVIDIA GTX280 GPGPU in detail.
With rapid developments in instrumentation and data acquisition technology, the size of event streams recorded by
MEAs has concomitantly grown, leading us to exhaust the abilities of sequential computation. For instance, just a few
minutes of cortical recording from a 64-channel MEA can easily generate millions of spikes! It has thus become imperative
to enable fast, near real-time computation, data mining, and visualization of mined patterns. This mapping is critical as it
can result in orders-of-magnitude difference in performance.
We adopt GPGPUs as the platform of choice for neuroscience data mining for several reasons. First, they are partic-
ularly economical in morphing a desktop into a rather powerful parallel computing machine (price-to-performance ratio).
This is especially attractive to neuroscientists who might not have access to clusters of workstations. At the same time,
mapping our application to a GPGPU architecture is non-trivial. GPGPUs were originally designed for data-parallel ap-
plications (e.g., rendering), not quite the class of problems that temporal data mining falls under. Our main contributions
are as follows:
1. MapConcatenate, a computation-to-core mapping scheme suitable for frequent episode mining that takes advantage
of the computational power of hundreds of processing cores on a GPU. MapConcatenate is a two-stage scheme like
MapReduce [1] but quite different in the intent and applicability of both the stages.
2. A two-pass elimination approach to find supported temporal episodes from a large number of candidates. The
first pass, which is conducted with relaxed timing constraints, can eliminate most of the non-supported episode
candidates with the accurate count computed using the second pass. Design of this elimination approach and proving
its correctness is non-trivial since the set of episodes returned by the relaxed set of constraints is not a superset of
the original set of constraints but the counts are indeed an upper bound.
3. A ‘chip-on-chip’ real-time solution to neuroscience data mining where one chip (the MEA) provides the data and
another chip (the graphics processor) mines it. Our solution is not a complete data streaming solution; nevertheless,
we achieve real-time responsiveness by processing partitions of the data stream in turn.
2
2 Problem Statement
A spike train dataset can be modeled as an event stream, where each symbol/event type corresponds to a specific neuron
(or clump of neurons) and the dataset encodes occurrence times of these events over the time course.
Definition 2.1. An event stream is denoted by a sequence of events 〈(E1, t1), (E2, t2), . . . (En, tn)〉, where n is the total
number of events. Each event (Ei, ti) is characterized by an event type Ei and a time of occurrence ti. The sequence is
ordered by time i.e. ti ≤ ti+1∀i = 1, . . . , n− 1 and Ei’s are drawn from a finite set ξ.
One class of interesting patterns that we wish to discover are frequently occurring groups of events (i.e., firing cascades
of neurons) with some constraints on ordering and timing of these events. This is captured by the notion of episodes, the
original framework for which was proposed by Mannila et al [9].
Definition 2.2. An (serial) episode α is an ordered tuple of event types Vα ⊆ ξ.
For example (A→ B → C → D) is a 4-node serial episode, and it captures the pattern that neuron A fires, followed
by neurons B, C and D in that order, but not necessarily without intervening ‘junk’ firings of neurons (even possibly
neurons A, B, C, or D). This ability to intersperse noise or don’t care states, of arbitrary length, between the event symbols
in the definition of an episode is what makes these patterns non-trivial, useful, and comprehensible.
Frequency of episodes: The notion of frequency of an episode can be defined in several ways. In [9], it is defined as the
fraction of windows in which the episode occurs. Another measure of frequency is the non-overlapped count which is the
size of the largest set of non-overlapped occurrences of an episode. Two occurrences are non-overlapped if no event of one
occurrence appears in between the events of the other. In the event stream of Fig. 2, there are at most two non-overlapped
occurrences of the episode A→ B, although there are 8 occurrences in total.
We use the non-overlapped occurrence count as the frequency measure of choice due to its strong theoretical properties
under a generative model of events [7]. It has also been argued in [10] that, for the neuroscience application, counting
non-overlapped occurrences is natural because episodes then correspond to causative, “syn-fire”, chains that happen at
different times again and again.
Temporal constraints: Besides the frequency threshold, a further level of constraint can be imposed on the definition
of episodes. In multi-neuronal datasets, if one would like to infer that neuron A’s firings cause a neuron B to fire, then
spikes from neuronB cannot occur immediately or spontaneously afterA’s spikes due to axonal conduction delays. These
spikes cannot also occur too much later than A for the same reason. Such minimum and maximum inter-event delays are
common in other application domains as well. We hence place inter-event time constraints giving rise to episodes such as:
(A
(t1low,t
1
high]−−−−−−−−→B
(t2low,t
2
high]−−−−−−−−→C)
In a given occurrence of episode A → B → C let tA, tB , and tC denote the time of occurrence of corresponding event
3
types. A valid occurrence of the serial episode satisfies
t1low < (tB − tA) ≤ t1high, t2low < (tC − tB) ≤ t2high
(In general, an N -node serial episode is associated with N − 1 inter-event constraints.) In Fig. 2, there is exactly one
occurrence of the episode A
(5,10]→ B (10,15]→ C satisfying the desired inter-event constraints (shown in dotted boxes).
Problem 1: Given an event stream {(Ei, ti)}, i ∈ {1 . . . n}, a set of inter-event constraints I = {(tklow, tkhigh]},k ∈
{1 . . . l}, find all serial episodes α of the form:
α = 〈E
(t
(1)
low,t
(1)
high]−−−−−−−−→
(1) E(2) . . .
(t
(N−1)
low ,t
(N−1)
high ]−−−−−−−−−−−→E(N)〉
such that the non-overlapped occurrence counts of each episode α is ≥ θ, a user-specified threshold. Here
E(.)’s are the event types in the episode α and (t
(.)
low, t
(.)
high]’s ∈ I are the corresponding inter-event constraints.
3 Prior Work
We review prior work in three categories: mining frequent episodes, data mining using GPGPUs, and the map-reduce
framework for large scale computations.
Mining Frequent Episodes: The overall mining procedure for frequent episodes is based on level-wise mining. Within
this framework there are two broad classes of algorithms: window-based [9] and state machine based [6, 7], and they
primarily differ in how they define the frequency of an episode. The window based algorithms define frequency of an
episode as the fraction of windows on the event sequence in which the episode occurs. The state machine based algorithms
are more efficient and define frequency as the size of largest set of non-overlapped occurrences of an episode. Within the
class of state machine algorithms, serial episode discovery using non-overlapped counts was described in [7], and their
extension to temporal constraints is given in [10]. With the introduction of temporal constraints the state machine based
algorithms become more complicated. They must keep track of what part of an episode has been seen, which event is
expected next and, when episodes inter-leave, they must make a decision of which events to be used in the formation of an
episode.
Data Mining Using GPGPUs: Many researchers have harnessed the capabilities of GPGPUs for data mining. The key to
porting a data mining algorithm onto a GPGPU is to, in one sense, “dumb it down”; i.e., conditionals, program branches,
and complex decision constraints are not easily parallelizable on a GPGPU and algorithms using these constraints will
require significant reworking to fit this architecture (temporal episode mining unfortunately falls in this category). There
are many emerging publications in this area but due to space restrictions, we survey only a few here. The SIGKDD tutorial
4
by Guha et al. [4] provides a gentle introduction to the aspects of data mining on GPGPUs through problems like k-means
clustering, reverse nearest neighbor(RNN), discrete wavelet transform(DWT), sorting networks, etc. In [2], a bitmap
technique is proposed to support counting and is used to design GPGPU variants of Apriori and k-means clustering. This
work also proposes co-processing for itemset mining where the complicated tie data structure is kept and updated in the
main memory of CPU and only the itemset counting is executed in parallel on GPU. A sorting algorithm on GPGPUs
with applications to frequency counting and histogram construction is discussed in [3] which essentially recreates sorting
networks on the GPU. Li et al. [8] present a ‘cut-and-stitch’ algorithm for approximate learning of Kalman filters. Although
this is not a GPGPU solution per se, we point out that our proposed approach shares with this work the difficulties of mining
temporal behavior in a parallel context.
MapReduce: Modeled after LISP primitives, MapReduce [1] provides a distribution framework for large scale compu-
tations using two functions: map, and reduce. It has received a fair share of attention (and, sometimes, criticism) from
the cloud computing and data-intensive computing communities. The framework has been ported to many platforms,
including GPGPUs (e.g., see [5]). See Section 5.2.2 for comparisons between our proposed framework and MapReduce.
4 GPGPU Architecture
To understand the algorithmic details behind our approach, we provide a brief overview of GPGPU and its architecture.
The initial purpose of specialized GPUs was to accelerate the display of 2D and 3D graphics, much in the same
way that the FPU focused on accelerating floating-point instructions. However, the rapid technological advances of the
GPU, coupled with extraordinary speed-ups of application “toy” benchmarks on the specialized GPUs, led GPU vendors
to transform the GPU from a specialized processor to a general-purpose graphics processing unit (GPGPU), such as
the NVIDIA GTX 280, as shown in Figure 1 (right). To lessen the steep learning curve, GPU vendors also introduced
programming environments, such as the Compute Unified Device Architecture (CUDA).
Processing Elements: The basic execution unit on the GTX 280 is a scalar processing core, of which 8 together form
a multiprocessor. While the number of multiprocessors and processor clock frequency depends on the make and model
of the GPU, every multiprocessor in CUDA executes in SIMT (Single Instruction, Multiple Thread) fashion, which is
similar in nature to SIMD (Single Instruction, Multiple Data) execution. Specifically, groups of 32 threads form a warp
and execute the exact same codepath until every thread terminates. However, when codepaths diverge, each thread must
now execute every instruction on every thread path, therefore, implying that optimal performance is attained when all 32
threads do not branch down different codepaths. The execution of a single instruction with 8 cores and a warp size of 32
completes in 4 cycles.
Memory Hierarchy: The GTX 280 contains multiple forms of memory. Beginning with the furthest from the GPU
processing elements, the device memory is located off-chip on the graphics card and provides the main source of storage
for the GPU while simultaneously being accessible from the CPU and GPU. Each multiprocessor on the GPU contains
5
three caches — a texture cache, constant cache, and shared memory. The texture cache and constant cache are both
read-only memory providing fast access to immutable data. Shared memory, on the other hand, is read-write to provide
each core with fast access to the shared address space of a thread block within a multiprocessor. Finally, on each core
resides a plethora of registers such that there exists minimal reliance on local memory resident off-chip on the device
memory.
Parallelism Abstractions: At the highest level, the CUDA programming model requires the programmer to offload
functionality to the GPGPU as a compute kernel. This kernel is evaluated as a set of thread blocks logically arranged
in a grid to facilitate organization. In turn, each block contains a set of threads, which will be executed on the same
multiprocessor, with the threads scheduled in warps, as mentioned above.
5 Algorithms
Our overall approach for solving Problem 1 is based on a state machine algorithm with inter-event constraints [10]. There
are two major phases of this algorithm: generating episode candidates and counting these episodes, and we focus on
the latter since it is the key performance bottleneck, typically by a few orders of magnitude. Therefore, we present a
GPU-accelerated algorithm for counting episodes of various size, while candidate generation is executed sequentially on
a CPU.
In this section, we first present the standard sequential algorithm for mining frequent episodes with inter-event con-
straints. Next, we introduce our GPU based algorithm (called A1). We then propose a two-pass counting approach, where
the first counting pass eliminates most of unsupported episodes and the second counting pass completes the counting tasks
for the remaining episodes. Since the first counting pass uses a less complex algorithm (called A2), the execution time
saved at this step contributes to an overall performance gain even we go through two-pass counting (A2 +A1).
5.1 Episode Mining with Inter-Event Constraints
Algorithm 1 outlines the serial counting procedure for a single episode α. Briefly, it maintains a data structure s which is
a list of lists. Each list s[k] in s corresponds to an event type E(k) ∈ α and stores the times of occurrences of those events
with event-type E(k) which satisfy the inter-event constraint (t
(k−1)
low , t
(k−1)
high ] with at least one entry tj ∈ s[k − 1]. This
requirement is relaxed for s[0], thus every time an event E(0) is seen in the data its occurrence time is pushed into s[0].
When an event of type E(k), 2 ≤ k ≤ N at time t is seen, we look for an entry tj ∈ s[k − 1] such that t −
tj ∈ (t(k−1)low , t(k−1)high ]. Therefore, if we are able to add the event to the list s[k], it implies that there exists at least one
previous event with event-type E(k−1) in the data stream for the current event which satisfies the inter-event constraint
between E(k−1) and E(k). Applying this argument recursively, if we are able to add an event with event-type E(|α|) to its
corresponding list in s, then there exists a sequence of events corresponding to each event type in α satisfying the respective
6
inter-event constraints. Such an event marks the end of an occurrence after which the count for α is incremented and the
data structure s is reinitialized. Figure 2 illustrates the data structure s for counting A
(5,10]→ B (10,15]→ C.
The maximality of the left-most and inner-most occurrence counts for a general serial episode has been shown in [7].
Similar arguments hold for the case of serial episodes with inter-event constraints and not shown here for lack of space.
A
(5,10]
B C
(10,15]
Events:
Times:
A A B A A C B C
1 2 5 8 10 13 15 18 20
B
s[A] s[B] s[C]
1
2
10
13
8
18
20
Figure 2: Illustration of the data structure s for counting A
(5,10]→ B (10,15]→ C
Algorithm 1 Serial Episode Mining
Input: Candidate N -node episode α = 〈E(1)
(t
(1)
low,t
(1)
high]−→ . . . E(N)〉 and event sequence S = {(Ei, ti)}, i ∈ {1 . . . n}.
Output: Count of non-overlapped occurrences of α satisfying inter-event constraints
1: count = 0; s = [[], . . . , []] //List of |α| lists
2: for all (E, t) ∈ S do
3: for i = |α| to 1 do
4: E(i) = ith event type ∈ α
5: if E = E(i) then
6: iprev = i− 1
7: if i > 1 then
8: k = |s[iprev]|
9: while k > 0 do
10: tprev = s[iprev, k]
11: if t(iprev)low < t− tprev ≤ t(iprev)high then
12: if i = |α| − 1 then
13: count++; s = [[], . . . , []]; break Line: 3
14: else
15: s[i] = s[i] ∪ t
16: break Line: 9
17: k = k − 1
18: else
19: s[i] = s[i] ∪ t
20: RETURN count
5.2 GPU Mining Algorithm: A1
5.2.1 PTPE: per-thread per-episode
In order to design a parallely-executed algorithm for counting supports for M episodes, we need to identify a set of
independent computational tasks/units that can be mapped onto different computational cores of GPU. A standard approach
7
h1 h2 h3 h4
α0 : count = 4
α1 : count = 1 α2 : count = 2
g1 g2 g3 g4
Partial occurrence
Segment 1 Segment 2
A B C A B CA B CA B C A B CEvent
Sequence
Figure 3: Illustration of splitting a data sequence into segments and counting within each segment.
is to useM computational units, each of which is counting for one episode. We then map each computational unit to a logic
GPU thread, which can execute in parallel with other threads. Because the counting of one episode in each computational
unit is executed sequentially, like CPU, we simply implement Algorithm 1 as the CUDA kernel for each counting thread.
5.2.2 MapConcatenate: multiple threads per episode
When the number of episodes is smaller than the number of cores, a per-thread per-episode model is prone to inefficiencies
in core usage , leading to higher execution times. Hence we seek to achieve a higher level of parallelism within the counting
of a single episode. We segment the input event stream into segments of sub-streams and use each thread block to count one
segment. The subcounts of all segements need to be concatenated together for the final count, and hence we refer to this
two-step approach as “MapConcatenate”, which include a Map step, where each event segment is mapped to one thread
block for local counting, and a Concatenate step, where the counting results from neighboring segments coresponding to
episodes that straddle them are concatenated/merged together.
MapReduce vs MapConcatenate
This new computation-to-core mapping scheme, MapConcatenate, is quite different from MapReduce [1], in both
stages. Whereas the map functions in MapReduce operate independently across segments, the map functions in Map-
Concatenate require data access from adjacent segments. Similarly, whereas the results from any two segments can be
“reduced” in MapReduce, only the results from two adjacent segments can be concatenated in MapConcatenate.
Design of MapConcatenate
When we divide the input stream into segments, there are chances that some occurrences of an episode span across the
boundaries of consecutive segments. As an example, see Fig. 3 which depicts a data sequence divided into two segments.
The shaded rectangles on the top mark the non-overlapped occurrences h1 . . . h4 of an episode (A → B → C) (assume
for now that inter-event constraints are always satisfied), as seen by a state machine α0 on the unified event sequence. α0
is thus the reference (serial) state machine. Let α1 and α2 be the state machines counting occurrences in segment 1 and
segment 2 respectively. During the Map step, α1 and α2 are executed in parallel, and each state machine can see a local
view of the episode occurrences, which are shown by empty rectangles below the event sequence. α1 sees the occurrence
g1 and a partial occurrence g2. For α2, it will first see the occurrence g3 and therefore miss h3 before moving onto g4.
We propose Map and Concatenate steps that use multiple state machines in each segment, so that the counting of a
8
segment is able to anticipate partial occurrences near boundaries. Let us explain in detail why multiple state machines are
necessary, and how we design the Map step and Concatenate step to maintain the correctness of counting.
E(1) E(2) E(N). . .
Possible locations of split
Figure 4: Illustration of different possibilities of an occurrence splitting across two adjacent data segments.
Assume that we are counting an episode α = 〈E(1)
(t
(1)
low,t
(1)
high]−→ . . . E(N)〉, the data sequence is divided into P segments,
and events in the pth data segment are in the range (τp, τp+1]. An occurrence of episode α can be split across two adjacent
segments in at leastN ways as shown in Figure 4. For each possible split, we need one state machine, αkp , 0 ≤ k ≤ N−1, to
count the second segment, starting at t = τp−
∑k
i=1 t
(i)
high. So we haveN different state machines all counting occurrences
of episode alpha using Algorithm 1, handling all possible cases of split between current segment and previous segment.
Segment-pEvent sequence
α0p
α1p
αN−1p
τp
τp − t(1)high
τp τp+1
τp −
∑N−1
i=1 t
(i)
high
a0p b
0
pcount
0
p
a1p b
1
p
aN−1p b
N−1
p
count1p
countN−1p
τp+1 +
∑N−1
i=1 t
(i)
high
Figure 5: Illustration of a Map step.
For each segment p, the Map step is designed as follows, and illustrated in Figure 5.
1. Each state machine αkp maintains its own count = count
k
p .
2. αkp does not increment count for occurrences ending at time t ≤ τp.
3. αkp stores the end time of the first occurrence that completes at time t, τp < t < τp +
∑N−1
i=1 t
(i)
high. Let this be a
k
p .
If there is no such occurrence akp is set to τp.
4. αkp on reaching end of the segment, crosses over into the next segment to completes the current partial occurrence
and continues until t < τp+1 +
∑N−1
i=1 t
(i)
high. Let the end time of this occurrence be b
k
p . Note that count is not
incremented for this occurrence. In case the occurrence cannot be completed bkp is set to τp+1.
The result of the Map step for each segment p is tuples of (akp, count
k
p, b
k
p). Based on these results, we design our
Concatenate step for pairs of consecutive segments as follows, and illustrated in Figure 6.
1. Start Concatenate step at level 1.
9
Segment 1 Segment 2 Segment 3 Segment 4
(a01, count
0
1, b
0
1) (a
0
2, count
0
2, b
0
2)
(a12, count
1
2, b
1
2)
(a22, count
2
2, b
2
2)
(a03, count
0
3, b
0
3)
(a13, count
1
3, b
1
3)
(a23, count
2
3, b
2
3)
(a04, count
0
4, b
0
4)
(a14, count
1
4, b
1
4)
(a24, count
2
4, b
2
4)
(a01, count
0
1 + count
2
2, b
2
2) (a
0
3, count
0
3 + count
1
4, b
1
4)
(a13, count
1
3 + count
2
4, b
2
4)
(a23, count
2
3 + count
0
4, b
0
4)
Level 1:
Level 2:
Level 3: (a
0
1, count
0
1 + count
2
2 + count
2
3 + count
0
4, b
0
4)
Figure 6: Illustration of a Concatenate step.
2. For level i, concatenate the tuples of segment (j − 1)2i + 1 with the tuples of segment (j − 1)2i + 1 + 2j−1 for
all possible j. The procedure for concatenate the tuples of sth segment and tthsegment is: find all pairs of tuples
(aks , count
k
s , b
k
s) and (a
l
t, count
l
t, b
k
t ) such that b
k
s = a
l
t, and then concatenate these pairs to obtain the next level
(i+ 1) tuples (aks , count
k
s + count
l
t, b
l
t) for s
thsegment.
3. After all adjacent segment pairs are concatenated for level i, increase the level to i+ 1 and repeat the previous step
until there is only one segment left for this level.
It is worth mentioning that, at level i of Concatenate step, segment (j− 1)2i+1 and segment (j− 1)2i+1+2j−1 are
considered as adjacent segments. We also choose segment number p to be a power of 2, say 2q , so that the Concatenate
step takes exactly q + 1 levels and 2q+1 − 1 concatenate operations to finish.
5.2.3 A Hybrid Algorithm
MapConcatenate can greatly out-perform the PTPE algorithm in cases when the number of episodes is small and some
GPU cores are idle for the PTPE algorithm. For other cases, the PTPE algorithm would run much faster than MapCon-
catenate. So, as a practical approach, we would like to use their selective superiorities to automatically decide the right
algorithm to execute. The logic for making these choices can be very simple: if the GPU can be fully utilized with the
PTPE algorithm, then we choose it, else we choose MapConcatenate. In NVIDIA’s CUDA framework, the GPU is fully
utilized when:
S > MP ×BMP × TB (1)
where S is the number of episodes to be counted, TB is the number of threads per block as defined by the algorithm, MP
is the number of multiprocessors on the GPGPU, and BMP is the number of blocks that the compiler determines can be
fit into one multi-processor.
10
Algorithm 2 GPU Mining Algorithm (A1)
1: if S > MP ×BMP × TB × f(N) then
2: Call PTPE Algorithm
3: else
4: Call MapConcatenate Algorithm
However, another key factor that determines the selective superiority is the size/length (N ) of the episode being
counted. For example, MapConcatenate will run slower on larger N , since the Map step needs N state machines to
count each event segment, forcing the Concatenate step to take more time to concatenate the state machines together. The
performance of the PTPE algorithm is also dependent on N , but will not change as much as MapConcatenate, since it
only uses one state machine to count each episode. Therefore, we need to consider the effect of N when deciding which
algorithm to use:
S > MP ×BMP × TB × f(N) (2)
where f(N) is a performance penalty factor dependent on the episode length/level of the mining algorithm. We define the
cross-over point as the number of episodes beyond which we estimate that the PTPE algorithm will outperform MapCon-
catenate.
By considering both GPU utilization and episode size, we propose the Hybrid algorithm (A1) for mining temporal
episodes on the GPU as shown in algorithm 2.
5.3 A Two-Pass Elimination Approach: A2+A1
After a thorough analysis of GPU performance of algorithm A1, we find the performance is largely limited by the require-
ment of large amount of shared memory and large number of GPU registers for each GPU thread. For example, if the
episode size is 5, each thread requires 220 bytes of shared memory and 97 bytes of register file. It means that only 32
threads can be allocated on a GPU multi-processor, which has 16K bytes of shared memory and register file. When each
thread requires more resources, only fewer threads can run on GPU at the same time, resulting in more execution time for
each thread.
To address this problem, the only way is to reduce the complexity of the algorithm without losing correctness. In
this section, we introduce a two-pass elimination approach that more efficiently searches larger numbers of episodes,
further improving the overall performance. The idea is to use a far less complex algorithm, A2, to eliminate most of
non-supported episodes, and only use the complex A1 algorithm to determine if the rest of episode is supported or not. In
order to introduce algorithm A2, we consider the solution to a slight relaxed problem, which plays an important role in our
two-pass elimination approach.
11
5.3.1 Less-Constrained Mining: Algorithm A2
Let us consider a constrained version of Problem 1. Instead of enforcing both lower-limits and upper-limits on inter-event
constraints, we design a counting solution that enforces only upper limits.
Let α′ be an episode with the same event types as in α, where α uses the original episode definition from Problem 1.
The lower bounds on the inter-event constraints in α are relaxed for α′ as shown below.
α′ = 〈E
(0,t
(1)
high]−−−−−→
(1) E(2) . . .
(0,t
(N−1)
high ]−−−−−−−→E(N)〉
Observation 5.1. In Algorithm 1, if lower-bounds of inter-event constraints in episode α are relaxed as α′, the list size of
s[k], 1 ≤ k ≤ N can be reduced to 1.
Proof. In Algorithm 1, when an event of type E(k) is seen at time t while going down the event sequence, s[E(k−1)] is
looked up for at least one tk−1i , such that t − tk−1i ∈ (0, t(k−1)high ]. Note that tk−1i represents the ith entry of s[E(k−1)]
corresponding the (k − 1)th event-type in α.
Let s[E(k−1)] = {tk−11 . . . tk−1m } and tk−1i be the first entry which satisfies the inter-event constraint (0, t(k−1)high ], i.e.,
0 < t− tk−1i ≤ t(k−1)high (3)
Also Equation 4 below follows from the fact that tk−1i is the first entry in s[E(k−1)] matching the time constraint.
tk−1i < t
k−1
j ≤ t, ∀j ∈ {i+ 1 . . .m} (4)
From Equation 3 and 4, Equation 5 follows.
0 < t− tk−1j ≤ t(k−1)high ,∀j ∈ {i+ 1 . . .m} (5)
This shows that every entry in s[E(k−1)] following tk−1i also satisfies the inter-event constraint. This follows from the
relaxation of the lower-bound. Therefore it is sufficient to keep only the latest time stamp tk−1m only in s[E(k−1)] since it
can serve the purpose for itself and all entries above/before it, thus reducing s[E(k−1)] to a single time stamp rather than a
list (as in Algorithm 1).
5.3.2 Combined Algorithm: A2+A1
Now, we can return to the original mining problem (with both upper and lower bounds). By combining Algorithm A2
with Algorithm A1, we can develop a two-pass elimination approach (A2 + A1) that can deal with the cases on which
Algorithm A1 cannot be executed. Algorithm A2 +A1 is as follows.
12
Algorithm 3 Less-Constrained Mining: A2
Input: Candidate episode α = 〈E(1)
(0,t
(1)
high]−→ . . . E(N)〉 is a N -node episode, event sequence S = {(Ei, ti)}, i ∈
{1 . . . n}.
Output: Count of non-overlapped occurrences of α
1: count = 0; s = [] //List of |α| time stamps
2: for all (E, t) ∈ S do
3: for i = |α| to 1 do
4: E(i) = ith event type ∈ α
5: if E = E(i) then
6: iprev = i− 1
7: if i > 1 then
8: if t− s[iprev] ≤ t(iprev)high then
9: if i = |α| then
10: count++; s = []; break Line: 3
11: else
12: s[i] = t
13: else
14: s[i] = t
15: Output: count
Algorithm 4 Combined Algorithm: A2+A1
1: (First pass) For each episode α, run A2 on its less-constrained counterpart, α′.
2: Eliminate every episode α, if count(α′) < CTh, where CTh is the support count threshold.
3: (Second Pass) Run A1 on each remaining episode, α, with both inter-event constrains enforced.
Algorithm A2 + A1 yields the correct solution for Problem 1. Although the set of episodes mined under the less
constrained version are not a superset of those mined under the original problem definition, we can show the following
result:
Theorem 5.1. count(α′) ≥ count(α), i.e., the count obtained from Algorithm A2 is an upper-bound on the count obtained
from Algorithm A1.
Proof. Let h be an occurrence of α. Note that h is a map from event types in α to events in the data sequence S. Let the
time stamps for each event type in h be {t(1) . . . t(k)}. Since h is an occurrence of α, it follows that
tilow < t
(i) − t(i−1) ≤ tihigh,∀i ∈ {1 . . . k − 1} (6)
Note that tilow > 0. The inequality in Equation 6 still holds after we replace t
i
low with 0 to get Eqn.7.
0 < t(i) − t(i−1) ≤ tihigh,∀i ∈ {1 . . . k − 1} (7)
The above corresponds to the relaxed inter-event constraint in α′. Therefore every occurrence of α is also an occurrence
of α′ but the opposite may not be true. Hence we have that count(α′) ≥ count(α).
13
In our two-pass elimination approach, Algorithm A2 is less complex and runs faster than Algorithm A1, because it
reduces the time complexity of the inter-event constraint check from O(|s[E(k−1)]|) to O(1). Therefore, the performance
of AlgorithmA2+A1 is significantly better than AlgorithmA1 when the number of episodes is very large and the number
of episodes culled in the first pass is also large as shown by our experimental results described next.
6 Experimental Results
6.1 Datasets and Testbed
6.1.1 Dataset Description
Our datasets are drawn from both mathematical models of spiking neurons as well as real datasets gathered by Wagenar et
al. [11] in their analysis of cortical cultures. Both these sources of data are described in detail in [10]. The mathematical
model involves 26 neurons (event types) whose activity is modeled via inhomogeneous Poisson processes. Each neuron
has a basal firing rate of 20Hz and two causal chains of connections—one short and one long—are embedded in the
data. This dataset (Sym26) involves 60 seconds with 50,000 events. The real datasets (2-1-33, 2-1-34, 2-1-35) observe
dissociated cultures on days 33, 34, and 35 from over five weeks of development. The original goal of this study was to
characterize bursty behavior of neurons during development.
6.1.2 Testbed and Runtime Parameters
We evaluated the performance of our GPU algorithms on a machine equipped with Intel Core 2 Quad 2.33 GHz and 4GB
system memory. We used a NVIDIA GTX280 GPU, which has 240 processor cores with 1.3 GHz clock for each core, and
1GB of device memory.
There are two CUDA runtime parameters we need to determine for each execution on the GPU: number of threads
per block, T , and the total number of blocks. The second parameter is always calculated based on T so that all required
computation can be finished with T threads per block and within one CUDA kernel function call.
Parameter T is determined by the algorithm and the size of the episode (N ). For the PTPE algorithm, we calculate
the maximum number of threads per block at each N . The larger N is, more shared memory is needed per thread. When
N = 1, we use 128 threads, and as N increases, the maximum number of threads per block decrease due to the shared
memory limit. When N = 6, we cannot have more than 32 threads per block. For MapConcatenate, the event stream is
segmented into a number (R) of sub-streams, as mentioned in Section 5.2.2. Recall that we need to create multiple threads
to count all sub-streams independently and run multiple state machines, as shown in Figure 5. The number of threads for
each block T can be calculated as T = R × N , since there are R sub-streams and N state machines. Again, we must
limit the number of sub-streams to reduce the number of threads due to the shared memory limit affected by N . For the
Algorithm A2, we generate as many threads as possible per block until shared memory usage reaches the hardware limit
14
(16KB). In this case, T is normally much larger than 32, since we do not have a strict memory requirement for the GPU
for Algorithm A2.
6.2 PTPE vs MapConcatenate
(a) Run Times at Different Episode Sizes.
(b) Speedups at Different Support Thresholds.
Figure 7: Comparison of PTPE, MapConcatenate, and Hybrid on Sym26 dataset.
Our first set of tests evaluated the three algorithms: PTPE, MapConcatenate, and the hybrid approach (A1). As seen
in Figure 7(a), it is clear that blindly choosing to execute the PTPE or MapConcatenate approach for all levels is not the
best solution. For episode sizes of 1, 2, and 5 both approaches complete in roughly the same amount of time. However,
PTPE significantly outperforms MapConcatenate at episode sizes of 3 and 4 by 3.96X and 2.84X, respectively. On the
other hand, PTPE performs slower than MapConcatenate for episodes of size 6 (1.32X) and 7 (2.63X).
These crossover points exist in all of our tests for this dataset (see supplementary information), and for lack of space,
Table 1 shows the crossover points determined experimentally for the Sym26 dataset.
Table 1: Crossover Points on number of episodes below which MapConcatenate should be run (for the fewer episodes
case). For other episode sizes—1, 2, and >8—MapConcatenate should be chosen.
Level 3 4 5 6 7 8
Crossover 415 190 200 100 100 60
Recall Equation 2 in Section 5.2.3 where optimal execution occurs when the GPU is fully utilized and a small factor
15
of episode size is taken into account. Using the table above, with M = 30, TB = 32, and BM = 1, we find that f(N) of
the form aN + b is a better fit than a×N + b as seen in Figure 8.
Figure 8: Crossover points fitted to aN + b and a×N + b
With these crossover points determined, the hybrid approach was evaluated on the same support thresholds and the
speedup for this approach over both PTPE and MapConcatenate is visible in Figure 7(b). The range of improvement over
PTPE is almost 3X and over 4.5X for MapConcatenate. When the number of episodes is large (i.e., low support threshold)
there are enough episodes to fully utilize the GPU and as such Hybrid shows little improvement over PTPE. Conversely,
Hybrid shows little improvement over MapConcatenate when the support threshold is high with very few episodes to
search for.
6.3 One-Pass (A1) vs Two-Pass (A2+A1)
As stated in Section 5.3, the performance of algorithm A1 suffers from the requirement of large shared memory and
large register file, especially when the episode size is big. So we introduce algorithm A2 that can eliminate most of the
non-supported episodes and requires much less shared memory and register file, then the complex algorithm A1 can be
executed on much fewer number of episodes, resulting in performance gains. The amount of elimination that A2 conducts
can greatly affect the execution time at different episode sizes. In Figure 9(a), the two-pass algorithm eliminates over
99.9% (43634 out of 43656) of the episodes of size four. The end result is a speedup of 3.6X over the hybrid A1 algorithm
for this episode size and an overall speedup for this support threshold of 2.53X. Speedups for three different datasets at
different support thresholds are shown in Figure 9(b) where in every case, the two-pass approach outperforms the one-pass
approach with speedups ranging from 1.2X to 2.8X.
We also use CUDA Visual Profiler to analyze the execution of algorithmA1 andA2 to give a quantitative measurement
of how A2 out-performs A1 on the GPU. We have analyzed various GPU performance factors, such as GPU occupancy,
coalesced global memory access, shared memory bank conflict, divergent branching, and local memory loads and stores.
We find the last two factors are primarily attributed to the performance difference between A1 and A2, which are shown in
Figure 10. Algorithm A1 requires 17 registers and 80 bytes of local memory for each counting thread, while algorithm A2
only requires 13 registers and no local memory. Since local memory is used as supplement for registers and mapped onto
16
(a) Execution time of Two-Pass and One-Pass algorithms for Support=3600 on Dataset 2-1-35 at dif-
ferent episode sizes.
(b) Speedup of Two-Pass algorithm over One-Pass for multiple support thresholds on multiple datasets.
Figure 9: Execution time and speedup comparison of One-Pass algorithm versus Two-Pass algorithm.
global memory space, it is accessed very frequently and has the same high memory latency as global memory. In Figure 10
(a), the total amount of local memory access of both two-pass approach and one-pass approach comes from algorithm A1.
Since algorithmA2 eliminates most of the non-supported episodes and requires no local memory access, the local memory
access of two-pass approach is much less than one-pass approach when the size of episode increases. At the size of 4, A2
eliminates all episode candidates, thus there is noA1 execution and no local memory access, resulting a large performance
gain for two-pass approach over one-pass. As shown in Figure 10 (b), the amount of divergent branching also affects the
GPU performance difference between the two-pass approach and the one-pass approach.
(a) (b)
Figure 10: Comparison between algorithm A1 and algorithm A2 for support threshold 1650 on dataset 2-1-33. (a) Total
number of loads and stores of local memory. (b) Total number of divergent branches.
17
6.4 GPU vs. CPU
To demonstrate the performance gain of our GPU approach, we implemented Algorithm 1 for a quad-core CPU with four
threads, and tested on our testbed machine. The CPU implementation is written in C++ and optimized for sequentially
executed applications. For example, since each CPU thread counts a large number of episodes, we can read the event
stream exactly once for each thread, and update all state machines in that thread with each event. In addition, we used
an acceleration structure to speed up the search for which the state machine needs to be updated. In addition, we also
implement two-pass (A2 + A1) elimination approach on CPU to get a fair performance comparison against our GPU
implementation.
Compared with the performance of the CPU implementation, our GPU algorithms exhibit a significant speedup, as
shown in Figure 11. For the 2-1-35 dataset, the speedup is approximately 15-fold for a support threshold of 2700.
6.5 Mining Evolving Neuronal Circuits
Due to space limitations, we are unable to discuss the biological significance of our mined episodes. In addition to under-
standing the activity of an ensemble of neurons, one of the advantages of our GPGPU solution is to mine evolving cul-
tures, and watch the progression of neuronal development in real-time. Please see our supplemental website http://neural-
code.cs.vt.edu/gpgpu for videos depicting how our mining reveals key characteristics of cortical culture growth.
Figure 11: Speedup of GPU implementation over the CPU implementation
7 Discussion
We have presented a powerful and non-trivial framework for conducting frequent episode mining on GPGPUs and shown
its capabilities for mining neuronal circuits in spike train datasets. For the first time, neuroscientists can enjoy the benefits
of data mining algorithms without needing access to costly and specialized clusters of workstations. Our supplementary
website (http://neural-code.cs.vt.edu/gpgpu) provides auxiliary plots and videos demonstrating how we can track evolving
cultures to reveal the progression of neural development in real-time.
18
Our future work is in four categories. First, our experiences with the neuroscience application have opened up the
interesting topic of mapping finite state machine based algorithms onto GPGPUs. A general framework to map any
finite state machine algorithm for counting will be extremely powerful not just for neuroscience but for many other areas
such as (massive) sequence analysis in bioinformatics and linguistics. Second, the development of MapConcatenate
highlights the importance of developing new, additional, programming abstractions specifically geared toward data mining
on GPGPUs. Third, we found that the two-pass approach performs significantly better than running the complex counting
algorithm on the entire input. The first pass solves a simpler problem, and generates an upper bound of the solution to the
original problem. This significantly reduces the input size for the complex second pass, speeding up the entire process.
Finally, we wish to integrate more aspects of the application context into our algorithmic pipeline, such as candidate
generation, streaming analysis, and rapid “fast-forward” and “slow-play” facilities for visualizing the development of
neuronal circuits.
References
[1] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proc. OSDI’04, San Francisco,
CA, December 2004.
[2] W. Fang, K.K. Lau, M. Lu, X. Xiao, C.K. Lam, P.Y. Yang, B. He, Q. Luo, P.V. Sander, and K. Yang. Parallel
data mining on graphics processors. Technical Report HKUST-CS08-07, Hong Kong University of Science and
Technology, Oct 2008.
[3] N.K. Govindaraju, N. Raghuvanshi, and D. Manocha. Fast and approximate stream mining of quantiles and frequen-
cies using graphics processors. In Proc. SIGMOD’05, pages 611–622, 2005.
[4] S. Guha, S. Krishnan, and S. Venkatasubramanian. Data Visualization and Mining using the GPU. Tutorial at ACM
SIGKDD’05, 2005.
[5] B. He, W. Fang, Q. Luo, N.K. Govindaraju, and T. Wang. MARS: A MapReduce Framework on Graphics Processors.
In Proc. PACS’08, pages 260–269, 2008.
[6] P. Hingston. Using finite state automata for sequence mining. Australian Computer Science Communications, Vol.
24(1):105–110, Jan–Feb 2002.
[7] S. Laxman, P.S. Sastry, and K.P. Unnikrishnan. A fast algorithm for finding frequent episodes in event streams. In
Proc. KDD’07, pages 410–419, 2007.
[8] L. Li, W. Fu, F. Guo, T.C. Mowry, and C. Faloutsos. Cut-and-stitch: efficient parallel learning of linear dynamical
systems on SMPs. In Proc. KDD’08, 2008.
19
[9] H. Mannila, H. Toivonen, and A.I. Verkamo. Discovery of frequent episodes in event sequences. DMKD, Vol.
1(3):pages 259–289, Nov 1997.
[10] D. Patnaik, P.S. Sastry, and K.P. Unnikrishnan. Inferring Neuronal Network Connectivity from Spike Data: A
Temporal Data Mining Approach. Scientific Programming, 16(1):49–77, 2008.
[11] D. A. Wagenaar, Jerome Pine, and Steve M. Potter. An extremely rich repertoire of bursting patterns during the
development of cortical cultures. BMC Neuroscience, 2006.
20
