Parallel algorithms for generating harmonised state identifiers and characterising sets by Hierons, RM & Turker, U
1
Parallel Algorithms for Generating Harmonised
State Identifiers and Characterising Sets
Robert M. Hierons, Senior Member, IEEE and Uraz Cengiz Türker
Abstract—Many automated finite state machine (FSM) based test generation algorithms require that a characterising set (CS) or a
set of harmonised state identifiers (HSIs) is first produced. The only previously published algorithms for partial FSMs were brute-force
algorithms with exponential worst case time complexity. This paper presents polynomial time algorithms and also massively parallel
implementations of both the polynomial time algorithms and the brute-force algorithms. In the experiments the parallel algorithms scaled
better than the sequential algorithms and took much less time. Interestingly, while the parallel version of the polynomial time algorithm
was fastest for most sizes of FSMs, the parallel version of the brute-force algorithm scaled better due to lower memory requirements.
Keywords—Software engineering/software/program verification, Software engineering/testing and debugging, Software engineer-
ing/test design, Finite State Machine, Characterising sets, Harmonised state identifiers, General purpose graphics processing units.
F
1 INTRODUCTION
Software testing is an important part of software de-
velopment but typically is expensive, manual and error
prone. One possible solution is to base testing on a
formal model or specification [1], [2], allowing rigorous
test generation algorithms to be used. In this context, one
of the most widely used types of formal model is the
Finite State Machine (FSM). The tester might devise an
FSM to drive test generation and execution or an FSM
might be produced from the semantics of a model in
a more expressive language, such as Specification and
Description Language (SDL) or State-Charts [3], [4], [5].
Approaches that derive a test sequence (an input se-
quence) from an FSM model have been developed in var-
ious application domains such as sequential circuits [6],
lexical analysis [7], software design [8], communication
protocols [5], [9], [10], [11], object-oriented systems [12],
and web services [13], [14]. Such techniques have also
been shown to be effective when used in important
industrial projects [15]. Once a test sequence has been de-
vised from an FSM specification M , the test sequence is
applied to the implementation N . The output sequences
produced by N and M are then compared and if they
differ then we can deduce that N is faulty.
There are many techniques that automate the genera-
tion of test sequences from an FSM specification M , with
this work going back to the seminal papers of Moore
[16] and Hennie [17]. Most such techniques use state
identification sequences that distinguish the states of M [8],
[17], [18], [19], [20], [21], [22], [23]. For example, a test
technique can check that the input of x in state s leads
to output y and state s′ as follows: start with a preamble
(input sequence) x̄ that takes M to s, then apply x, check
• Department of Computer Science, Brunel University London, UK.
E-mail: {rob.hierons,uraz.turker}@brunel.ac.uk
that the output produced is y, and finally use one or
more input sequences to distinguish the expected state
s′ from all other states of M . The test technique might
also check the state reached by x̄.
One approach to state identification is to use a charac-
terising set (CS): a set of input sequences that distinguish
all pairs of states [8], [24]. Harmonised state identifiers
(HSIs) improve on CSs by allowing different sets of input
sequences for different states [25]. One of the benefits
of CSs and HSIs is that every minimal deterministic
FSM1 has a CS and an HSI and so test generation
techniques that use these are generally applicable. This
paper focuses on generating CSs and HSIs. The main
focus of previous work has been complete FSMs, where
the response to input x in state s is defined for every
input x and state s. However, it has been observed
that often FSM specifications are not complete (they
are partial) [26], [27], [28], [29]. Complete FSMs are a
special class of partial FSMs and so partial FSMs model
a wider range of systems. However, the traditional state
identification methodologies are not directly applicable
to partial FSMs [17], [30]. The assumption that FSMs are
complete is typically justified by assuming that missing
transitions can be completed by, for example, adding
transitions with null output.
Although it is sometimes possible to complete a partial
FSM, this is not a general solution [31]. For example,
an FSM M might specify a component that receives
inputs from another component M ′; the behaviour of
M ′ influences what input sequences can be received
by M . In addition, it might not be possible for certain
input sequences to be provided by the environment due
to physical constraints. Furthermore, ensuring complete-
ness may introduce redundancy. For example, during the
1. Similar to most other work, we restrict attention to minimal
deterministic FSMs; these terms are defined in Section 2.
2
experiments we found a benchmark FSM ‘scf ’ provided
in [32] that has 97 states and 13 million inputs but only
166 transitions. Clearly, it is not sensible to complete such
an FSM. There has thus been interest in techniques for
testing from a partial FSM [20], [21], [25], [31], [33], [34],
[35], [36], [37], [38].
While one might expect a tester to use quite small
models, an FSM M might represent the semantics of a
model M ′ written in a more expressive language such
as State-Charts or SDL. A state of M will represent
a combinations of a logical state for M ′ and a tuple
of values for the variables in M ′. Thus, even small
models can lead to large FSMs. However, the scalability
of deriving CSs and HSIs from partial FSMs has received
little attention despite FSM specifications often being
partial and many FSM-based test generation methods
using a CS or HSI. Note that by ‘scalability’ we refer
to the size (number of states) of the largest partial FSMs
that can be processed by an algorithm in a reasonable
amount of time. To our knowledge there exists only
one paper that proposes algorithms for deriving CSs
and HSIs for partial FSMs [25] despite there being test
generation algorithms, for testing from a partial FSM,
that use such state identifiers [20], [21], [25], [31], [37],
[38]. The CSI/HSI generation algorithms presented are
sequential algorithms that operate on a single thread.
The sequential CS algorithm has exponential worst case
time complexity and the sequential HSI algorithm re-
quires CSs to first be constructed.
Despite the increasing interest in Graphics Processing
Units (GPUs) [39], [40], [41], [42], [43], [44], no previous
work has utilised GPU technology to generate CSs or
HSIs. In this work, our primary motivation is to address
the scalability problem raised when constructing CSs
and HSIs and to do so through using GPU technology.
As noted, the only published algorithm for generat-
ing CSs and HSIs from partial FSMs has exponential
worst case time complexity. This paper tackles scalabil-
ity from two directions. First, we devise a polynomial
time sequential algorithm for generating CSs. We also
produced a parallel implementation of this. We devise
a parallel HSI construction algorithm and also a par-
allel implementation of the brute-force algorithms for
generating CSs and HSIs (based on previous work). We
also present the results of experiments. In experiments
with randomly generated FSMs, the parallel algorithms
scaled much better than the sequential algorithms. Such
improvements in performance should help test gen-
eration techniques scale to larger FSMs. Interestingly,
although the parallel brute-force (worst case exponential
time) algorithms were slower than the parallel versions
of the polynomial time algorithms, they scaled better.
This is because the polynomial time algorithms have
greater memory requirements. The parallel algorithms
also outperformed the sequential algorithms on bench-
mark FSMs.
There were several challenges in designing a scalable
massively parallel algorithm for deriving CSs and HSIs
from partial FSMs. First, it was necessary to develop
data structures that (1) can be processed quickly; (2)
can be efficiently stored in GPU memory; and (3) can
encapsulate enough data for constructing CSs and HSIs.
It is also important that the parallel algorithm maximises
GPU utilisation (occupancy). There is a trade-off between
these factors since, for example, storing information in
local GPU memory reduces memory access time but
can also reduce the number of threads that can run in
parallel.
This paper is organised as follows. We provide prelim-
inary material in Section 2 and in Section 3 we develop
a polynomial time sequential algorithm. In Section 4 we
present the proposed parallel HSI and CS algorithms.
Section 5 outlines the experiments and the results of
these. Section 6 describes threats to validity and in
Section 7 we draw conclusions. The appendix explains
how the algorithms were implemented on a GPU.
2 PRELIMINARIES
2.1 Finite State Machines (FSMs)
An FSM is defined by a tuple M = (S, s0, X, Y, h) where
S = {s1, s2, . . . , sn} is the finite set of states; s0 ∈ S is
the initial state; X = {x1, x2, . . . , xp} is the finite set of
inputs; Y = {y1, y2, . . . , yq} is the finite set of outputs (we
assume that X is disjoint from Y ); and h ⊆ S×X×Y ×S
is the set of transitions of M. We let h = {τ1, τ2, . . . , τκ}.
If (s, x, y, s′) ∈ h and input x ∈ X is applied when M
is in state s then M can change its state to s′ and produce
output y. Here τ = (s, x, y, s′) is a transition of M with
starting state s, ending state s′, and label x/y. An FSM M
is deterministic if for all s ∈ S and x ∈ X we have that
M has at most one transition of the form (s, x, y, s′).
An FSM can be represented by a directed graph.
Figure 1 gives two FSMs with state sets {s1, s2, s3, s4}, in-
puts {x1, x2, x3, x4, x5, x6}, and outputs {y1, y2}. A node
represents a state and a directed edge from a node
labelled s to a node labelled s′ with label x/y represents
the transition τ = (s, x, y, s′).
An input x is defined at state s if there exists a transition
of the form (s, x, y, s′) ∈ h and we then use the notation
δ(s, x) = s′ and λ(s, x) = y. If x is not defined at state
s then we write δ(s, x) = e (for a special symbol e 6∈ S)
and λ(s, x) = ε. Thus δ is a function from S×X to S∪{e}
and λ is a function from S ×X to Y ∪ {ε}. If for every
state s ∈ S and input x ∈ X , there exists a transition of
the form (s, x, y, s′) then M is a complete FSM, otherwise
it is a partial FSM. In this paper, we consider partial
deterministic FSMs and from now on we use the term
FSM to refer to partial deterministic FSMs.
We use juxtaposition to denote concatenation: if x1, x2,
and x3 are inputs then x1x2x3 is an input sequence. We
use ε to represent the empty sequence and use pref(.)
to denote the set of all the prefixes of parameter (.)
longer than 0. For example, pref(x1x2) = {x1, x1x2} and























































Figure 1: CS for M1 has 4(4 − 1)/2 elements W =
{{x1}, {x2}, {x3}, {x4}, {x5}, {x6}}. The length of sepa-
rating sequence for states s1, s2 of M2 is 4(4− 1)/2 and
is x1x2x3x4x5x6.
The behaviour of an FSM M is defined in terms
of the labels of walks of M. A walk is a sequence
τ̄ = (s1, x1, y1, s2)(s2, x2, y2, s3) . . . (sm, xm, ym, sm+1) of
consecutive transitions. This walk has starting state
s1, ending state sm+1, and label x1/y1 x2/y2 . . . xm/ym.
z̄ = x1/y1 x2/y2 . . . xm/ym is an input/output sequence,
x1x2 . . . xm is the input portion (in(z̄)) of z̄, and y1y2 . . . ym
is the output portion (out(z̄)) of z̄. For example, M1 has the
walk (s2, x4, y1, s3)(s3, x6, y1, s4) that has starting state s2
and ending state s4. The label of this walk is x4/y1 x6/y1
and this has input portion x4x6 and output portion y1y1.
FSM M is strongly connected if for every ordered pair
(s, s′) of states of M there is a walk with starting state s
and ending state s′.
An input sequence x̄ = x1x2 . . . xm is a defined input
sequence for state s if M has a walk with starting state s
and label z̄ such that in(z̄) = x1 . . . xm. For example,
since (s2, x4, y1, s3)(s3, x6, y1, s4) is a walk of M1, we
have that x4x6 is a defined input sequence for s2. We will
let LIM (s) denote the set of defined input sequences for
s and LIM (S
′) = ∩s∈S′LIM (s) (the set of input sequences
defined in all states in S′). In an abuse of notation we
use λ and δ with input sequences: if x and x̄ denote an
input and an input sequence respectively such that xx̄ is
a defined input sequence for s then δ(s, ε) = s, δ(s, xx̄) =
δ(δ(s, x), x̄)), λ(s, ε) = ε, λ(s, xx̄) = λ(s, x)λ(δ(s, x), x̄).
For example, in M1 we have that δ(s2, x4x6) = s4 and
λ(s2, x4x6) = y1y1. Note that at times we let x̄/ȳ denote
the input/output sequence z̄ such that x̄ = in(z̄) and
ȳ = out(z̄). Given set S′ of states and x̄ ∈ LIM (S′) we let
δ(S′, x̄) = ∪s∈S′{δ(s, x̄)} and λ(S′, x̄) = ∪s∈S′{λ(s, x̄)}.
We let LM (s) denote the set of labels of walks of
M with starting state s and so LM (s) = {x̄/ȳ|x̄ ∈
LIM (s) ∧ ȳ = λ(s, x̄)}. For example, LM1(s2) contains the
input/output sequence x4/y1 x6/y1. FSM M defines the
language L(M) = LM (s0) of labels of walks with starting
state s0. Given S′ ⊆ S we let LM (S′) = ∪s∈S′LM (s).
States s, s′ are equivalent if LM (s) = LM (s′) and FSMs M
and N are equivalent if L(M) = L(N).
Given a set X we let X∗ denote the set of finite
sequences of elements of X and let Xk denote the set of
sequences in X∗ of length k. An input sequence x̄ ∈ X?
is a separating sequence for states s and s′ if x̄ is a defined
input sequence for s and s′ and λ(s, x̄) 6= λ(s′, x̄). Con-
sider, for example, M1 (Figure 1a) and states s1 and s2.
Then x1 is a separating sequence for this pair since x1 is
defined in both states and λ(s1, x1) = y1 6= y2 = λ(s2, x1).
In contrast, no input sequence starting with input x4 can
be a separating sequence for a pair that contains s1 since
x4 is not defined in s1. If every pair of states of FSM M
has a separating sequence then M is a minimal FSM.
In this work, we consider only minimal FSMs. If an
FSM is not minimal then a minimal FSM can be formed
by merging pairs of compatible states in an iterative
manner, where two states are compatible if they cannot
be distinguished. It is possible to check whether two
states are compatible in polynomial time and so it is also
possible to construct a minimal FSM M’ from a partial
FSM M in polynomial time [45] 2. The main restriction
we make is that we consider deterministic FSMs. The
main focus of FSM-based testing has been on determin-
istic FSMs and these have been used in areas such as
hardware [47], protocol conformance testing [5], [8], [9],
[11], object-oriented systems [12], web services [13], [48],
[49], and general software [50].
There are a number of approaches to state verification
and this paper focuses on two that are applicable to
any minimal FSM. A characterisation set is a set of
input sequences that, between them, distinguish all of
the states of M.
Definition 2.1: A CS for FSM M = (S, s0, X, Y, h) is a
set W ⊆ X? such that for all si, sj ∈ S with i 6= j there
exists x̄ ∈W such that a prefix of x̄ distinguishes si and
sj .
We use SM = {(si, sj)|si, sj ∈ S, i < j} to denote a
set of distinct pairs (si, sj) of states of M. The restriction
that i < j ensures that a pair of states is represented
exactly once. We will use S to denote SM and γij will
denote a pair (si, sj) in SM . Since |S| = n, set S contains
n(n−1)/2 pairs. A CS W is a set of input sequences that
distinguish the pairs in S .
When considering a minimal complete FSM M with
state set S (|S| = n), a set A of input sequences defines
an equivalence relation ∼A over S, with two states s, s′
being equivalent under ∼A if λ(s, x̄) = λ(s′, x̄) for all
x̄ ∈ A. Let us suppose that we add an input sequence
x̄ to A, to form A′, and this makes A more effective
at distinguishing the states of M : there exist s, s′ ∈ S
such that s ∼A s′ and λ(s, x̄) 6= λ(s′, x̄). Then ∼A′ has
more equivalence classes than ∼A. Since an equivalence
relation ∼A on S can have at most n equivalence classes,
it is thus straightforward to see that, starting from the
empty set, we can add at most n − 1 input sequence if
each input sequence is to make the set more effective.
2. The problem of constructing a smallest such M’ is NP-hard [46].
4
Thus, M has a CS with at most n−1 sequences. Further,
we can use the following result, which is straightforward
to prove, to deduce that there is such a CS where each
input sequence is of length at most n− 1.
Proposition 2.1: If input sequence x̄ distinguishes
states s, s′, no proper prefix of x̄ distinguishes these
states, and x̄ = x̄′x̄′′ for input sequences x̄′ and x̄′′ then
x̄′′ distinguishes states δ(s, x̄′) and δ(s′, x̄′).
For minimal partial FSMs we have different bounds
since a set A of input sequences need not define an
equivalence relation on the states of a partial FSM M
if different input sequences from A are defined from
different states of M . In particular, it is possible to
construct an FSM M such that for each pair of states s, s′
there is a unique input x such that x is the only input
defined in both s and s′ (and so a characterising set must
contain at least n(n − 1)/2 input sequences). However,
for a minimal partial FSM M we require at most one
input sequence for each pair of states in SM and so at
most n(n−1)/2 sequences. We can use Proposition 2.1 to
deduce that we require input sequences of length at most
n(n−1)/2 and so a CS requires O(n4) memory. Figure 1
contains two FSMs. FSM M1 has n(n− 1)/2 inputs and
for every input xi there is a pair of states s, s′ such that
xi is the only input that is defined in both s and s′ and
so a CS must contain an input sequence that starts with
xi. For FSM M2 there is a pair of states whose shortest
separating sequence is of length n(n− 1)/2.
It may not be necessary to execute all sequences from
a CS to distinguish a state s from all other states [25].
Definition 2.2: A state identifier (SI) for a state si of FSM
M = (S, s0, X, Y, h) is a set Hi ⊆ X? such that for all sj ∈
S \ {si}, there exists x̄ ∈ Hi such that x̄ is a separating
sequence for si and sj .
This leads to HSIs.
Definition 2.3: A set of Harmonised State Identifiers
(HSIs) for FSM M = (S, s0, X, Y, h) is a set of state iden-
tifiers H = {H1, H2, . . . Hn} such that for all si, sj ∈ S
with i 6= j, there exists x̄ ∈ pref(Hi)∩ pref(Hj) that is a
separating sequence for si and sj .
As suggested in [25] one can derive HSIs from a CS.
This provides an upper bound n4 on the size of HSIs.
2.2 Previous HSI generation method
The HSI construction algorithm given in [25] takes an
FSM M and CS W for M and in the first step the
algorithm constructs an SI for the initial state (s0). To
do so the algorithm generates a subset H0 of W such
that for all sj ∈ S \ {s0} there exists at least one input
sequence x̄ in H0 such that x̄ is a separating sequence
for s0 and sj . The remaining SIs are computed in the
second phase. For state si, the algorithm finds a subset
Hi of W such that
1) For j < i, there exists x̄ ∈ Hi and x̄′ ∈ Hj such
that some input sequence in pref(x̄) ∩ pref(x̄′)
distinguishes si and sj .
2) For all sj with i < j, there exists x̄ ∈ Hi and a
prefix x̄′ of x̄ with x̄′ ∈ LIM (sj) (the set of input
sequences defined in sj) such that x̄′ distinguishes
si and sj .
3 NOVEL CS GENERATION ALGORITHM
The previously devised CS generation algorithm, for
partial FSMs, has exponential worst case execution time.
In this section we devise a polynomial time algorithm.
The first step of the proposed algorithm computes
separating sequences of length one. The following is an
immediate consequence of Proposition 2.1.
Proposition 3.1: Let M be a minimal FSM M . Then
there exists a pair of states (s, s′) ∈ SM and an input
x ∈ X such that x distinguishes (s, s′).
After the first step we have two sets of pairs of states:
a set P√ of pairs with separating sequences of length one
and a set P× of pairs whose separating sequences have
not yet been computed. In the second step, the algorithm
computes new separating sequences through using the
previously computed separating sequences.
Proposition 3.2: Let M be a minimal FSM M , let P√ 6=
∅ be the set of pairs of states with known separating
sequences, and let P× = SM \ P√. Then there exists
(s, s′) ∈ P×, (s′′, s′′′) ∈ P√ with separating sequence x̄,
and a defined input x ∈ X for s, s′ such that xx̄ is a
separating sequence for (s, s′).
We now present terminology used in this section. A
pair-node η is a tuple (si, sj , ϑ) such that (si, sj) ∈ SM
(i < j) and ϑ ∈ X? is a separating sequence for si
and sj or is ε if such a separating sequence has not yet
been found. We use λ(η, x) to denote {λ(si, x), λ(sj , x)}.
If ϑ = ε, δ({si, sj}, x) = {s, s′} and there is a pair-node
η′ = (s, s′, ϑ′) with ϑ′ 6= ε then we say that η ‘evolves
to’ (becomes) (si, sj , xϑ′). In such a situation, xϑ′ is a
separating sequence for si, sj .
The algorithm (Algorithm 1) initialises P ; for each
(s, s′) ∈ S it adds pair-node η = (s, s′, ε) (Line 1). This
requires O(n2) time. The algorithm then enters a loop
(first-loop) in which it computes separating sequences
of length one, iterating over P and X (Lines 2-4). The
first-loop requires O(n2p) time.
The algorithm then enters a while loop (main-loop)
and computes separating sequences through evolving
the elements of P . At each iteration the algorithm enters
a for-loop that iterates over P×X . For a pair-node η from
set P with ϑ = ε and input x ∈ X , the algorithm checks
whether η evolves to a pair-node η′ (with x) that has
separating sequence ϑ′ 6= ε. If so, the algorithm uses the
separating sequence ϑ = xϑ′ (Lines 6-8). If, after the for-
loop, no pair-node has evolved, the algorithm declares
that M is not minimal (Lines 9-10) and otherwise it
continues. If all items in P have non-empty separating
sequences, the algorithm returns P and terminates. To
find a pair to be evolved the main loop can iterate
O(n2p) times; since there are O(n2) pairs, the main-
loop requires O(n4p) time. The main-loop is the most
expensive component.
5
Theorem 3.1: If M is an FSM with n states and p inputs
then Algorithm 1 requires O(n4p) time.
Algorithm 1: Sequential CS generation algorithm
Input: An FSM M = (S, s0, X, Y, h) where |S| = n and n > 1
Output: CS for M if M is minimal
begin
1 P ← {(s1, s2, ε)(s1, s3, ε), . . . (sn−1, sn, ε)}
// first-loop
2 foreach η = (si, sj , ϑ) ∈ P, x ∈ X do
3 if |λ(η, x)| > 1 and ε 6∈ λ(η, x) then
4 ϑ← x
// main-loop
5 while There exist a pair η = (si, sj , ϑ) ∈ P such that ϑ = ε do
// for-loop
6 foreach η = (si, sj , ϑ) ∈ P such that ϑ = ε and x ∈ X with
ε 6∈ λ(η, x) do
7 if {s, s′} = δ({si, sj}, x), η′ = (s, s′, ϑ′) ∈ P and
ϑ′ 6= ε then
8 ϑ← xϑ′
9 if No pair-node is updated then
10 Declare M is not minimal.
11 Return P
The following results, which demonstrate that Algo-
rithm 1 is correct, follow immediately from the construc-
tion of the algorithm and Proposition 2.1.
Theorem 3.2: If Algorithm 1 is given a minimal FSM
M then it returns a set P .
Theorem 3.3: If Algorithm 1 returns P when given
minimal FSM M then P defines a CS for M .
4 PARALLEL CS AND HSI ALGORITHMS
In this section we first provide the approach employed
to address scalability problems and then the parallel CS
and HSI generation algorithms.
4.1 Design Choices
There are two main strategies: the Fat Thread strategy and
the Thin Thread strategy [51]. The fat thread approach
minimises data access latency by having threads process
a large amount of data on shared memory [51]. However,
the number of threads may be restricted by the available
shared memory and this may reduce performance.
In contrast, the thin thread approach aims to maximise
the number of threads by storing only small amounts
to data in shared memory. Although global memory
transactions are relatively slow, it has been reported
that the high global memory transaction latency can be
hidden when there are many threads [51]. In this work
we employed the thin thread strategy.
4.2 Parallel CS algorithm
4.2.1 Parallel CS algorithm: parallel design
To implement a thin thread based algorithm we propose
what we call a conditional pair-node vector (CPn-vector for
short), and later we will see that a scalable CS generation
algorithm can be based on this. A CPn-vector P captures
information related to pair-nodes plus additional infor-
mation that will allow us to evolve its elements.
Definition 4.1: A conditional pair-node vector (CPn-
vector) P for an FSM M = (S, s0, X, Y, h) with n states
is a vector with n(n− 1)/2 conditional pair-nodes. Each
element ρ in the vector P is a tuple (f, η) for a pair-node
η = (s, s′, ϑ), and a flag f : {T, F} that states whether
states s, s′ are distinguished by ϑ.
Let ρ = (F, (s, s′, ε)) and ρ′ = (T, (s′′, s′′′, ϑ′)) be CPns.
As in the case of pair-nodes, we say ρ ‘evolves’ to ρ′
with a defined input x, if δ({s, s′}, x) = {s′′, s′′′}. After
the evolution, ρ becomes (T, (s, s′, xϑ′)). As the elements
of a CPn-vector evolve, we update the flag values.
The following are based on Theorem 3.3 and the
definition of evolution of CPns and show how CPn-
vectors are related to the states distinguished.
Lemma 4.1: Given CPn-vector P , if P contains ρ =
(T, (s, s′, ϑ)) then ϑ is a separating sequence for (s, s′).
Theorem 4.1: Given FSM M = (S, s0, X, Y, h), if the
flags of all elements of CPn-vector P of M are set to
true then the input sequences retrieved from pair-nodes
of P define a Characterising Set for M.
4.2.2 Parallel CS algorithm: an overview
Algorithm 2: Parallel CS generation algorithm, high-
lighted lines are executed in parallel.
Input: An FSM M = (S, s0, X, Y, h) where |S| = n and n > 1
Output: CS for M if M is minimal
begin
1 P ← {(F, (s1, s2, ε))(F, (s1, s3, ε)), . . . (F, (sn−1, sn, ε))}
// first-loop
2 foreach x ∈ X do
3 if There exists ρ = (F, (si, sj , ϑ)) ∈ P such that ϑ = ε,
|λ(η, x)| > 1 and ε 6∈ λ(η, x) then
4 ρ← (T, (si, sj , x))
// main-loop
5 while There exist an element ρ = (F, (si, sj , ϑ)) ∈ P do
6 foreach ρ = (F, (si, sj , ϑ)) ∈ P and x ∈ X do
7 if {s, s′} = δ({si, sj}, x), (T, (s, s′, ϑ′)) ∈ P and ϑ′ 6= ε
then
8 ρ← (T, (si, sj , xϑ′))
9 if no new conditional pair-node ρ is altered then
10 Declare M is not minimal.
11 Return P
The parallel-CS algorithm first initialises the CPn-
vector P in parallel (Line 1). A naı̈ve implementation
would require O(n(n − 1)/2) time to initialise P . How-
ever, as we will see later, the initialisation of P can be
achieved in O(n) time if Γ ≥ n, where Γ is the number
of threads used by the GPU.
Afterwards, the algorithm enters the first-loop in
which it finds all elements that are distinguished by a
single input (Lines 2-4). This step takes O(n(n−1)p/(2Γ))
time. The algorithm then enters the main-loop. In the
main-loop the algorithm applies all inputs to elements
of P whose flag is F and evolves elements. If an
evolution to a distinguishes pair of states is possible
then the flag values are set to T (Lines 6-8) (can be
6
achieved in O(n(n−1)/(2Γ)p) time). If no elements of P
have changed then the algorithm terminates, otherwise
it continues (Lines 9-10). This step may also require
O(n(n − 1)p/(2Γ)) time. As the length of a separating
sequence is bounded above by n(n−1)/2, the main-loop
iterates O(n2) times and hence the algorithm requires
O(n4p/Γ) time.
Theorem 4.2: If Algorithm 2 is given an FSM with n
states and p inputs and there are Γ threads then it
requires O((n4p)/Γ) time.
4.3 Parallel HSI algorithm
4.3.1 Parallel HSI algorithm: parallel design
The existing HSI generation algorithm takes as input a
CS for the FSM M . However, we require O(n4) space to
store a CS for an FSM with n states. This has at least two
practical implications: (1) it may be impossible to keep
data in the main memory and (2) threads will process a
very large amount of data.
Recently, Hierons and Türker proposed a heuristic to
construct HSIs for complete FSMs that overcomes this
bottleneck [52]. Instead of using an existing CS, they
construct HSIs from incomplete distinguishing sequences.
Their algorithm keeps a list (Q) of pairs of states (a list
for the items of set S) such that at each iteration an input
sequence that removes the maximum number of pairs
from Q is selected and the algorithm terminates when Q
is empty. In the Parallel-HSI algorithm, we adopted this
strategy by using state-trace vectors. A state-trace-vector
contains the information regarding which pairs from S
have been distinguished by an input sequence x̄.
Definition 4.2: A state-trace vector (ST-vector) D for
an FSM M = (S, s0, X, Y, h) with n states is a vector
associated with input sequence x̄ ∈ X? and having n
elements such that: an element d of D is a tuple (si, sc, ȳ)
such that si is an initial state, sc = δ(si, x̄) is a current state,
and ȳ = λ(si, x̄) is an output sequence.
We may need to construct a set of ST-vectors since
there may not be a single input sequence that distin-
guishes all pairs of states.
Theorem 4.3: A set of state-trace vectors for an FSM
M defines an HSI for M if for every pair of states si, sj
there exists a state-trace vector D associated with input
sequence x̄ such that there exists x̄′ ∈ pref(x̄) that is a
separating sequence for si and sj .
4.3.2 Parallel-HSI algorithm: an overview
We now give a brief overview of the algorithm and in the
Appendix we show how the Parallel-HSI algorithm was
implemented using GPUs. For an FSM with n states, the
parallel HSI algorithm uses a boolean valued vector B of
length n(n−1)/2. The elements of B correspond to pairs
in S: the ith pair of S corresponds to the ith item of B and
is set to 1 if these states have been distinguished. Initially
all elements in B are set to 0 and when all elements are
1 the algorithm terminates.
The parallel HSI algorithm (Algorithm 3) has three
nested loops. The first loop (main-loop) iterates as long
as at least one element in B is 0. In every iteration, the
algorithm checks whether the upper-bound on the length
of the input sequence x̄ has been reached. If so, the al-
gorithm terminates. Otherwise, the algorithm enters the
middle-loop and increments ` (initially ` = 0). The middle-
loop iterates as long as there exists an unprocessed input
sequence of length ` and not all elements in B are 1.
In the middle-loop the algorithm first resets the ST-
vector D. It then receives the next input sequence x̄
of length ` and evolves elements in D with x̄. Since
the FSM is partially specified, the algorithm evolves an
element if x̄ is defined at the associated state. Then the
algorithm executes the inner loop (for-loop). The for-loop
iterates over the states and for state si it compares the
output sequence obtained from si and all other states (in
parallel). If there exists a state sj that produces an output
sequence that is different from that produced from si,
then the algorithm writes 1 to the corresponding element
of B (corresponding to γij or γji). Later it writes x̄ to the
corresponding SIs (His) of distinguished pairs. For state
si there are n− 1 pairs in S (and in B) with state si and
so the size of Hi cannot be larger than n− 1.
The process of finding a separating sequence for a pair
of states might require all possible input sequences to be
considered. Therefore the worst case time complexity of
the algorithm is exponential.
4.3.3 Generating characterising sets from HSIs
We first show how HSIs relate to characterising sets.
Lemma 4.2: Let M = (S, s0, X, Y, h) be an FSM and
H be a set of harmonised state identifiers for M . Then⋃
Hi∈HHi defines a characterising set of M .
Following the intuition provided by Lemma 4.2 we
can construct a CS by using the Parallel-HSI algorithm
through replacing Line 18 with the following line:




It is possible to parallelise the process of taking the
union of harmonised state identifiers through two steps.
In the first step we sort all the input sequences in parallel
and in the second step we pick unique state identifiers
to form the CS. We present details in the Appendix.
5 EXPERIMENTS
5.1 Experimental Design
The experiments had two main aims.
1) To explore how well the algorithms scaled. There-
fore, we recorded the time taken.
2) To explore properties of the CSs and HSIs con-
structed. We recorded the number of sequences and
the lengths of these sequences since fewer/shorter
sequences lead to cheaper testing.
Initial experiments used FSMs generated by the tool
used in [53]. This randomly assigns δ(s, x) and λ(s, x)
7
Algorithm 3: Parallel HSI construction algorithm,
highlighted lines are executed in parallel
Input: An FSM M = (S, s0, X, Y, h) where |S| = n and n > 1
Output: HSI for M
begin
1 H ← {H0 ← ∅, H1 ← ∅, . . . Hn−1 ← ∅}
// Construct vector D and a boolean vector B of
size |S| such that the ith item of B corresponds
to the ith pair of S.
2 Construct D with n elements such that x̄← ε and each element di
is associated with initial/current state si and ȳ ← ε. Construct a
boolean vector B of length n(n− 1)/2 whose elements are 0.
3 Execute← T , `← 0
// (main-loop)
4 while Execute is T do
5 if `+ 1 ≤ n(n− 1)/2 then
6 `← `+ 1
else
7 Declare that FSM is not minimal and terminate
// (middle-loop)
while There exists unprocessed x̄ ∈ X` and Execute is T do
// For every element of D set current state
value to initial state value
8 For every element d of D, sc ← si
// Replace x̄ of D with a new input sequence
x̄′ retrieved from X`
9 x̄← x̄′ for some x̄′ ∈ X` that has not yet been used
// Evolve D with x̄′
10 For every element d of D, d← evolve(d, x̄)
// Analyse the outcome of application of x̄.
11 foreach si ∈ S do
// Mark distinguished states with x̄ and
store x̄
12 if ∃di, dj ∈ D with i 6= j and B[γij ] = 0 and
x̄′ ∈ pref(x̄) that is a separating sequence for si, sj
then
13 Hi ← Hi ∪ {x̄}
14 Hj ← Hj ∪ {x̄}
15 Let index denote the index of pair γij on S
B[index]← 1
16 if all elements in B are set to 1 then
17 Execute← F
18 Return H.
for each s ∈ S and x ∈ X , discarding the FSM if it is
not strongly connected and minimal. After constructing
M , the tool randomly selects 1 ≤ K ≤ np and K state-
input pairs. For a pair (s, x) it erases the transition of M
with start state s and input x. If deleting a transition
τ disconnects M then τ is retained and another pair
chosen. We used the tool to construct three test suites.
In test suite one (TS1), for each n ∈ {26, 27, . . . 217} we
had 100 FSMs with number of inputs/outputs p/q = 3/3.
These experiments explored the performance of the algo-
rithms under varying numbers of states. To see the effect
of the numbers of inputs and outputs we constructed
TS2 and TS3. In TS2 we set n = 1024 and q = 3 and for
each of p ∈ {24, 25 . . . , 28} we had 100 FSMs. For TS3 we
set n = 1024 and p = 3 and for each of q ∈ {24, 25 . . . , 28}
we again had 100 FSMs. Therefore we used 2200 FSMs3.
It is possible that FSM specifications of real life sys-
tems differ from randomly generated FSMs. Therefore
3. After an FSM is generated we do not check whether it is minimal.
However, during experiments we found that 316 FSMs were not
minimal. When such an FSM was found we simply generated a
replacement and so each test suite contained 100 minimal FSMs.
we also performed experiments on case studies: FSMs
from the ACM/SIGDA benchmarks. This is a set of FSMs
used in workshops in 1989–91–93 [32]. In Table 2 we
present the specifications.
Name |X| |S| |S| ∗ |X|
dk27 2 7 14
bbtas 4 6 24
dk17 4 8 32
dk15 8 4 32
s298 8 135 1080
ex7 4 10 40
mc 5 15 75
dk512 2 15 30
dk16 4 27 108
ex4 64 14 21
tbk 64 16 776
planet 128 48 115
s386 128 13 1664
bbsse 128 13 1664
s1 256 18 4351
ex1 512 18 138
styr 512 30 166
sand 2048 32 184
scf 134217728 97 166
Figure 2: Properties of benchmark FSMs.
5.2 Experimental settings
Throughout this section we use SEQ-BF-CS to refer to
the sequential brute force CS generation algorithm [25]
and SEQ-BF-HSI to refer to the sequential brute force
HSI generation algorithm [25]. PAR-BF-HSI, PAR-BF-
CS, SEQ-PLY-CS, PAR-PLY-CS will denote the Parallel-
HSI algorithm, the CS generation algorithm base on the
Parallel-HSI algorithm, the sequential polynomial time
CS generation algorithm, and the parallel polynomial
time CS generation algorithm respectively. We set a
bound of n on the length of sequences considered in
the PAR-BF-CS algorithm. However, this did not affect
the results regarding scalability since there were no cases
where the PAR-BF-CS algorithm failed to find separat-
ing sequences and other algorithms returned separating
sequences of length greater than n.
We used an Intel Core 2 Extreme CPU (Q6850) with
8GB RAM and NVIDIA TESLA K40 GPU under 64 bit
Windows Server 2008 R2. During the experiments we
stored the generated CSs and HSIs in the hard disk
drive as for large FSMs the available CPU/GPU memory
becomes insufficient. The timing information does not
include the time for storing the sequences.
Finally, to perform the experiments in an acceptable
amount of time, we set 1500 seconds as the limiting time.
5.3 The effect of the number of states
Figures 3 and 4 presents the mean construction times in
ms. The results are promising: on average PAR-BF-HSI
was 420 times faster than SEQ-BF-HSI and 1836 times
faster when n = 1024. On average, PAR-BF-CS was 605
times faster than SEQ-BF-CS; SEQ-PLY-CS was 3 times
8
Figure 3: Average time required to construct CSs (TS1).
Figure 4: Average time required to construct HSIs (TS1).
faster than SEQ-BF-CS; and PAR-PLY-CS was 6316 times
faster than SEQ-BF-CS. With n = 2048, PAR-BF-CS was
3118 times faster than SEQ-BF-CS and PAR-PLY-CS was
33241 times faster than SEQ-BF-CS.
The bottleneck for PAR-BF-HSI is the B vector, re-
quiring n(n − 1)/2 boolean variables. PAR-BF-HSI was
able to process FSMs with 131072 states, making PAF-
BF-HSI 128 times more scalable than the existing HSI
construction algorithm. The bottleneck for PAR-PLY-CS
is the P vector, requiring 2n(n(n − 1)/2) integer values
plus n(n − 1)/2 boolean values. PAR-PLY-CS was able
to process FSMs with 32768 states making PAR-PLY-CS
16 times more scalable than the sequential brute-force
characterising set generation algorithm.
Table 1 shows the mean sequence lengths. For CSs
there are no differences. In addition, the mean sequence
length for PAR-BF-HSI was slghtly less than that for
SEQ-BF-HSI. However, when we applied the parametric
Kurskall Vallis [54] significance test4, we found that the
difference was not statistically significant.
Table 2 shows the mean size of state identifiers. PAR-
BF-HSI tended to generate SIs that, on average, were
slightly smaller than those returned by SEQ-BF-HSI.
According to the Kruskal Vallis test there is a statistically
significant difference when n ≥ 256. This indicates that
as the number of states increases, the parallel HSI gener-
ation algorithm tends to find fewer SIs. Moreover, we can
see that the number of CSs generated does not depend
on whether the algorithm is sequential or parallel. Over-
all, the results suggest that PAR-BF-HSI constructs more
compact HSIs and faster and SEQ-PLY-CS and PAR-PLY-
CS are faster than the brute-force versions.
4. The results obtained from SEQ-BF-HSI were the nominal and the
results obtained from PAR-BF-HSI were the measurement variable
5.4 The effect of the number of inputs and outputs
The results for TS2 and TS3 are in Figure 5. As the
number of inputs increases (Figures 5a, 5c), the time
required to construct CSs and HSIs increases regardless
of the algorithm used. However, mean lengths of state
identifiers reduce as we increase the number of inputs
(Figure 5e). This reflects there being more transitions that
might be used in finding separating sequences. Figure 5g
shows the mean number of separating sequences in
HSIs, the differences between the HSIs constructed by
SEQ-BF-HSI and PAR-BF-HSI being limited. The number
of separating sequences constructed by SEQ-BF-CS, PAR-
BF-CS, SEQ-PLY-CS and PAR-PLY-Cs are shown in Fig-
ure 5i. The number of separating sequences reduces with
the number of inputs, indicating that when there are
more inputs it was possible to find separating sequences
that distinguish more states.
Consider now the experiments where we increased the
number of outputs (TS3). The time required, the mean
length of the separating sequences and mean number
of separating sequences produced are in Figure 5b, Fig-
ure 5d, Figure 5f, Figure 5h, Figure 5j. The results are as
expected: as the number of outputs increases it is easier
to find separating sequences (one expects separating
sequences to be shorter) and so the time taken reduces.
5.5 Benchmark FSMs
We found that PAR-BF-CS, PAR-PLY-CS, and SEQ-BF-
CS generated identical CSs for each FSM except for scf .
Only the PAR-BF-CS and PAR-PLY-CS algorithms were
able to construct CSs for scf (Figure 6 gives the times).
Moreover PAR-BF-HSI, and SEQ-BF-HSI generated iden-
tical state identifiers except for scf . SEQ-BF-HSI was not
able to construct HSIs for scf .
The results are promising. Although the FSMs are
relatively small, we see that the parallel algorithms
outperformed the sequential algorithms.
5.6 Discussion
The proposed algorithms accelerated the construction
of CSs and HSIs and increased scalability. However,
randomly generated FSMs will tend to have separating
sequences that are much shorter than the upper bound.
Sokolovskii introduced a special class of (complete)
FSMs that we called s-FSMs [55]. The shortest separating
sequence for states s1 and s2 of an s-FSM with n states
has length n− 1. We performed additional experiments
with a set of s-FSMs to explore how the algorithms
perform when the separating sequences are relatively
long. The transition and the output functions of an s-
FSM are defined as follows in which n′ = n/2 and n > 2.
9
(a) Average time to construct CSs for FSMs in TS2. (b) Average time to construct CSs for FSMs in TS3.
(c) Average time to construct HSIs for FSMs in TS2. (d) Average time to construct HSIs for FSMs in TS3.
(e) Average length of sequences for FSMs in TS2. (f) Average length of sequences for FSMs in TS3.
(g) Average number of state identifiers for FSMs in
TS2.
(h) Average number of state identifiers for FSMs in
TS3.
(i) Average number of separating sequences for
FSMs in TS2.
(j) Average number of separating sequences for FSMs
in TS3.
Figure 5: Results of experiments on TS2 and TS3.
10
Algorithms Number of States
64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
SEQ-BF-CS 3.21 3.22 3.34 3.76 3.88 4.23
PAR-BF-CS 3.21 3.22 3.34 3.76 3.88 4.23 4.43 4.52 4.72 4.84 5.03 5.22
SEQ-PLY-CS 3.21 3.22 3.34 3.76 3.88 4.23 4.43 4.52 4.72 4.84
PAR-PLY-CS 3.21 3.22 3.34 3.76 3.88 4.23 4.43 4.52 4.72 4.84
SEQ-BF-HSI (per state) 3.05 3.11 3.19 3.22 3.45
PAR-BF-HSI (per state) 3.05 3.09 3.15 3.19 3.46 3.68 3.86 4.25 4.69 4.96 5.13 5.37
Table 1: Average length of state identifiers generated for TS1.
Algorithms Number of States
64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072
SEQ-BF-CS 156.34 194.53 1200.56 1489.43 2935.45 7956.32
PAR-BF-CS 156.34 194.53 1200.56 1489.43 2935.45 7956.32 13405.33 21230.35 58431.59 102440.91 173539.21 298552.78
SEQ-PLY-CS 190.54 233.11 1399.12 1611.55 3202.44 8301.22 15005.50 22532.07 68441.47 140448.02
PAR-PLY-CS 190.54 233.11 1399.12 1611.55 3202.44 8301.22 15005.50 22532.07 68441.47 140448.02
SEQ-BF-HSI (per state) 41.54 41.83 52.67 52.88 63.01
PAR-BF-HSI (per state) 41.36 41.62 52.13 52.56 52.78 62.99 73.10 83.28 93.49 93.71 93.90 103.98
Table 2: Average number of state identifiers generated for TS1.
Figure 6: Results for benchmark FSMs.
δ(si, xj) =

si+1, if j = 1 ∧ i 6= n′ ∧ i 6= n
s1, if j = 1 ∧ i = n′
sn′+1, if j = 1 ∧ i = n
si, if j = 0 ∧ 1 ≤ i ≤ n′ − 1
sn′+1, if j = 0 ∧ i = n′




y0, if j = 0 ∧ i = n
y1, otherwise
(2)
We generated 600 s-FSMs with n states, n ∈ {10, 20,
. . . , 6000}. The results for the brute-force approaches
(Table 3) indicate that scalability drops drastically for
FSMs with long separating sequences. SEQ-BF-CS, PAR-
BF-HSI, and PAR-BF-CS can process s-FSMs with up to
30 states. While SEQ-BF-HSI is very fast, this is because
it does not construct separating sequences: for state s it
picks state identifiers from CSs that have previously been
constructed. However, it requires the output of SEQ-
BF-CS and so SEQ-BF-HSI also is not scalable for such
FSMs. However, SEQ-PLY-CS and PAR-PLY-CS were able
10 20 30
PAR-BF-HSI 2.56 1427.43 109746.34
SEQ-BF-HSI 0.01 1.32 2.21
PAR-BF-CS 2.51 1555.01 111000.34
SEQ-BF-CS 0.01 103.51 1553571.28
Table 3: The s-FSMs time in milliseconds.
Figure 7: log10 scale of time (in ms) required to construct CSs
for the s-FSMs with PAR-PLY-CS and SEQ-PLY-CS.
to process all 600 s-FSMs. When n = 30, SEQ-PLY-CS re-
quired 150 milliseconds and PARL-PLY-CS required 2.98
milliseconds and when n = 6000 SEQ-PLY-CS required
307 seconds and PARL-PLY-CS required 2.2 seconds. We
present the results of these experiments in Figure 7.
These suggest that for s-FSMs, SEQ-PLY-CS and PAR-
PLY-CS are at least 200 times more scalable than SEQ-BF-
CS, PAR-PLY-CS is 537000 times faster than SEQ-BF-CS
and SEQ-PLY-CS is 10357 times faster than SEQ-BF-CS.
We also investigated the effect of K by examining the
relationship between performance and K for PAR-BF-
HSI when n = 4096 in TS1. In Table 4 we observe that
as we increase the number of transitions we increase
the number of separating sequences but reduce the
mean separating sequence length and the time required
by PAR-BF-HSI. These results unsurprising since there
being many transitions typically allows pairs of states to
be distinguished using shorter sequences.
6 THREATS TO VALIDITY
Threats to internal validity concern factors that might
introduce bias and so largely concern the tools used
in the experiments. The tool that generated the FSMs,
used as experimental subjects, is one that has previously
11
Properties Number of Transitions5000 6000 7000 8000 9000 10000 11000
Number of separating sequences per state 9.85 21.45 38.45 75.45 90.45 104.45 110.45
Average length of separating sequences per state 6.84 6.13 4.79 2.94 2.57 1.84 1.73
Average to time construct separating sequences 904.45 882.35 819.44 737.67 679.11 649.45 648.63
Table 4: The effect of number of transitions on the distinguishability of FSMs where n = 4096, p/q = 3/3.
been used and we also tested this. We carefully checked
and tested the implementations of the algorithms. We
used C++ to code the sequential CS and HSI methods
and CUDA C++ to implement the parallel algorithms.
We used the CUDA-Thrust library for sorting output
sequences while constructing characterising sets.
Threats to construct validity refer to the possibility
that the properties measured are not those of interest
in practice. Our main concern was scalability and this is
important if FSM based test techniques are to be applied
to larger systems. However, the sets of separators will
typically be generated to be used within a test generation
algorithm and so we are also interested in the potential
effect on the size of the resultant test. There are two
aspects that affect this: the number of sequences in a
CS/HSI and the lengths of these sequences. We therefore
measured mean values for these in the experiments.
Threats to external validity concern the degree to
which we can generalise from the results. One cannot
avoid such threats since the set of ‘real FSMs’ is not
known and we cannot uniformly sample from this. To
reduce this threat we varied the number of states, inputs
and outputs. We also used FSMs from industry.
7 CONCLUSIONS AND FUTURE WORK
This paper explored the use of GPU computing to devise
massively parallel algorithms for generating CSs/HSIs.
The main motivation was to make such algorithms scale
to larger FSMs. An FSM M used in test generation could
represent the semantics of a model M ′ written in a more
expressive language such as State-Charts or SDL and
even quite modest models can result in large FSMs.
We used the thin thread strategy in which relatively
little data is stored in shared memory, this allowing
there to be many threads. We devised polynomial time
algorithms and massively parallel versions of the brute-
force and polynomial time algorithms.
We performed experiments with randomly generated
partial FSMs, the parallel algorithms being much quicker
than the sequential algorithms and scaling much better.
Interestingly, the parallel brute-force algorithm, with
exponential worst case complexity, outperformed the se-
quential polynomial time algorithm. The parallel version
of the polynomial time algorithm was fastest but did
not scale as well as the parallel brute-force algorithm
due to its memory requirements. When the techniques
were applied to FSMs with relatively long separating
sequences, only the polynomial time algorithms scaled to
FSMs with 40 states or more. However, these algorithms
scaled quite well, easily handling FSMs with 6000 states.
As expected, the parallel version of the polynomial time
algorithm was much faster than the sequential version.
There are several lines of future work. First, experi-
ments might explore how the results change when using
different GPU cards. It would also be interesting to
run additional experiments with FSMs generated from
models in languages such as SDL and State-Charts.
Third, there is the challenge of designing and imple-
menting parallel algorithms for CS and HSI algorithms
that can run on multi-core systems. Finally, there is the
potential to investigate new approaches that are capable
of constructing HSIs for larger FSMs.
8 ACKNOWLEDGEMENTS
This work was supported by The Scientific and Tech-
nological Research Council of Turkey (TUBITAK) under
grant 1059B191400424 and by the NVIDIA corporation.
REFERENCES
[1] M. Broy, B. Jonsson, and J.-P. Katoen, Model-Based Testing of
Reactive Systems: Advanced Lectures LNCS. Springer, 2005.
[2] R. M. Hierons, K. Bogdanov, J. P. Bowen, R. Cleaveland, J. Der-
rick, J. Dick, M. Gheorghe, M. Harman, K. Kapoor, P. Krause,
G. Lüttgen, A. J. H. Simons, S. A. Vilkomir, M. R. Woodward,
and H. Zedan, “Using formal specifications to support testing,”
ACM Computing Surveys, vol. 41, no. 2, pp. 9:1–9:76, 2009.
[3] A. Y. Duale and M. U. Uyar, “A method enabling feasible
conformance test sequence generation for EFSM models,” IEEE
Transactions on Computers, vol. 53, no. 5, pp. 614–627, 2004.
[4] W. Grieskamp, “Multi-paradigmatic model-based testing,” in
Formal Approaches to Software Testing and Runtime Verification
FATES/RV, ser. LNCS, vol. 4262. Springer, 2006, pp. 1–19.
[5] D. Lee and M. Yannakakis, “Principles and methods of testing
finite-state machines - a survey,” Proceedings of the IEEE, vol. 84,
no. 8, pp. 1089–1123, 1996.
[6] A. Friedman and P. Menon, Fault detection in digital circuits, ser.
Computer Applications in Electrical Engineering Series. Prentice-
Hall, 1971.
[7] A. Aho, R. Sethi, and J. Ullman, Compilers, principles, tech-
niques, and tools, ser. Addison-Wesley series in computer science.
Addison-Wesley Pub. Co., 1986.
[8] T. S. Chow, “Testing software design modelled by finite state
machines,” IEEE Transactions on Software Engineering, vol. 4, pp.
178–187, 1978.
[9] E. Brinksma, “A theory for the derivation of tests,” in Proceedings
of Protocol Specification, Testing, and Verification VIII. Atlantic City:
North-Holland, 1988, pp. 63–74.
[10] S. Low, “Probabilistic conformance testing of protocols with un-
observable transitions,” in 1993 International Conference on Network
Protocols, Oct, pp. 368–375.
[11] D. P. Sidhu and T.-K. Leung, “Formal methods for protocol test-
ing: A detailed study,” IEEE Transactions on Software Engineering,
vol. 15, no. 4, pp. 413–426, 1989.
[12] R. V. Binder, Testing Object-Oriented Systems: Models, Patterns, and
Tools. Addison-Wesley, 1999.
[13] M. Haydar, A. Petrenko, and H. Sahraoui, “Formal verification
of web applications modeled by communicating automata,” in
Formal Techniques for Networked and Distributed Systems FORTE,
ser. LNCS, vol. 3235. Madrid: Springer-Verlag, September 2004,
pp. 115–132.
12
[14] M. Utting, A. Pretschner, and B. Legeard, “A taxonomy of model-
based testing approaches,” Software Testing, Verification and Relia-
bility, vol. 22, no. 5, pp. 297–312, 2012.
[15] W. Grieskamp, N. Kicillof, K. Stobie, and V. A. Braberman,
“Model-based quality assurance of protocol documentation: tools
and methodology,” Software Testing, Verification and Reliability,
vol. 21, no. 1, pp. 55–71, 2011.
[16] E. P. Moore, “Gedanken-experiments,” in Automata Studies,
C. Shannon and J. McCarthy, Eds. Princeton University Press,
1956.
[17] F. C. Hennie, “Fault-detecting experiments for sequential circuits,”
in Proceedings of Fifth Annual Symposium on Switching Circuit Theory
and Logical Design, Princeton, New Jersey, November 1964, pp. 95–
110.
[18] R. M. Hierons and H. Ural, “Generating a checking sequence
with a minimum number of reset transitions,” Automated Software
Engineering, vol. 17, no. 3, pp. 217–250, 2010.
[19] H. Ural and K. Zhu, “Optimal length test sequence generation
using distinguishing sequences,” IEEE/ACM Transactions on Net-
working, vol. 1, no. 3, pp. 358–371, 1993.
[20] G. L. Luo, G. v. Bochmann, and A. Petrenko, “Test selection
based on communicating nondeterministic finite-state machines
using a generalized Wp-method,” IEEE Transactions on Software
Engineering, vol. 20, no. 2, pp. 149–161, 1994.
[21] A. Petrenko, N. Yevtushenko, and G. v. Bochmann, “Testing
deterministic implementations from nondeterministic FSM spec-
ifications,” in IFIP TC6 9th International Workshop on Testing of
Communicating Systems. Darmstadt, Germany: Chapman and
Hall, 9–11 September 1996, pp. 125–141.
[22] A. Petrenko and A. Simão, “Generalizing the DS-methods for
testing non-deterministic fsms,” The Computer Journal, vol. 58,
no. 7, pp. 1656–1672, 2015.
[23] A. da Silva Simão, A. Petrenko, and N. Yevtushenko, “On re-
ducing test length for FSMs with extra states,” Software Testing,
Verification and Reliability, vol. 22, no. 6, pp. 435–454, 2012.
[24] M. P. Vasilevskii, Failure Diagnosis of Automata. Cybernetics.
Plenum Publishing Corporation, 1973.
[25] G. Luo, A. Petrenko, and G. v. Bochmann, “Selecting test se-
quences for partially-specified nondeterministic finite state ma-
chines,” in The 7th IFIP Workshop on Protocol Test Systems. Tokyo,
Japan: Chapman and Hall, November 8–10 1994, pp. 95–110.
[26] P.-C. Tsai, S.-J. Wang, and F.-M. Chang, “FSM-based pro-
grammable memory BIST with macro command,” in 2005 IEEE
International Workshop on Memory Technology, Design, and Testing,
2005. (MTDT), Aug., pp. 72–77.
[27] K. Zarrineh and S. Upadhyaya, “Programmable memory BIST and
a new synthesis framework,” in Fault-Tolerant Computing, 1999.
Digest of Papers. Twenty-Ninth Annual International Symposium on,
June, pp. 352–355.
[28] L. Xie, J. Wei, and G. Zhu, “An improved FSM-based method for
BGP protocol conformance testing,” in International Conference on
Communications, Circuits and Systems, 2008, pp. 557–561.
[29] A. Drumea and C. Popescu, “Finite state machines and their
applications in software for industrial control,” in 27th Int. Spring
Seminar on Electronics Technology: Meeting the Challenges of Electron-
ics Technology Progress, vol. 1, 2004, pp. 25–29.
[30] A. Petrenko and N. Yevtushenko, “Testing from partial determin-
istic FSM specifications,” IEEE Transactions on Computers, vol. 54,
no. 9, pp. 1154–1165, 2005.
[31] ——, “Testing from partial deterministic FSM specifications,”
IEEE Transactions on Computers, vol. 54, no. 9, pp. 1154–1165, 2005.
[32] F. Brglez, “ACM/SIGMOD benchmark dataset,” Available
online at http://www.cbl.ncsu.edu/benchmarks/Benchmarks-
upto-1996.html, 1996, accessed: 2014-02-13.
[33] N. Kushik, N. Yevtushenko, and A. R. Cavalli, “On testing against
partial non-observable specifications,” in 9th International Confer-
ence on the Quality of Information and Communications Technology
(QUATIC 2014). IEEE, 2014, pp. 230–233.
[34] A. L. Bonifácio and A. V. Moura, “Test suite completeness and
partial models,” in 12th International Conference on Software En-
gineering and Formal Methods (SEFM 2014), ser. LNCS, vol. 8702.
Springer, 2014, pp. 96–110.
[35] ——, “Partial models and weak equivalence,” in 11th International
Colloquium on Theoretical Aspects of Computing ICTAC 2014), ser.
LNCS, vol. 8687. Springer, 2014, pp. 80–96.
[36] ——, “On the completeness of test suites,” in Symposium on
Applied Computing (SAC 2014). ACM, 2014, pp. 1287–1292.
[37] A. Petrenko and N. Yevtushenko, “Conformance tests as checking
experiments for partial nondeterministic FSM,” in 5th Int. Work-
shop on Formal Approaches to Software Testing, ser. LNCS, vol. 3997.
Springer, 2006, pp. 118–133.
[38] A. da Silva Simão and A. Petrenko, “Generating checking se-
quences for partial reduced finite state machines,” in 20th IFIP TC
6/WG 6.1 International Conference Testing of Software and Communi-
cating Systems, 8th International Workshop on Formal Approaches to
Testing of Software TestCom/FATES, ser. LNCS, vol. 5047. Springer,
2008, pp. 153–168.
[39] J. Dümmler and S. Egerland, “Interval-based performance
modeling for the all-pairs-shortest-path problem on GPUs,” The
Journal of Supercomputing, vol. 71, no. 11, pp. 4192–4214, 2015.
[Online]. Available: http://dx.doi.org/10.1007/s11227-015-1514-9
[40] F. Busato and N. Bombieri, “An efficient implementation of
the bellman-ford algorithm for kepler GPU architectures,” IEEE
Transactions on Parallel and Distributed Systems, vol. PP, no. 99, pp.
1–1, 2015.
[41] H. Liu and H. H. Huang, “Enterprise: Breadth-first graph traver-
sal on GPUs,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, ser. SC
’15. New York, NY, USA: ACM, 2015, pp. 68:1–68:12.
[42] F. Al Farid, M. Uddin, S. Barman, A. Ghods, S. Das, and
M. Hasan, “A novel approach toward parallel implementation of
BFS algorithm using graphic processor unit,” in 2015 International
Conference on Electrical Engineering and Information Communication
Technology (ICEEICT), May 2015, pp. 1–4.
[43] S. Lai, G. Lai, G. Shen, J. Jin, and X. Lin, “Gpregel: A GPU-
based parallel graph processing model,” in High Performance
Computing and Communications (HPCC), 2015 IEEE 7th International
Symposium on Cyberspace Safety and Security (CSS), Aug 2015, pp.
254–259.
[44] H. Djidjev, G. Chapuis, R. Andonov, S. Thulasidasan, and D. Lave-
nier, “All-pairs shortest path algorithms for planar graph for
GPU-accelerated clusters,” Journal of Parallel and Distributed Com-
puting, vol. 85, pp. 91 – 103, 2015, {IPDPS} 2014 Selected Papers
on Numerical and Combinatorial Algorithms.
[45] G. Mealy, “A method for synthesizing sequential circuits,” Bell
System Technical Journal, vol. 34, no. 5, pp. 1045–1079, 1955.
[46] C. P. Pfleeger, “State reduction in incompletely specified finite-
state machines,” IEEE Transactions on Computers, vol. 22, no. 12,
pp. 1099–1102, Dec. 1973.
[47] V. Jusas and T. Neverdauskas, “FSM based functional test gen-
eration framework for VHDL,” in 18th International Conference on
Information and Software Technologies ICIST, ser. Communications
in Computer and Information Science, vol. 319. Springer, 2012,
pp. 138–148.
[48] I. Pomeranz and S. M. Reddy, “Test generation for multiple
state-table faults in finite-state machines,” IEEE Transactions on
Computers, vol. 46, no. 7, pp. 783–794, 1997.
[49] S. Thummalapenta, K. V. Lakshmi, S. Sinha, N. Sinha, and
S. Chandra, “Guided test generation for web applications,” in
35th International Conference on Software Engineering (ICSE). IEEE
/ ACM, 2013, pp. 162–171.
[50] C. D. Nguyen, A. Marchetto, and P. Tonella, “Combining model-
based and combinatorial testing for effective test case generation,”
in International Symposium on Software Testing and Analysis (ISSTA).
ACM, 2012, pp. 100–110.
[51] G. Klingbeil, R. Erban, M. Giles, and P. Maini, “Fat versus
thin threading approach on GPUs : Application to stochastic
simulation of chemical reactions,” IEEE Transactions on Parallel and
Distributed Systems, vol. 23, no. 2, pp. 280–287, Feb 2012.
[52] R. M. Hierons and U. C. Türker, “Incomplete distinguishing
sequences for finite state machines,” The Computer Journal, vol. 58,
no. 11, pp. 3089–3113, 2015.
[53] ——, “Distinguishing sequences for partially specified FSMs,” in
6th NASA Formal Methods Symposium (NFM), 2014, pp. 62–76.
[54] W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion
variance analysis,” Journal of the American Statistical Association,
vol. 47, no. 260, pp. pp. 583–621, 1952.
[55] M. Sokolovskii, “Diagnostic experiments with automata,” Kiber-
netica, no. 6, pp. 44–49, 1971.
[56] D. B. Kirk and W. H. Wen-mei, Programming massively parallel
processors: a hands-on approach. Newnes, 2012.
[57] J. Hoberock and N. Bell, “Thrust: A parallel template library,”
2010, version 1.7.0. [Online]. Available: http://thrust.github.io/
13
Symbol Meaning
ti Thread with id i
CurrentStates A vector of states
InputSequence A vector of input symbols
OutputSequence A vector of output symbols
FSM A vector holding FSM transition structure
Flags A vector of boolean variables
Table 5: Nomenclature for the algorithms.
APPENDIX
We first discuss issues affecting performance and then
describe how the parallel algorithms were implemented.
Table 5 gives the terms used.
Performance considerations
In implementing a massively parallel algorithm, one
needs to consider coalesced memory transaction and thread
divergence. As we followed the thin thread strategy, we
needed to perform many global memory transactions. In
a GPU, global memory is accessed in chunks of aligned
32, 64 or 128 bytes. If the threads of a block access global
memory in the right pattern, the GPU can pack several
accesses into one memory transaction. This is a coalesced
memory transaction [56]. For example, whenever a thread
ti requests a single item, the entire line from global
memory is brought to cache. If thread ti+1 requests
the neighbouring item, ti+1 can read this data from the
cache. As reading global memory is hundreds of times
slower than reading cache memory, coalesced memory
access may drastically improve performance. Therefore,
to reduce the time spent on global memory transactions,
we needed an appropriate storage layout.
All threads in a multiprocessor execute the same code
(kernel). If the code has branching statements such as if
or switch then some of the threads may follow different
branches and hence different threads need to execute
different lines of code. In such cases the GPU will
serialise execution: a GPU is not capable of executing
if or switch like statements in parallel. This problem is
known as thread divergence [56]. However, if one can
guarantee that the threads execute the same sequence
of instructions then one can use branching statements.
Parallel-CS algorithm
Data structures
In implementing the Parallel-CS algorithm we used the
P vector and the FSM vector. The P vector holds
n(n − 1)/2 CPn-vectors and we use P[i] to denote the
ith element of P . Such an element holds a pair-node
and a flag where a node-pair consists of a pair of states
(s, s′) and possibly a separating sequence ϑ for this pair.
Although the maximum length of ϑ is n(n−1)/2, we set
this bound as n, and therefore the memory required for
a CPn-vector was of O(n3).
The transition structure of the FSM is kept in the FSM
vector. For state s and input x, the FSM vector returns
output y ∈ Y ∪{ε} and next state s′ ∈ S∪{e}. The size of
the FSM vector is therefore 2n|X|. For a thread ti and
an input sequence x̄ of length greater than 1, reads on
the FSM vector may not be coalesced as the memory
access pattern on FSM is data dependent. For example,
let us assume that we need to apply x̄ = x1x2 to si.
For x1, thread ti will retrieve output and next state (sj)
information from the ith location of FSM and it will
then apply input x2 to sj which will cause thread ti to
access a different location of the FSM vector.
Initiating the P vector
First note that for a pair (si, sj) there are n−i elements in
P that start with si. Therefore, if an initialisation kernel
receives the P vector, integers i and n as its parameters
and launches with n− i threads, it can set n− i elements
of P . Since i varies from 1 to n − 1, the Host can call
the kernel O(n) times. If Γ ≥ n the time required for
initialisation is of O(n) as stated in Section 4.2.
Computing initial separating sequences
To compute the initial separating sequences, we can
launch a kernel with n(n − 1)/2 threads and with the
FSM and the P vectors, inputs X and a boolean variable
isMin (set to F prior to the kernel call) as parameters.
Thread ti processes one element P[i] = ((s, s′, ϑ), f) in
a for-loop. The for-loop iterates over inputs X and in
each iteration the kernel applies input x to states s, s′
and retrieves outputs from the FSM vector and then
checks if λ(s, x) 6= λ(s′, x), λ(s, x) 6= ε, and λ(s′, x) 6= ε.
If these conditions are met, the thread sets the flag to 1,
separating sequence to x, and isMin to T and exits from
the for-loop. More than one thread can set isMin; as the
only value to be written to isMin is T , this does not
cause a race. After all threads have finished, the kernel
returns and the algorithm checks the value of isMin. If
it is F , the algorithm declares that M is not minimal and
the algorithm terminates, otherwise it continues.
Evolving the P vector
After some separating sequences are set, the algorithm
uses these to compute additional separating sequences.
To achieve this we implemented two kernels.
The first kernel is launched with n(n − 1)/2 threads,
and the FSM vector, the P vector, n, and inputs X as
parameters. A thread ti processes P[i] = (f, (s, s′, ϑ))
through a for-loop if and only if f = 0. The for-loop
iterates over inputs; at each iteration it applies an input
to s and s′ and receives si = δ(s, x) and sj = δ(s′, x) from
the FSM vector. ti then computes the index k for pair
(si, sj). If the flag of P[k] is 1 then the thread retrieves
the separating sequence ϑ′, assigns xϑ′ to ϑ, and sets
f = 2. The thread sets the flag value to 2 rather than 1 to
avoid long separating sequences being constructed. For
example, consider the scenario in which thread ti tries to
evolve P[i] and there is a freshly evolved element P[j]
(evolved in the current iteration) and an element P[!] that
14
has already evolved. As ti checks input symbols one by
one, if the flags of P[j] and P[!] are 1 then ti may use ϑj
in forming its separating sequence. However, |ϑj | > |ϑ!|.
By setting the flags of freshly evolved elements to 2, we
ensure that these cannot be used to form new separating
sequences until the next iteration.
After the first kernel returns, the algorithm launches
another kernel with P and isMin to convert 2s to 1s. A
thread ti checks if the flag of P[i] is 2, if so it sets the flag
to 1 and sets isMin to T . After the kernel returns, the
algorithm checks isMin and if it reads F it terminates
declaring that M is not minimal; otherwise it continues.
Parallel-HSI algorithm
Data structures
The Parallel-HSI algorithm uses a boolean vector B, of
size n(n − 1)/2, to record which pairs of states have
known separating sequences. Recall that elements of an
ST-vector are associated with an input sequence, initial
and current states and output sequences. We simulated
an ST-vector by using CurrentStates, InputSequence
and OutputSequence vectors. The InputSequence vector
holds the input sequence x̄ to be applied and so has
length at most n(n−1)/2. CurrentStates[i] gives the cur-
rent state δ(si, x̄) for initial state si and so CurrentStates
has size n. The OutputSequences vector holds the output
sequences produced from the different states and so its
length can vary from n to n(n(n− 1)/2).
The Parallel-HSI algorithm uses the FSM vector. The
algorithm also uses a vector Flags: for state s, Flags[i]
is larger than 0 if and only if si has been distinguished
from s. As the FSM and InputSequence vectors never
change they are kept in the Texture memory.
Resetting/Initiating Vectors
The Parallel-HSI algorithm first initialises the B vector
using a kernel in which thread ti writes 0 to B[i]. The
algorithm then enters a loop and in every iteration it
resets vector D. As D is simulated by different vectors,
we used more than one kernel to achieve this. The first
kernel receives CurrentStates and its length and during
execution thread ti writes si to CurrentStates[i]. The
second kernel receives a vector (InputSequence, Flags,
OutputSequences etc.) and its size. During execution a
thread ti writes 0 to the ith index of the vector.
Evolving the ST vector
A thread ti applies the input sequence x̄ to state si, iterat-
ing over a loop (kernel-loop). The number of iterations
of the kernel-loop is equal to the length of the input
sequence. At each iteration of the kernel-loop, a thread
ti reads the next input x from the InputSequences vector
and retrieves the next state s and the observed output y
from the FSM vector. It writes s to CurrentState[j] and
writes the output observed in the current iteration to the
corresponding index of the OutputSequence vector.
Evaluating the output sequences
After the elements of D have been evolved, the Parallel-
HSI algorithm evaluates the output sequences through a
loop (states-loop) called by the CPU. At each iteration,
the states-loop chooses a state si and calls a kernel that
writes 0 to all elements of the Flags vector. It then exe-
cutes another kernel in which a thread tj compares the
output sequences produced by states si and sj through
a loop (kernel-loop-2). At each iteration of kernel-loop-2,
thread tj retrieves outputs yi, yj respectively and checks
whether the value ε⊕yi∧ε⊕yj is 0 (here ⊕ denotes XOR).
If so the thread is suspended since the input sequence is
not a defined sequence (from at least one of si and sj).
Otherwise it writes yi ⊕ yj ∨ Flags[j] to Flags[j].
Gathering state identifiers
After the output sequences have been compared, a
kernel compares the values of Flags and B. An input
sequence is added to Hi and Hj if it distinguishes
si and sj (Flags[j] is true) and the states were not
previously distinguished (!B[index] is true). Thus, tj
writes InputSequence to Hj , Hi and 1 to B[index] if
Flags[j]∧!B[index] is larger than 0.
Constructing a characterising set from HSIs
While constructing the CS W from HSIs we need to
eliminate duplicates. Thus, once HSIs have been com-
puted by the Parallel-HSI algorithm we form a group
G by collecting state identifiers from every Hi. We then
sort these input sequences (sort(G)) in parallel, removing
duplicated using the generateW (G) kernel. We used the
Thrust Sort function [57] to sort input sequences.
The generateW kernel receives G and an empty W . A
loop iterates |x̄i| times and at each iteration a thread ti
compares the jth input of neighbouring input sequences:
it compares the jth input of x̄i with both x̄i+1 and x̄i−1.
If ti finds that x̄i is different from a neighbouring input
sequence then it adds this input sequence to W .
Robert M. Hierons received a BA in Mathe-
matics (Trinity College, Cambridge), and a Ph.D.
in Computer Science (Brunel University). He
then joined the Department of Mathematical
and Computing Sciences at Goldsmiths College,
University of London, before returning to Brunel
University in 2000. He was promoted to full
Professor in 2003.
15
Uraz Cengiz Türker received the BA, MSc and
PhD degrees in Computer Science (Sabanci
University, Turkey), in 2006, 2008, and 2014, re-
spectively. He is now post doctoral researcher at
Brunel University London under the supervision
of Prof. Robert M. Hierons.
