Turkish Journal of Electrical Engineering and Computer Sciences
Volume 27

Number 5

Article 20

1-1-2019

Parallel brute-force algorithm for deriving reset sequences from
deterministic incomplete finite automata
URAZ CENGİZ TÜRKER

Follow this and additional works at: https://journals.tubitak.gov.tr/elektrik
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Computer Engineering Commons

Recommended Citation
TÜRKER, URAZ CENGİZ (2019) "Parallel brute-force algorithm for deriving reset sequences from
deterministic incomplete finite automata," Turkish Journal of Electrical Engineering and Computer
Sciences: Vol. 27: No. 5, Article 20. https://doi.org/10.3906/elk-1809-1
Available at: https://journals.tubitak.gov.tr/elektrik/vol27/iss5/20

This Article is brought to you for free and open access by TÜBİTAK Academic Journals. It has been accepted for
inclusion in Turkish Journal of Electrical Engineering and Computer Sciences by an authorized editor of TÜBİTAK
Academic Journals. For more information, please contact academic.publications@tubitak.gov.tr.

Turkish Journal of Electrical Engineering & Computer Sciences
http://journals.tubitak.gov.tr/elektrik/

Research Article

Turk J Elec Eng & Comp Sci
(2019) 27: 3544 – 3556
© TÜBİTAK
doi:10.3906/elk-1809-1

Parallel brute-force algorithm for deriving reset sequences from deterministic
incomplete finite automata
Uraz Cengiz TÜRKER∗
Department of Computer Engineering, Faculty of Engineering, Gebze Technical University, Kocaeli, Turkey
Received: 01.09.2018

•

Accepted/Published Online: 26.03.2019

•

Final Version: 18.09.2019

Abstract: A reset sequence (RS) for a deterministic finite automaton A is an input sequence that brings A to a
particular state regardless of the initial state of A . Incomplete finite automata (FA) are strong in modeling reactive
systems, but despite their importance, there are no works published for deriving RSs from FA. This paper proposes a
massively parallel algorithm to derive short RSs from FA. Experimental results reveal that the proposed parallel algorithm
can construct RSs from FA with 16,000,000 states. When multiple GPUs are added to the system the approach can
handle larger FA.
Key words: Finite automata, incomplete finite automata, brute-force approach reset sequences, GPGPU programming

1. Introduction
Finite automata (FA) have many practical applications. They have been used in various fields, including
automata theory, robotics, biocomputing, set theory, propositional calculus, model-based testing, and many
more [1–12].
For example, in model-based testing, checking the experiment construction requires a reset sequence to
bring the implementation to the specific state in which the designed test sequence is to be applied (e.g., see [13–
15]). In [5], the authors studied a practical problem related to automated part orienting on an assembly line.
In this work the authors showed that after some assumptions the part orienting problem is reducible to the
problem of constructing reset sequences for deterministic finite automata. Another interesting example arises in
biocomputing. In [16, 17], the researchers showed that in a controlled environment they ran 3 ∗ 1012 automata
per microliter, performing 6.6 × 1010 transitions per second, and in order to run these automata, we need to
construct reset words from synthetic nucleotides. Moreover, in [18], the authors proposed a molecular automaton
that plays Tic-Tac-Toe against a human opponent. Such an automaton, after the game ends, requires a reset
word to bring the automaton to the “new game” state.
1.1. Formal background
A deterministic finite automaton (or simply an automaton) is defined by a triple A = (Q, Σ, δ) where Q is a
finite set of states, Σ is a finite input alphabet, and δ : Q × Σ → Q is a transition function. For a given state
q , we say that input x is defined at q if δ(q, x) ∈ Q . Automaton A is an incomplete automaton if for some
state input pairs the function δ is undefined. Otherwise, it is a complete automaton. Throughout this paper we
∗ Correspondence:

urazc@gtu.edu.tr

3544
This work is licensed under a Creative Commons Attribution 4.0 International License.

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

use automaton (or automata), finite automaton (or finite automata), or FA to denote an incomplete automaton
(or incomplete automata). We assume that only one input can be applied to the automaton at a time. Upon
receiving the input, the automaton changes its states according to its transition function. Application of inputs
one after another forms a sequence of inputs w = x1 x2 , x3 ... and is called an input sequence. The transition
function can be extended to input sequences in the usual way: for all q ∈ Q, input sequence w ∈ Σ⋆ , and input
symbol x ∈ Σ, δ̂(q, ε) = q , δ̂(q, wx) = δ(δ̂(q, w), x) where ε denotes the empty input sequence. Throughout
∪
this paper instead of δ̂ we use δ . Moreover, for a set Q̄ ⊆ Q , we use δ(Q̄, w) to denote the set q∈Q̄ {δ(q, w)} .
For a FA, a word w ∈ Σ⋆ is said to be defined at a state q ∈ Q if ∀w′ , w′′ ∈ Σ⋆ , ∀x ∈ Σ such that w = w′ xw′′ ,
δ(δ(w′ ), x) is defined. We use symbol e to denote the state reached from a given state q with application of
an undefined input at q , i.e. δ(q, x) = e iff x is not defined on state q . We use Σℓ to denote the set of
input sequences of length ℓ and we use n to denote the number of states of the automaton. An automaton
A = (Q, Σ, δ) is synchronizable if there exists an input sequence w ∈ Σ⋆ such that |δ(Q, w)| = 1 and w is
defined for Q. An automaton has a reset functionality if it can be reset to a single state by reading a special
sequence w . In this case w is called a reset sequence (RS). On the other hand, an input sequence w is called
a collapsing sequence for a set of states Q̄ if and only if |δ(Q̄, w)| < |Q̄| [19].
1.2. Problem statement
Although reset sequences are important in many areas, the scalability of methods for constructing such sequences
has not been addressed thoroughly. By scalability we refer to the maximum size of FA that can be processed
in an acceptable amount of time. Thus, a more scalable algorithm can process larger FA. Besides, although
FA can model a wide range of systems, previous approaches (except the methods given in [20–22]) for deriving
RSs have been developed for deriving short reset sequences from complete FA [23–28] and surprisingly, to our
knowledge, there are no proposed works for effectively constructing short RSs from FA.
One reason for this is clearly the difficulty of managing high memory/time demand when constructing
RSs from FA. The only method for deriving RSs from such FA is the “successor tree” method [20–22]. In this
method, a tree data structure called a successor tree for an FA is constructed by following a breadth-first search
process until (1) the RS is computed or (2) the upper-bound is reached. However, as the size of a successor tree
grows exponentially, an approach that constructs a successor tree would not process very large FA and hence
leads to poor scalability. Therefore, considering the importance of RSs, we believe that the scalability problem
is very important and hence needs to be addressed.
On the other hand, we have been witnessing that researchers have been taking advantage of generalpurpose graphics processing units (GPGPU) technology in various problems including shortest path, breadthfirst search, sorting, and many more [29–34]. Moreover, recently researchers proposed a parallel algorithm to
accelerate the execution of a FA [35, 36]. Recently massively parallel algorithms have also been presented for
deriving state identification sequences from finite state machines [37, 38]. Karahoda et al. recently represented
a fast version of an existing greedy RS generation algorithm ([4]) for deterministic complete automata [39].
However, despite these significant advances, surprisingly, algorithms for deriving short RSs from complete and
partial FA have not changed in 30 years and in this paper we aim to address this gap.
1.3. Contributions
In this paper, we introduce a revolutionary strategy to derive RSs. Existing approaches to generate RSs from
FA use successor trees [23–28]. However, due to the high memory demand of such trees, we abolish this strategy
3545

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

and introduce a vector-based approach by introducing a new and an efficient data structure that (1) is suitable
for the memory architecture of GPUs and (2) demands less memory and effectively encapsulates data that are
required to derive RSs. In other words, we propose a parallelizable data structure that demands less memory
for generating RSs from FA. Based on the proposed formalism, we propose a massively parallel algorithm that
can derive RSs from FA with many states. We present in Section 4 how the proposed algorithm can be
implemented by using the CUDA tool-kit for CUDA-enabled GPUs. We present the results of experiments
conducted to measure the scalability of the proposed algorithm. Experimental results show that the proposed
algorithm can derive reset sequences from the FA with 16 million states in about 6 minutes.
1.4. Organization of the paper
This paper is organized as follows: in Section 2, we introduce data structures that are suitable for modern
GPGPU hardware, and after that, we present a massively parallel algorithm for deriving reset sequences from
deterministic incomplete finite automata. In Section 3, we present the experimental results. Experimental
results suggest that the proposed algorithm is more scalable and effective in deriving reset sequences compared
to the existing search-tree-based approaches. In Section 4, we draw conclusions and suggest some future work.
We also provide implementation-related details in Section 4.
2. Parallel RS generation algorithm (PRSGA)
2.1. Algorithm design
The proposed algorithm relies on the thin thread strategy, in which threads are exposed to a very limited
amount of data. Therefore, the number of threads that can be launched by a kernel is usually high [40].
The existing RS generation algorithms for partial FA aim to construct a successor tree for the given FA
to derive RSs. Edges expanded from a common node of a successor tree have different input labels. Nodes
are associated with the current states reached from the initial states through application of the input sequence
formed by concatenating the edge labels on the path from the root to that node. Although successor trees are
good in providing required information, they are very expensive to be kept in CPU memory (see Figure 1).
To address this bottleneck, we need a data structure that demands less memory space and is parallelizable.
In order to achieve this, we need to highlight important aspects of a successor tree. First note that a path from
the root to a leaf of a successor tree labels a unique input sequence w . Moreover, we say that w is a RS if the
current states within the leaf are all the same. Therefore, if we somehow associate input sequences with current
states, we can check whether w is a RS for the FA under consideration.
Unlike any other RS sequence derivation methods, the proposed parallel RS generation algorithm uses
a states-vector to construct RSs. This, we believe, allows us to exploit the massive parallelism provided by
modern GPUs. We now define states-vectors and present some of their properties.
Definition 1 For a given automaton A = (Q, Σ, δ), a states-vector V is a vector for states Q̄ ⊆ Q, associated
with an input sequence w ∈ Σ⋆ having |Q̄| = m elements such that each element v is associated with an initial
state vi = q ∈ Q and a current state vc = q ′ ∈ Q such that δ(q, w) = q ′ . We use the notation V [i] to denote
the i th element of the states-vector and we use U (V ) to denote the number of distinct current states associated
to V .
The elements of a states-vector “advance” with input symbol x . Let v be an element of a states-vector
3546

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

s1 , s 2 , s 3 , s 4 , . . . , s n
sa , s t , . . . , s g
...
...

...

...

...
...

...

...

...

...

...

sx , s y , . . . , s z

sc , s r , . . . , s m
...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...
...

...

...

...

...

...

...

...

...

...

Figure 1. An example of a successor tree (edge labels are not given to reduce visual complexity).

V0

V1

V2

V3

V4

V5

V0′ = δ(Q, w0 )

V1′ = δ(Q, w1 )

V2′ = δ(Q, w2 )

V3′ = δ(Q, w3 )

V4′ = δ(Q, w4 )

V5′ = δ(Q, w5 )

s1 , s2 , s3 , s4 , . . . , sn s1 , s2 , s3 , s4 , . . . , sn s1 , s2 , s3 , s4 , . . . , sn s1 , s2 , s3 , s4 , . . . , sn s1 , s2 , s3 , s4 , . . . , sn s1 , s2 , s3 , s4 , . . . , sn …

sa , sg , sc , st , . . . , se sf , sb , sh , si , . . . , sj sk , sr , sm , sn , . . . , so sp , sl , ss , sd , . . . , sw sa , sa , sa , sa , . . . , sa sa , sh , sr , sj , . . . , sk …

Figure 2. An example of a vector approach. Note that w4 is a RS as after application of w4 to Q we reach state sa
only.

associated with an initial state vi and a current state vc . If x is an input symbol then adv(v, x) = v ′ if and
only if vi = vi′ and δ(vc , x) = vc′ .
Note that when the elements of a states-vector advance with an input sequence w , one among four
possible cases should occur. Let V be a states-vector for a set of states Q̄, and then:
1. For each element vc ∈ V , we have that vc = s where s ∈ S .
2. For at least two elements v, v ′ ∈ V , we have that vc = vc′ = s where s ∈ S , and for each element
v ′′ ∈ V \ {v, v ′ } , we have that vc′′ = s′ for some s′ ∈ S .
3. For at least one element vc of V , we have that vc = e such that e ̸∈ S .
4. For any pair of elements v, v ′ , v ̸= v ′ , we have that vc ̸= vc′ and vc = s, vc′ = s′ and s, s′ ∈ S .
Note that the first two possibilities state that w is a collapsing/reducing sequence for Q̄. Moreover, if Q̄ = Q
and after advancing V with w , the first possibility occurs, and then w corresponds to a reset sequence for A .
The following shows how states-vectors are related to collapsing sequences.
Lemma 1 Let V be a states-vector for a set of states Q̄ associated with input sequence w ∈ Σ . If the elements
of vector V are advanced with w and for at least two elements v, v ′ ∈ V , we have vc = vc′ , and for each element
v of V we have that vc ̸= e , then w is a collapsing sequence for Q̄ .
Although states-vectors provide some level of parallelism, we can further exploit the massive parallelism
provided by a GPU through a group of states-vectors called a cluster.
Definition 2 A cluster Ck for a set of states Q̄ is a vector having k states-vectors for Q̄. Intuitively, Ck [i]
returns the ith states-vector and Ck [i][j] returns the j th element of the i th states-vector.
Here, k is the cardinal of the cluster.
We are now ready to provide the intuition behind the proposed algorithm. The algorithm keeps a statesvector (Vmin ) and an input sequence (wmin ) that will be used to keep the best choices made throughout the
execution.
The algorithm iteratively forms a reset sequence R . At each iteration, for a given set of states Vmin ,
the algorithm forms a cluster ( Ck ) and then generates k different input sequences ( w1 , w2 , . . . wk ) from Σκ ,
3547

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

where k, κ are positive integers supplied to the algorithm. This is then followed by advancing the elements of
cluster Ck . Upon completion, the algorithm checks whether the current states of at least one states-vector ( Vi )
of cluster Ck are the same. If so, then the algorithm concatenates wi to R and declares that a RS is found and
returns R . Otherwise, the algorithm searches for a minimum states-vector Vj on C . A states-vector Vj of Ck is
called minimum if (1) U (Vj ) < U (Vmin ) and (2) for every element Vi ̸= Vj of Ck we have that U (Vj ) ≤ U (Vi ) .
If there exists one minimum states-vector Vj , it sets Vmin as Vj and wmin as wj and repeats these steps until
either it finds a sequence that resets all the current states or all input sequences from Σκ are applied.
When all input sequences from set Σκ are applied to the elements of Ck , then depending on the outcome,
the algorithm follows one of the following options: (1) if neither a minimum states-vector nor a sequence that
resets all the current states is found, then the algorithm increments κ by one and checks if κ is less then
an integer (current length value) ℓ , where ℓ is supplied to the algorithm. If κ < ℓ the algorithm resets (i.e.
generates k copies Vmin and forms Ck ) Ck then it repeats the process mentioned above. Otherwise it returns
the message “ A does not have a reset sequence”. (2) If all input sequences from set Σκ are applied and a
minimum states-vector is found, i.e. Vmin is updated, then the algorithm concatenates R with wmin and resets
Ck and then repeats the process mentioned above.
Since all input sequences from set Σℓ may be applied, the proposed algorithm requires exponentially many
steps to find a RS or declare that the underlying automaton has no RS. However, as checking the existence of
a RS from the FA is a PSPACE-complete problem, we cannot escape this consequence.
2.2. High-level algorithm overview
The outline of the algorithm is given in Algorithm . The algorithm receives an automaton A and positive
integers ℓ, k ∈ Z>0 . The algorithm has three nested loops: OuterLoop, MiddleLoop and InnerLoop.
OuterLoop will iterate until all input sequences from set Σℓ have been applied or a RS is constructed.
In OuterLoop, the algorithm first checks whether all input sequences have been applied or not. If not, then it
increments the current length value ( κ) by one and enters MiddleLoop. Otherwise, it declares that “ A does
not have a reset sequence” and terminates (Lines 2–7).
MiddleLoop iterates at most f loor(|Σ|κ /k) + 1 times. In the MiddleLoop the algorithm carries out the
following steps:
1. It resets1 Ck (in parallel) (Line 9).
2. It retrieves k different input sequences w1 , w2 , . . . , wk of length κ (Line 10).
3. It evolves elements of Ck (in parallel) (Line 11).
4. It enters the InnerLoop, which iterates over input sequences: at the i th iteration 1 ≤ i ≤ k the algorithm
checks whether elements in Ck [i] are the same. If so, the algorithm returns wi as a reset sequence (in
parallel) (Lines 13–14). Besides, the algorithm searches for a minimum states-vector, and if it finds it it
updates Vmin (Lines 15–17).
Note that for a given k , the algorithm may not always retrieve k input sequences (Line 10) as k may
not be a factor of |Σ|κ . This happens when |Σ|κ mod k ̸= 0 . In such a case, after f loor(|Σ|κ /k) iterations (at
the f loor(|Σ|κ /k) + 1 th iteration), the algorithm returns |Σ|κ mod k input sequences and continues to execute.
1 For

3548

each i, j , it writes initial state value ( qj ) to C[i][j] .

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

Algorithm: Parallel RS construction algorithm. Highlighted states are executed in parallel.
Input: An automaton A = (Q, Σ, δ) and positive integers ℓ, k
Output: A reset seqeunce for A
begin
1
Execute ← T rue , κ ← 0, R ← ε , isF ound ← F alse , wmin ← ε ,
Vmin ← ⟨(q1 , q1 ), (q2 , q2 ), . . . , (qn , qn )⟩
// Outerloop
2
while Execute is true do
3
if !isF ound then
4
if κ < ℓ then
5
κ←κ+1
end
else
6
Declare that A is not synchronizing and terminate
end
end
else
7
κ = 1 , isF ound ← F alse
end
// Middleloop
while T here exists unprocessed w ∈ Σκ and Execute is T rue do
8
9
Reset Ck with Vmin
10
Retrieve k input sequences w1 , w2 , . . . , wk ∈ Σκ and associate them with Ck
11
Evolve elements of Ck
// Innerloop
12
foreach wi , 1 ≤ i ≤ k do
13
if All elements of vector Ck [i] are the same then
Return R.wi
14
end
15
if U (Vi ) < U (Vmin ) then
16
Vmin ← Vi and wmin ← wi
17
isF ound ← T rue
end
end
end
end
end

We now show the execution of the PRSGA using an example. Consider the FA A1 given in Figure 3(a).
Let us suppose that A1 , ℓ = 10 , and k = 50 are provided to the PRSGA. Then the algorithm will first set
and increment κ and hence we have κ = 1 . Since k > 3 , the algorithm first constructs Σ1 = 3 states-vectors.
This is then followed by advancing the states-vectors with these inputs (top image of Figure 3(b)). Note that
the algorithm fails to find a RS, but it finds that input sequence x1 results in a minimum states-vector. Thus,
the algorithm updates Vmin with s1 , s2 , and s3 , and then the algorithm moves to the second iteration.
In the second iteration, the algorithm again constructs Σ1 = 3 states-vectors by using the Vmin vector
and then it evolves the states-vectors with these inputs. Again the algorithm fails to find a RS, but it finds that
3549

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

collapsing sequence x1 results in a minimum states-vector (consisting of s1 , s3 ). Thus, the algorithm updates
Vmin with s1 and s3 , and then the algorithm moves to the next iterations.
The algorithm will keep applying this procedure until it finds a collapsing sequence that resets states s1
and s3 . Note that when κ = 5 , the PRSGA finds an input sequence x1 x3 x3 x1 x1 that merges states s1 and
s3 .
V0

V1

V2

s1 , s2 , s3 , s4 s1 , s2 , s3 , s4 s1 , s2 , s3 , s4
1st iteration, Vmin = ⟨(s1 , s1 ), (s2 , s2 ), (s3 , s3 )⟩
s1 , s3 , s3 , s2 s1 , er, s4 , er s2 , s4 , er, s1

x1 , x2

δ(Vmin , x1 )

V3

x3

s1

δ(Vmin , x2 )

V4

δ(Vmin , x3 )

V5

s1 , s2 , s3 s1 , s2 , s3 s1 , s2 , s3

s2

2nd iteration, Vmin = ⟨(s1 , s1 ), (s3 , s3 )⟩
s1 , s3 , s3 s1 , s4 , s4 s2 , er, er

x1
x1

x3
x3

Vk

s3

s4

δ(Vmin , x1 ) δ(Vmin , x2 )

14, x1 x1

x2

δ(Vmin , x3 )

Vl

Vm

s1 , s3

s1 , s3

s1 , s3

s1 , s3

s1 , s4

s1 , err

...

Vt

...

…

s1 , s3

…

…

...
δ(Vmin , x1 x1 x1 x1 x2 )
δ(Vmin , x1 x1 x1 x1 x1 )
δ(Vmin , x1 x1 x1 x1 x3 )

x1
(a) FA A1 with 3 inputs and 4
states

s3 , s3

…

...
δ(Vmin , x2 x3 x3 x1 x1 )

Vf

s1 , s3

Vc

s1 , s3

er, er

s4 , s1

δ(Vmin , w)

δ(Vmin , w′ )

(b) Execution steps of PRSGA when A1 , κ = 10 , and k = 50 are provided to the
algorithm, where er denotes the state labeled ERROR

Figure 3. An example of FA and execution steps of the PRSGA.

Clearly, for sufficiently large ℓ values, the PRSGA algorithm returns a RS if the underlying automaton
A possesses one such reset sequence. However, as mentioned above, the algorithm may require an exponential
amount of time and space for constructing such a sequence.
3. Experiments
In this section we first discuss what tools were used to construct FA and present the test data. Afterwards we
present results of experiments conducted on FA and finally we present some discussions. We implemented the
parallel reset generation algorithm (PRSGA) using CUDA. We present the implementation details in Section
4. We implemented the brute-force algorithm (BF(CPU)) described in [22]2 using C++. The experiments
explored two aspects that are of practical importance: the time required to construct RSs and the length of
these RSs. Naturally, the lower the length and the time of derivation, the better the approach.
3.1. FA generation
We used the automata generation method and the tool used in [41] to generate test cases: in this approach for
a given n and Σ , we first randomly pick a completeness ratio R from an interval that we call the completeness
interval, I = [low, high]. Afterwards, we set transitions of FA A : for each state q and for each input x ∈ Σ
we randomly set a state q ′ . Then we randomly drop R ∗ n ∗ |Σ|/100 number of transitions from A . This is
2 Through

3550

constructing a successor tree as described in Chapter 13.2.

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

then followed by testing if the produced automaton had a computable3 RS. If not, we discard this automaton;
otherwise, we keep the produced automaton. Following this method, we constructed three sets of automata:
• The aim of test set SET1 was to understand the performance of the algorithm under varying state sizes.
We randomly generated automata with n states where n ∈ {26 , 28 , . . . 224 } with ternary input and with
I = 15% − 25%. In SET1 for each n we generated 100 FA.
• In SET2 our aim was to see the effect of the number of inputs on the performance of the algorithm.
Therefore, we randomly generated FA with n = 28 and |Σ| ∈ {24 , 26 , 28 , 210 } and with I = 15% – 25% .
In SET2 for each different |Σ| value we constructed 100 FA.
• The aim of test set SET3 was to understand the effect of the completeness interval. Hence, we randomly
generated automata with 28 states with ternary input where I ∈ {[5, 10], [15, 20], [25, 30], [35, 40], [45, 50],
[55, 60], [65, 70]}. Again in SET3 for each I we generated 100 FA.
In total, we used 2100 FA. In order to accomplish experiments in an acceptable amount of time,
throughout the experiments, we set ℓ = 20 4 and we allowed algorithms to finish their execution in 1250
seconds. Experiments are carried out on an Intel Core 2 Extreme CPU (Q6850) with 8GB RAM and NVIDIA
TESLA K40 GPU under the 64 bit Windows Server 2008 R2 operating system. During generation we also
counted the number of FA with RSs. Figure 4 gives the result. While forming SET1 , we noted the chance of
constructing a FA with a RS. The probability of constructing an automaton with a RS is > 89% .
0.9

Chance of resetting

0.8995
0.899
0.8985
0.898
0.8975
16777216

8388608

4194304

2097152

1048576

262144

16384

65536

4096

1024

256

64

0.897

Number of states

Figure 4. Probability of generating FA with RS.

3.2. Effect of number of states
For each automaton in SET1 we constructed RSs by using the PRSGA and BF(CPU). The results for the
experiments performed on SET1 are given in Figure 5(a). The results verify our intuition: the GPU-accelerated
algorithm (PRSGA) is 9 times faster than the BF(CPU) construction algorithm (when k = 60 and n ≤ 256).
Note that the proposed algorithm construct RSs for FA with 16 million states in less than 228 seconds (in
about 4 minutes) when k = 60 on average. As the proposed algorithm exploits the parallelism provided by the
3 Note
4 Note

that a FA with n states may have a RS of exponential length. Therefore we set 1250 seconds to derive RSs.
that for Σ and κ the number of input sequences to be generated is bounded above by Σκ .

3551

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

GPU through processing k states-vectors, the number k affects the performance of the proposed algorithm.
From Figure 5(a) we see that as the k value drops the time required to construct RSs increases. Moreover, we
observe that BF(CPU) fails to construct RS for FA with n > 256 states within 1245.1 seconds, indicating that
the proposed algorithm increases the scalability of constructing RSs of partial FA by 65536 .
Note that the length of the RSs returned by the PRSGA algorithm does not depend on the parameter k
and so we present the results obtained when k = 60 . We see in Figure 5(b) that the average length of RS is
not larger than 20 and increases with the number of states.
PRSGA(k=20)

PRSGA(k=40)

PRSGA(k=60)

BF(CPU)

16
14
l og 2 (R S L eng th)

10

log2(sec.)

8
6
4
2

10
8
6
4
2

0

16777216

8388608

4194304

2097152

1048576

262144

65536

16384

4096

1024

(a) Average time required to construct RSs for FA in SET1

256

Number of States

0
64

16777216

8388608

4194304

2097152

1048576

262144

65536

16384

4096

1024

256

64
-2

12

Number of States

(b) Average length of RSs constructed for FA in SET1

Figure 5. Experimental results on test set SET1 .

3.3. Effect of number of inputs
The results of the experiments conducted on SET2 are given in Figure 6(a). Results are interesting; we observed
that BF(CPU) could only return 2 RSs for FA with 16 inputs. We noted that lengths were both 3 ; moreover,
BF(CPU) could return 1 RS for FA with 64 inputs where the length of the RS was 3 . For all other cases
BF(CPU) failed to return RSs due to insufficient memory or long computation time. However, the PRSGA
algorithm could compute RSs for all of the FA. We observed that as the number of inputs increases, the time
required to construct RSs increases as well regardless of the parameter k . Clearly this is as expected since the
number of input sequences to be processed should increase with the number of inputs. On the other hand, we
also observe from Figure 6(b) that the length of the RSs reduces as the number of inputs increases.
3.4. Effect of completeness interval (I)
The completeness interval (I) determines what percentage of the state/input pairs are unspecified. One would
therefore expect the value of the completeness interval to affect the performance of the proposed algorithm.
The results of experiments indicate that the time required to construct RSs increases exponentially when
using the higher set of values for I (Figure 7(a)). Moreover, we observe that the transition saturation ratio
affects the length of RSs; as we increase the completeness interval, the length of RSs increases exponentially.
3.5. Threats to validity
This section briefly reviews the threats to validity and how these were reduced. We consider two types of
threats: those to internal validity and external validity.
3552

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

PRSGA(k=20)

PRSGA(k=40)

25

PRSGA(k=60)

9

20

8.5

RS Length

8
7.5
7
6.5

15
10
5

6
5.5

0
1024

256

1024

256

64

16

(a) Average time required to construct RSs for FA in SET2

64

16

5

Number of States

(b) Average length of RSs constructed for FA in SET2

Figure 6. Experimental results on test set SET2 .
PRSGA(k=20)

PRSGA(k=40)

25

PRSGA(k=60)

9

20

8.5

RS Length

8
7.5
7
6.5

15
10
5

6
5.5

0
1024

256

1024

256

64

16

(a) Average time required to construct RSs for FA in SET3

64

16

5

Number of States

(b) Average length of RSs constructed for FA in SET3

Figure 7. Experimental results on test set SET3 .

Threats to internal validity concern factors that might introduce bias. The main source of such threats
is the tools used to run the experiments. The FA generation tool has been used in a number of projects and
was tested. The implementations of these algorithms were carefully checked and also tested with a range of FA.
To further reduce this threat, we also used an existing tool that checks if a given input sequence is a RS for the
underlying FA. This tool was used to check all of the RSs generated by the PRSGA and BF(CPU).
Threats to external validity concern our ability to generalize from the experiments. There is always such
a threat to validity since we do not know the space of relevant FA and certainly have no good way of sampling
from this. We reduced this threat by using randomly generated FA. We also varied the number of inputs and
states and the completeness interval.
3.6. Discussion
Recall that in Section 1, we observed that the PRSGA is an exponential algorithm; this cannot be avoided since
determining the existence of a RS is a PSPACE-hard problem. As the lengths of the RS sequences generated
3553

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

from the FA given in SET1 , SET2 , and SET3 are not longer than the logarithm of the number of states, it
appears that we have not found such long executions.
Another important point related to the PRSGA is the need to select the k -cluster parameter k . The
experiments revealed that when we select a value for k that is too large, the algorithm gets faster as the size
of input sequences to be processed increases. However, on the other hand, if we select a value of k that is too
small then there is a risk of decreasing the GPU occupancy and increasing the memory traffic between the CPU
memory and the GPU memory. Furthermore if we select a value of k that is too large, then there is a risk of
calling too many threads that reduce the GPU performance due to the high interleaved transactions. Therefore,
before the PRSGA, the parameter k should be selected carefully.
Experiment results presented in this paper suggest that the algorithm is capable of deriving RSs from FA
and complete FA with 16 million states in less than 6 minutes. These results are very important for two reasons.
First of all, there has been no work that can process such large FA. Second, with this method researchers can
investigate properties of RSs on larger FA, which may lead to interesting research directions.
4. Conclusion
In this paper we addressed the scalability issue encountered while constructing RSs from incomplete FA through
massively parallel GPGPUs. We first provided a high-level overview of the algorithm. We then explained the
results of an experimental study done to measure the performance of the algorithm by using randomly generated
FA. The results showed that the proposed algorithm can effectively derive RSs from large partial FA. We provide
low-level descriptions for the algorithm in the Appendix.
There are several lines of future work. First, it would be interesting to study massively parallel heuristic
algorithms for deriving RSs from FA. Second, despite the high energy requirement, it would be interesting to
investigate multicore approaches for constructing RSs. Finally, it would also be interesting to extend this work
to nondeterministic partial-complete FA.
References
[1] Jourdan G, Ural H, Yenigün H. Reduced checking sequences using unreliable reset. Information Processing Letters
2015; 115 (5): 532-535. doi:10.1016/j.ipl.2015.01.002
[2] Türker UC, Yenigün H. Complexities of some problems related to synchronizing, non-synchronizing and monotonic automata. International Journal of Foundations of Computer Science 2015; 26 (1): 99-122. doi:10.1142/
S0129054115500057
[3] Ananichev DS, Volkov MV. Synchronizing monotonic automata. Theoretical Computer Science 2004; 327 (3): 225239.
[4] Eppstein D. Reset sequences for monotonic automata. SIAM Journal of Computing 1990; 19 (3): 500-510.
[5] Natarajan BK. An Algorithmic Approach to the Automated Design of Parts Orienters. In: 27th Annual Symposium
on Foundations of Computer Science; Toronto, Canada; 1986. pp. 132-142.
[6] Chow TS. Testing software design modeled by finite-state machines. IEEE Transactions on Software Engineering
1978; 4 (3): 178-187.
[7] Boute R. Distinguishing sets for optimal state identification in checking experiments. IEEE Transactions on Computers 1974; 23 (8): 874-877. doi:10.1109/T-C.1974.224043
[8] Petrenko A, Bochmann G. Selecting test sequences for partially-specified nondeterministic finite state machines. In:
International Workshop on Protocol Test Systems; London, UK; 1995. pp. 95-110.

3554

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

[9] Hierons R. Minimizing the number of resets when testing from a finite state machine. Information Processing Letters
2004; 90 (6): 287-292.
[10] Hierons R, Ural H. Generating a checking sequence with a minimum number of reset transitions. Automated Software
Engineering 2010; 17 (3): 217-250.
[11] Rezaki A, Ural H. Construction of checking sequences based on characterization sets. Computer Communications
1995; 18 (12): 911-920.
[12] Vasilevskii M. Failure diagnosis of automata. Cybernetics 1973; 9 (4): 653-665. doi:10.1007/BF01068590
[13] Gonenc G. A method for the design of fault detection experiments. IEEE Transactions on Computers 1970: 19 (6):
551-558. doi:10.1109/T-C.1970.222975
[14] Ural H, Wu X, Zhang F. On minimizing the lengths of checking sequences. IEEE Transactions on Computers 1997;
46 (1): 93-99.
[15] Hierons RM, Ural H. Reduced length checking sequences. IEEE Transactions on Computers 2002; 51 (9): 1111-1117.
[16] Benenson Y, Paz-Elizur T, Rivka A, Keinan E, Livneh Z et al. Programmable and autonomous computing machine
made of biomolecules. Nature 2001; 414 (6862): 430-434. doi:10.1038/35106533
[17] Benenson Y, Adar R, Paz-Elizur ZLT, Shapiro E. DNA molecule provides a computing machine with both data
and fuel. Proceedings of the National Academy of Sciences of the United States 2003; 100 (5): 2191-2196. doi:
10.1073/pnas.0535624100
[18] Stojanovic MN, Stefanovic D. A deoxyribozyme-based molecular automaton. Nature Biotechnology 2003; 21 (9):
1069-1074.
[19] Cherubini A, Gawrychowski P, Kisielewicz A, Piochi B. A combinatorial approach to collapsing words. Mathematical
Foundations of Computer Science 2006; 4162: 256-266.
[20] Gill A. Introduction to the Theory of Finite State Machines. New York, NY, USA: McGraw-Hill, 1962.
[21] Hennie FC. Finite-State Models For Logical Machines. New York, NY, USA: Wiley, 1968.
[22] Kohavi Z. Switching and Finite State Automata Theory. New York, NY, USA: McGraw-Hill, 1978.
[23] Roman A. Synchronizing finite automata with short reset words. Applied Mathematics and Computation 2009;
209 (1): 125-136.
[24] Roman A. Genetic algorithm for synchronization. In: Language and Automata Theory and Applications, Third
International Conference; Tarragona, Spain; 2009. pp. 684-695.
[25] Roman A. New algorithms for finding short reset sequences in synchronizing automata. In: International Enformatika Conference; Prague, Czech Republic; 2005. pp. 13-17.
[26] Kudłacik R, Roman A, Wagner H. Effective synchronizing algorithms. Expert Systems and Applications 2012;
39 (14): 11746-11757.
[27] Kisielewicz A, Szykuła M. Generating small automata and the Černý conjecture. In: Implementation and Application of Automata; Halifax, Canada; 2013. pp. 340-348. doi:10.1007/978-3-642-39274-0_30
[28] Kisielewicz A, Kowalski J, Szykuła M. A fast algorithm finding the shortest reset words. In: Computing and
Combinatorics; Hangzhou, China; 2013. pp. 182-196. doi:10.1007/978-3-642-38768-5_18
[29] Satish N, Harris M, Garland M. Designing efficient sorting algorithms for manycore GPUs. In: IEEE International
Symposium on Parallel & Distributed Processing; Rome, Italy; 2009. pp. 1-10.
[30] Merrill D, Garland M, Grimshaw A. Scalable GPU graph traversal. ACM SIGPLAN Notices 2012; 47: 117-128.
[31] Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW et al. A performance study of general-purpose applications on
graphics processors using CUDA. Journal of Parallel and Distributed Computing 2008; 68 (10): 1370-1380.

3555

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

[32] Luo L, Wong M, Hwu WM. An effective GPU implementation of breadth-first search. In: Design Automation
Conference; Anaheim, CA, USA; 2010. pp. 52-55.
[33] Harish P, Narayanan P. Accelerating large graph algorithms on the GPU using CUDA. In: High Performance
Computing; Goa, India; 2007. pp. 197-208.
[34] Kirk DB, Wen-Mei WH. Programming Massively Parallel Processors: A Hands-On Approach. Oxford, UK: Newnes,
2012.
[35] Mytkowicz T, Musuvathi M, Schulte W. Data-parallel finite-state machines. ACM SIGPLAN Notices 2014; 49 (4):
529-542.
[36] Mytkowicz T, Schulte W. Maine: A Library for Data Parallel Finite Automata. Technical Report. New York, NY,
USA: Microsoft Research, 2012.
[37] Hierons RM, Türker UC. Parallel algorithms for testing finite state machines: generating UIO sequences. IEEE
Transactions on Computers 2016: 42 (11): 1077-1091.
[38] Hierons RM, Türker UC. Parallel algorithms for testing finite state machines: harmonised state identifiers and
characterising sets. IEEE Transactions on Computers 2016; 65 (11): 3370-3383.
[39] Karahoda S, Erenay OT, Kaya K, Türker UC, Yenigün H. Parallelizing heuristics for generating synchronizing
sequences. In: Testing Software and Systems; Graz, Austria; 2016. pp. 106-122.
[40] Klingbeil G, Erban R, Giles M, Maini P. Fat versus thin threading approach on GPUs: application to stochastic
simulation of chemical reactions. IEEE Transactions on Parallel and Distributed Systems 2012; 23 (2): 280-287.
[41] Hierons RM, Türker UC. Distinguishing sequences for partially specified FSMs. In: NASA Formal Methods;
Houston, TX, USA; 2014. pp. 62-76.

3556

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

Appendices
Low-level algorithm details.
In developing the PRSGA we used several data structures to perform efficient memory transactions and
efficient thread utilization.
• The A vector holds the transitions of the underlying FA: given state qi and an input symbol x , A returns
either a state qj such that δ(qi , x) = qj or ERROR , indicating that the input is not defined. We also add
|Σ| elements to A for the ERROR state, such that for all x ∈ Σ, δ(ERROR, x) = ERROR . Hence, the
size of A is (n + 1)|Σ|.
• The CurrentStates vector corresponds to the elements of k -cluster Ck , i.e. it holds the relationship between
initial and current states.
Hence, the size of CurrentStates is at most kn .
• The InputSequence vector corresponds to the input sequences associated with the states-vectors of k cluster Ck . That is, the InputSequence vector holds the input sequences that are going to be applied to
the elements of states-vectors of k -cluster Ck . Since there are k input sequences, and the current length
value of input sequences ( ℓ) varies between 1 and κ, the size of InputSequence is at most kκ .
The PRSGA algorithm uses three kernels: Reset (called on line 9 of Algorithm), Apply (called on line 11 of
Algorithm), and T est (called on line 13 of Algorithm).
The Reset kernel is used to reset the CurrentStates vector. In order to do this the Reset kernel receives
the CurrentStates vector, Vmin vector, and the size of the Vmin vector and the value of k . The Reset kernel is
called with kη threads. During the execution thread ti (where 0 ≤ i < kη ) copies the i − f loor(i/η)η th value
of the Vmin vector to the i th value of the CurrentStates vector. Note that read/write operations carried out in
the Reset kernel are coalesced and do not cause thread divergence.
The Apply kernel is used to evolve elements of the Ck vector. In other words, threads that execute the
Apply kernel apply input sequences retrieved from the InputSequence vector to the current states retrieved from
the CurrentStates vector. Therefore, the Apply kernel receives CurrentStates, InputSequence, number of input
sequences k , and current length value κ. The Apply kernel is launched with kη threads. The Apply kernel is
an iterative procedure and the number of iterations is equal to κ.
At the j th iteration, thread ti retrieves the value (state id) from CurrentStates[i] and computes inputindex value ( Σ(i, j, η, κ)) and fetches the input symbol from InputSequence [Σ(i, j, η, κ)]. The input-index value
is computed as follows:
Σ(i, j, η, κ) = f loor(i/η)κ + j.

(1)

After the input and the current states have been retrieved, thread ti retrieves the next state value from
vector A and writes the next state value to CurrentStates[i] . Thread ti repeats these steps as far as j < κ .
The algorithm evaluates the outcome of the Apply kernel through the T est kernel. The T est kernel
is called within InnerLoop, which iterates over input sequences w1 , w2 , . . . , wk . At each call, the T est kernel
receives CurrentStates, index (indicating the current index of input sequence ( windex )). It also receives an
integer vector ( bulk ) of n elements with all elements set to 0.
Note that the CurrentStates vector holds kη elements (in other words, k states-vectors). For input
sequence windex , threads that execute the T est kernel should evaluate the index th states-vector, i.e. all the
1

Uraz Cengiz TÜRKER/Turk J Elec Eng & Comp Sci

elements in interval [(η)(index − 1) + 1, ((η)(index − 1) + η) − 1) of the CurrentStates vector. To allow this
we launch the T est kernel with η threads. During the execution of the T est kernel, thread i first reads the
value ( ς ) that resides in the (η)(index − 1) + 1 + i th element of the CurrentStates vector and then performs an
(atomic) increment on the ς th element of the bulk vector. If ς = ERROR , then the T est kernel writes (2n)
to the first element of the bulk vector.
When the T est kernel returns, we apply a generic parallel-MAX function and receive the maximum value
on the bulk vector. Note that the maximum value of the bulk vector corresponds to the maximum number
of states that are merged/collapsed by advancing the Vmin vector with input sequence windex . Hence, if the
maximum value returned by the parallel-MAX function equals η then we append windex to R and terminate;
otherwise, if the maximum value is larger than U (Vmin ), then we copy windex to wmin and parallel copy the
index th states-vector of Ck to Vmin . If the maximum value is equal to or larger than 2n , we deduce that input
sequence windex is not defined for Vmin and therefore the algorithm jumps to input sequence windex+1 and
continues.

2

