Abstract-This paper describes an efficient parallel algorithm that uses many-core GPUs for automatically deriving Unique Input Output sequences (UIOs) from Finite State Machines. The proposed algorithm uses the global scope of the GPU's global memory through coalesced memory access and minimises the transfer between CPU and GPU memory. The results of experiments indicate that the proposed method yields considerably better results compared to a single core UIO construction algorithm. Our algorithm is scalable and when multiple GPUs are added into the system the approach can handle FSMs whose size is larger than the memory available on a single GPU.
INTRODUCTION
S OFTWARE testing is an important part of the software development process but is typically expensive, manual and error prone. This has led to significant interest in automation and one of the most promising approaches is modelbased testing (MBT) in which test automation is based on a model of the system under test (SUT) or some aspect of the SUT. Many MBT methods base test automation on a finite state machine (FSM), with this line of work going back to Moore's seminal paper [1] .
Many FSM-based test generation methods check that the transitions of the FSM specification M have been implemented correctly. In order to check a transition it is necessary to have some method that checks that the state of the SUT, after input x in state s, is the expected state s 0 . This is typically achieved by using input sequences that distinguish the states of M. Ideally, one has a distinguishing sequence (an input sequence that distinguishes all of the states of M) and early work by Hennie showed how a test sequence can be automatically derived when there is a known distinguishing sequence [2] . However, an FSM need not have a distinguishing sequence and instead one might use a unique input output sequence (UIO) for a state s 0 : an input sequence that distinguishes s 0 from all other states of M but need not distinguish any other pairs of states of M. Although not all FSMs have a UIO for each state, it has been reported that in practice most FSMs do have such UIOs [3] and this has led to the development of many FSM-based test generation methods that use UIOs [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] . However, it is known that the problem of checking the existence of a UIO is PSPACE-Hard and so one cannot expect to find polynomial time algorithms that construct UIOs and there is no polynomial upper bound on UIO length. Since the length of a UIO sequence can be exponential, the duration and hence the cost of deriving such sequences from large FSMs can be very high. This has led to interest in methods that relatively efficiently generate UIOs [12] , [13] , [14] , [15] but ultimately we cannot get away from the worst case complexity. Previous approaches for deriving UIOs have developed sequential algorithms that operate on a single thread and have not used Graphics Processing Units (GPUs) despite the increasing interest in GPUs. Recently, with the publication of the Compute Unified Device Architecture (CUDA) development toolkit that allows GPU programming in a C-like language, the use of GPUs has been extended to a range of application domains [16] , [17] , [18] , [19] , [20] .
In this paper, we address the scalability problem that can arise while constructing UIOs for large completely-specified FSMs through the use of massively parallel GPU technology. The work was motivated by the fact that GPUs have become an important tool in large scale applications in which massively parallel processing is needed. As far as we are aware, this problem has not previously been explored. One of the reasons for this may be that there is a need to model the UIO generation problem in a manner that is suitable for GPUs and this is not straightforward; previous algorithms for constructing UIOs have used data structures that are not suitable for GPU computing. As previously noted [21] this can be considered to be the biggest difference between CPUs and GPUs: much of the physical space of a GPU is reserved for computing units rather than memory. This space is divided into many relatively simple cores, with the instruction set of a core being much smaller than that of a standard CPU. Despite their limited instruction sets, the performance of GPUs make them highly effective when used to solve certain types of problems. In addition, the demand for high performance graphics (due to computer gaming, high resolution image processing and big data research) led to increasing parallelism. In fact, new massively parallel computing processors such as the NVIDIA's Tesla-K40 already have 2,880 cores where a core has clock speed of 745 MHz, 288 GB/sec memory bandwidth, and 12 GB of memory.
The constraints imposed by GPU computing led to us devising an algorithm that is entirely new with the exception of the fact that it utilises the 'Unique Predecessor' approach devised by Naik [15] . Otherwise the proposed algorithm is unique since all the existing brute force approaches in the FSM based literature construct a 'UIO Tree' and this is not appropriate due to the space/time limitations (as shown by the experiments). There were several additional challenges related to memory management. These challenges include the need to efficiently distribute data processing between the CPU and GPU, thread synchronisation, optimisation of data transfer, and the capacity constraints of GPU memory.
This paper proposes a massively parallel UIO (P-UIO) generation algorithm that addresses these problems in the context of deriving UIOs for a deterministic completely specified FSM. The P-UIO algorithm was evaluated against Naik's algorithm using randomly generated FSMs with up to 1,048,576 states. In the experiments the proposed algorithm constructed UIOs significantly faster (by a factor of 11,000 on average) and for much larger FSMs (by a factor of 512). For example, the P-UIO algorithm was able to handle FSMs with 1,048,576 states in under 2 seconds on average while the implementation of Naik's algorithm took 1,231 seconds on average for FSMs with 2,048 states. We also performed experiments on some much smaller benchmark FSMs, with between 4 and 48 states. The performance of the two algorithms was similar for these FSMs but the P-UIO algorithm took less time to find UIOs for the larger benchmark FSMs. Unsurprisingly, the differences in performance were less significant for the (much smaller) benchmark FSMs. This paper is organised as follows. Section 2 introduces the terminology used and reviews previously devised UIO generation techniques. Section 3 provides an overview of the proposed parallel UIO algorithm and describes the high-level design. Section 4 provides the low-level design. In Section 5 we describe the experiments designed to evaluate the proposed UIO construction algorithm and the results of these experiments. Finally, in Section 6 we provide concluding remarks and discuss possible lines of future work.
PRELIMINARIES

Finite State Machines (FSMs)
A finite state machine M is defined by a tuple ðS; X; Y; d; Þ where S is a finite set of states, X ¼ fx 1 ; x 2 ; . . . ; x p g is a finite set of inputs, Y ¼ fo 1 ; o 2 ; . . . ; o r g is a finite set of outputs, d is the transition function of type d : S Â X ! S and is the output function of type : S Â X ! Y . The functions d and are total functions and so M is completely-specified. If FSM M is in state s 2 S and input x 2 X is applied then M moves to the state s 0 ¼ dðs; xÞ and produces output o ¼ ðs; xÞ. Such a transition will be denoted t ¼ ðs; x=o; s 0 Þ and we say that x=o is the label of t (labelðtÞ), s is the start state of t (startðtÞ), and s 0 is the end state of t (endðtÞ). Note that sometimes the definition of an FSM includes an initial state; we do not include this since we are interested in distinguishing states of an FSM and so do not require there to be an initial state.
We use juxtaposition to denote concatenation: if x 1 , x 2 , and x 3 are inputs then x 1 x 2 x 3 is an input sequence. Given a set X we let X Ã denote the set of finite sequences of elements of X and let X k denote the set of sequences in X Ã that have length k. The symbol " is used to denote the empty sequence.
An input/output sequence consists of a sequence of input/output pairs of the form In this work, we consider only deterministic, completelyspecified, minimal FSMs. An FSM can be minimised in polynomial time [22] . Further, an FSM that is not completelyspecified can often be completed by adding either an error state or transitions with null output. 1 Thus, the main restriction is that we only consider deterministic FSMs. While non-determinism can be a useful abstraction technique, and some classes of systems are non-deterministic, the main focus of FSM-based testing work has been on deterministic FSMs and these have been found to be sufficient in important application domains such as hardware [24] , protocol conformance testing [5] , [25] , [26] , [27] , [28] , [29] , object-oriented systems [30] , web services [31] , [32] , [33] , [34] , and general software [35] .
An FSM M can be represented by using a directed graph G where the vertices of G correspond to the states of M and the edges of G correspond to the transitions of M. An FSM M is strongly connected if the corresponding directed graph G is strongly connected (for any ordered pair ðv; v 0 Þ of vertices there is a path from v to v 0 ). In Fig. 1 an example FSM M 1 is given, where S ¼ fs 1 ; s 2 ; s 3 ; s 4 g, X ¼ fx 1 ; x 2 g, and Y ¼ fo 1 ; o 2 g. It is straightforward to see that this FSM is strongly connected. Note that M 1 is a minimal machine, since the input sequence x 1 x 2 x 1 x 1 x 2 x 1 is a splitting sequence for every pair of different states.
For a given FSM, state verification can be achieved by using a Distinguishing sequence (DS), unique input/output 1. As has been previously noted, it is not always possible to complete an FSM since, for example, unspecified input may correspond to input that should not occur [23] .
sequences (UIOs), a Characterising set (CS), or state identifiers. A distinguishing sequence is an input sequence x such that for every pair ðs; s 0 Þ of distinct states of FSM M, M produces different output sequences in response to
x from s and s 0 (ðs; xÞ 6 ¼ ðs 0 ; xÞ). Unfortunately, not all FSMs have a DS. An alternative is to use an adaptive distinguishing sequence, which is an adaptive process that determines the next input to apply on the basis of the output observed. Since ADSs generalise DSs, an FSM that has a DS also has an ADS and an additional benefit is that it is possible to determine whether an FSM has an ADS in low-order polynomial time [36] .
A There may be value in using UIOs even when an FSM has a DS since the UIOs may be shorter than the DS. For example, state s 1 of M 1 has a UIO (x 1 =o 2 ) of length 1 but it is straightforward to check that M 1 does not have a DS of length 1. It has also been found that in practice many FSMs have UIOs for all states [3] .
A characterising set is a set W of input sequences that can distinguish any pair of states. A minimal FSM with n states has a CS with at most n À 1 sequences of length at most n À 1 and such a CS can be found in polynomial time. If every sequence in W is executed from state s, the set of output sequences identifies/verifies s. However, the use of characterising sets could lead to long test sequences [37] . In addition, most test generation techniques that use a characterising set return many test sequences and it has been noted that the process of resetting a system between test sequences can be expensive [38] , [39] , [40] , [41] . As a result, it is desirable to use a DS or UIOs where they exist (and are sufficiently short) and use a characterising set otherwise. Since the size of a characterising set is of Oðn 2 Þ, it has been suggested that if a UIO or DS is longer than an Oðn 2 Þ upper bound then it might be best to use a characterising set. This has led to the suggestion that one might initially attempt to find a DS or UIOs but only use a DS/UIOs if the length is lower that the given bound. There has been much interest in UIOs [42] , [43] since they help in state transition fault detection and have been found to yield shorter test sequences than using the DSs and CSs [42] , [43] .
It has been observed that it may not be necessary to use all of the sequences from a characterising set in order to identify a state s of the FSM [44] . This has led to the notion of state identifiers, where a state identifier (or separating set) for state s is a set of input sequences that distinguish s from all other states of the FSM M. The use of state identifiers can lead to smaller test suites, when compared to characterising sets.
Previous UIO Generation Methods and Inference Rules
Since UIOs have been used in automated FSM-based test generation methods, there has been interest in the problem of devising UIOs. Although the problem of checking the existence of UIOs is PSPACE-Complete [36] , the value of UIOs in test generation has led to significant interest in UIO generation.
It is possible to represent UIO generation in terms of a UIO tree (Definition 2.1) [15] . Definition 2.1. Let M be an FSM with set of states S (jSj ¼ n).
A Unique Input Output Tree for S is a rooted tree T such that nodes are labeled with two groups (initial group I and current group C) and edges are labeled with input/output pairs. A node of T is called a leaf node if its groups have cardinality one. If two distinct edges leaving a node v share the same input label then these edges have different output labels. For every node v of T , if x= o is the input/output sequence formed by concatenating the edge labels on the path from the root node to v, then we have that I ¼ fs 2 Sjðs; xÞ ¼ og and
xÞg. A UIO tree is complete if for every state s i there exists a leaf node v with initial set I ¼ fs i g.
An example UIO tree for M 1 is given in Fig. 2 . As the upper bound on UIO length is exponential, the process of constructing UIO trees from scratch can be expensive. This led to Naik [15] proposing an approach to construct UIOs in which inference rules are used. In this method some minimal length UIOs are found and these UIOs are used to deduce UIOs for other states. The inference rules operate as follows: If x= o is a UIO for state s and t ¼ ðs 0 ; x=o; sÞ is the only transition that reaches state s with label x=o, then x x=o o is a UIO for s 0 . Here state s 0 is known as a unique predecessor of state s. Note that the notion of 'unique' here is for the state and input/output; there might be more than one unique predecessor for a state s. Since a machine M is assumed to be strongly connected, it may be possible to use inference rules to compute UIO sequences for all states once we have constructed only a few UIOs. In order to achieve this we have a database, called a rule base, containing the known inference rules for the FSM. In the following, we formalise what it means for a state to be a unique predecessor. Fig. 3b . Let us assume that by using a UIO tree we can construct a UIO for state s 1 . Then by using the rule base table, we can derive UIOs for all of the states of the FSM (Fig. 3c) .
If an FSM does not possess UIOs for all of its states, the inference rules do not solve the problem. Naik's algorithm then constructs the full UIO tree, which requires exponential time/space.
The CUDA Programming Model
Compute Unified Device Architecture is NVIDIA's parallel computing architecture that combines software and hardware architectures. We first present an overview of the CUDA hardware model.
At the hardware level, a CUDA capable GPU processor is a collection of multiprocessors (SMX), each having a number of processors. Each multiprocessor has its own shared memory which is common to all its processors. It also has a set of 32-bit (or 64-bit depending on the card) registers, texture memory (a read only memory for the GPU), and constant (a read only memory for the GPU that has the lowest access latency) memory caches. In any given cycle, each processor in the multiprocessor executes the same instruction on different data and so a multiprocessor is a single instruction multiple data (SIMD) processor. Communication between multiprocessors can be achieved through the global device memory, which is available to all the processors in all multiprocessors.
From the programmer's point of view the CUDA model is a collection of threads running in parallel, with a collection of threads, called a warp, running simultaneously on a multiprocessor. The warp size can vary according to the GPU. The programmer decides on the number of threads to be executed. If the number of threads is more than the warp size then these threads are time-shared internally on the multiprocessor. At a given time, a block of threads runs on a multiprocessor. The maximum number of threads in a block can vary according to the underlying GPU. However, multiple blocks can be assigned to a single multiprocessor and their execution is again time-shared. The collection of blocks for a single program is called a grid and the maximum number of grids can vary according to the GPU.
All threads of all blocks executing on a single multiprocessor divide its resources equally amongst themselves. Each thread executes a piece of code called a kernel. The kernel is the core code to be executed on a multiprocessor. Upon execution, thread t i is given a unique ID and during execution thread t i can access data residing in the GPU by using its ID. Since the GPU memory is available to all the threads, a thread can access any memory location. This allows programmers to interpret a device as a Parallel Random Access Machine (PRAM) architecture through the usage of global device memory. However, the performance improves with the use of shared memory (which can only be accessed by threads within a block), as such memory can be accessed faster than the global device memory. During GPU computation the CPU can continue to operate. Therefore the CUDA programming model is a hybrid computing model in which a GPU is referred as a co-processor (device) for the CPU (host).
HIGH-LEVEL DESIGN
In this section we present the high-level design of the proposed massively parallel (P-UIO) algorithm for deriving UIOs from FSMs. We start by discussing the design decisions made in developing the P-UIO algorithm and then provide an overview of the P-UIO algorithm. In Section 4 we provide additional information regarding the data structures used.
Parallel Design: From UIO-Tree to Sorting
In the proposed parallel algorithm, we aimed to address several bottlenecks that we may encounter while using na€ ıve UIO tree construction algorithms. 
1) Sequential process:
A na€ ıve sequential UIO tree generation algorithm iterates over a UIO tree T and would process this tree node-by-node. Here each node is associated with a group and the same input is applied to each state in a group (since these states have yet to be distinguished). (Fig. 4) . 2) Memory Requirements: During UIO tree computation, all portions of the UIO tree would be kept in memory since: It includes information about how the states are split ( Fig. 5a ) and It makes it possible to back-track when required (Fig. 5b ). In developing a massively parallel approach, one might construct a UIO-tree and choose to have a thread t i process a single node of a UIO-tree. The thread t i would process all of the data associated with a node [12] , [15] (Figs. 7a) . However, a node can have many current and initial states and for an FSM with n states, a node is associated with data whose size is of OðnÞ. Although the maximum number of states associated with a node reduces as the depth of the tree increases, the rate at which this happens will vary between FSMs (Fig. 7b) . As a result, an approach that directly represents the UIO tree may not scale well for very large FSMs.
These are crucial obstacles in designing a scalable UIO generation algorithm. In order to ease these issues, we need to devise a scalable alternative approach which demands less memory, can be parallelised, and can be used to derive UIO sequences. We now explain how this can be achieved.
Consider Fig. 6a , which gives output sequences produced by the n states of an FSM in response to an input sequence x.
Input sequence
x uniquely distinguishes a state if and only if the output sequence produced by this state is unique. Thus, we can use an approach that finds columns in which the output sequences are unique. We can re-formalise this problem as follows: We are given a set of sequences of the same length and want to find the different sequences. This problem can be solved by sorting the output sequences: after sorting, if the column being considered is different from its neighbouring columns then it is unique (Fig. 6b) . A benefit of this is that sorting can be parallelised and can be efficiently performed by GPUs [19] . Thus, in the P-UIO algorithm we used an approach based on sorting in order to determine which states have been distinguished. In implementing the parallel UIO algorithm we used stable merge sort.
In order to be able to use sorting we need to introduce an alternative formalisation for constructing UIO sequences. Rather than represent the problem in terms of UIO-trees, we use what we call input output vectors; later we will see how we can base a scalable parallel UIO generation algorithm on this formalisation. 
HIERONS AND T € URKER: PARALLEL ALGORITHMS FOR TESTING FINITE STATE MACHINES:GENERATING UIO SEQUENCES
of elements that have the same output sequence also have the same current state.
We are interested in whether an IO-vector is homogenous since there is no value in extending such an IO-vector with further input: if two initial states s and s 0 have not been distinguished in this IO-vector (they have the same output sequences) then they are mapped to the same current state and so cannot be distinguished by further input. Thus, whenever the search for UIOs finds a homogenous IO-vector it will back-track.
In UIO generation, we will 'evolve' the elements of an IOvector and will do so in a manner that is consistent with the notion of a UIO. We sort the output sequences to determine whether the states corresponding to two elements have been distinguished: two elements share the same output sequence if they have not been distinguished. In each iteration, for each output sequence that appears in the current IO-vector the algorithm chooses a next input to use. Let Note that an IO-vector may not evolve into a UIO-vector. The reason for this is that in evolving an IO-vector the same input is applied to all elements that have the same output sequence. The problem here is that, for example, an FSM might have UIOs for all states but have a pair of states s; s 0 such that the UIOs for s and s 0 start with different inputs: such a scenario cannot be captured by an IO-vector. Consequently in order to construct UIOs one may need to construct a set of IO-vectors.
set for FSM M with state set S if for all s 2 S, there exists an IO-vector V 0 2 V that has an element v whose initial state is s and whose input sequence XðvÞ is such that XðvÞ= OðvÞ is a UIO for s.
The following is an immediate consequence of the definition of a full set. Importantly, an element of an IO-vector contains all information related to the evolution from a single state including the input/output sequence, initial and current states. This representation allows us to have a one-to-one correspondence between threads of the GPU and the elements of an IO-vector (Fig. 8) , overcoming the issue we had with UIO-trees where if a thread t i processes a node then t i considers OðnÞ states.
However, note that if we insist on keeping input/output sequences within elements then each thread will process a whole input/output sequence during every iteration. As a result, the memory used by a single thread will increase and this may reduce the scalability of the algorithm. In order to avoid this, we devised an approach in which the input sequence associated with an element is kept elsewhere (not in the element). This is not problematic for inputs: when evolving an IO-vector we do not need to know about the previous inputs. However, we need to determine which elements of an IO-vector must be evolved using the same input. This suggests that threads should consider all the output sequences observed so far.
We addressed this problem as follows. Instead of keeping/sorting all output sequences observed, each element will keep a unique representation of an output sequence, the aim being to reduce the amount of data stored. In order to achieve this, an output sequence o is represented by an enumeration (enumð oÞ) that assigns a unique representation (number) to o. For example, if we reach a point where only two output sequences have been observed then we could simply use the numbers 0 and 1. In the next section we describe how enumeration was done.
We will see that one important property of enumeration is that the equality relation over strings is preserved. We can thus use the enumeration function to reduce the size of the information stored regarding the output sequences observed. It will be possible to define an enumeration function such that the size of the representation of the output sequence will be no larger than log ðnÞ since it takes log ðnÞ space to represent the integers from 0 to n À 1 and there can be at most n different output sequences. Further, this information will allow the algorithm to determine which states have not been distinguished in constructing an IO-vector (so must be followed by the same input) using space of size no more than log ðnÞ (Fig. 9) . As a result, the proposed approach will satisfy our requirements regarding GPU memory.
An Overview of the P-UIO Algorithm
In this section, we provide an overview of the P-UIO algorithm.
The P-UIO algorithm receives an FSM and positive integers d and ' and computes UIOs. The loop iterates until either (1) UIOs have been found for all states or (2) the algorithm cannot back-track. The overall algorithm is like a depth-first search that, in an iteration of the main loop, increases the length of the input sequence being considered by d. We could simply increase the input sequence length by 1 in each iteration; there would then be no need to introduce the parameter d. However, as we will see later, this choice could lead to more frequent (relatively slow) memory transfers between the GPU and CPU to store the current data (for back-tracking).
If the overall depth of the process is greater than or equal to a bound ' at the end of an iteration of the main loop then the algorithm back-tracks and stores the elements that are associated with UIOs (those with a unique enumeration of an output sequence). Therefore, the algorithm stores UIOs for M when they are found.
The algorithm can return UIOs of length greater than ' if d is not a factor of ': if kd < ' ðk þ 1Þd for integer k then the main loop back-tracks if the overall depth reaches ðk þ 1Þd. In addition, longer UIOs can be returned if the inference rules are used. The algorithm has the following phases in every iteration of the main loop. Fig. 10 describes the approach. For a given state s 2 S we may need to consider every possible input sequence whose length is below the upper-bound ' and so the P-UIO algorithm is an exponential algorithm.
The following is immediate from the definition of the termination condition of the algorithm. Note that this is not an 'if and only if' result since the use of d and inference rules allows the P-UIO algorithm to return UIOs of length greater than '.
LOW-LEVEL DESIGN
In the previous section we provided a high-level overview of the proposed algorithm. However, there is a need to map this to a structure that can be implemented using GPUs. In this section we give these low-level design details of the P-UIO algorithm. Although the P-UIO algorithm consists of only a few steps, in order to obtain high performance from the GPU, one has to consider the following design principles: 1) Minimise global memory transactions 2) Maximise number of parallel threads per block 3) Prevent thread divergence. Fig. 9 . An illustration for enumeration. Each string is compacted to another string. Note that the enum function produces same values when given the same input string (red coloured texts). Fig. 10 . Overview of the steps taken in one iteration of the P-UIO algorithm: the algorithm keeps evolving elements through the IDP. Before each iteration the algorithm stores the current data and chooses inputs to be applied. After each IDP, the algorithm checks for distinguished elements, applies inference rules, updates enumerations, and decides to continue or not. If the vector is homogeneous then it back-tracks.
Recall that in designing the P-UIO algorithm, our objective is to realise a one-to-one correspondence between the states of the FSM and the threads of the GPU. Therefore, we want to maximise the number of threads used in a block. However, this has implications regarding the use of shared memory on a GPU as shared memory usage limits the number of threads in a block.
Therefore, the P-UIO algorithm we did not use shared memory but instead we tried to minimise the global memory transactions latency. As previously reported [45] global memory transaction latency can be hidden by using (1) many threads and (2) supporting coalesced memory access in which the threads in a block access global memory in a manner that allows the GPU to bundle a number of memory accesses into one memory transaction. The principle of memory coalescing is similar to the cache line principle of a CPU, in which a cache line is either completely replaced or not at all. Even if only a single data item is requested, the entire line is read so whenever a neighbouring item is subsequently requested, it is already in the cache.
Recall that threads from a block are grouped into warps for execution on a CUDA core and threads within a warp must follow the same execution trajectory (otherwise we have thread divergence). That is, all threads must execute the same instruction at the same time. In order to satisfy this constraint, in the P-UIO algorithm we avoided if-else structures where possible.
We now discuss how the high-level P-UIO algorithm was refined for use with GPUs. In order to perform coalesced global memory access, in the P-UIO algorithm we use several structures to represent an IO-vector V ; these structures are to be kept in the global memory of the GPU (as opposed to the local memory of a thread) and hold the following information.
1) The D states vector holds the relationship between initial and current states: given initial state s i , D states½i is the corresponding current state. 2) The D inputs vector holds the inputs that will be applied during the next step of IDP.
3) The D outputs vector holds output data: enumerations of output sequences observed from initial states. It therefore provides information regarding which states have been distinguished (split). 4) The D FSM vector holds the transitions of the underlying FSM. 5) D inferenceRules holds the unique precedence information for each state. Table 1 summarises the memory management. The FSM and inference rules were associated with the texture memory. The vectors used to construct UIOs (such as D inputs, D outputs, D states) were declared as global memory. Variables used for the computation were automatically associated with registers by the compiler. In developing the P-UIO algorithm we did not use local and shared memory.
As explained in Section 3.2, the P-UIO algorithm has a main loop that has three phases: 1) In the first phase IDP is applied to extend the depth by d (Phase 1 in the high-level description of the loop, described in detail in Section 4.1); 2) In the second phase, the outcome of IDP is analysed (Phase 2 in the high-level description, described in detail in Section 4.2); and 3) In the final phase the algorithm decides whether to continue or back-track (Phase 3 in the high-level description, described in detail in Section 4.3). The algorithm is summarised in Algorithm 1. Here lines 3-5 correspond to Phase 1 in the high-level description (Section 3.2); lines 6-9 correspond to Phase 2; and lines 10-11 correspond to Phase 3. The shading shows the steps that are carried out in parallel. 
Applying the Iterative Deepening Process
Before IDP begins, id and vectors are stored in CPU memory for back-track (Line 4) and then the input vector is generated by a parallel random combinator generator (PRCG) (Line 5). This procedure receives an integer value id, number of states n, alphabet X, the D inputs vector, and iterative deepening parameter d. PRCG first calls a kernel called random number generator (RNG). RNG receives n, d, and p and it returns an integer value y in the range ½0; p nd . Then the algorithm checks if it can increment 2 id, if so the PRCG calls The next step is to update the D outputs vector. To achieve this, we follow a similar procedure which is applied to check if new pairs of states are split. This procedure is explained in the following section. But in summary we apply two steps: sort the D outputs vector, write integer values (starting with 0) to elements of the D outputs vector so that two elements of D outputs vector are identical if and only if they receive same integer value. As there are at most n different possible output sequences, the number assigned to an element of the D outputs vector is between 0 and n.
During IDP, for a state s i a thread t i will normally read the corresponding indexes on D states, D inputs, D outputs and D FSM, many times. Although the reads and writes on the D states and D outputs vectors are coalesced, transactions on D inputs are not. The index values in the D inputs vector depends on the data retrieved from the D outputs vector. Moreover, since the host can write the FSM transition structure to D FSM once and the kernels can read this FSM structure many times, the D FSM vector is stored in the texture memory and so coalescing is not an issue. On the other hand note that IDP does not allow thread divergence. That is, all threads in a warp will process the same instruction of a kernel.
Gathering Outcomes of IDP
Once IDP ends, we need to check if new pairs of states are split. This is done through a parallel stable sorting (Line 7). The sorting algorithm gathers vector D outputs and a vector (D keys) which holds the relative orders of items of D states (the initial state information). It then sorts the states according to D outputs 3 (Fig. 11) . The results of the sort reveals states that are distinguished from all other states (singleton states) and pairs of states that produce the same output sequences.
In order to achieve this a temporary vector (D singletons) is used. After receiving D keys; D outputs and D singletons a thread (thread t i ) selects a single (ith) item of the D outputs vector and compares the enumeration of the output sequence to those of the neighbouring values (the enumeration of the output sequence read from the i þ 1th and i À 1th locations of the D outputs vector). If the ith value is different from both of these values then the thread reads the initial state information from the D keys vector and stores it in the D singleton vector. In order to determine which states are split, another temporary vector called the D groups vector is used as follows: a thread again selects a single (ith) item of the D outputs vector and reads the index of the initial state information from the D keys vector only if one neighbouring output data is the same as that of the ith item of the D outputs vector (not both). As a result of this process, for each group of states with the same output data we have two values in the D groups vector indicating the starting and ending indexes of the initial states in a group from the D keys vector (Fig. 12) . Note that the process of finding singletons and groups of states can cause thread divergence, which may prevent threads in a warp from executing concurrently.
After singletons have been found, a kernel uses the inference rules to try to find UIOs for other states (Lines 8-9 ). In order to achieve this, the kernel receives the list of singletons found and the D InferenceRules vector. Each thread selects one state from the D singletons vector and finds its unique predecessors from the D InferenceRules vector. Note that similar to the D FSM vector, the host can write the D InferenceRules vector once and the kernels can read this data many times, therefore the D inferenceRules vector is stored in texture memory and so coalesced memory access is not an issue. Moreover, since the algorithm should also consider unique predecessors of fresh states, the kernel may be called by the Host until it reaches a point where no fresh states are found.
Once singletons and groups have been revealed, the algorithm assigns unique integers (beginning from 0) to singleton states and groups of states and this defines the 3. Note that after the sort, the information in D outputs½i may not belong to initial state s i .
enumeration of each corresponding output sequence (Line 10). The algorithm then updates representations (i.e., enumð OðsÞÞ) of states. A thread t i is assigned to a group S 0 and generates a unique integer value (k S 0 ¼ i). Then for all s 2 S 0 , t i retrieves the initial state information from the D keys vector and it writes k S 0 to D outputs (Fig. 12 ).
Checking Termination Conditions
After enumerations are computed, the algorithm checks if UIOs of some states have been found in the current level, if so the algorithm stores the corresponding input sequences in CPU memory (Lines 11-12) . Afterwards the algorithm decides what to do next. If not all input sequences of length less than ' have been applied and either the underlying IO-vector is homogeneous or it has reached the upper bound ' then the P-UIO algorithm back-tracks. If it back-tracks, all the data (except data related to singletons) computed in the current iteration of the main loop is discarded and the previous data, which resides in CPU memory, is brought to GPU memory. The P-UIO algorithm then continues to execute. Otherwise, if a UIO has been found for every state then the algorithm ends execution. If neither of these conditions holds then the algorithm continues to execute with current D states and D outputs vectors.
Note that after each IDP the algorithm stores the current data in CPU memory. As memory transactions between CPU and GPU are expensive, it is good practice to reduce the number of such transactions. As a result, it makes sense to select relatively large values of d. If we pick d ¼ 1 then each time we increase the input sequence length by 1 we need to send data back to the CPU and this will reduce the performance of the algorithm. In the next section we report on the results of experiments that show how the value of the parameter d affects the performance of the algorithm.
Example
We now show the execution of the P-UIO algorithm using an example. Consider the FSM given in Fig. 3a . Let us suppose that M 2 , d ¼ 2 and ' ¼ 5 are provided to the P-UIO algorithm as parameters. Then the algorithm first sets id ¼ 0, then initiates vectors (an IO vector)
It then stores the values of id and the vectors to CPU memory. Afterwards it randomly generates an input sequence, increments id, and sets id ¼ 1. Let us suppose that
possible inputs for 0th iteration
Note that the length of D inputs is dn since n ¼ 4. The P-UIO algorithm then evolves the elements of the vectors as follows. The 0th iteration: as all output values are 0, the Apply kernel picks the element at index 0 Ã 4 þ 0 of the D inputs vector (x 2 ) and the vectors become
The 
After the second iteration (since d ¼ 2) the algorithm moves to the next step. Now as the initial state s 4 has a different output, the algorithm concludes that x 2 x 1 in an input sequence that distinguishes state s 4 from any other states. Later it proceeds with the inference rules given in Fig. 3b and finds input sequences for other states as
Since a UIO has been found for each, the algorithm terminates.
EMPIRICAL STUDY
In this section we present the results of our experiments. We used an Intel Core 2 Extreme CPU (Q6850) with 8 GB RAM and 64 bit Windows Server 2008 R2 operating system. The GPU computing approach (separately) used three NVIDIA GPUs: a TESLA K40, a TESLA c2070, and a TESLA c1060. In the experiments, we evaluated the methods by investigating the average time to construct UIOs for FSMs and the average length of the UIOs constructed. For the P-UIO method, we used d ¼ 40 as the default value. However, this value affects the performance of the algorithm and so we also performed experiments with different d values.
We used several sets of FSMs, described below. Moreover, we also set the upper-bound on the length of UIOs as ' ¼ n 2 where n is the number of states. This value was chosen since, as noted earlier, it is an upper bound on the sum of the lengths of the sequences in a characterisation set (assuming a sensible algorithm has been applied to generate the characterisation set). In Naik's algorithm, there is no upper-bound on the length of UIOs. However we believe that this has at least two drawbacks. First, very long UIOs will typically be of little value when computing test sequences; instead we can use alternative approaches (with polynomial upper bounds on size) such as characterising sets. In addition, if the FSM does not possess UIOs for all of its states, then Naik's algorithm constructs the complete UIOtree in the worst case. Note that some FSMs did not have UIOs of length ' or less for all states and these were also discarded; later we report on this.
FSMs Used in the Experiments
The FSMs in SUITE I
The FSMs in this suite were designed to investigate the performance of the methods under varying number of states. We fixed the number of inputs and outputs to be p ¼ 2 and r ¼ 2.
The FSMs in this class were generated as follows. First, for each input x and state s we randomly assigned the values of dðs; xÞ and ðs; xÞ. After an FSM M was generated we checked its suitability as follows. We checked whether M was strongly connected and minimal. If the FSM failed one or more of these tests then we omitted this FSM and produced another. Consequently, all FSMs were strongly connected and minimal.
By following this procedure we constructed 100 FSMs with n states, where n is a power of 2 and n 2 f64; 128; . . . ; 524;288; 1;048;576g. In total we constructed 1;500 FSMs for the first test suite.
The FSMs in Test SUITE II
These FSMs were used to explore the effect of the size of the output alphabet. We fixed the number of states to be 1;024 and constructed 100 FSMs with each of the following sizes i=o of input/output alphabets: i=o 2 f128=2; 128=128; 128=256g. As a result there were 300 FSMs in SUITE II.
The FSMs in Test SUITE III
While using randomly generated FSMs allowed us to perform experiments with many subjects and see how performance changes as the problem size increases, it is possible that FSMs used in practice differ from these randomly generated FSMs. We therefore complemented the experiments with case studies from the ACM/SIGDA benchmarks, which is a set of test suites (FSMs) used in workshops between 1989 and 1993 [46] . The benchmark suite has 59 FSMs, for circuits, obtained from industry.
The circuits were represented using the kiss2 file format; a standard format devised by manufacturers [46] . In this format, inputs and outputs are represented as binary numbers, and states are represented as alphanumeric characters. For example a transition provided in kiss2 file format ðs1; 11=10111000; s1Þ tells us that if input three is received when the FSM is in the state called s1 then there is no state change and output 184 is produced. Therefore, it is straightforward to obtain an FSM specification from a circuit design written in the kiss2 file format.
We used FSMs from the benchmark that were minimal and deterministic. We completed partial FSMs by introducing self loop transitions for missing transitions. Thus, for example, if there was no transition from state s with input x then a transition from s to s with input x and null output was added.
Results
In order to carry out these experiments for each FSM we computed UIO sequences using (1) Naik's UIO construction algorithm (implemented as given in [15] ), and (2) the P-UIO algorithm. For a given method we constructed UIO sequences for each FSM in our pool.
Results of Experiments for FSMs in SUITE I
We present the mean timing results in Fig. 13a . As expected, when the size of the FSM grows, the time required to construct UIOs increases. We observe that Naik's approach took less than three seconds on average to generate UIOs for FSMs with 512 states. For FSMs with 1;024 states the time rises to 68:45 seconds, for FSMs with 2;048 states the average time to construct UIOs is 1;231 seconds. Therefore we did not process FSMs with more than 2,048 states. These results suggest that the P-UIO algorithm can increase the scalability of Naik's algorithm by a factor 4 of 512. The results show that when the NVIDIA TESLA K40 card was used, UIOs for very large FSMs (FSMs with 1 million states) could be constructed in less than two seconds (1,626 msec on average). With the TESLA c2070 card the average time required increased to 3,170 msec. and with the TESLA c1060 card the average time increased to 3,658 msec. In Table 2 we provide the reduction in timings. The results for SUITE I indicate that P-UIO can be 11;000 times faster then the existing UIO construction algorithm on average.
In Fig. 13b we present the distribution of time spent by the P-UIO algorithm where Sort, Inference Rules, MemCpy, and Iterative Deepening stand for the average time spent for sorting, average time spent for finding new UIOs using inference rules, average time spent for memory transactions between the CPU and the GPU and the average time spent for IDP respectively.
The results suggest that most of the time required to construct UIO sequences was spent on sorting (averages vary between 37:6 and 45:01 percent) and we also observe that the use of inference rules took 25À30 percent of the time on average, which (compared to time spent on sorting) was not expected. Therefore we investigated the effect of using inference rules by counting the number of UIOs found using inference rules and the number of UIOs found during exploration. Fig. 13d summarises the results.
The results suggest that on average at least 80 percent of the UIOs were found using inference rules. These results justify the time spent on inference rules. In addition, the average percentage of time used for back-tracking (Memcpy) and iterative deepening reduces as the size of the FSM increase. This implies that as we increase the number of states, the utilisation of the GPU increases. Fig. 13c gives the results regarding UIO length. Note that the length of the UIOs returned by the P-UIO algorithm does not depend on the underlying card and so we present the results obtained from the TESLA K40 GPU card. The results suggest that compared to the P-UIO algorithm, Naik's approach can find shorter UIOs (13 percent shorter on average). This result may be caused by the iterative deepening process. As the P-UIO algorithm iteratively deepens an IO-vector until it reaches depth d, it need not find the shortest UIOs. To investigate this we performed a set of experiments and repeated the tests on the P-UIO algorithm with different d values.
The results are presented in Figs. 13e and 13f . These results suggest that as we decrease the depth parameter (to d ¼ 20), the time required to construct UIOs increases. This is because as we decrease d, the performance of the GPU reduces due to the frequent memory copy operations. However, the length of the UIO sequences reduces: when d ¼ 20, the average difference between the length of UIO sequences constructed by the Naik and the P-UIO algorithms reduces to 6 percent. On the other hand, as we increase the iterative deepening parameter d to d ¼ 80 again the time required to construct UIOs increases. This may be due to the fact that as we increase d, we also increase the amount of data that is sorted after IDP. Moreover, when we use d ¼ 80 the length of the UIOs increase: Naik's algorithm generates UIOs that are 38 percent shorter compared to the P-UIO algorithm on the average.
Clearly having longer UIOs increases the cost of testing and so shorter UIOs are preferable. As the results suggest, the parameter d used in the P-UIO algorithm allows a tradeoff between the quality (length) and the computation time; hence in using the P-UIO algorithm, one can adjust d to obtain shorter UIOs.
During Recall that if UIOs are not found for all states then one can instead use a characterising set that contains at most n À 1 sequences of length at most n À 1 and a characterising set can be found in polynomial time.
Results of Experiments for FSMs in SUITE II
The time required to construct UIOs for FSMs in SUITE II is given in Table 3 . Throughout these experiments we set d ¼ 40. As expected, as the number of outputs increases, the time required to construct UIOs decreases. We observe that one particular reason for this is that as the number of outputs increases the length of the UIOs derived from FSMs tends to reduce (Table 4) . This is to be expected: as the number of outputs increases, the algorithms (Naik, P-UIO) have more opportunities to split states, hence the length of the UIOs reduce. However, we see that the performance of the P-UIO algorithm is far better than that of Naik's algorithm (9; 900 times faster on average).
Results of Experiments for FSMs in SUITE III
The results are presented in Table 5 where we set d ¼ 40.
The time required to construct UIOs with Naik's algorithm and the P-UIO algorithm (with the Tesla K40 card) are similar for FSMs dk27, bbtas, dk17, and dk15. Moreover, for these FSMs the P-UIO algorithm is slower when Tesla C2070 and C1060 were used. However, as the FSMs get larger the time required to construct UIO with Naik's approach increases faster than that required by the P-UIO algorithm. As before, we also observe that the UIOs found are shorter when Naik's approach is used.
Threats to Validity
This section briefly reviews threats to validity and how these were reduced. We consider threats to internal validity, construct validity, and external validity. Threats to internal validity concern factors that might introduce bias. The main source of such threats is the tools used to run the experiments. The FSM generation tool has been used in a number of projects and was tested. The implementations of the two algorithms were carefully checked and also tested with a range of FSMs. To further reduce this threat, we also used an existing tool that checks if an input sequence is a UIO for the FSM. This tool was used to check all of the UIOs generated by the P-UIO algorithm and Naik's approach.
Another threat to internal validity concerns the random process employed while selecting input sequences: the order of selection may effect the performance of the algorithm. To investigate this factor, we repeated each experiment on Test SUITE III 100 times. The results are provided in Table 6 . We observe that, except for the specification named planet, the variance of timing and length of UIOs are low, that is to say for this set of FSMs the random input selection process has limited effect.
Threats to construct validity reflect the potential for the measurements made to not reflect properties that are of interest in practice. The main focus of our study was the time taken to generate UIOs and, as a result, the scalability of the algorithm. We want FSM-based test generation techniques that scale to large FSMs and so scalability is important. Note that FSMs are likely to be particularly large when one cannot abstract out all of the data of a model, since we then obtain a separate state for each logical state of the model combined with each possible combination of values for the model's variables. However, to reduce the scope for threats to construct validity we also recorded the mean UIO length.
Threats to external validity concern our ability to generalise from the experiments. There is always such a threat to validity since we do not know the space of relevant FSMs and certainly have no good way of sampling from this. We reduced this threat by using a combination of randomly generated FSMs and FSMs from industry that are in a benchmark. We also varied the number of outputs and states.
Discussion
Recall that in Section 3 we observed that the P-UIO algorithm is an exponential algorithm; this cannot be avoided since determining the existence of UIOs is PSPACE-hard. As the length of the UIO sequences generated from the FSMs in SUITE I and SUITE II are not longer than the logarithm of the number of states, it appears that we have not found such long executions. However, it has been reported [12] that this is usual: the length UIO sequences are often no longer than the logarithm of the number of states of the FSM. Another important point is the need to select parameter d.
The experiments revealed that when we select a value for d that is too large, the algorithm gets slower as the size of data to be sorted increases. However, if d is too small then this may decrease the GPU occupancy and increasing the traffic between the CPU memory and the GPU memory. Therefore, the parameter d should be selected carefully.
CONCLUSIONS
This paper explored the problem of constructing UIOs for very large FSMs. We proposed a new massively parallel algorithm that can construct UIOs for FSMs with millions of states. We presented the parallel design, issues encountered, and proposed solutions for the issues. The proposed algorithm has exponential worst time complexity. In order to evaluate the proposed algorithm, we performed an experimental study by comparing the proposed algorithm with a well-known UIO generation algorithm and investigated both the time required to construct UIOs and the lengths of the UIOs produced. In the experiments the P-UIO algorithm was able to handle FSMs with 1,048,576 states in under 2 seconds on average while the implementation of Naik's algorithm took 1,231 seconds on average for FSMs with 2,048 states. The two algorithms had similar performance for the benchmark FSMs but these FSMs were much smaller (at most 48 states) and there was a difference in performance for the larger benchmark FSMs.
There are several possible lines of future research. We plan to investigate massively parallel UIO generation algorithms for partial deterministic and nondeterministic FSMs. We also plan to investigate massively parallel algorithms for generating other types of sequences such as distinguishing sequences, characterising sets and checking sequences for complete/partial and deterministic/nondeterministic FSMs. There is also the question of whether guidance can be provided regarding the choice of d. Finally, there may be potential to adapt the proposed approach to problems regarding FSM inference.
