Recent technological developments made available various many-core hardware platforms. For example, a SIMD-like hardware architecture became easily accessible for many users who have their computers equipped with modern NVIDIA GPU cards with CUDA technology. In this paper we redesign the maximal accepting predecessors algorithm [7] for LTL model checking in terms of matrix-vector product in order to accelerate LTL model checking on many-core GPU platforms. Our experiments demonstrate that using the NVIDIA CUDA technology results in a significant speedup of verification process.
Introduction
Model-checking [1] is a wide-spread technique for automated formal verification. Given a formal description of a system and desired system property, the goal of the model-checking procedure is to analyze reachable system configurations in order to decide whether the system satisfies the property or not. In this paper we deal with LTL model checking, which is the case when the property to be verified is given as a formula of Linear Temporal ogic (LTL). In LTL model checking, the question of satisfaction of the property can be reduced to the problem of detection of an accepting cycle (cycle through at least one vertex denoted as accepting vertex) in a directed graph.
All the know algorithms for accepting cycle detection can be divided into two classes. Algorithms such as Nested DFS [10] or algorithms based on Tarjan's SCC decomposition algorithm [17] , [18] exhibit optimal (linear) time complexity, but are incompatible with parallel processing. This is because they strongly rely on the so called depth-first search (DFS) postorder for computation of which no scalable parallel algorithm is known [16] . The other group of algorithms for accepting cycle detection are algorithms such as OWCTY [9] , [12] or MAP [7] that avoid DFS postorder, but exhibit unoptimal time complexity. How-ever, it has been demonstrated that the unoptimality is easily outweighted by parallel processing [2] , [19] . As a result, the unoptimal algorithms are actually faster than the optimal sequential algorithms if contemporary parallel hardware is used.
Moreover, the graph to be analyzed tends to be very large for realistic systems and it is handled only with difficulties by a single memory-limited machine. Consequently, optimal utilization of resources of various hardware platforms have got much attention by the model checking community. As most modern hardware platforms are actually parallel platforms, the desire for full utilization of the power available rendered all sequential algorithms obsolete.
Modern graphics processing units (GPUs) have emerged as a revolutionary technological opportunity due to their tremendous massive parallelism, floating point capability, low cost, and ubiquitous presence in commodity computer systems. Many key computational kernels have been redesigned to exploit the performance of this modern hardware. The key to effective utilization of GPUs for scientific computing is the design and implementation of efficient data-parallel algorithms that can scale to hundreds of tightly coupled processing units.
In this paper we show how one of the parallel algorithms for accepting cycle detection, namely the MAP algorithm, can be effectively accelerated on GPU if the input data are given in an appropriate format. The MAP algorithm was chosen, because, unlike the OWCTY algorithm, it allows on-the-fly verification.
The rest of the paper is organized as follows. In Section 2 we briefly recall basics of automata-theoretic approach to LTL Model Checking. Sections 3, 5 and 6 describe how we adapted the algorithm MAP to GPU processing and what enhancements we did to achieve good performance. In Sections 4 we recapitulate NVIDIA CUDA hardware platform a show how our adapted algorithm can be implemented on CUDA. Section 7 reports on an experimental evaluation of our approach, and finally, Section 8 summarizes achieved results and plots some future directions.
LTL Model Checking
For LTL model checking purposes, the system to be analyzed has to be described in some modeling language, ProMeLa [14] for example, and the property to be checked has to be given as formula of Linear Temporal Logic (LTL) [1] . To answer the LTL model checking question, tools, such as SPIN [14] or DiVinE [5] , employ automata-theoretic approach to reduce the model checking problem to the problem of non-emptiness of Büchi automata. In particular, the model of a system S is viewed as a finite automaton A S describing all possible behaviors of the system. The property to be checked (LTL formula ϕ) is negated and translated into Büchi automaton A ¬ϕ describing all the behaviors violating ϕ. In order to check whether the system violates ϕ, a synchronous product A S × A ¬ϕ of A S and A ¬ϕ is constructed describing those behaviors of the system that violates ϕ,
The automata A S , A ¬ϕ , and A S × A ¬ϕ are referred to as system, property, and product automata, respectively. System S satisfies formula ϕ if and only if the language of the product automaton is empty, which is if and only if there is no reachable accepting cycle in the underlying graph of the product automaton. The LTL model checking problem is thus reduced to the problem of the detection of an accepting cycle in the product automaton graph.
There are several parallel algorithms for accepting cycle detection. One of them is the algorithm MAP [7] which we now briefly introduce in its successor version. Let G = (V, E, v 0 , A) be the graph of the product automaton, where V is a finite set of vertices, E is a set of edges, v 0 is an initial vertex, and A is a vertex predicate indicating whether a state is accepting or not. Let < be a linear ordering of the set of vertices, given e.g. by the vertex numbering. We extend the ordering to the set V ∪ {⊥} (⊥ / ∈ V ) and put ⊥< v for all v ∈ V . Furthermore, let map :: V → V ∪ {⊥} is a function returning the maximal accepting successor of a given vertex or ⊥ if it does not exist, i.e.
The idea of the algorithm to detect an accepting cycle is as follows. If a vertex u is its own maximal accepting successor, i.e. u = map(u), the presence of an accepting cycle is guaranteed. If there is an accepting cycle in the graph, but for none of its vertices u = map(u), then the maximal accepting successor of all the vertices of the cycle must be the same, must lie outside the cycle and can thus be marked as non-accepting. The idea of the algorithm is to process the graph in a few iterations so that each iteration computes map values for all the vertices. If no accepting cycle is discovered, all maximal accepting successors that occur in map(u) for some u are marked as non-accepting for all the following iterations. The algorithm iterates until an accepting cycle is found or the set of accepting vertices becomes empty. See the pseudo-code in Algorithm 1. A key procedure of the
if A S×¬ϕ contains accepting cycle false, otherwise 1: 
COMPUTEALLMAPS(G,<) 3: if (∃u ∈ V : u = map(u)) then A(v) ← false 7: end while 8: return false algorithm is COMPUTEALLMAPS() that is responsible for computing the values of the function map for all the vertices reachable from the initial vertex. See the pseudo-code in Algorithm 2. Initially, the values of map(u) are set to ⊥ for all u ∈ V . These values are then repeatedly updated until a global fix-point is reached, i.e. no update can be done for any value of map(u). Suppose a directed edge (u, v) from u to v, the new value of map(u), the so called update along the edge (u, v), is computed using function maxacc(u, v) as follows:
Algorithm 2 COMPUTEALLMAPS(G,<) 
end if 11: end for 12: end while 13: return map Henceforward, we also refer to the iterations of the while loop of Algorithm MAP given in Algorithm 1 as of outer iterations, and the iterations of the while loop of procedure COMPUTEALLMAPS given in Algorithm 2 as of inner iterations.
The practical performance of the basic algorithm may be further enhanced if the graph to be checked for the presence of an accepting cycle is partitioned into subgraphs so that no cycle of the original graph maps to multiple partitions. In that case the inner iterations as performed in procedure COMPUTEALLMAPS may be prevented from propagating values of map along edges that cross partition boundaries. This brings no complexity improvement, but it generally reduces the number of inner iterations needed to achieve the fixpoint.
One technique to partition the product automaton graph is part of the algorithm itself. It builds upon the fact that if two vertices differ in their values of map, they cannot lie on the same cycle. Therefore, the propagation in procedure COMPUTEALLMAPS may be localized to those edges (u, v) for which the values of map(v) and map(u) computed in the previous outer iteration are the same. The values of map function from the previous outer iteration are referred to as oldmap values.
Reformulation of MAP algorithm
In order to accelerate the MAP algorithm on CUDA we reformulate it as a matrix-vector multiplication algorithm.
Let us assume the graph G of the product automaton A S×¬ϕ is represented as an adjacency matrix M. The matrix keeps value 1 at row u and column v for every directed edge (u, v). See Figure 1 . Additional data to be stored with every vertex of the graph are not stored directly in the matrix, but they are rather organized in separate vectors. Namely, vector m of map values, vector o of oldmap values, vector A of values of the predicate A, and an output vector r of bits indicating a recent update to the value of map. Vector r is used to detect the fix-point of computation of inner iterations, i.e. the situation when no update to map function occurs in two successive inner iterations.
The algorithm proceeds as illustrated in Figure 1 . Initially, all map and oldmap values are set to ⊥, see the column vectors m and o. The algorithm repeatedly updates values of the vector m until a fix-point is reached (the three topmost matrix-vector product equations). Since map(v) = v for all vertices v, the maximal accepting vertex v 3 cannot be part of an accepting cycle (map(v 3 ) < v 3 ). The algorithm resets its accepting status, copies map values to oldmap values and sets all values of map to ⊥. Note that there are two subgraphs to be further processed identified by the oldmap values. Vertex v 3 is part of one of the subgraphs, but since it is not accepting anymore, it cannot influence any future values of map computed for vertices within the subgraph. Then the next outer iteration proceeds. The algorithm detects (using two inner iterations) that map(v 1 ) = v 1 and so it terminates reporting accepting cycle through vertex v 1 .
The key observation is that the vector of map values computed in each inner iteration is computed as a matrix-vector product by substituting max for the standard + operation, and maxacc for the standard · operation. The result, however, considers only those summands for which oldmap values are the same as oldmap value of the updated vertex. For example, the resulting value of m[u] is computed as 
CUDA Architecture
The Compute Unified Device Architectures (CUDA) [11] , developed by NVIDIA, is parallel programing model and software environment providing general purpose programming on Graphics Processing Units. At the hardware level, GPU device is a collection of multiprocessors each consisting of eight scalar processor cores, instruction unit, on-chip shared memory, and texture and constant memory caches. Every core has a large set of local 32-bit registers but no cache. See the structure as depicted in Figure 2 . The multiprocessors follow the SIMD architecture, i.e. they concurrently execute the same program instruction on different data. Communication among multiprocessors is realized through the shared device memory that is accessible for every processor core.
On the software side, the CUDA programming model extends the standard C/C++ programming language with a set of parallel programming supporting primitives. A CUDA program consist of a host code running on a CPU and a device code running on the GPU. The device code is structured into so called kernels. A kernel executes the same scalar sequential program in many data independent parallel threads. Within the kernel, threads are organized into thread blocks forming a grid of one or more blocks, see Figure 3 . Each thread is given a unique index within its block threadIdx and each block is given a unique index blockIdx within the grid. The threads of a single block are guaranteed to be executed on the same multiprocessor, thus, they can easily access data stored in shared memory of the multiprocessor. The programmer specifies both the number of blocks and number of threads per block to be created before a kernel is launched. These values are available to the kernel as gridDim and blockDim values, respectively. Using CUDA to accelerate the computation is easily exemplified on a vector summation problem. Suppose two vectors of length n to be summed. In the standard imperative programming language, a programmer would use a for loop to sum individual vector elements successively. Using CUDA, however, the vector elements can be summed concurrently in a single kernel call populated with n threads, each responsible for summation of a single pair of vector elements at the position given be the thread index.
CUDA Accelerated Algorithm MAP
Data structures used for CUDA accelerated computation must be designed with care. First, they have to allow independent thread-local data processing so that the CUDA hardware can employ massive parallelism. And second, they have to be small so that the high latency device-memory access and limited device-memory bandwidth are not large performance bottlenecks. As for the algorithm MAP, it is the matrix representation of the graph A S×¬ϕ to be encoded appropriately at the first place. Note that uncompressed matrix or dynamically linked adjacency lists violate the requirements and as such they are inappropriate for CUDA computing.
We decided to encode the matrix of the product automaton graph as a sparse matrix using compresse sparse row (CSR) format. In this format a sparse matrix is encoded using three one-dimensional arrays M r , M c , and M n as follows. All the non-zero elements of a matrix M are stored in the array M r in left-to- while repropagate ∧ ¬acc cycle found do 8: repropagate ← false 9: map Kernel(gM c , gM n , gm, acc cycle found) 10: check repropagate Kernel( gm, repropagate) Since the values of map and oldmap are technically pointers, we were able to store the two other bits of information into unused pointer bits reducing thus the space needed to record all the data for one vertex to two times 4 Bytes.
As explained in Section 3, the major computation part of the algorithm MAP can be formulated in terms of matrix-vector product. Given CSR matrix representation and a column vector, an efficient CUDA accelerated matrix-vector product procedure was described in [6] , [13] . The idea of the procedure is to map every row of the matrix to one thread. Since in our case the edges of the graph are more or less uniformly spread in the matrix, this approach leads to a satisfactory balanced load of CUDA cores.
The pseudo-code of the CUDA accelerated algorithm MAP follows. Algorithm 3 lists the overall host code, i.e. the part that is executed on the CPU. The inner and outer while loops listed in the pseudocode correspond with the inner and outer iterations as introduced in Section 2.
There are three kernel functions called from the propagate ← ⊥
7:
u.updated ← false 8: for column ← row begin to row end − 1 do
if u.map = v.old ∧ u.old = v.old then 11: u.old ← v.old 12: u.map ← ⊥
13:
propagate ← ⊥ Note that an update of map value of a single vertex requires access to all immediate successors of it, which can be done effectively in CSR matrix representation. If we opted for the predecessor version of the algorithm, the algorithm would require to access 
if u.acc ∧ u.map < row then
5:
u.map ← ⊥
6:
u.old ← row
7:
u.acc ← false 8:
unmarked ← true 10:
end if 11: end if immediate predecessors of vertices, which would mean to transpose the matrix first. This would prevent onthe-fly computation at all.
On-The-Fly Verification
The last not-yet-discussed but quite essential procedure of the whole verification process is the transformation of the input data as given to the model checker into the form suitable for CUDA accelerated computation. In the model checking process, the graph to be searched for accepting cycles is given implicitly. Implicit definition of a graph involves a function to enumerate initial vertices, a function to enumerate edges emanating from a given vertex, and a function to check for accepting status of a given vertex. In order to use our CUDA accelerated accepting cycle detection algorithm, we have to turn the implicit definition of the graph into an explicit one. This process is generally referred to as state space generation. In addition to explicit state space construction we also build its CSR representation.
A distinguished property of the MAP algorithm is that it can be altered to work on-the-fly [7] . An on-thefly algorithm can detect the presence of an accepting cycle before the state space generation procedure completes its task. We were able to adapt our implementation to mimic this behavior as well. In particular, we let the CPU to perform state space generation during which we let the GPU to apply CUDA accelerated MAP algorithm on partially constructed graph. If the part of the graph constructed so far contains an accepting cycle, CUDA accelerated MAP algorithm simply reveals it before the state space generation is complete.
To further accelerate CUDA computation, we employed another technique to decompose the product automaton graph [4] , [15] . The idea is to decompose the property automaton into strongly connected components and then project this decomposition to the final graph. Moreover some parts of the product automata graph are known to be without accepting vertices in advance and may be omitted when constructing CSR representation of the graph. This technique significantly reduced the size of the matrix as well as the number of repropagations needed.
Experimental evaluation
We have implemented the algorithm as a part of the DiVinE-Cluster model checker version 0.8.2 [5] . We compared the performance of the CUDA implementation against the algorithms MAP and OWCTY as provided by the model checker. To order vertices as required by the algorithm, we employed inverse ordering on row numbers since with this ordering the numbers of inner iterations were very small. For the details on how the ordering influences performance of the algorithm, see [8] .
To compare the CUDA algorithm with the existing algorithms implemented in the DiVinE Cluster model checker, we used DiVinE native models as listed in Table 1 . All the experiments were run on a Linux workstation equipped with two AMD Phenom(tm) II X4 940 Processors @ 3MHz, 8 GB DDR2 @ 1066 MHz RAM and NVIDIA GeForce GTX 280 GPU with 1GB of GPU memory. Table 2 captures various statistics of our experiments. The difference between stored and generated states illustrates how much of the state space is made of subgraphs without accepting states. Note that if the graph contains an accepting cycle, the reported numbers refer to numbers of states and transitions generated and stored before the accepting cycle was discovered. #MAP iterations reports the number of outer iterations, #kernel executions gives the total number of calls to CUDA kernels, and avg kernel time gives an average time a single call to a CUDA kernel took. Table 3 provides details on run-times of individual algorithm parts. As for the CUDA MAP algorithm, the total run-time includes the initialization time (not reported in the Table 4 . The overall run-times in seconds, and speedup of the whole model checking procedure.
slower than construction of the CSR representation. This is because the first iteration of the CPU MAP not only generates the state space, but also computes first stable values of map. Also note the different number of outer iterations in CUDA MAP (reported in Table 2 ) and CPU MAP. The difference is a result of employing maximal accepting predecessors in CPU MAP and maximal accepting successors in CUDA MAP. The number of iterations of CUDA MAP is consistently smaller, for which we have no good explanation yet. Algorithms MAP and OWCTY were running on a single core. Finally, Table 4 gives a comparison of overall runtimes for both valid and invalid model checking instances. We can see that if the whole model checking procedure is considered, the speedup is not that impressive. This is obviously due to the CSR representation preparation. Though, the speedup is still significant.
Conclusions
We demonstrated successful reformulation of the LTL model checking algorithm MAP in terms of matrix-vector product that allows for significant GPU accelerated model checking process. The main bottleneck of the whole approach is the costly procedure of preparation of data structures that are necessary for efficient acceleration. Though we put significant effort in designing accelerated CSR representation computation, we did not achieve a procedure with consistent speed-up. Therefore, we consider GPU accelerating of the data structures preparation to be the next challenge for model checking community.
We are aware of other representations that could be used for CUDA efficient matrix-vector product, the other representations even exhibit better CUDA performance, however, their preparation is generally more complex, hence not very suitable for our domain.
In the future we would like to accelerate slow CSR representation preparation at least by means of multi-core processing, which we believe may bring similar speed-up as in the case of state space generation [3] . Another problem we are aware of is the limited memory size of a single CUDA device. We intend to overcome this limit by employing multiple CUDA devices for which we already have some initial thoughts.
