Abstract Recent development in computer hardware has brought more widespread emergence of shared memory, multi-core systems. These architectures offer opportunities to speed up various tasks-model checking and reachability analysis among others. In this paper, we present a design for a parallel shared memory LTL model checker that is based on a distributed memory algorithm. To improve the scalability of our tool, we have devised a number of implementation techniques which we present in this paper. We also report on a number of experiments we conducted to analyse the behaviour of our tool under different conditions using various models. We demonstrate that our tool exhibits significant speedup in comparison with sequential tools, which improves the workflow of verification in general.
important bottleneck instead. This naturally raises interest in using parallelism to improve the performance of many formal verification tools.
Much of the extensive research on the parallelisation of model checking algorithms followed the distributed memory programming model which stemmed from the necessity to fight the memory constraints of a single computer system. Networks of workstations are easily accessible and they provide the desired computational power, aggregated memory in particular. Parallel, distributed memory techniques have been successfully applied to explicit-state (or enumerative) model checking [2, 4, 41] , symbolic model checking [25, 26] , analysis of stochastic [27] and timed [9] systems, equivalence checking [12] and other related problems [10, 13, 22] . For a survey on parallel LTL model checking algorithms we refer to [6] .
A recent shift in architecture design toward multi-cores with large amounts of local RAM has intensified research pertaining to shared memory paradigm as well. In [29] Holzmann and Bosnacki proposed an extension of the SPIN model checker for multicore machines. They suggested two different parallel algorithms for verification of safety and liveness properties. While the algorithm for checking safety properties scales well to N-core systems, the algorithm for liveness checking, which is based on SPIN's original nested depth-first search (DFS) algorithm, has scalability limited to dual-core systems. It is still an open problem to do scalable verification of general liveness properties on N-cores with time complexity linear in the size of the product automaton.
A different approach to shared memory model checking is presented in [31] , based on CTL * translation to Hesitant Alternating Automata. The proposed algorithm uses a so-called non-emptiness game for deciding validity of the original formula and is, therefore, largely unrelated to the algorithms based on fair cycle detection.
In this paper, we propose a design for a parallel shared memory model checking tool that is based on known distributed memory algorithms. For the prototype implementation, we considered the algorithm byČerná and Pelánek [18] . This algorithm is linear for properties expressible as weak Büchi automata, which comprise the majority of LTL properties encountered in practice. Although the worst-case complexity is quadratic, the algorithm exhibits very good performance with real-life verification problems. To achieve good scalability, we have devised several implementation techniques, as presented in this paper, and applied them to the algorithm. We expect that application of the proposed implementation approaches to other distributed memory algorithms for LTL model checking may bring about similar improvements in scalability on multi-core systems.
We have published a tool based on the presented results, under the name DiVinE Multi-Core. Its full source code is available from [3] , together with the instructions on compilation and usage. The tool is meant to be used on shared memory multi-processor and multi-core computers, and it is capable of performing reachability analysis and OWCTYbased LTL model checking both on N-core systems.
In Sect. 2 we summarise the existing parallel algorithms for LTL model checking (accepting cycle detection). In Sect. 3, we give an overview of implementation techniques that were applied to multi-core implementations of the selected algorithms. In Sect. 4, we give a broad selection of experimental data on scalability and performance. A comparison to the most recent multi-core-capable version (5.1.4) of SPIN is given as well. Moreover, the effect of several optimisations on both performance and scalability is measured.
Parallel LTL model-checking algorithms
An efficient parallel solution to many problems often requires approaches radically different from those used to solve the same problems sequentially. Classical examples are list rankings, connected components, and depth-first search in planar graphs. In the area of LTL model checking, the bestknown enumerative sequential algorithms based on fair cycle detection are the Nested DFS algorithm [20, 30] (implemented, e.g., in the model checker SPIN [28] ) and SCCbased algorithms originating in Tarjan's algorithm for the decomposition of the graph into strongly connected components (SCCs) [39] . These optimal sequential algorithms differ in their space requirements, length of the counter-example produced, and other aspects. For a recent survey we refer to [40] . The main idea of the Nested DFS algorithm is to use two interleaved searches to detect reachable accepting cycles. The first search discovers accepting states while the second one, the nested one, checks for self-reachability. Several modifications of the algorithm have been suggested to remedy some of its disadvantages. References [23, 24] have proposed modifications of Tarjans algorithm, whose common feature is that they recognise an accepting cycle as soon as all transitions on the cycle are explored. For a survey and detailed comparison we refer to [37] .
However, both types of algorithms rely on inherently sequential depth-first search (DFS) postorder. Unfortunately, it is not known how the DFS postorder can be computed efficiently in parallel. Therefore, it is difficult to adapt known DFS-based LTL model checking algorithms to parallel architectures. This is also the reason why the parallel algorithm of SPIN is limited to dual-core platforms. In particular, the algorithm performs two nested depth-first searches while the outer search must preserve the DFS postorder on backtracked states. As a result, the outer search cannot be executed in parallel and the algorithm cannot efficiently use more than two cores.
Consequently, different techniques and algorithms are needed for a parallel verification. Unlike the LTL model checking, the reachability analysis is a verification problem for which an efficient parallel solution is available. The reason is that the exploration of the state space is independent of the search order. In the following, we sketch four parallel algorithms for enumerative LTL model checking that are, more or less, based on performing multiple parallel reachability procedures to detect a reachable accepting cycle. The reader is kindly asked to consult the original sources for the details.
MAP. The main idea of the Maximal Accepting Predecessor Algorithm [14, 16] is based on the fact that every accepting vertex lying on an accepting cycle is its own predecessor. An algorithm that is directly derived from the idea, would require expensive computation as well as space to store all proper accepting predecessors of all (accepting) vertices. To overcome this obstacle, the MAP algorithm stores only a single representative of all proper accepting predecessor for every vertex. The representative is chosen as the maximal accepting predecessor accordingly to a presupposed linear ordering ≺ of vertices (given, for example, by their memory representation). Clearly, if an accepting vertex is its own maximal accepting predecessor, it lies on an accepting cycle. Unfortunately, it can happen that all the maximal accepting predecessor lie outside accepting cycles. In that case, the algorithm removes all accepting vertices that are maximal accepting predecessors of some vertex, and recomputes the maximal accepting predecessors. This is repeated until an accepting cycle is found, or there are no more accepting vertices in the graph.
The time complexity of the algorithm is O(a 2 · m), where a is the number of accepting vertices and m is the number of edges. One of the key aspects influencing the overall performance of the algorithm is the underlying ordering of vertices used by the algorithm. Computing the optimal ordering is, however, difficult to parallelise; hence, heuristics for computing a suitable vertex ordering are used.
OWCTY. The next algorithm [18] is an extended enumerative version of the One Way Catch Them Young Algorithm [21] . The idea of the algorithm is to repeatedly remove vertices from the graph that cannot lie on an accepting cycle. The two removal rules are as follows: First, a vertex is removed from the graph if it has no successors in the graph (the vertex cannot lie on a cycle), and second, a vertex is removed if it cannot reach an accepting vertex (a potential cycle the vertex lies on is non-accepting). The algorithm performs removal steps as far as there are vertices to be removed. In the end, either there are some vertices remaining in the graph meaning that the original graph contained an accepting cycle, or all vertices have been removed meaning that the original graph has no accepting cycles.
The time complexity of the algorithm is O(h · m), where h is the height of the SCC quotient graph. Here, the factor m comes from the computation of elimination rules while the factor h relates to the number of global iterations required for application of the removal rules. Also note that an alternative algorithm is obtained if the rules are replaced with their backward search counterparts.
OWCTY is the algorithm we have chosen as a primary one for the tool, in part due to its favourable time complexity and also thanks to its performance and scaling behaviour observed in practice. This choice does not preclude the use of the presented techniques with any other distributed memory algorithm, although we do not expect their performance to be an improvement over OWCTY.
NEGC. The idea behind the Negative Cycle Algorithm [15] is a transformation of the LTL model checking problem to the problem of negative cycle detection. Every edge of the graph outgoing from a non-accepting vertex is labelled with 0 while every edge outgoing from an accepting vertex is labeled with −1. Clearly, the graph contains a negative cycle if and only if it has an accepting cycle.
The algorithm exploits the walk to root strategy to detect the presence of a negative cycle. The strategy involves construction of the so-called parent graph that keeps the shortest path to the initial vertex for every vertex of the graph. The parent graph is repeatedly checked for the existence of the path. If the shortest path does not exist for a given vertex, then the vertex is part of a negative (and therefore accepting) cycle. The worst-case time complexity of the algorithm is O(n · m), where n is the number of vertices and m is the number of edges.
BLEDGE. An edge (u, v) is called a back-level edge if it does not increase the distance of the target vertex v form the initial vertex of the graph. The key observation connecting the cycle detection problem with the back-level edge concept, as used in the Back-Level Edges Algorithm [1] , is that every cycle contains at least one back-level edge. Back-level edges are therefore used as triggers to start a procedure that checks whether the edge belongs to an accepting cycle. However, this is too expensive to be done completely for every back-level edge. Therefore, several improvements and heuristics are suggested and integrated within the algorithm to decrease the number of tested edges and speed up the cycle test.
The BFS procedure which detects back-level edges runs in time O(m + n). In the worst case, each back-level edge may trigger a search for accepting cycle, which requires linear time O(m + n) as well. Since there is at most m backlevel edges, the overall time complexity of the algorithm is O (m.(m + n) ).
All the algorithms allow for a scalable parallel implementation based on partitioning the graph (its vertices) into disjoint parts. Suitable partitioning is an important factor in overall efficiency of the parallelisation.
One particular technique that is specific to automata-based LTL model checking is cycle locality preserving problem decomposition [5, 32] . The graph (product automaton) originates from a synchronous product of the property and system automata. Hence, vertices of product automaton graph are ordered pairs. An interesting observation is that every cycle in a product automaton graph arises from cycles in system and property automaton graphs. Let A, B be Büchi automata and A ⊗ B their synchronous product. If C is a SCC in the automaton graph of A ⊗ B, then the A-projection of C and the B-projection of C are (not necessarily maximal) SCCs in the automaton graphs of A and B, respectively.
As the property automaton is derived from the LTL formula to be verified, it is typically quite small and can be pre-analysed. In particular, it is possible to identify all SCCs of the property automaton graph. A partition function may then be devised that respects SCCs of the property automaton and therefore preserves cycle locality. The partitioning strategy is to assign all vertices that project to the same SCC of the property automaton graph to the same subproblem. Since no cycle is split among different subproblems, it is possible to employ a localised Nested DFS algorithm to perform local accepting cycle detection simultaneously.
Moreover, further interesting information can be drawn from the property automaton graph decomposition. Maximal SCCs can be classified into three categories:
Type F: (Fully Accepting) Any cycle within the component contains at least one accepting vertex. (There is no non-accepting cycle within the component.) Type P: (Partially Accepting) There is at least one accepting cycle and one non-accepting cycle within the component. Type N: (Non-Accepting) There is no accepting cycle within the component.
Realising that a vertex of a product automaton graph is accepting only if the corresponding vertex in the property automaton graph is accepting, it is possible to characterise the types of the SCCs in the product automaton, based on the types of the corresponding components in the property automaton. This classification of components into three distinct types, N , F, and P, can be used to gain additional improvements that may be incorporated into the algorithms given above.
Specifically, the OWCTY algorithm needs to use this additional information for it to be linear for all weak graphs. However, for practically encountered property automata, this is not strictly necessary, as the algorithm exhibits linear behaviour even without this modification.
Implementation techniques
It is a well-known fact that a distributed memory, parallel algorithm can be straightforwardly transformed into a shared memory one. However, there are several inefficiencies involved in this direct translation. The shared memory architecture has several traits which may offer advantages in real-world performance of such implementations. In this section, we present our approaches to the challenges of shared memory architecture and its specific characteristics. We will briefly describe the techniques introduced in [2] , concerning communication, memory allocation, and termination detection, and we will show their application to the OWCTY algorithm described in Sect. 2. In addition to this, we introduce some of our latest results regarding implementation. First of all though, let us describe the target platform in more detail.
Shared-memory platform
Since we will work with several assumptions about the targeted hardware architecture in the paper, we will briefly describe the platform first.
Our working environment is POSIX threads. Therefore, we work with a model based on threads that share all the memory, although they have separate stacks in their shared address space and a special thread-local storage to store thread-private data. Contrast this with SPIN 5.1 that employs processes and inter-process communication channels to handle parallel computation.
Critical Sections, Locking and Lock Contention. In a shared memory setting, an access to a memory place that is used by multiple threads has to be controlled; otherwise, a race condition may occur. This is generally achieved by using a "mutual exclusion device", the so-called mutex. A thread wishing to access the memory place has to lock the associated mutex to guarantee the exclusiveness. The access is then performed in the so-called critical section. Locking procedure may block the calling thread if the mutex is locked by some other thread. Such a situation is called a resource or lock contention. It occurs whenever two or more threads happen to need to access the same critical section (and therefore lock the same mutex) at the same time. If the critical sections are long or they are entered very often, the contention starts to cause observable performance degradation, as more and more time is spent waiting for mutexes.
Processor Cache: Locality and Coherence. There are currently two main architectures in use for Level 2 cache. One is that each processing unit has its completely private Level 2 cache (for the Symmetric Multiprocessing case), or there is a shared Level 2 cache for a package of 2 cores. In bigger shared memory computer systems, it is usual to encounter split cache, since they often contain on the order of 8-64 cores attached to a single memory block. In recent hardware, the basic building units are dual-core CPUs with shared cache, but among the different units, the caches are still separate.
Due to coherence requirements, a read of data from thread B subsequent to a write of the date from thread A may (and usually does) incur a significant penalty. Since the cache often works with smallest units of 128 or more bytes long (the so-ca cache lines), various pieces of data may share a single cache line, if they are adjacent in memory. If different threads access some data on a single cache line very often the performance of the computation may suffer dramatically. Note that this is also the case if the data accessed by two threads are disjoint, but close enough to occupy the same cache line. Such a situation is spoken about as of false sharing. Much more detailed study of these phenomenon may be found in, e.g. [38] .
Implementing algorithms in shared memory
Whenever an algorithm is about to be implemented in a shared memory setting, all the technical details laid out in previous paragraphs must be taken into account. Recall that our goal is to adapt distributed memory algorithms to shared memory environment and to achieve a scalable implementation. The scalability is inversely proportional to communication overhead and its growth with increasing number of threads. Therefore, the techniques we designed were aimed to reduce communication overhead by exploiting traits of shared memory systems that are not available in distributed memory environment. However, keeping in mind the possibility to scale beyond shared memory systems, we tried to keep the implementation in a shape that would make a combined tool to work efficiently on clusters of multi-CPU machines achievable.
When we venture into a strictly shared memory implementation, one may pose a question whether a different approach of using a standard serial algorithm modified to allow parallelisation at lower levels of abstraction would give a scalable, efficient program for multi-CPU and/or multi-core systems. Our efforts at extracting such a micro-parallelism in our codebase have been largely fruitless, due to high synchronisation cost relative to the amount of work we were able to perform in parallel. An example of such micro-parallel approach would be to implement a parallel successor generator, where there is certain independence of the sub-tasks involved, but these sub-tasks are very small and on current hardware, it is faster to perform them in sequence than in parallel.
In the following sections, we explore the possibilities to build on existing distributed memory approaches, in the vein of statically partitioned graphs, reducing the overhead using idioms only possible due to locality of memory.
Communication
Generally, in a distributed computation, all communication is accomplished by passing messages-e.g. using a library like MPI for cluster message passing. However, in communication-intensive programs, or those sensitive to communication delay, using general-purpose message passing may be fairly inefficient.
In shared memory, most of the communication overhead can be eliminated by using more appropriate communication primitives, like high-performance, contention-and lock-free FIFOs (First In, First Out queues). We have adopted a variant of the two-lock algorithm-a decent compromise between performance on the one hand and simplicity and portability on the other-presented in [35] . Our modifications involve improved cache-efficiency and only use a single write-lock, instead of a pair of locks, one for reading and one for writing, since there is ever only one thread reading, while there may be several trying to write.
Representation and pseudo-code for enqueue and dequeue algorithms are found in Figs. 1, 2 and 3 , respectively. The correctness, linearisability (atomicity) and liveness proofs as given in [35] are straightforwardly adapted to our implementation and thus left out.
Originally, every thread involved in the computation owned a single instance of the FIFO and all messages for However, in communication-intensive workloads (like our parallel model checking algorithms), the write lock has been observed to be a point of contention, creating a bottleneck even when only 4 CPU cores were involved.
Although there is also a completely lock-free design described in [35] , we have opted for a matrix-style communication primitive, with a private FIFO for each pair of communicating threads. This is partially motivated by the use of atomic compare and swap instructions in the lock-free queue design, which are relatively expensive compared with regular memory access. Moreover, even when lock contention is removed, the lower level contention for a single memory location on the CPU level is likely to be a reason of concern. Moreover, for our use-case, all of the mentioned issues can be addressed by using a larger number of queues without incurring any significant penalties.
Alternatives to our implementation, which may be more appropriate in different settings, include a ring-buffer FIFO implementation (if there is a bound on the amount of inflight data known beforehand, the ring-buffer implementation may be more efficient) and possibly an algorithm based on swapping incoming and outgoing queues (which could be easily implemented as a pointer swap). The latter gives results comparable with the described FIFO method, although the code and locking behaviour is much more complex and error-prone, which made us opt for the simpler FIFO implementation.
Memory allocation
In a distributed computation, every process has simply its own memory which it fully manages. In a shared memory, however, we prefer to manage the memory as a single shared area, since an equal partitioning of available memory and separate management may fall short of efficient resource usage. However, this poses some challenges, especially in allocation-intensive environment like ours.
Efficient allocation and deallocation routines. Since the workload we are facing facilitates large amounts of fixed-size allocations and deallocations, it appears natural, to implement tailored allocation and deallocation routines.
A very simple O (1) memory pool has been devised, optimised for many allocations of a limited set of sizes. Of course, from time to time it needs to obtain memory from the system and this operation is not constant-time; however, it is at most linear in the block size and therefore amortises over the individual allocations as well (given a fixed allocation size).
Each thread has its own private pool and therefore the implementation is lock-free. This is possible since most operations are thread-local: the remaining case of cross-thread deallocation is discussed in the following.
Concurrent allocation and deallocation. First, a naïve approach of protecting the allocation routines with a simple mutual exclusion is highly prone to resource contention. Fortunately, modern general-purpose allocator implementations refrain from this idea and have a generally non-contending behaviour on allocation. However, releasing memory back for reuse is more complex to achieve without introducing contention, in a setting where it is often the case that thread other than the one allocating the chunk needs to release it.
There are known general-purpose solutions to this problem, e.g. [34] ; however, they are currently not in widespread use in general-purpose allocators. Therefore, when relying on the system's allocator, we have to refrain from the aforementioned pattern of releasing memory from different than allocating thread, in order to avoid contention and the accompanying slowdown.
The message-passing implementation we employ is pointer-based; in other words, the message sent is only a pointer and the payload (actual interesting message content) is allocated on the shared heap, and it may be either reused or released by the receiving thread. Observe, however, that releasing the associated memory in the receiving thread will introduce the situation which we are trying to avoid. Therefore, the allocation routines handle these cases differently. Instead of manipulating the memory pool of a different thread, the memory (allocated in a different thread) is appended to the freelist of the current thread. Thanks to the workload distribution scheme employed, this approach is very much feasible and does not introduce significant memory overhead. Moreover, it is correct, since the memory handed over to the new thread is never again examined by the original thread. We have dubbed the technique "memory stealing", since the releasing thread "steals" the memory from its previous owner.
General purpose memory allocation. Since apart from the state allocation and deallocation, there are several important memory-intensive routines (one example being the FIFOs, another is the model parser and interpreter) in the model checker, which do not exhibit the behaviour described above, we also need a high-performance, general-purpose memory allocator. Moreover, the pool allocator described needs to obtain memory blocks somewhere as well.
We have opted for Emery Berger's excellent HOARD multi-threaded memory allocator [11] . Apart from having very good performance and scalability properties, HOARD strives to avoid heap layouts leading to false sharing, further improving performance.
Termination detection
Since our algorithms rely on work distribution among several largely independent threads, we need an algorithm for shared memory termination detection. Similar to the distributed memory setting, our algorithm should introduce minimal overhead and avoid as much serialisation as possible.
One possible solution is presented in [33] . The solution avoids locking at all; however, it requires that the system provides an enqueue-with-wakeup primitive. We decided not to follow this lock-free approach as we were able to achieve quite satisfactory solution much easily employing primitives available in POSIX Thread API. In particular, the API offers a mutex implementation that allows threads to use the mutex in a lock-or-fail manner, as opposed to the standard lock-or-wait, which is usually employed for protecting critical sections. We can leverage this mechanism to achieve an efficient termination detection algorithm as follows.
The idea is that each thread is associated with a mutex whose status corresponds to the status of the thread: whenever a thread is idle, its corresponding mutex is unlocked and conversely, whenever the thread is busy, its mutex is locked.
In order to detect termination we run a separate thread performing the termination detection. The termination detection algorithm tries to lock mutexes of all worker threads, one by one, using the lock-or-fail behaviour. If it succeeds in locking all mutexes (all working threads are idle) it proceeds to check the communication queues. If these are empty, the termination has occurred and the algorithm terminates. Pseudo-code for the algorithm is shown in Fig. 4 .
In order to cope with the termination detection algorithm, every working thread is augmented so that it locks the corresponding mutex whenever it starts processing pending work, and unlocks the mutex whenever it becomes idle. To reduce overhead caused by repeated polling for incoming work, the thread, in addition, enters a sleeping mode after becoming idle. Since we run the termination detection algorithm in a dedicated thread, the termination detection thread may wake up threads that have pending work, but are sleeping due to Fig. 4 Termination detection in shared-memory previous idling; i.e., if the termination thread has successfully grabbed any locks and some queues belonging to those locked threads are found non-empty, the corresponding threads are awakened. After every run of the termination detection algorithm, all grabbed locks are released again.
Moreover, although this algorithm works correctly as-is, it is rather inefficient if the termination detection thread is left running in a loop. Therefore, the termination detection thread goes to sleep after every iteration and is woken up by any worker thread that goes idle.
This modification introduces a race-condition to the algorithm. If the last thread going to sleep wakes up the termination detection thread, which then runs the algorithm before the calling thread manages to go to sleep, the system may deadlock. We solved the problem with further technical modifications of the algorithm; however, we do not list these modification from simplicity reasons.
An alternative approach would be to synchronously execute the termination detection algorithm in the thread that has become idle; but due to the nature of the system, the above procedure is more practical code-wise and only incurs very insignificant overhead.
Workload partitioning
One of the traditional approaches, when exploring the state space of an implicitly specified model, is that the algorithm starts from the initial state and using a transition function, generates successors of every explored state. Visited states are stored in a hash table, to facilitate quick insertion of newly visited states and quick lookup of states that have already been visited.
The usual approach in distributed algorithms is to partition the state space statically, using a partition function [17, 19] (which is usually in turn based on a hash function over the state representation). This partition function unambiguously assigns each state to one of the computation nodes. Same approach can be leveraged in shared memory computation, where each thread of control assumes ownership of a private hash table and potentially also a private memory area for storing actual state representations.
All the processors (and therefore threads of control) share a single continuous block of local memory, with uniform accessibility from all the CPUs and/or cores. This gives us two new options, compared with the situation in distributed environment. First, if several hash tables are used, threads can look into tables they do not own, and the second option is to have a single shared hash table, used by all the threads.
Static and Dynamic Partitioning. In our research on this topic in [8] , we have arrived to the conclusion that when available (i.e. the algorithm allows), static partitioning is preferable to dynamic. Such a scheme leads to distinct hash tables for each thread and appears to improve both performance and scalability by a significant margin. The issues with shared hash table are a subject of further research, and unfortunately, we cannot give more insight into the cause of these problems at this time.
Therefore, we have opted for statically partitioning the state space and using private hash tables. The implementation used in this paper is fully based on this approach. For an experimental evaluation of the shared hash table options, please refer to [8] .
3.7 Implementing OWCTY in shared-memory As can be seen from the pseudo-code (refer to Fig. 5) , the main OWCTY loop consists of few steps, namely, reachability, elimination and reset. All of them can be parallelised, but only on their own, which requires a barrier after each of them.
The algorithm uses a BFS state space visitor to implement both reachability and elimination. The underlying BFS is currently implemented using a partition function, i.e., every state is unambiguously assigned to one of the threads. The The barriers (i.e. barrier synchronisation) are straightforwardly implemented using the termination detection algorithm presented-the computation is initiated by the main thread, and the termination detection is then executed in this same thread, which also doubles as a scheduler. When the step terminates, the main thread prepares the next step, spawns the worker threads and initiates the computation again. Since the hash table is always thread-private, i.e. owned exclusively by a single thread, the main thread has to transfer the hash table among different threads in the serial portion of computation. This is nonetheless done cheaply (few pointer operations only) so is likely not worth parallelising.
Experiments

Methodology
The main testing machine we have used is a 16-way AMD Opteron 885 (8 CPU units with 2 cores each). A second testing configuration has been a 4-way Intel Xeon 5130. All timed programs were compiled using unpatched gcc 4.2.2, using -O2. We have used both 32-and 64-bit builds, using -m32 and -m64, respectively. If not specified otherwise, a 64-bit build has been used, due to memory demands of the verification runs (32-bit pointers can only address up to 4 gigabytes of memory, part of which is reserved by the operating system). The 32-bit version has only been used for comparison with few of the smaller models.
For this paper, our main concern is speed and scalability; therefore, we focus on these two parameters. Measurement has been done using standard UNIX time command, which measures real and cpu times used by a program.
For the experimental evaluation we implemented algorithms upon the state generator from DiVinE [7] . All the models we have used are listed in Table 1 including the verified properties. The models come from the BEEM database [36] that contains the models in DiVinE-native modelling language as well as in ProMeLa. We used ProMeLa models for comparison with the SPIN model checker.
Generator influence
In addition to these models, we have implemented a special-purpose ("dummy") state space generator which generates very small states very quickly, to evaluate scalability when using a high-performance generator. This is important to identify the role of our current generator, which is known to be sub-optimal in the performance and scalability of the framework. The results are presented in Fig. 6 . It can be seen that, although the scaling is roughly 1:2 (number of cores needs to be quadrupled to double the speed, ie efficiency is about 50%), it is fairly flat across the board (i.e. the speedup between 1 and 2 cores is not much higher than speedup between 2 and 4 cores and so on). This is an interesting result, since it gives us a good lower bound estimate on scaling behaviour of the system when we improve the state space generator. Scalability is inversely proportional to communication overhead-a faster generator means the proportion of time spent in communication is higher than with a slower one.
In addition to the "dummy" generator, a model from BEEM with similar runtime on a single core is plotted in the same figure, for comparison. The expected scalability behaviour with improved state space generator falls between that of these two (i.e. worse than current behaviour with realistic models, but better than that of the "dummy" generator).
Results
We report both runtimes and speedup for BFS reachability in Fig. 7 . Measurements for the implementation of OWCTY algorithm (for full LTL model checking) are provided in Fig. 8 . These were obtained using a 64-bit build on the 16-way AMD Opteron machine. Another set of data points comes from the 4-way Intel machine (a 64-bit build again), visualised in Fig. 9 for reachability.
Some of the phenomena visible in the plots need to be explained. First of all, it needs to be noted that the exact distribution of graph vertices among worker threads strongly depends on actual number of worker threads used, due to distribution scheme used (please refer to Sect. 3.6: DiVinE Multi-Core uses hash(vertex)/threadcount to assign a given vertex to its owner thread). We call an edge a "cross" edge when the two vertices it connects belong to different threads. Clearly, communication overhead depends on proportion of such "cross" edges in the system. Moreover, this proportion depends on exact partitioning of the state space, and this varies with the actual number of threads used. This sometimes leads to situations where adding more CPU cores Cab passes level 0 at most once without serving it, after it has been requested
The algorithm operates on a ring of N processes. Each process is assigned a unique number. The purpose of this algorithm is to find the largest number assigned to a process. to the computation changes the partitioning of the graph in such a way that the higher communication overhead cancels out the benefit of increased computational power (and sometimes even causes the whole computation to slow down). Moreover, it can be seen that the OWCTY algorithm, which has inherently higher communication costs than simple reachability, has accordingly poorer scalability behaviour. Another issue that contributes to the inferior scalability of OWCTY is the ordering restriction imposed in the "elimination" pass, which reduces the amount of work that can be done in parallel and therefore impedes obtained speedup. In addition to these experiments, we have performed a few comparison runs, first for 32/64 bit builds, with results presented in Figs. 10 and 11. These have been done in order to establish the effect of pointer width on performance and scalability. The conclusion we can draw from the experiment is that while the wider pointers incur a performance penalty on a single core run (which is observable in reachability, but negligible in OWCTY), this penalty is eventually evened out as cores are added (which in turn means that the penalty is divided evenly among the cores).
Furthermore, the effect of custom-made pool allocator is shown in Fig. 12 . It can be seen that the tailored allocator, indeed, helps with scalability on the extreme right end of the SPIN-generated verifier has been used with parameters -E -A -w27 -m5000000 and compiled with -DMEMLIM = 8000 -DNOREDUCE -DVMAX = 512, plus -DSAFETY for reachability. We have used bigger stack or hash table than strictly necessary (to avoid running into excessive hash table collisions or exceeding search depth; determining the best sizes for each model would be slightly impractical). For the NCORE > 1 runs, it has also been necessary to increase the system-shared memory limit. For the Anderson model, the stack limit needed to be quadrupled to facilitate verification. This fulfils our goal of implementing a scalable multi-core LTL model checker. It maintains a linear time complexity for majority of LTL properties verified in practice and provides scalability that makes it practical to use on machines with several CPU cores available. Moreover, in the range of 4-8 cores, the tool performance rivals that of SPIN and even exceeds it with higher core numbers, which is in itself a considerable achievement.
From the profiling work we have done, it is clear that the current most important bottleneck of DiVinE is its state generator. The experimental data presented in the paper support this observation, especially when compared with SPIN, with its famously fast state space generator. Improvements in this area should reduce the absolute running times, while it may negatively affect relative scalability. Nevertheless, the current implementation of algorithm and supporting mechanisms still offer a speedup close to (number-of-cores/2) even when used with a very fast state space generator.
To sum up, this paper presents a significant achievement in the stated goal of implementing scalable algorithms and support code for reachability and LTL model checking in DiVinE Multi-Core.
