With the increase in complexity and degree of parallelism of computer systems, it became even more important to develop formal methods for ensuring their quality. Correctness and reliability became a must have flavor for business success, and therefore, various techniques for automated and semi-automated formal verification and analysis have been designed and successfully applied. Formal verification and analysis bring many benefits such as early integration in design process, more effective detection of logic errors, etc. Even though introduction of formal analysis is rather costly, it pays off after all as it results in significant reduction in verification time as well as development costs and time-to-market. Attempts are being made to integrate formal verification techniques and tools with other design approaches to support engineering of complex industrial systems. The iFEST Artemisia project [59] is an example of a promising tools integration in embedded systems domain.
Fighting the Space Explosion
With the increase in complexity and degree of parallelism of computer systems, it became even more important to develop formal methods for ensuring their quality. Correctness and reliability became a must have flavor for business success, and therefore, various techniques for automated and semi-automated formal verification and analysis have been designed and successfully applied. Formal verification and analysis bring many benefits such as early integration in design process, more effective detection of logic errors, etc. Even though introduction of formal analysis is rather costly, it pays off after all as it results in significant reduction in verification time as well as development costs and time-to-market. Attempts are being made to integrate formal verification techniques and tools with other design approaches to support engineering of complex industrial systems. The iFEST Artemisia project [59] is an example of a promising tools integration in embedded systems domain.
Model checking is a distinguished technique of formal verification of complex hardware and software designs. Founders of the technique Edmund M. Clarke jr. (CMU, USA), Allen E. Emerson (Texas at Austin, USA), and Joseph Sifakis (IMAG Grenoble, France) were awarded ACM Turing Award in 2007 for their roles in developing model checking into a highly effective verification technology, widely adopted in the hardware and software industries. Unfortunately, model checking procedure is computationally demanding and memory-intensive in general, hence, its applicability to large and complex systems routinely seen in practice these days is still limited. The major hampering factor is the state space explosion problem due to which large industrial models cannot be efficiently handled unless more sophisticated and scalable methods are used.
A lot of attention has been paid to the development of approaches to fight the state space explosion problem [37] in the field of automated formal verification [77] . Many techniques, such as state compaction [47] , compression [56] , state space reduction [76, 36, 44] , symbolic state space representation [29] , etc., were introduced to reduce the memory requirements needed to handle the verification problem with a standard sequential software tool. These techniques allowed system developers to verify larger systems without the need of increased computing power.
To verify even larger systems, however, no option was left out than to employ combined computing power of multiple computing devices. Unfortunately, some verification techniques cannot preserve their efficiency if adapted to non-sequential models of computation, and therefore an urgent need for new and quite different verification procedures emerged. Many new techniques have been introduced. Some of them are applicable across a broad range of computing platforms, some of them are tailored to the specific capabilities of a particular hardware architectures. Examples include techniques to fight the memory limits with an efficient utilization of external memory devices [80] , techniques that introduce clusterbased algorithms to employ the aggregate power of network-interconnected computers [79, 73, 46, 4] , techniques to speed-up the verification process on multi-core processors [58, 11, 71] , etc.
However, back at the beginning of the 21st century, many of these techniques waited to be yet discovered. Even at that time the idea of using combined resources to increase the computational power was far from being new in formal verification. Attempts to use hard drives or parallel computers for verification of large systems have appeared in the very early years of the automated formal verification era. However, the inaccessibility of cheap parallel computers with sufficiently fast external memory devices together with the negative theoretical complexity results excluded these approaches from the main stream in formal verification. Moreover, thanks to the Moore's law, the performance of software tools kept improving continuously for years as the power of a single cored CPU grew. The situation changed dramatically with oncoming of multi-core CPU chips. The progress in computer design over the past decades had measured several orders of magnitude with respect to various physical parameters such as power consumption, efficiency, physical size or cost. As a result, it became more efficient for chip producers to introduce multiple CPU cores on a single chip rather than to increase the speed of a single core. As the speed of a single core virtually stopped growing, every piece of software that was built upon a serial algorithm could not take the advantage of technological progress anymore. The focus of parallel and distributed-memory computing community shifted away from unique massively parallel systems competing for world records towards smaller and more cost effective systems built up from small and cheap personal computer parts. Suddenly, the need for parallel processing become rather general and wide spread in all science fields relying on complex computation operations, automated formal verification being not an exception. As a matter of fact, the interest in the platform-dependent formal verification has been revived.
The DiVinE Story!
DiVinE [15, 18] is a tool for LTL model checking and reachability analysis of discrete distributed systems. The tool is able to efficiently exploit the aggregate computing power of multiple networkinterconnected multi-cored workstations in order to deal with extremely large verification tasks. As such it allows to analyse systems of which size is far beyond the size of systems that can be handled with regular sequential tools. DiVinE tool follows the explicit-state automata-based approach to LTL model checking. Due to Vardi and Wolper [81] , the LTL model checking problem reduces to the problem of emptiness of Büchi automata, hence to the problem of the accepting cycle detection in the underlying directed graph of a Büchi automaton.
Parallel Algorithms in LTL Model Checking
The need of parallel processing in automated formal verification stemmed from the desire to fight the state space explosion problem by employing aggregate memory of multiple network interconnected workstations. The crucial aspect studied at first was how to distributed the work among participating processors in order to take advantage of aggregate memory and parallel processing at the same time.
Based on a parallel algorithm for state space generation [30] a static partitioning scheme relying on a hash function was introduced [34] . As observed by multiple researchers, the hash-based partitioning yields better space locality if only parts of the state descriptor are used as the input to the partitioning function. There were approaches requiring the user of the tool to specify the concrete parts of the state descriptor to be used for partitioning [34, 73] , other approaches employed automated or semi-automated techniques to do it [35] . DiVinE implicitly uses a hash-based partitioning over the full state descriptor. Parts of the descriptor used for partitioning might be statically redefined prior compilation. Techniques for load balancing the set of visited states, also known as re-partitioning techniques, have been suggested [2, 74, 70] as well as state space generation schemes employing probability aspects [67, 66] . Nevertheless, none of them have been implemented in DiVinE.
When DiVinE model checker development started, several previous tools had existed. The first known public implementation of a distributed memory tool for verification of communication protocols was the parallel implementation of the Murϕ tool [39, 79] . Murϕ parallel work-flow relied on the standard MPI-like approach to messaging, nevertheless, active messages were later introduced into Murϕ to improve its efficiency [43] . The successful story of the Murϕ was followed by other verification tools: SPIN [73, 74] , CADP [46] , UPPALL [22] , etc. Distributed-memory state space generation as a technique of automated formal verification also appeared in the context of Petri Nets [34, 55] and Markov chains [54, 53] .
Prior the DiVinE model checker, the existing distributed-memory parallel tools were focused on state-space generation and reachability analyses only. The reason was simple: the lack of parallel algorithms for accepting cycle detection in distributed-memory setting. Nested Depth First Search algorithm (Nested DFS) and other algorithms relying on dept-first search stack cannot be used in distributedmemory setting as the distributed and parallel maintenance of the depth-first search stack is inefficient [4] . Therefore, new parallel and distributed-memory algorithms for accepting cycle detection had to be introduced with the development of DiVinE tool. The first implementation of DiVinE employed the so called dependency structure [14] to record the reachability relation among accepting states of a distributed graph and applied the topological sort algorithm [65] to detect the presence of a self-reachable accepting state. Other parallel algorithms appeared with the time building upon various ideas: detection of negative cycles (NEGC Algorithm) [28, 26] , explicit-state implementation of symbolic SCC hull detection (OWCTY Algorithm) [31] , value propagation (MAP Algorithm) [27] , or algorithm based on back-level edges as produced by a breadth-first search procedure (BLEDGE Algorithm) [8, 9] . These new algorithms differed in theoretic complexity as well as practical efficiency. After large experimental evaluation, some of the algorithms were discontinued in DiVinE. The latest version of DiVinE employs a combination of MAP and OWCTY algorithms by defualt [12] . Table in Figure 1 gives complexities and on-the-fly abilities of newly introduced parallel LTL model checking algorithms.
Distributed-memory processing cannot fight the state space explosion problem alone and must be combined with other techniques. One of the most successful technique to fight the state space explosion in explicit-state model checking is Partial Order Reduction [76] . DiVinE is able to perform this reduction, however, new topological sort proviso had to be developed in order to maintain efficiency of parallel and distributed-memory processing [13] . Another important algorithmic improvement relates to classification of LTL formulas [32] . For some classes of LTL formulas (weak LTL) the parallel algorithms may by significantly improved [7] . With this observation the OWCTY algorithm can be improved so that its complexity even meets the complexity of the optimal sequential Nested DFS algorithm, see Table in Figure 1 . However, this algorithm suffers from not being an on-the-fly algorithm. Since the on-the-fly verification is an important practical aspect, we have devised a modification of this algorithm that allows for on-the-fly verification in most verification instances [12] .
While DiVinE focuses on "complete" verification, parallel distributed-memory "incomplete" verification due to lossy state compaction has been introduced by PReach tool [24] .
Complexity
Optimality On-The-Fly 
Algorithm Engineering for Parallel LTL Model Checking
There is no doubt that without an appropriate parallel algorithm the LTL model checking procedure cannot be successfully adapted to contemporary parallel computing platforms. Nevertheless, the algorithm is not the only ingredient required. Even the best algorithms in theory may not outperform good-butnot-optimal algorithms that are equipped with platform-aware heuristics. This observation is even more applicable to parallel processing where the scalability and absolute runtime reduction are typically more valued achievements than theoretical optimality. To that end there is another ingredient behind the development of parallel and distributed-memory tool DiVinE -Algorithm Engineering.
Efforts must be made to ensure that promising algorithms discovered by the theory community are implemented, tested and refined to the point where they can be usefully applied in practice.
[Aho et al. [1997] , Emerging Opportunities for Theoretical Computer Science]
In other words, characteristics of individual computing platforms must be taken into account in order to obtain efficient implementations on these platforms. In order to take the advantage of the processing power various platforms provide, algorithm and data structure implementations need to be platformdependent and platform-aware.
Parallelism in Distributed-Memory
Distributed-memory parallel platform was the first platform that the DiVinE tool was adapted to. The intention was to aggregate computational power and distributed system memory of multiple network interconnected workstations (clusters) in order to facilitate the verification of large model checking instances [14, 4] . The general idea of employing the distributed-memory platform for execution of a parallel graph algorithm was, and still is, as follows. The set of vertices of the graph to be processed is partitioned among participating computation nodes using a static partitioning function. When a computation node processes a vertex it enumerates all its immediate successors and checks them for their ownership. If a newly generated vertex is local according to the partitioning function, it is pushed to the local queue where it waits for further processing. In the other case a network message is created containing the vertex and sent to the queue of the owning computation node. With this work-flow there is a message generated with every edge connecting vertices from different partitions of the graph. This is where the theory is done, however, when it comes to the implementation there are still numerous design choices to be made. Some of them are detailed for individual computing platforms in the following subsections. Message aggregation and buffering are the standard techniques in parallel computing to alleviate the burden of network communication overhead. Therefore, DiVinE tool maintains buffers of messages to be sent to individual network-attached computing nodes. In the first implementation of DiVinE, a buffer was flushed (messages were sent to network) upon one of the following situations: 1) buffer was explicitly flushed by the executed graph algorithm, 2) maximal number of messages to fit the buffer has been reached, 3) the local computing node was (otherwise) idle, and 4) messages in the buffer were too old. Deep experimental evaluation, however, showed that the fourth condition is completely ineffective in terms of network flow, while its checking is quite expensive. After dropping the fourth rule for flushing of buffers DiVinE significantly gained in performance.
There were other distributed-memory performance bugs in earlier versions of DiVinE. For exmaple, uncontrolled polling of incoming network messages, massive flushing of all buffers at the same time, or insufficient separation of initialization and computation phases. For more details see [82] . Cumulative effect of elimination of these bugs from DiVinE is shown in Figure 2. 
Parallelism in Shared-Memory
Most techniques and results known from the distributed-memory setting are straightforwardly applicable to shared-memory architectures. DiVinE architecture follows this observation, which means that if DiVinE is executed on a multi-core machine with shared-memory, it mimics distributed-memory behavior. In particular, the graph to be processed is partitioned among individual parallel shared-memory threads in the same way as it would be in the distributed-memory setting. Each individual thread maintains its own hash table and its own pool of vertices to be processed. Vertices belonging to different threads are pushed to their local pools by means of lock-free shared-memory queues [10] . Relative advantages and disadvantages of shared versus private hash tables within the context of thread-private pools of vertices to be processed have been discussed in [21] . These approaches were evaluated, both theoretically and practically, in a prototype implementation [11] .
Nevertheless, the scalability of parallel distributed-memory solutions to shared-memory is often limited. Therefore, shared-memory specific techniques are needed to improve the efficiency and scalability of existing parallel distributed-memory solutions on shared-memory architectures. Examples of successful shared-memory specific techniques include, e.g., shared communication data structures [60, 10] , specific termination detection techniques [10] , dual-core algorithms [58] , or quite a unique partitioning scheme [57] . As for DiVinE, it seems that the design choice of having thread-private pools of vertices to be processed was not the best one [71] . However, an experimental confirmation still waits to be done.
Employing External Memory
Efficient algorithmic usage of computing devices with memory hierarchies is an established research topic [75] . Numerous algorithms were devised to efficiently utilize external-memory block devices, such as disks. The efficiency of such an algorithm is typically measured in the number of I/O (input/output) operations. To that end, the I/O efficient complexity has been defined [1] and the standard breadth-first graph traversal algorithms adapted to the I/O setting. The crucial technique used to do so is the so called delayed duplicate detection [33] that has been further improved in [83, 3, 49] and specialized for undirected graphs [68, 69] . Regarding formal verification, the graph traversal algorithms are used for state space generation and verification of safety properties, see e.g. disk extension of the verification tool Murϕ [80, 78] .
As for problems beyond the state space generation, breadth-first search graph traversal algorithms are unsuitable. Therefore, the first approach to LTL model checking with external memory device employed a generic reduction of the LTL model checking problem to the reachability problem [23] . Unfortunately, such a reduction results in a quadratic grow in the memory demands, which effectively eliminates its application to large scale industry cases. Therefore, "incomplete" verification approaches dominated the research field for some time. We have seen random walks strategies implemented [64] , iterative deepening and A * algorithms [61, 62] , or breadth-first search based approaches with limited amount of stored information [72] to be used.
The I/O branch of DiVinE was started with the invention of a new I/O algorithmic technique that efficiently avoids the quadratic space overhead [19] . The new approach was further improved by introduction of the so called merge omissions [20] that allowed for more efficient delayed duplicate detection in the later stages of the computation. Various formulas for control of what should be omitted were introduced [45] , however, they were not implemented within the I/O branch of DiVinE. A completely different technique for trading time for space employing perfect hashing has been implemented in the I/O branch of DiVinE. This technique is referred to as the semi-external approach to LTL model checking problem [40] .
Many-Core Parallelism
After NVIDIA's CUDA technology [38] was introduced, a lot of computational demanding tasks have been accelerated by GPU-aware algorithms. Examples of GPU accelerated procedures include, but are not limited to, sorting [48] , reduce operations [52] , or numerous biological and physical simulations, such as protein folding [63] . As for the graph theory, successful adaptation of general graph traversal algorithms have been reported too [50, 51] demonstrating the tremendous computational power of the CUDA device. On the other hand, graphs to be explored efficiently with a CUDA accelerated algorithm must be encoded explicitly in a compact way.
The CUDA technology as a computing platform attracted also researches in the field of automated formal verification. The key challenge for which no satisfactory solution is known yet is how to accelerate the generation of explicitly encoded state space graph from the implicit definition. Preliminary attempts to do so relate to explicit model checking. They suggest to employ massively parallel check for enabled transitions emanating from the vertices on the frontier of the search and their massive parallel execution [41, 42] .
Once the state space is generated and explicitly represented in an appropriate sparse matrix like structure, many verification tasks can be accelerated using CUDA technology. This has been successfully demonstrated, e.g., on verification of probabilistic systems [25] or LTL model checking [17] . Latest developments in DiVinE CUDA tool [16] allow for efficient utilization of multiple CUDA devices [5] and acceleration of detection of strongly connected components [6] .
Summary
Platform dependent verification is an alternative approach how to make automated formal verification attractive for industry. Despite significant progress in the development of various specific techniques and tools on the algorithmic level, mainly for parallel architectures, there is still a gap between pseudo-code and implementation. Implementations must be tuned for specific platforms, e.g. memory access patterns seem to play crucial role. In platform depended verification we should learn to appreciate engineering solutions.
