Abstract-Existing algorithms for I/O Linear Temporal Logic (LTL) model checking usually output a single counterexample for a system which violates the property. However, in real-world applications, such as diagnosis and debugging in software and hardware system designs, people often need to have a set of counterexamples or even all counterexamples. For this purpose, we propose an I/O efficient approach for detecting all accepting cycles, called Detecting All Accepting Cycles (DAAC), where the properties to be verified are in LTL. Different from other algorithms for finding all cycles, DAAC first searches for the accepting strongly connected components (ASCCs), and then finds all accepting cycles of every ASCC, which can avoid searching for a great many paths that are impossible to be extended to accepting cycles. In order to further lower DAAC's I/O complexity and improve its performance, we propose an intersection computation technique and a dynamic path management technique, and exploit a minimal perfect hash function (MPHF). We carry out both complexity and experimental comparisons with the state-of-the-art algorithms including Detect Accepting Cycle (DAC), Maximal Accepting Predecessors (MAP) and Iterative-Deepening Depth-First Search (IDDFS). The comparative results show that our approach is better on the whole in terms of I/O complexity and practical performance, despite the fact that it finds all counterexamples.
INTRODUCTION
M ODEL checking has become one of the most attractive and important approaches to verification for software and hardware systems. The automata-theoretic approach, as one of the most important model checking techniques, translates the LTL model checking problem into the detection of reachable accepting cycles in a directed graph [1] . However, with the increasing scale and complexity of software and hardware systems, the directed graph tends to be extremely large and induces the state explosion problem. One of the most effective approaches to this problem is to exploit external memory devices (disks), which is called the external-memory approach, and has been extensively studied [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] .
Compared to internal memory, external memory devices (disks) can provide much larger space. Moreover, in the past few year, there have been enormous increases in the capacity of magnetic disks, with little increase in their cost, resulting in dramatic reductions in the cost per byte. Magnetic disks are about two and a half orders of magnitude cheaper than semiconductor memory [12] . These two facts together suggest the idea of using external memory in model checking large-scale systems.
Previous I/O LTL model checking approaches are usually designed to find one counterexample, as one counterexample is sufficient to judge the invalidity of the model [2] , [3] , [4] . However, in real-world applications, such as diagnosis and debugging in system designs, people often need to have a set of counterexamples or even all counterexamples.
To demonstrate that all (or a set of) counterexamples can be helpful in locating software bugs, let us observe the following example. Suppose a software engineer wrote a program to compute x 2 þ 6x þ 9 where x is an integer from À100 to þ100, and his codes are as follows:
scanf("%d", x); y ¼ x Â x; y ¼ y þ 6x; y ¼ y À 9; Apparently, he mistakenly typed þ9 as À9. When debugging the program, he finds the following three counterexamples: s 34 ðx ¼ 1; y ¼ À2Þ. Now he wants to locate the wrong code line by these three counterexamples. Clearly, he may not complete the work only by one counterexample. However, when considering all three counterexamples, he may find the difference of the correct function value and the output of the program is a constant 18 in these three counterexamples. Thus it is most likely that the line containing a constant addend is written mistakenly, and he concludes that the fourth line is most likely the wrong code line. In this case, the three counterexamples together can provide more useful information than any one of them does.
Indeed, a lot of work about applications of a set of counterexamples in diagnosis and debugging of software and hardware systems has been carried out [13] , [14] , [15] , [16] , [17] . Leue and Befrouei developed an automated method for explaining counterexamples indicating the occurrence of deadlocks in concurrent systems [13] . The method is based on an analysis of a set of counterexamples that can be generated by a model checking tool. The set of counterexamples is referred as the bad dataset. With the aid of the model checking tool, the method also gets a set of execution traces that do not violate the property, which is referred as the good dataset. By examining the differences in the good and bad datasets, it extracts a number of ordered sequences of actions, which point to the root causes of the deadlock occurrence in the model. Fady Copty et al. presented a diagnosis and rectification solution based on a set of counterexamples [14] . The usage flow of the solution consists of running a model checker, dumping all the model checking data needed to compute all the counterexamples of a given length, and then debugging in an interactive environment by loading the pre-dumped model checking data. The efficiency of their approach were evaluated through the case studies on Intel's actual design [14] .
Thus it is very important to develop efficient approaches for locating root bugs (or root causes of bugs) by all or sufficiently many counterexamples. Therefore, to find all or sufficiently many accepting cycles and to efficiently perform debugging in software and hardware system designs by these accepting cycles are two closely related and significant problems.
On the other hand, for large models, though it could take several minutes for a model checker to find a counterexample, a user may find it convenient to get all (or a set of) counterexamples, and then fix them. This paper focuses on investigating an external memory approach for finding all accepting cycles of large-scale systems. We outline the main technical contributions of this paper as follows.
1) To the best of our knowledge, we are the first to propose an I/O efficient approach that detects all accepting cycles of large-scale systems, called detect all accepting cycles (DAAC). DAAC can be very helpful for debugging in software and hardware designs, as the information contained in the found counterexamples can be better used to identify root bugs accurately.
2) This paper provides a general framework for finding all counterexamples of a large-scale system, which is different from previous algorithms [18] , [19] , [20] , [21] , [22] , [23] . Previous algorithms search for all elementary cycles directly from the directed graph. In our framework (see Section 4) , in order to find all accepting cycles, DAAC traverses the accepting state set. For each accepting state, DAAC first searches for the ASCC containing the accepting state, and then finds accepting cycles of the ASCC. When all accepting states are traversed, all accepting cycles of the system are found. The framework has the advantage that it avoids searching for a great many ineffective paths, where an ineffective path is a path that can not be extended to an accepting cycle. This framework has essential contributions to the good performance of our approach, as shown in Section 4. 3) To further lower the I/O complexity and improve the performance of our approach, we propose a technique for computing the intersection of two state sets and a technique for dynamic path management (DPM). The intersection computation technique has linear worst-case I/O complexity and thus can improve the performance of the algorithm that searches for ASCCs. The dynamic path management technique can reduce the memory dithering phenomenon during search, and thus can improve the performance of the algorithm that finds all accepting cycles of an ASCC. The memory dithering in this paper refers to the phenomenon that states are frequently moved into and out of the internal memory, which may significantly increase the number of I/O operations and thus reduce the efficiency of an algorithm (see Section 4.3 for more details). To demonstrate the effectiveness of our approach, we compare our approach with Detect Accepting Cycle (DAC) [2] , Maximal Accepting Predecessors (MAP) [3] , [5] and Iterative-Deepening Depth-First Search (IDDFS) [4] , in terms of both I/O complexity and practical performance. DAC, MAP and IDDFS are I/O efficient LTL model checking algorithms that achieve state-of-the-art performances and represent the most recent advances. The comparative results show that our approach is better on the whole in term of I/O complexity and practical performance, despite the fact that it finds all counterexamples. This paper is organized as follows. We discuss related work in the next section, and introduce some preliminary notions in Section 3. Section 4 proposes an I/O efficient approach for detection of all accepting cycles, followed by correctness proof in Section 5. In Sections 6 and 7, we compare our approach with DAC, MAP and IDDFS, in terms of both I/O complexity and performance. Section 8 discusses the parameter of dynamic path management and on-the-fly approaches that directly search for all (or a set of) accepting cycles from the graph, and explores two variants of DAAC. We conclude this paper in Section 9.
RELATED WORK
In this section, we introduce existing algorithms for I/O LTL Model Checking and those for finding all elementary cycles.
I/O LTL Model Checking
The first I/O-efficient algorithm for the LTL model checking was proposed by Edelkamp and Jabbar [6] . The algorithm reduces the accepting cycle detection problem to the reachability problem [24] , which is designed for symbolic exploration with BDDs. Afterwards, many I/O efficient algorithms for solving the LTL model checking problem were proposed [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] . To the best of our knowledge, among this kind of algorithms, DAC [2] , MAP [3] , [5] and IDDFS [4] achieve state-of-the-art performance and represent the most recent advances.
The algorithm DAC [2] adapts an existing non-DFS-based accepting cycle detection algorithm One Way Catch them Young (OWCTY) [25] to the I/O efficient setting. The algorithm first inserts all reachable vertices into an approximation set. After that, it repeatedly reduces the approximation set until a fixpoint is reached. In detail, vertices violating the condition are gradually removed from the approximation set using two procedures. One procedure removes those vertices from the approximation set that lies outside any cycle. The other removes vertices lying on non-accepting cycles. Finally, if the approximation set is empty, there is no accepting cycle in the graph, otherwise the presence of an accepting cycle is ensured. The algorithm is especially useful for verification of large systems with valid properties. Nevertheless, it needs to create the whole state space.
Since DAC does not work in an on-the-fly way, Barnat et al., the same authors of DAC, further proposed an on-thefly algorithm: MAP algorithm [3] , [5] , which is a revisiting resistant algorithm for I/O efficient LTL model checking. Revisiting resistant graph algorithms are those that can tolerate reexploration of edges without yielding incorrect results. The main idea behind the MAP algorithm is based on the fact that each accepting vertex lying on an accepting cycle is its own accepting predecessor (an accepting predecessor of a state t is an accepting state from which t is reachable). Instead of expensive computing and storing of all accepting predecessors for each (accepting) vertex, the algorithm computes and stores a single representative accepting predecessor for each vertex, namely the maximal one in a suitable ordering of vertices. Experiments showed the algorithm outperformed previous I/O efficient algorithms on invalid LTL properties.
IDDFS [4] is a 5-bit semi-external LTL model checking algorithm proposed by Edelkamp et al. Semi-external graph algorithms are algorithms in which the vertices but not the edges fit in memory [26] . IDDFS uses heuristic EPH to construct a minimal perfect hash function (MPHF) from the vertex set stored on disk, which allows compressing V to 5jV j bits, and only needs to store the 5jV j bits but not V into internal memory, where V is the vertex set of a graph, and jV j is the size of V . Thus the algorithm can handle spaces that are orders of magnitudes larger than internal memory. However, IDDFS still has a limitation on the size of the graph because it needs 5 bits of internal memory for every vertex. This algorithm works on-the-fly by applying iterative-deepening strategy.
The main difference between our approach and the I/O LTL model checking approaches above is as follows: our approach is designed to detect all accepting cycles, while the latter are designed to detect one accepting cycle, and can require that every state may not be repeatedly visited. As our approach needs to find all accepting cycles, states may be on different accepting cycles, and so may be repeatedly visited, which will increase significantly the searching time. This requires our approach for higher efficiency.
Algorithms for Finding All Elementary Cycles
The algorithm elementary circuit (EC ) proposed by Tiernan [18] is one of the earliest algorithms for finding all elementary cycles in a directed graph. The algorithm finds all elementary cycles by a backtracking procedure which explores elementary paths of the graph and check if they are elementary cycles. The vertices of the graph are numbered from 1 to n, where n is the number of vertices in the graph. All elementary paths begin from different vertices. The first path starts from vertex 1. A path is extended from its end, one arc at a time. In the extension process, the index of the extended vertex must be larger than that of the last vertex of the path. If the index of the extended vertex is equal to that of the first vertex of the path, then the path is an elementary circuit (i.e. cycle). The algorithm enumerates each elementary cycle exactly once, because each cycle contains a unique vertex with smallest index. The time complexity of this algorithm is exponential in the size of the graph.
Tarjan [19] improved EC algorithm and gave a bound of Oðn Á eðc þ 1ÞÞÞ for the running time of the algorithm on an entire graph in the worst case, where n is the number of vertex; e is the number of edges; and c is the number of elementary circuits in the graph. The idea of this algorithm is as follows. For each vertex s, the algorithm generates an elementary path p which starts at s and contains no vertex whose index is smaller than that of s. The elementary path p is stored in a point stack. A vertex v is marked if it is on the elementary path p or if every path leading from v to s intersects p at a point other than s. A vertex v becomes unmarked if the algorithm finds a path from v to s which does not intersect p other than at s. Once a vertex v has been used in a path, it can only be used in a new path when it has been deleted from the point stack and when it becomes unmarked. Whenever the last vertex on p is adjacent to the start vertex s, the elementary path p is an elementary circuit. A similar algorithm with the same bound was given by Ehrenfeucht and Osterweil [20] .
A more efficient algorithm with the time bound Oððn þ eÞðc þ 1ÞÞ [21] was proposed by Johnson. The algorithm has two important strategies. The first one is that to avoid duplicating circuits, a vertex v must be blocked when it is added to some elementary path beginning in s, and keeps blocked as long as every path from v to s intersects the current elementary path at a vertex other than s; the other is that a vertex does not become a root vertex for constructing elementary paths unless it is the least vertex in at least one elementary circuit. These two strategies reduce much of the fruitless searching of previous algorithms, which results in the lower time bound.
There are some other algorithms for finding all the elementary circuits. For example, Weinblatt gave an algorithm similar to Tiernan's, but it has relatively large storage requirements [22] . Also based on the edge weight, Yamada and Kinoshita developed a heuristic algorithm to list all negative cycles [23] .
The main difference between the algorithms above and our approach is as follows. Firstly, the algorithms above search for all elementary cycles directly from the directed graph, while our approach first computes accepting strongly connected components (ASCCs), and then searches for elementary accepting cycles of every ASCC, which avoids searching for a great many ineffective paths. Secondly, the algorithms above are based on internal memory, but our approach stores the whole state space and hash tables in external memory, which allows to handle largerscale problem for finding all elementary cycles.
PRELIMINARIES
In this section, for convenience, we give a brief introduction of some necessary notions about graphs, I/O complexity model, and minimal perfect hash function used in this paper. Please refer to [27] , [28] , [29] , [30] for more details.
Related Definitions of Graphs
In this paper, we treat an automaton as a directed graph that has an initial state s 0 and an accepting state set F . Sometimes we use the term state for node and transition for edge.
A directed graph G is a pair ðV; EÞ where V is a set of nodes (or vertices), and E V Â V is a set of edges. A node u is said to be reachable from v in ðV; EÞ if and only if there is a sequence of nodes
A cycle is a path in which the first and last nodes are identical. The graphs in our definitions contain no loop (edges of the form ðv; vÞÞ and no multiple edges between the same nodes, namely if edge ðu; vÞ 2 E, then u 6 ¼ v and the edge ðu; vÞ consists of unique directed arc from u to v. An accepting cycle is a cycle going through some accepting state. Previous I/O LTL model checking algorithms [2] , [3] , [4] work by finding an accepting cycle as a counterexample. In this paper, we aim at finding all accepting cycles.
Given a graph ðV; EÞ, a strongly connected component (SCC) is a maximal subgraph ðV
An accepting strongly connected component of the automaton is defined as an SCC containing some accepting state s 2 F . Given a state s, according to the definition of ASCC, we know that if G 1 and G 2 are two ASCCs containing state s, then G 1 and G 2 are identical. We use ASCCðsÞ to denote the ASCC containing state s.
I/O Complexity Model
As disk access is orders of magnitude slower than internal memory access [31] , complexities of external memory algorithms are usually measured in terms of the number of I/O operations. For complexity analysis of external memory algorithms, a widely used model is the one by Aggarwal and Vitter [28] . In the model, the complexity of an I/O algorithm is further parameterized by N, M, and B, where N denotes the total number of items of a system stored on disk, M denotes the number of items that fit into the internal memory, and B denotes the number of items that can be transferred in a single I/O operation. Thus, the number of I/O operations for scanning N nodes on disk is OðscanðNÞÞ ¼ OðN=BÞ. As another important operation, sorting N nodes on disk has an I/O complexity of OðsortðNÞÞ ¼ OðN=B Á log M=B N=BÞ.
In this paper, we adopt this complexity model to analyze the I/O complexity of our approach.
The Minimal Perfect Hash Function
A minimal perfect hash function is a one-one correspondence between a static set Q and f0; . . . ; jQjg, which requires that all elements in Q are distinct mutually. MPHF has many applications as other hash functions do, but with the advantage that no collision resolution has to be implemented. An asymptotically space optimal, practical algorithm for generating MPHFs was designed in [29] , which has I/O complexity OðsortðjQjÞÞ. The external memory variant, refereed to as external perfect hashing (EPH), was also given in [30] .
In this paper, we use algorithm EPH to construct a minimal perfect hash function. From references [4] and [30] , we know MPHF constructed by EPH can be stored in less than 4 bits of internal memory per state. Thus our approach can handle systems with more than 2 Á 2 30 Á 8=4 ¼ 2 32 states on a computer with 2 GB RAM.
I/O EFFICIENT DETECTION OF ALL ACCEPTING CYCLES
Each counterexample can be represented as a pair of an accepting cycle and a path from the initial state to the cycle. When finding an accepting cycle, we easily find a path from the initial state to the cycle by using DFS. We say two counterexamples are equivalent if they have the same accepting cycles. Thus, the problem of detecting all (or a set of) counterexamples may be reduced to that of detecting all (or a set of) accepting cycles. In this section, we propose an I/O efficient approach for detecting all accepting cycles, namely DAAC.
Framework of DAAC
The basic framework of DAAC is as follows. To find all accepting cycles, DAAC proceeds by traversing the accepting state set. For each accepting state, DAAC first searches for the ASCC containing the accepting state by using the algorithm SFA (Search For ASCC), and then finds accepting cycles of the ASCC by the algorithm Find Accepting Cycles of the ASCC (FACA). When all accepting states are traversed, all accepting cycles of the system are found. The above framework is based on the observation that given a graph with an accepting state set, all accepting cycles going through the same accepting state must be in the same ASCC, and all ASCCs are disjoint. Thus, finding all accepting cycles of a system amounts to finding accepting cycles in each ASCC.
According to the definition of ASCC, each state in an ASCC must be on some accepting cycle. Thus, compared to the naive method which directly finds accepting cycles from the graph, this framework enables our approach to avoid searching for a great many ineffective paths. For example, considering Fig. 1 , suppose the graph has 2n+4 states and only one accepting state s. Obviously, the left sub graph is one accepting cycle, and the right subgraph contains the following n 2 cycles (not accepting cycles):
Note that to find all accepting cycles, every state may be visited repeatedly. In our approach, we first compute ASCC by traversing the whole state space using BFS and MPHF, which needs to execute 2ð2n þ 4Þ ¼ 4n þ 8 steps. Then, we find the accepting cycle with three steps. Thus our approach needs 4n þ 11 steps in total. Previous approaches, such as Johnson's algorithm [21] , search directly from the graph to find all accepting cycles, traversing not only the whole state space but also all possible paths, which needs to execute
In order to further lower the I/O complexity and improve the performance of DAAC, we propose an intersection computation technique and a dynamic path management technique, and exploit a minimal perfect hash function. The intersection computation technique is to lower the I/O complexity and improves the performance of the algorithm SFA. The dynamic path management technique allows to reduce the memory dithering and so improves the performance of the algorithm FACA. With MPHF, our algorithm requires only one I/O operation to locate a record in a hash table on disk according to a hash value, which significantly improves the efficiency of DAAC.
In the following, we describe three main algorithms in DAAC: the top-level algorithm, SFA, and FACA. Note that all state tables are stored on disk and have the same data structure, consisting of one field "state"; all hash tables are stored on disk and have the same data structure, consisting of two fields: the hash field and the tag field. Initially every hash table has N different records, in which values of the hash field range from 1 to N and all values of the tag field are 0, and every hash table is sorted by the hash field, where N is the number of states of the system. In fact, values of the hash field of a hash table corresponds to hash values of N states of the system.
DAAC Algorithm: Top Level
We now describe the DAAC algorithm (see Algorithm 1) . DAAC first enumerates all reachable states of the automaton using external breadth first search (BFS) by invoking the function enumerateBFSðÞ, and then constructs a MPHF by using the algorithm EPH (constructMPHF ðÞ). After that, the algorithm finds all accepting cycles of the system according to the proposed framework above. To avoid repeated computation and speed up DAAC, after finding the ASCC containing an accepting state, DAAC deletes all accepting states contained in the ASCC from F by a for loop. The function clearðÞ is to reset a hash table by modifying all tags to 0. The function gettagðtableH; hashðtÞÞ is to get the tag value corresponding to hashðtÞ from the hash table tableH.
The algorithm enumerateBFSðÞ exploits an external BFS technique presented in Section 3.1 of literature [7] , which uses external memory(disk) and delayed duplicate detection (DDD) [7] , [32] . The usage of external memory is to avoid the limitation of internal memory for the algorithm, and DDD improves greatly the algorithm's efficiency. DDD is based on the observation that when a BFS is used to enumerate the space, it has to maintain a set of visited states to prevent their re-exploration. Since the graphs are large for large systems, the set of visited states cannot be completely kept in the main memory and should be stored on the external memory device. A newly generated state does not need to be checked against the state set immediately; one can postpone the checking until an entire level of the BFS has been explored and then check all states in the level together by linearly reading the set from the disk. Please refer to [7] , [32] for more details.
Search for ASCC
Given an accepting state, the algorithm SFA is to compute the ASCC containing the accepting state. The underlying idea is based on the observation that every state in this ASCC is reachable from this accepting state, and this accepting state is also reachable from every state in this ASCC. Thus we first compute the forward reachable state set and backward reachable state set from this accepting state, and then compute their intersection which is just the ASCC containing this accepting state.
In the computation process of reachable sets, we need to consider memory shortage problem for large-scale systems, and the problem of efficiency of duplicate detection of states, where duplicate detection of a state is to check if the state was searched for before. The first problem can be effectively solved by storing the state tables and hash tables on disk. The second problem can be effectively solved by using hash tables on disk. This is because hash values ranging from 1 to N in a hash table are sorted, and locating a state in the hash table needs only one I/O operation, where N is the number of states of the system. Thus the duplicate detection of a state costs only one I/O operation.
Computing intersection of two state sets is another key factor that affects the performance and the I/O complexity of the ASCC computation. Generally, intersection computation is with non-linear worst-case time complexity. To the best of our knowledge, the best algorithm for set intersection [33] has a linear expected time complexity. In this section, based on a minimum perfect hash function (MPHF), we propose an I/O efficient algorithm for computing set intersection with linear worst-case I/O complexity, which lowers the I/O complexity and improves the performance of SFA. The basic idea of this algorithm is to translate the intersection computation problem of two state sets into the tag comparison problem of two hash tables.
In this section, we first propose an algorithm for the intersection computation of two state sets, and then introduce an algorithm for computation of reachable sets, finally, based on the two techniques above, we propose an I/O efficient algorithm of searching for ASCC, namely SFA.
Computing Intersection of Two State Sets
We first define an equivalent representation of a state set. Suppose Q is a state set, then an equivalent representation of Q is a hash table H which satisfies the condition that t 2 Q iff the entry that contains hashðtÞ in H has tag value 1. Thus, we can translate the problem of computation of intersection of two state sets into the problem of tag checking of two hash tables. 
the tag value corresponding to hashðtÞ in H 1 to Ã. The computation process of intersection of two state sets is described in Algorithm 2.
Search for the Reachable State Set of an Accepting State
We start with Algorithm 3 that computes the sets of states reachable from an accepting state s 2 F, including the forward reachable set and the backward reachable set. Let I-QUEUE be type of internal queue, and E-QUEUE be type of external queue. We assume the basic operations over a queue data structure, such as enqueueðÞ that inserts an element at the end of a queue and dequeueðÞ that pops an element from the beginning of a queue. An empty queue is represented by . Moreover, we assume internal queue has a capacity of m elements, bounded by the limited internal memory. Note that succ may be succ þ and succ À , which expresses forward transition and backward transition, respectively. 
all levels. The outer while loop is to create the next level from the current level which is stored in currentTable. During every outer while loop, the algorithm first moves minðm; #ðcurrentTableÞÞ states to the queue current from the current level by a repeat loop, and then computes all nonduplicate successors of every state in the queue current and stores them in the queue next, and sets the tag values of these non-duplicate states
Computing ASCC
We now consider the algorithm SFA. Based on the idea above, the algorithm for the computation of set intersection and the algorithm for computation of reachable set, given an accepting state s, Algorithm 4 can efficiently compute ASCCðsÞ. The result H returned by function SFAðÞ is a hash table corresponding to ASCCðsÞ. 
Algorithm 4. The Algorithm of Search for ASCC

Search for all Accepting Cycles in an ASCC
After attaining an ASCC using the algorithm SFA, we need to find all accepting cycles in this ASCC. This is accomplished by the algorithm FACA, which is an extension of Johnson's algorithm [21] to external memory and our framework. In the algorithm FACA, the whole ASCC is stored in external memory by a hash table, and search path is first stored in a stack in internal memory. We need to efficiently manage the stack because there exists the memory dithering problem when using stack. In this section, we first propose a dynamic path management technique denoted by Dynamic Path Management, and then design the algorithm FACA by combining DPM, MPHF and Johnson's algorithm [21] .
Equivalence of Accepting Cycles
We first give some definitions about accepting cycles and an equivalence proposition.
Definition 1.
A state s is said to be elementary for an accepting cycle C if there is no accepting subcycle of C after s is deleted from C; otherwise, s is said to non-elementary.
Definition 2.
An accepting cycle C is said to be elementary if every state on C is elementary.
Definition 3. Two accepting cycles are said to be equivalent if they have the same elementary states.
According to elementary graph theory, we prove an important claim: Any accepting cycle C is equivalent to some elementary accepting cycle.
Proof. Assume the accepting cycle C goes through the accepting state s, and t is a state appearing twice on C. Obviously, after deleting the sequence of states between the first t and the second t, the remaining sequence of states still constructs an accepting cycle going through s.
We deal with all duplicate states like this and finally attain an accepting cycle D without duplicate states. This accepting cycle is obviously an elementary accepting cycle and equivalent to C. t u
With the definitions and the claim above, we have that, in order to find all accepting cycles, it is sufficient to find all elementary accepting cycles. This can significantly reduce number of counterexamples to be searched for, but not impair the work that uses multiple counterexamples for better debugging.
Dynamic Path Management
When an elementary accepting cycle is too long to be stored in the internal memory, the search path management has a big impact on the performance of DAAC. In this section, we propose a dynamic path management technique, namely DPM. In this algorithm, we employ a stack in internal memory and a state table tableP in external memory to store the whole search path.
During the search, when the stack is full and a new state is generated, we move states from stack to tableP in order to avoid memory overflow. However, this may result in the phenomenon that states are frequently moved inside and outside internal memory.
In the following, we analyze how this phenomenon occurs. Suppose that we swap M states at a time between internal memory and disk, where M is the number of states of the stack when the stack is full. When stack is full, if we transfer M states from stack to disk, then stack becomes empty. In succession, if the algorithm needs to pop a state from stack because of backtracking, then immediately those M states just moved to disk have to be moved back to internal memory, and thus stack becomes full. Afterwards, if another new state is generated and needs to be pushed into stack, then the M states have to be moved outside internal memory again to make space for the new state, and so on. We call such a phenomenon of frequent movement of state block as memory dithering, which significantly increases disk accesses and thus the algorithm's I/O complexity.
In order to reduce memory dithering, we propose an efficient technique which works as follows. 1) When stack is full, the algorithm moves only some states from stack into tableP to make memory space for new states. To be more detailed, because stack operates in First In, Last Out (FILO), the algorithm moves k states in the bottom of the stack, which will not be used temporarily, into tableP , and releases the corresponding memory space, and then the bottom pointer of stack points to the ðk þ 1Þ-th state, where k ¼ #ðstackÞ Á r and r is a parameter that 0 < r < 1. This procedure is implemented by the function Mem-DiskðÞ. 2) When stack becomes empty, if tableP is not empty, then the algorithm pushes k states stored most recently in tableP into stack in turn and deletes them from tableP , where k =minð#ðtableP Þ, M Á rÞ. The corresponding procedure is implemented by the function Disk-MemðÞ. By reserving some states in the stack when stack is full, we ensure there are always some states in the stack, which to some extent reduces the memory dithering phenomenon.
Finding All Elementary Accepting Cycles in an ASCC
Given an ASCC described by a hash In the ith traverse (namely the ith while loop, 1 i m), FACA searches for all elementary accepting cycles going through s i by recursively calling the function dfsðÞ. In the search process, the successors of every state are implicitly generated by the function successorðÞ. Note that here we only consider the successors in the ASCC, namely all successors t such that ASCCof½hashðtÞ=1. The states on current search path are kept on stack. If a successor of some state is just s i , then FACA finds an elementary accepting cycle going through s i , which is a sequence of states stored successively in tableP and stack, and FACA outputs the cycle, and then continues to search for other elementary accepting cycles going through s i . When the stack is full, FACA invokes the function Mem-DiskðÞ to manage the stack. To avoid state duplication on the same accepting cycle, a state u is blocked by setting blockedðuÞ to be true when it is added to the current search path beginning in s i . And it stays blocked as long as every path from u to s i intersects the current path at a state other than s i . This reduces much of the fruitless searching. To avoid duplicate searching for elementary accepting cycles going through s i , the algorithm deletes s i from the ASCC by setting ASCCof½hashðs i Þ=0 after finding all elementary accepting cycles going through s i . FACA is described by Algorithm 5. Note that when the number of accepting cycles is very large, we may consider to set a bound k on the number of accepting cycles in advance. In search process, when the number of accepting cycles reaches the bound k, the approach terminates.
Algorithm 5. The Algorithm for Finding Accepting Cycles in an ASCC
CORRECTNESS OF DAAC
In this section, we give the proof for the correctness of DAAC. The proof consists of two parts. One is for the correctness of algorithm SFA which finds accepting strongly connected components; the other for that of algorithm FACA which finds all elementary accepting cycles of some ASCC. 
Correctness of SFA
Correctness of FACA
The correctness of the algorithm for finding all accepting cycles of some ASCC can be proved along similar line as [21] .
Lemma 3. In FACA, for any state u, if there is a path from the state to s which intersects the stack only at s, then blockedðuÞ is false.
Proof. Suppose the path is ðu ¼ v k ; v kÀ1 ; . . . ; v 1 ; sÞ. Assume that blockedðv k Þ is true. Then u must be the state which was added to the stack earlier and was subsequently popped from stack. Certainly v 1 , v 2 , . . ., v kÀ1 also pop from the stack successively before that. Because s is a successor of v 1 , the flag is true before v 1 pops from the stack. It follows that the unblockedðv 1 Þ is called, and blockedðv 1 Þ is set false, and dfsðv 1 Þ returns true. Because dfsðv 1 Þ returns true, the flag is also true before v 2 pops from the stack. It follows that the unblockedðv 2 Þ is called, and blockedðv 2 Þ is set false, and dfsðv 2 Þ returns true, and so on. Finally the unblockedðv k Þ is called, and blockedðv k Þ is set false, which contradicts with assumption. Thus, we conclude that blockedðuÞ is false. t u Lemma 4. In the function dfsðÞ, for any state x 6 ¼ s, there is a call unblockðuÞ which sets blockedðxÞ false if and only if 1) there is a path, containing u, from x to s on which only u and s are on the stack; 2) there is no path from x to s on which only s is on the stack.
Proof. The main difference between dfsðÞ of DAAC and CIRCUIT ðÞ of Johnson's algorithm [21] is that the graph (or stack) is partly put on disk in dfsðÞ, which does not change the structure of the graph (or stack), namely, its vertex set and edge set do not change. In the function FACAðÞ, for i < m, the algorithm deletes s i from ASCC after it executes dfsðs i Þ, and then executes dfsðs iþ1 Þ. Thus, for i 6 ¼ j, the elementary accepting cycles generated by dfsðs i Þ are different from that by dfsðs j Þ. Moreover, in the call dfsðs i Þ, no elementary accepting cycle is output more than once, since for any stack ðs i ¼ v 1 ; v 2 ; . . . ; v k Þ with v k on top, once v k is removed the same stack cannot reoccur. Therefore, every elementary accepting cycle is output only once. t u
COMPLEXITY ANALYSIS
In this section, we apply the model of Aggarwal and Vitter to analyze the I/O complexity of DAAC, and compare DACC with DAC, MAP and IDDFS in terms of I/O complexity.
The I/O Complexity of DAAC
The I/O complexity of DAAC directly depends on that of algorithms SFA and FACA.
Lemma 6. The I/O complexity of the algorithm for computation of intersection of two state sets is OðscanðNÞÞ, where N is the number of states in an automaton. Proof. The main factors impacting I/O complexity of Algorithm 3 is three repeat statements, gettagðÞ and settagðÞ. The accumulated effect of the first repeat statement in all iterations is equivalent to moving N states from disk to the internal memory. The accumulated effect of the other two repeat statements is equivalent to moving N states from the internal memory to disk. Thus, these three repeat statements take ð2 Á N=BÞ I/O operations in all iterations in total. The functions gettagðÞ and settagðÞ are executed N times, respectively, which takes 2N I/O operations in total. Therefore, the I/O complexity of the algorithm is OðscanðNÞÞ. t u Theorem 6. The I/O complexity of the algorithm SFA is OðscanðNÞÞ.
Proof. By Lemmas 6 and 7, clearly the I/O complexity of the algorithm SFA is OðscanðNÞÞ. Proof. By the proof of [21, Lemma 3], we know no state can be unblocked more than once unless an elementary accepting cycle is output. Thus, in the search process of any elementary accepting cycle C in the ASCC, the worst case is that all states in the ASCC are traversed twice and there is the most times of movement of state block from the internal memory to disk or from disk to the internal memory. Thus, when stack is full, FACA moves ðM Á rÞ states from stack to tableP , and then pops ðM Á ð1 À rÞÞ states one by one from stack, and stack becomes empty, where M is the number of states of the stack when the stack is full; afterwards, FACA transfers ðM Á rÞ states to stack from tableP again, and then generates and pushes ðM Á ð1 À rÞÞ states one by one into stack, and stack becomes full; and goes on like this until all states in the ASCC are traversed. In this process, the block of states is moved ððN 0 À M Á rÞ=ðM Á ð1 À rÞÞ À 1Þ times from stack to tableP . Because every movement of the block of states needs ðM Á r=BÞ I/O operations and the number of states moved from stack to tableP is equal to that from tableP to stack, the whole process costs 2ððN In the Algorithm 1, enumerating all reachable states using the function enumerateBFSðÞ needs to perform Oð' Á scanðNÞ þ sortðjEjÞÞ I/O operations [4] , [7] , [32] , where ' ¼ maxfdðs; uÞ À dðs; vÞ þ 1 j ðu; vÞ 2 Eg is the length of the longest back-edge in the BFS graph or its locality [7] . Constructing the minimal perfect hash function MPHF by using the algorithm EPH costs OðsortðNÞÞ I/O operations.
Executing the for loop needs OðscanðjFjÞÞ I/O operations. Thus, by Theorems 6 and 7, we have that executing the ith while loop needs OðscanðNÞ þ scanðjFjÞ þ c i Á scanðNÞÞ, namely Oðc i Á scanðNÞÞ I/O operations. Therefore, DAAC has the I/O complexity of Oð' Á scanðNÞ þ sortðjEjÞ þ sortðNÞþ ððS L i¼1 c i Þ Á scanðNÞÞÞ, namely Oðð' þ cÞ Á scanðNÞþ sortðjEjÞÞ. t u
Complexity Comparison
We compare the I/O complexity of DAAC with that of algorithms DAC, MAP and IDDFS according to Theorem 8 and existing results concerning the I/O complexity of these algorithms. Theorem 1 of [2] claims that the I/O complexity of DAC is Oðl SCC Á ðh BFS þ jp max j þ jEj=MÞ Á scanðNÞÞ, where p max is the length of the longest path in the graph going through a trivial SCC (without self-loops), h BFS is the BFS height, and jEj is the number of edges in the graph. Since OðsortðjEjÞÞ ¼ OðjEj=B Á log M=B ðjEj=BÞÞ and jEj N 2 , OðsortðjEjÞÞ is smaller than Oð2jEj=B Á log M=B ðNÞÞ. We know Oðlog M=B ðNÞÞ is much lower than scanðNÞ. It follows that OðsortðjEjÞÞ is much lower than OðjEj Á scanðNÞÞ. Therefore, DAAC is far superior to DAC from the I/O complexity viewpoint. Regarding MAP, its I/O complexity is OðjF j Á ððd þ jEj=M þ jF jÞscanðNÞ þ sortðNÞÞÞ in the case for candidate set in RAM, and OðjF jððd þ jF jÞscanðNÞ þ sortðjF j Á jEjÞÞÞ in the case for candidate set on disk, where d is the diameter of a graph [3] , [5] . The I/O complexity of the first case is affected mainly by OðjF j Á jEj=M Á scanðNÞÞ. From the above, we know that OðsortðjEjÞÞ is much lower than OðjEj Á scanðNÞÞ. Thus, OðsortðjEjÞÞ is much lower than OðjF j Á jEj=M Á scanðNÞÞ. Generally N is less than jEj. Thus, OðsortðNÞÞ is lower than OðsortðjEjÞÞ. It follows that, for the first case, DAAC outperforms MAP in terms of I/O complexity. In addition, it is obvious that OðsortðjEjÞÞ is lower than OðsortðjF j Á jEjÞÞ. Thus, for the second case, DAAC also outperforms MAP in terms of I/O complexity.
The I/O complexity of IDDFS is Oð2 s Á sortðNÞ þ ' Á scanðNÞ þ sortðjEjÞÞ, where 2 s ¼ maxfdðs; vÞ j v 2 V g and ' ¼ maxfdðs; uÞ À dðs; vÞ þ 1 j ðu; vÞ 2 Eg is the length of the longest back-edge in the BFS graph or its locality [4] . It is obvious that OðscanðNÞÞ is lower than OðsortðNÞÞ. Thus, we can observe that DAAC has lower I/O complexity than IDDFS.
EXPERIMENT
In this section, we demonstrate by experiments that finding all accepting cycles can be nearly as I/O efficient as searching for one. To do so, we compare runtime and allocated disk space of DAAC with that of DAC, MAP and IDDFS. The experimental results confirm the efficiency of DAAC for finding all accepting cycles.
Benchmarks
In this experiment, we selected the same benchmarks as [2] and [4] except for models Petersonð6Þ; P 4 and Szyman:ð6Þ; P 4. The two models are to show the limitation of scales of systems IDDFS can verify. All selected benchmarks are from the BEEM project [34] , which include models with valid properties and those with invalid properties, and range from less than 50,000 states to more than 6,000,000,000 states. They are typical ones in the literatures and serve as a good test bed to justify the efficiency and performance of model checking algorithms.
The explanation about the notations in the names of benchmarks in this experiment is as follows. Please refer to [34] for more details. Elev2ð16Þ; P 4 is an elevator model with 16 levels and the fourth property to be verified. MSCð5Þ; P 4 is a queue lock mutual exclusion model (or algorithm) with five processes and the fourth property to be verified. Philsð16; 1Þ; P 3 is a philosopher dining model with 16 philosophers, one plate and the third property to be verified. Lamportð5Þ; P 4 is a Lamport's fast mutual exclusion model with five processes and the fourth property to be verified. Petersonð6Þ; P 4 is a Peterson's mutual exclusion protocol with six processes and the fourth property to be verified. Szyman:ð6Þ; P 4 is a Szymanski mutual exclusion protocol with six processes and the fourth property to be verified. Bakeryð5; 5Þ; P 3 is a Bakery mutual exclusion algorithm with five processes, five critical sectionS and the third property to be verified. Liftsð7Þ; P 4 is a distributed model for lifting trucks with seven lifts and the fourth property to be verified.
Experimental Setup
The four algorithms have been implemented on top of the DiVine library [35] , providing the state space generator, and the STXXL library [36] , providing the I/O primitives. In the experiments, for DAAC, we set the parameters r as 0:94 and no bound to the number of counterexamples.
All experiments were run on a PC with CPU P 4 2:4 G, memory 2 G, disk space 400 GB, and Ubuntu Linux 9.0 operation system. For each instance, each algorithm is performed for 10 runs. For each algorithm on each instance, we report the averaged runtime ("time") and averaged disk consumption ("disk"). The time format is hh:mm:ss (the elapsed hours, minutes and seconds) which is the same as that of [2] , and "01:01:01" means that an algorithm costs 1 hour and 1 minute and 1 second. "szie" is the number of states of an instance, and "cycles" is the number of the accepting cycles detected by an algorithm.
Experimental Results
The experimental results are listed in Tables 1 and 2 . On models with invalid properties, our approach finds all accepting cycles and the other three approaches find only one accepting cycle. Even so, our approach is still superior to DAC. As shown in Table 1 , for all four instances with invalid properties, DAAC is much faster than DAC on three instances. In addition, our approach is also comparable with MAP and IDDFS. From Table 1 , we can observe that for two instances Elevator2ð7Þ,P5 and Liftsð7Þ,P4 with the small number of elementary accepting cycles, our approach is as fast as MAP and IDDFS, but for the two instances Bakeryð5; 5Þ,P3 and Szyman:ð4Þ,P2 with the large number of elementary accepting cycles, our approach performs a bit slower than MAP and IDDFS. However, this is worthwhile because our approach can find all accepting cycles in a run.
On models with valid properties, as a by-product, our approach also outperforms DAC, MAP and IDDFS. Table 1 shows that DAAC is faster than its competitors on almost all models with valid properties, especially those largescale models such as Petersonð6Þ, P4 and Szyman:ð6Þ, P4. The main cause is that for models with valid properties, DAC, MAP and IDDFS need to create the whole state space eventually as DAAC does, but DAAC exploits some efficient techniques such as the dynamic path management technique, the efficient framework and the intersection computation technique, which greatly improve its performance. Additionally, DAAC can handle the benchmarks Petersonð6Þ, P4 and Szyman:ð6Þ, P4, which are too large for IDDFS. This is because IDDFS exploits heuristic EPH to construct a minimal perfect hash function, which needs 5 bits of internal memory for every state. Therefore, IDDFS cannot handle systems with more than 2 Á 2 30 Á 8=5 states on a computer with 2 GB RAM [4] . Moreover, the space consumption of DAAC is less than that of DAC, MAP and IDDFS on almost all models with valid properties.
In summary, DAAC has better practical efficiency on the whole than DAC, MAP and IDDFS.
Similar to IDDFS, DAAC also has the problem that the scales of the verified systems are to some extent limited because it exploits a minimal perfect hash function. But generally DAAC can handle larger-scale systems than IDDFS. This is mainly because DAAC uses EPH to construct a minimal perfect hash function and needs less than 4 bits of internal memory for every state. Comparatively, IDDFS exploits heuristic EPH and needs 5 bits of internal memory for every state, otherwise it cannot obtain the performance claimed in the literature [4] . This is also confirmed by the experimental results on the instances Petersonð6Þ,P4 and Szyman:ð6Þ,P4 in Table 1 .
In the following, we further analyse the approaches above briefly so as to understand the main reasons for the efficiency of DAAC over DAC and MAP and IDDFS. DAC generates also the whole state space, but does not exploit MPHF; MAP also does not exploit MPHF though it improves DAC. These two algorithms need much more I/O operations than DAAC to carry out state duplicate checking. This, together with their higher I/O complexity than DDAC, reduces significantly their efficiency. Though IDDFS exploits MPHF, it carries out search one level by one level. Beginning with the first level, whenever generating a new level, it regenerates the sub state space from the first level to the new level, and reconstructs the MPHF corresponding to the sub state space, and researches it. This way makes IDDFS get lower efficiency than DAAC. DAAC avoids the disadvantages above by using an efficient framework, an intersection computation technique and a dynamic path management technique. This improves significantly its performance.
DISCUSSION
In this section, we first study the parameter in dynamic path management, and then develop two variants of our approach that may output a set of counterexamples. Finally, we discuss those on-the-fly approaches for directly finding all accepting cycles from the graph.
The r Parameter
The r parameter needs to be settled in order to execute DAAC. It influences to some extent the performance of DAAC by controlling the size of state blocks moved into or out of the internal memory.
To investigate the parameter's impact on the performance of DAAC, we run DAAC with different values of the r parameter. The experimental environment and selected benchmarks are the same as that of Section 7 except for models Elevator2ð7Þ,P5 and Liftsð7Þ,P4, because small models do not need to use DPM technique. The experimental results are reported in Table 3 .
From Table 3 , we observe that the r parameter has an impact on the performance of DAAC, and its optimal value varies slightly for different instances. The optimal value is mainly between 0:94 and 0:96, and thus in practical application, we may simply fix the r parameter a constant in this area, say 0.94 as in our experiments, without loss of any major efficiency.
Variants of DAAC
To deal with those systems with a great many of counterexamples, or with the too large size, we consider two variants The experimental environment is the same as that of Section 7. As six models have no counterexample we only select the remaining four models with invalid properties of Section 7 as benchmarks.
In TNC-DDAC, the threshold of the number of counterexamples is set in advance. When the number of counterexamples reaches the threshold, TNC-DAAC terminates. Table 4 shows the time consumptions of TNC-DAAC on every model when thresholds of the number of accepting cycles are 1,100, 1;000, 10;000, respectively. From the table, it is observed that for the case that threshold of the number of accepting cycles is 1, the time consumption of TNC-DAAC is less than DAC on all models. Moreover, TNC-DAAC is still faster than MAP and IDDFS on two instances Elevator2ð7Þ,P5 and Liftsð7Þ,P4 with the small number of accepting cycles though TNC-DAAC is a bit slower than MAP and IDDFS on other two models. As for other cases of thresholds of the number of accepting cycles, the experimental results are similar to the above.
TT-DAAC works in an anytime way, which means its running time is bounded by time threshold. Table 5 shows the numbers of counterexamples that TT-DAAC finds within different time thresholds for every instance. The experimental results show TT-DAAC needs different time thresholds to find one or more counterexamples for different instances. Generally, the larger scale a system has, the greater time threshold is needed for finding one or more counterexamples.
As for many other application conditions, such as returning all counterexamples of a given length, we can also develop the corresponding variants of our approach by minor revisions, which may be considered as future work.
On-the-Fly Approaches for Directly Finding All
Accepting Cycles from the Graph
One may adopt an on-the-fly approach to find all accepting cycles. Previous on-the-fly approaches (including MAP and IDDFS) which use external memory device are designed to detect one counterexample. Thus, they exit when finding one counterexample, and do not need to search the whole state space. However, in order to find all accepting cycles, whatever approach we use, it has to search the whole state space eventually. This is because we cannot ensure that there is no counterexample in the remaining part of the state space. To compare our approach with on-the-fly approaches that directly search for all (or a set of) accepting cycles from the graph, we consider the number of I/O operations which they need in the searching. The main factor affecting the number of I/O operations is the check of states.
For large-scale system, to find all accepting cycles, an on-the-fly approach has to store a large set of visited states on the external memory device and make some check against the state set on disk when a state is visited [7] , [32] . Because the serial numbers of states in the state set are not continuous generally, locating a state in the set needs to cost averagely log m I/O operations, where m is the number of states in the set. Thus, every check also needs to cost log m I/O operations on average. Moreover, a usual on-the-fly approach has to traverse all possible paths to find all accepting cycles, and a great many of these paths may be ineffective paths. This results in a great number of unnecessary I/O operations.
Though our approach first generates the whole state space by using the algorithm enumerateBFSðÞ, the algorithm needs only to traverse the whole state space once. Its aim is to construct a MPHF and get the accepting state set, which is exploited in latter parts. With MPHF, the check for a state needs only one I/O operation in the process of searching for ASCCs and all accepting cycles. In addition, our framework avoids searching for a great many ineffective paths as shown by the example in Section 4.1. Due to all above, our approach is obviously better than on-the-fly approaches that directly search for all (or a set of) accepting cycles from the graph, in terms of the number of I/O operations.
CONCLUSION AND FUTURE WORK
In this paper, we proposed an I/O efficient approach for detection of all accepting cycles of a large-scale system, namely, DAAC. It exploits a new framework for finding all accepting cycles of a large-scale system, a new linear algorithm for computing intersection of two state sets, a dynamic path management technique, and a minimal perfect hash function. Because so far there is no I/O efficient algorithm for detection of all counterexamples, we compare DAAC with the state-of-the-art algorithms designed to find one counterexample, such as DAC, MAP and IDDFS. Despite the fact that it computes all counterexamples, DAAC has on the whole more satisfying I/O complexity and practical performance. Moreover, because the found counterexamples can be better used to locate root bugs, our approach has important significance for debugging in largescale system designs.
As for future work, we plan to extend DAAC to a distributed version in order to further improve our approach's efficiency. Since DAAC proceeds by handling every state in the accepting state set and finds all elementary accepting cycles by dealing with every ASCC, it is not difficult to implement a distributed version.
We will also further investigate how to explain and extract information from all (or a set of) accepting cycles, and how to apply it in debugging of large-scale software and hardware systems.
