Abstract| Static analysis of concurrent programs has been hindered by the well known state explosion problem. Although many di erent techniques have been proposed to combat this state explosion, there is little empirical data comparing the performance of the methods. This information is essential for assessing the practical value of a technique and for choosing the best method for a particular problem. In this paper, we carry out an evaluation of three techniques for combating the state explosion problem in deadlock detection: reachability search with a partial order state space reduction, symbolic model checking, and inequality necessary conditions. We justify the method used for the comparison, and carefully analyze several sources of potential bias. The results of our evaluation provide valuable data on the kinds of programs to which each technique might best be applied. Furthermore, we believe that the methodological issues we discuss are of general signi cance in comparison of analysis techniques.
I. Introduction
Ada tasks arm software developers with the power, and dangers, of concurrency. With this power, many systems composed of cooperating agents can be concisely speci ed, but if the communication protocol used by these agents is faulty, the resulting program may contain concurrency errors such as deadlock or starvation. Concurrent software is increasingly a part of safety critical systems, thus research into methods and tools for increasing the reliability of such software is badly needed 35] .
When a concurrent system is modeled as a set of communicating nite-state components, static reachability analysis provides a method for automaticallydetecting most concurrency errors. Its application in practice, however, has been limited by the state explosion problem: the number of states in a concurrent system tends to increase exponentially with the number of processes. Many techniques have been proposed to combat this explosion, including state space reductions 20, 29, 40, 42] , symbolic model checking 4, 32] , compositional techniques 43], abstraction 6], data ow analysis 16, 30] , and integer programming techniques 1, Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
This research was supported by National Science Foundation grant CCR-9308067.
The author is in the Department of Information and Computer Sciences at the University of Hawai`i at M anoa, Honolulu, Hawaii 96822 (corbett@hawaii.edu).
A preliminary version of this paper appeared in the Proceedings of the International Symposium on Software Testing and Analysis, 1994.
34].
Although each of the techniques excels on certain examples, complexity results indicate that they are all heuristic|when a concurrent system is modeled as a set of communicating nite-state automata, analysis of even simple properties like deadlock is PSPACE-hard 38] . In addition, empirical data comparing the performance of the di erent techniques are rare. This is understandable given the e ort of constructing even a single analysis tool, the variations among the models and input languages used by various existing tools, and the di culty of conducting a fair and meaningful evaluation. Nevertheless, this kind of data is essential for assessing the practical value of the techniques and in assisting developers in selecting a technique for their particular application. There is growing recognition that software engineering research must focus on \de-signing and carrying out experiments that yield quantiable and reproducible results " 39] .
In this paper we make two contributions. First, we examine the methodological issues of empirically comparing different deadlock detection techniques for Ada tasking programs. Second, we evaluate the performance of three analysis techniques and report the results of this evaluation. Speci cally, we evaluate the e cacy of a partial order state space reduction 20], symbolic model checking 4], and inequality necessary conditions 1] in detecting communication deadlocks of Ada tasking programs. While we found that none of the techniques was clearly superior to the others overall, there was signi cant variation in the performance of the techniques on particular programs. Our data provide some indication of the kinds of programs to which each technique might best be applied. Although our evaluation is narrow in scope, being restricted to one property of one class of concurrent system, we believe that our basic approach is broadly applicable. Anyone conducting an empirical comparison of automated veri cation techniques for nite-state concurrent systems would bene t from our experience.
This paper is organized as follows. Section II de nes the model of concurrent software used throughout the paper. Section III gives an overview of di erent approaches to the state explosion problem and provides a brief description of the three techniques evaluated. Section IV discusses the method used for the comparison and discusses the issues that arise in such an evaluation. Section V presents the results of the experiments and draws some conclusions about the relative strengths and weaknesses of the techniques evaluated. Section VI considers several alternative models of concurrent software and explores how such models might have a ected the results of our evaluation. Fi-nally, Section VII summarizes our experience in conducting the evaluation.
II. Model
To make the presentation (and later, the evaluation) more uniform, we adopt a canonical model of concurrent software. There are many di erent kinds of concurrent systems, ranging from gate level hardware, to asynchronous network protocols, to Ada tasking programs. Here, we chose a model that seemed the most natural for representing the class of systems over which we plan to conduct the evaluation: Ada tasking programs. The selection of the model is one of many issues that can a ect the outcome of an evaluation, as we will discuss in Section IV-E. Formally, we model a concurrent system as a set of communicating nite-state automata (FSAs) To illustrate the techniques, we will use a simple concurrent system comprising two tasks, each of which can either synchronize with the other or perform some internal action. The FSAs used to model this system, as well as the automaton for the system itself, are shown in Fig. 1 . In this section, we review the basic approaches to the state explosion problem, discuss how we selected the techniques to evaluate, and provide a brief overview of the three techniques selected for evaluation.
A. Approaches to the State Explosion Problem
Many di erent techniques have been proposed to combat the state explosion problem. Among them:
State-space reductions make a standard reachability analysis more e cient by reducing the number of states that must be explored to verify a property. They range from virtual coarsening techniques 29, 40] , which coalesce internal actions into adjacent external actions, to partial order techniques 20, 42] , which alleviate the e ects of representing concurrency with interleaving, to symmetry techniques 31, 37] which take advantage of symmetries in the state space. Symbolic model checking 4, 32] uses a symbolic representation of a system's states, which is sometimes much more compact than an explicit enumeration. These techniques have proven especially successful in verifying hardware. Compositional techniques 5, 7, 43] exploit modularity in a system by dividing it into smaller subsystems, verifying each subsystem, and then combining the results of these analyses to verify the full system. If the subsystems have simple interfaces, such a hierarchical analysis can be quite e ective. Abstraction Of all the techniques available, we selected a partial order state space reduction 20], symbolic model checking 4], and inequality necessary conditions 1] to evaluate. We selected these three techniques for several reasons. First, they represent three very di erent approaches to the state explosion problem and might be expected to perform best on di erent kinds of systems. Second, they are all \base case" techniques for verifying properties of systems or parts of systems without further decomposition. Compositional and abstraction techniques are part of a divide and conquer strategy for veri cation. They may be used to simplify the system into manageable pieces, but at the lowest level these pieces must be veri ed by some other technique. We expect compositional and abstraction techniques to play a vital role in the veri cation of any large system, regardless of the technique used to verify the base case subsystems. Finally, we selected these three techniques because an implementation of each was available to us, as discussed in Section IV-C.
In the worst case, the time complexity of all of these techniques is exponential in the size of the system 1, 4, 20] . In practice, the techniques may have a much lower complexity on certain kinds of systems. In the remainder of this section, we provide a brief overview of the three techniques selected for evaluation.
C. Partial Order State Space Reduction
The simplest deadlock analysis technique is to enumerate the states M (the automaton representing the system) and search for deadlock states. This is often intractable since the number of states of M is usually an exponential function of n (the number of tasks). One possible solution to this problem is to construct a machine M 0 = (S 0 ; ; 0 ; s 0 ; F 0 ) with S 0 S, 0 , and F 0 F such that M has some property P (e.g., a deadlock state) if and only if M 0 has property P. Then we can test for P in M by testing for P in M 0 , which may be much smaller. This general approach is called state space reduction.
Some of the most powerful state space reductions use partial orders. These techniques are based on the observation that much of the state explosion is due to the modeling of concurrency with interleaving. For example, if the FSAs in Fig. 1 each perform their internal actions, then the concurrency of these actions would be represented the presense of the traces ac and ca in M. While this produces simple semantics, it greatly increases the size of M.
Partial order semantics represents the actions of a concurrent system as a set of partially-ordered events. Two events are related by the partial order if one must precede the other given the semantics of processes and their interaction. In the example above, a single partial order in which a and c are unrelated represents both possible traces. Properties such as deadlock depend only on the partial order of the events that occur and not on the particular linearization of those events. Thus if M 0 contains at least one trace for each partial order, we may check for deadlock in M by checking for deadlock in M 0 . Conditional stubborn sets and sleep sets are two techniques for constructing such an M 0 . These techniques involve identifying sets of transitions that \commute" (do not disable each other) at each global state and then ring only one transition from each set. In the example of Fig. 1 , f((1; 3); a; (2; 3));((1; 3);c; (1;4))g is one such set of transitions enabled in the system's start state. By ring only one of these transitions, we represent only one possible order of the two events, reducing the number of states generated. See 19, 20] for details.
D. Symbolic Model Checking
Another approach to making deadlock detection more tractable is to use a di erent representation for M. Statespace searches typically generate the states of M explicitly and store them in a hash table. The state space of real systems is often very large, but also very regular. For example, consider a simple system modeling an n-bit counter with states 0; 1; : : :; 2 n ?1 and a transition from each state i to state (i + 1)mod 2 n . For n = 32, the size of a typical machine register, this system is far too large for state enumeration techniques, yet the transition relation is given \symbolically" by = f(i; a; (i + 1)mod 2 n )g.
One way to represent symbolically is to encode the relation as a boolean function represented by an Ordered Binary Decision Diagram (OBDD) 3]. OBDDs represent many frequently occurring boolean functions very compactly (e.g., symmetric functions, addition). An OBDD for a function f(x 1 ; : : :; x n ) and a total order < on the boolean variables x 1 ; : : :; x n is a rooted directed acyclic graph with the following properties. Each internal node v is labeled by some variable var(v) 2 fx 1 ; : : :; x n g and has exactly two children, denoted lo(v) and hi(v). Each leaf node is labeled by 0 or 1. For any internal node v, var(v) < var(lo(v)) and var(v) < var(hi(v)) (i.e., along any path from the root, variables appear in order). Finally, given an assignment of values to x 1 ; : : :; x n , the value of f is the label of the leaf node reached by traversing a path from the root node, following the arc to lo(v) if var(v) = 0 and following the arc to hi(v) if var(v) = 1. The OBDD for an example function f is given in Fig. 2 . Dashed arcs are used to connect v to lo(v), solid arcs to connect v to hi(v). Any nite set S can be encoded as an OBDD over dlog 2 jSje variables by assigning each element e 2 S a unique n-bit sequence (e) and building the OBDD for the characteristic function:
f(x 1 ; : : :; x n ) = 1 9e 2 S: (e) = x 1 : : :x n 0 otherwise
Relations can be similarly encoded since they are simply sets of tuples. The function f in Fig. 2 encodes the transition relation for M 1 in Fig. 1 where x represents the current state, y represents the input, and z represents the next state. The encodings for states 1 and 2 and symbols a and b are 0, 1, 0, 1, respectively. Given the transition relation , the set of reachable states R is the smallest set of states containing s 0 and including any state that is reachable from a state in R via . The OBDD for R can be computed from the OBDD of with xed-point techniques that manipulate sets using their characteristic functions encoded as OBDDs. The OBDD for R can be used to check for reachable states with certain properties (e.g., deadlock). In fact, model checking of temporal logic formulas can be performed in this framework without ever constructing an explicit representation of M. See 4, 32] for details.
E. Inequality Necessary Conditions
Yet another approach to making deadlock detection more tractable is to forgo any representation of the state space. To verify that a system has a property P, the technique generates necessary conditions for the existence of a trace of M violating P. If these conditions are not satis able, then M must have property P. If the conditions are satis able, however, then M may or may not satisfy P, since the conditions are necessary but not su cient. If the conditions are strong (i.e., are rarely satis able if P holds) and easy to check, then such a technique can be quite effective. Unlike the previous two techniques, however, this kind of technique can yield an inconclusive result (of course, for any technique, an intractable analysis is inconclusive). Di erent kinds of necessary conditions have been used for deadlock analysis in Ada tasking. Masticola Hang: (symbol)
Require:
Necessary conditions in the form of linear inequalities have been used to verify a variety of di erent properties of concurrent system, including freedom from deadlock 1], general safety and liveness properties 8], and real-time properties 2, 11] . The basic idea is to view each FSA M i as a owgraph and nd a ow from the start state to some nal state. This ow represents the path M i takes in the trace being sought (i.e., the trace violating P). The ow through arc i is represented by an integer variable x i . Flows are found by generating a ow equation at each state equating the ow into the state with the ow out of the state. There is an implicit ow of 1 into the start state, and an implicit ow of 1 out of the nal state. Additional inequalities are generated to enforce some consistency among the paths taken by the FSAs.
Consider the example in Fig. 1 . Let action b represent a synchronization between tasks 1 and 2. Let internal action a represent task 1 becoming permanently blocked waiting for task 2 to synchronize on b, and let internal action c represent task 2 becoming permanently blocked waiting for task 1 to synchronize on b. The inequalities representing necessary conditions for the existence of a trace in which some task becomes permanently blocked are given in Fig. 3 . The ow equations nd ows in each FSA; the communication equation requires that the b communication occur the same number of times in the two tasks; the hang inequality prevents both tasks from becoming permanently blocked waiting for the same communication action (i.e., events a and c cannot both occur); the requirement inequality requires that one task become permanently blocked (i.e., one of events a or c must occur). Notice that the inequalities have no integral solution, proving that deadlock is impossible. See 1, 12] for details.
IV. Method for Comparison
The performance of an analysis technique depends on many di erent factors, including the examples to which it is applied, the property to be veri ed, the quality of the technique's implementation, the way in which problems are modeled/speci ed in the input language, and possibly other parameters speci c to the particular technique (e.g., the OBDD variable ordering for symbolic model checking). A method for comparison must control all these factors in such a way that the resulting performance data gath-ered may be meaningfully compared. In this section, we describe a method for empirically evaluating deadlock detection techniques for Ada tasking programs.
A. Selection of Examples
As noted in Section II, there are many di erent kinds of concurrent systems. We restrict the scope of our evaluation to one kind of concurrent system, Ada tasking programs, for several reasons. First, this allows us to use a very simple interleaving model of concurrency in which only one type of communication must be represented. Asynchronous protocols are more naturally represented using a model with explicit message bu ers. Hardware circuits are more naturally represented using models that allow the next state of a component to depend on the states of many other components and allow many components to change state simultaneously. A model that could represent everything would probably represent nothing well. Second, given our limited resources, the results of an evaluation over a limited class of systems is more meaningful since the range of examples provides a better coverage of that particular class. Third, we are familiar with this class of systems and have a collection of examples to use. Of course, this restriction in the class of systems limits the scope of our results. The experiments described in Section V must not be interpreted as an evaluation of the techniques in general, but only as an evaluation of the techniques as applied to Ada tasking programs.
Given this restriction on the class of systems, the technique should be tried on as many examples as possible. Many examples are scalable in some parameter and applying a technique to several sizes of such an example gives some indication of the scalability of the technique. We collected as many real Ada tasking programs as possible and also used standard benchmark examples from the concurrency analysis literature. The examples analyzed are listed in Section V-A. Our choice of Ada re ects a standard in the eld of concurrent software analysis 1, 14, 28, 30, 43, 46] .
B. Selecting a Property
We used the techniques to test for deadlock in the communications protocol used by the tasks. We selected deadlock since it is almost always an undesirable property in this setting and is essentially the same for all examples (i.e., a program deadlocks if and only if its automaton contains a deadlock state, as de ned in Section II). More complex properties, such as mutual exclusion or starvation, are more system speci c and often require more knowledge of a program than is present in its source code. As with the restriction to one class of systems, our restriction to one particular property limits the scope of our results. We note, however, that the veri cation of any safety property can be reduced to a check for deadlock 21]. Given our decision to use existing implementations of the techniques, we are faced with a problem: each analysis tool has its own unique speci cation language. We see two solutions to this problem. We could specify each example directly in each tool's speci cation language, or we could specify each example in some canonical form and then generate the input for each tool automatically from this canonical form. The rst approach is technically simpler since it avoids the tricky issues involved in automatic generation of input. On the other hand, if we specify an example in two di erent speci cation languages, on what formal basis can we say that these speci cations represent the same system? This issue is more serious than it may appear since speci cation languages can be very di erent and slight changes in the way a system is speci ed can produce large variations in the performance of a tool. We must be sure that speci cations of the same example in di erent languages are formally equivalent before we can meaningfully compare the performance of the techniques on the example. We believe this requires using the second approach.
The next step is to decide on a canonical form for the examples. One possibility is to use the input language for one of the tools as the canonical form. The problem with this approach is that the semantics of most speci cation languages are su ciently complex and varied that translating one into another is very di cult. Each language has constructs that might be awkward to represent in another, thus the tool whose input language is considered canonical might enjoy an unfair advantage in the evaluation. Since all of the tools, regardless of their input language, represent a concurrent system as a nite-state transition system, we instead use the more abstract model of concurrent systems from Section II as the canonical form for all of the examples. This model has simple semantics, thus it is relatively straightforward to embed in the richer semantics of the speci cation languages. Given each example speci ed as a set of communicating FSAs, we use translators to automatically generate semantically equivalent input for the di erent tools. The translators for SPIN+PO and SMV are described in Sections IV-F and IV-G, respectively. The back end of INCA (which performs the analysis) accepts communicating FSAs directly, thus no translation is required. The communicating FSAs representing each example are generated automatically from an Ada-like speci cation by the front end of the INCA tool. This aspect of the comparison method is summarized in Fig. 4 .
Although it may seem that we are using the input language for INCA as the canonical form, we view the front end of INCA only as a tool for constructing the canonical form. INCA constructs the FSA for each task using standard techniques for constructing control ow graphs 17]; these automata are very similar to those produced (internally) by SPIN or any other tool that constructs a nite-state representation of the control ow of imperative code. Thus we do not believe that using the front end of INCA to produce the communicating FSAs conveys any advantage on the back end of INCA, which performs the analysis. While it is true that the input for INCA does not go through a translator, we could easily have written an INCA translator that reversed the mapping performed by the INCA front end (i.e., converted the communicating FSAs back to the original Ada-like speci cation) and then applied the whole of INCA to this translated input. Since the same communicating FSAs would be analyzed in either case, however, the performance of INCA would not have changed. The real issue is whether our choice of canonical form introduces a bias; we discuss this issue in Section IV-E.
A description of the algorithm used by the front end of INCA to translate our Ada-like speci cation language into communicatingFSAs is given in 9]. Since the details of this translation are extensive and probably beyond the scope of this paper, here we give only an example of a sample speci cation and the FSAs generated from it. Fig. 5 shows our Ada-like speci cation for the basic dining philosophers problem and Fig. 6 shows two of the FSAs generated by the INCA front end from this speci cation (we selected this example because it has the smallest speci cation and is probably the most familiar). The constant Problem Size is set by the INCA front end so that varying sizes of the example can be generated from the same source code. The task discriminant I is set to the index in the array at which the philosopher/fork task is placed. A rendezvous between two tasks is modeled by a shared symbol that encodes the caller, the acceptor, and the entry. For this example, the FSA state encodes only the syntactic location within the source code. For many examples, the values of a few small ranged variables (e.g., ags or counters) must be encoded into the task state for accurate modeling of the task's synchronization behavior; adding variables to the task specications causes INCA to perform this encoding.
E. Identifying Potential Bias
Our model of concurrent systems may introduce a bias against SPIN and SMV. While our model is very natural for representing Ada tasking programs, it is not particularly appropriate for representing asynchronous protocols or hardware, the domains for which the other tools were designed. We believe this bias is much worse for the OBDD-based technique for two reasons. First, we use an interleaving model of concurrency rather than a simultaneous model (in which multiple actions can occur simultaneously). OBDD-based techniques generally perform better on simultaneous models, although this depends on the communication structure of the system 32]. Second, the encoding of task variable values within a monolithic task state may increase the size of the OBDDs needed to represent the transition relation of the task's FSA. We elaborate on this second e ect.
Large FSAs are generated by data intensive tasks containing variables that must be encoded into the state of the task for accurate modeling of the task's synchronization behavior. When a system is directly speci ed in the SMV input language, SMV can encode the states more e ciently for representation by OBDDs. For example, consider again the n-bit counter with states 0; : : :; 2 n ? 1, and transition relation = f(i; a; (i + 1)mod 2 n )g. Since addition can be represented very e ciently with OBDDs, the OBDD representing this transition relation would be much smaller if The model was chosen to match the domain of the examples, not the analysis tools. Since we are conducting the evaluation over a domain for which SPIN and SMV were not designed, the introduction of some bias seems unavoidable. INCA was built to analyze Ada tasking programs, thus it does not su er from this bias. In Section VI, we explore the extent of the bias by considering di erent models.
F. SPIN Translator
Here, we describe the how the communicating FSAs representing a concurrent system were translated into Promela, the input language for SPIN. Promela is a guarded command language like CSP 24], with a C-like syntax. It directly supports communicating processes, thus the translation is relatively straightforward.
Each FSA is represented by a process created using the proctype declaration. Each state of the FSA is represented by a statement labeled with the name of the state. The statement implements the transitions out of the state it represents using goto statements guarded by the action causing the transition. Fig. 1 shows a pair of communicating FSAs and Fig. 7 shows the Promela code generated for these FSAs (with some comments added by hand).
The if statement in Promela is like the alternative command in CSP, allowing an arbitrary selection among alter-natives whose guards are true. There is one alternative for each transition out of the state. If the state has no out transitions, then the statement for that state is 0, which is not executable, causing the process to halt. Each alternative representing a transition on action A to state i consists of a goto statement to state i's label guarded by a statement representing A. Internal actions are always enabled and are represented by the statement skip. Communication actions are discussed below. If a state label begins with the string \end", then it is considered a nal state of the process for purposes of deadlock detection.
Promela supports communication between processes via channels which are declared like ordinary variables having type chan. Channel variables must be initialized, indicating the number of messages that the channel can bu er. We use channels with zero capacity to implement synchronous communication. We declare such a channel for each communication action in the FSAs. Each communication action is shared by two processes, one of which is designated the sender, and the other, the receiver. The Promela statements c!m and c?m are used to send and receive a message of type m to/from a channel c, respectively. Since the channels have zero capacity, these statements are executable (i.e., may be chosen in the if statement for a state) only when both processes containing the corresponding communication action are in states in which they can take that action. No information is passed in the messages, thus we declare and use a single message type synch using the mtype declaration.
Promela also supports the speci cation of safety and liveness properties. We do not generate such speci cations in our translation, causing SPIN (and SPIN+PO) to search only for invalid endstates (i.e., deadlocks).
G. SMV Translator
We now describe how a set of communicating FSAs is translated into the SMV input language. Unlike Promela and Ada, the SMV input language was designed primarily for hardware speci cation. In the SMV language, systems are speci ed as a set of variables and a set of functions that de ne the next value of those variables in terms of the current values. Facilities for constructing, replicating, and connecting components are also provided. While this is convenient for specifying gate-level logic and even certain kinds of protocols, we found it awkward for specifying communication between sequential processes. Fortunately, the SMV language also provides an escape that allows the transition relation to be speci ed directly. This was meant speci cally to facilitate writing translators from other languages into SMV and proved invaluable for this comparison.
The SMV speci cation we generate has four parts. This section presents the results of the experiments conducted using the comparison method described in Section IV. We applied the SPIN+PO, SMV, and INCA to seventeen families of scalable examples and to several non-scalable programs, measuring the amount of time and memory used by the tools to check for deadlock. For each scalable example, we applied each tool to several sizes of the example to gauge the scalability of the technique. We also applied SPIN, a straight reachability analyzer, to each example to provide a baseline for measuring the e ectiveness of the other tools in curbing the state explosion problem. From the raw performance data for the scalable examples, we derived a numerical measure of how fast the resource requirements grew with the problem size. Using these growth rates and the raw data for the non-scalable examples, we were able to correlate the scalability of each technique to certain features of the examples and thus characterize the kinds of programs to which each technique might best be applied.
This section is organized as follows. Section V-A presents a brief description of the examples used in the evaluation. Section V-B discusses some issues that arose in modeling these examples for analysis. Section V-C describes the general approach we used (i.e., what analyses should be run). Section V-D presents the raw data from the analyses and the details of its collection. In Section V-E, we develop a numerical measure for the rate of growth of the time/memory required by the tools as an example is scaled up. Finally, in Section V-F, we use these growth rates to draw conclusions about the scalability of the various techniques on di erent kinds of programs.
A. Examples
This section lists the examples analyzed. We use m as the size parameter for scalable examples, and denote the size m version of scalable example X as X(m). Space permits only a brief description of each example. The Adalike speci cations of these examples, the communicating FSAs generated from them by the INCA front end, and the Promela and SMV inputs generated by the translators are available for anonymous FTP on ftp.ics.hawaii.edu in /pub/corbett/eval.tar.Z.
Some raw characteristics of each example are summarized in Table I , whose columns give the lines of code, the number of tasks, the number of unique rendezvous, and some additional features used in our analysis of the results in Section V-F. The lines of code given is for the Ada-like input to INCA; this is not an accurate measure of the size of the resulting system since all the scalable examples use arrays of tasks, and thus this measure does not vary with the problem size. For speci cations extracted from real programs, we give the size of our speci cation rather than the size of the original program since the amount of unmodeled non-concurrent code is not relevant to our analyses. We count each distinct shared symbol as a unique rendezvous; this symbol encodes the caller, the acceptor, the entry name, and any parameters passed in the rendezvous.
The examples analyzed were: Alternating Bit Protocol (ABP) A simple but often analyzed example modeled with 6 tasks representing two users, a sender, a receiver, and two lossy channels. Although not a very realistic problem, it does contain a nontrivial deadlock and is probably the most commonly analyzed example 1, 14, 28, 42, 46] . We model each of the m philosophers and m forks with a task. These tasks synchronize to model forks being acquired and released. In addition to the standard version (DP), which can deadlock, we also analyzed several versions of the problem where deadlock is prevented. In the version with host (DPH), there is an additional host task with which a philosopher must synchronize before attempting to acquire her forks. This might model a real-world situation in which a task wishing . Customers arrive and pay the operator for gas. The operator activates a pump, at which the customer then pumps gas. When the customer is nished, the pump reports the amount of gas actually pumped to the operator, who then gives the customer her change. We analyzed versions with one operator task, two pump tasks, and m customer tasks. We analyzed two di erent versions of this example. In the original version (GASQ) from 1, 23], the operator task queues customer requests and must keep track of which customers are waiting for each pump and in what order. In the non-queuing version (GASNQ) from 14], the operator does not enforce a rst-come-rst-serve order on the customers and must only record the number of customers waiting for each pump in order to activate the pump when any waiting customers remain. For a few of the examples (DPFM, ELEVATOR, GASNQ, GASQ), it is convenient for a task accepting an entry to be able to test the value of a call parameter before deciding whether to accept the call. This functionality is absent from Ada 83, although the requeue statement of Ada 95 e ectively adds it. The same e ect can be achieved in Ada 83 by using a di erent entry for each value of the critical parameter, though this often makes scaling the speci cation cumbersome. In our speci cations, we employ a special assume statement for this purpose. For example, the operator task in GASNQ should not accept the CHARGE entry for a pump for which no customers are currently waiting. This can be expressed as: accept CHARGE ( P : PUMP) do assume WAITING(P) > 0; ... end CHARGE;
The assume statement is modeled after the assuming clause used by Yeh and Young in 44].
Although we believe all of our examples except the standard dining philosophers (DP) are free of deadlocks, our models of the Ada programs DARTES, KEY, and SPEED contain spurious deadlocks due to the presence of a global variable used for synchronizing task termination. Currently, global variables are not processed by the INCA front-end, although they can be represented in our model of concurrent systems using an additional FSA for each variable to hold the value of that variable, along with communication actions for testing and setting the value of the variable. Our model of the Q program also contains a spurious deadlock due to the abstraction of timing information (the program makes heavy use of conditional entry calls). Since we had so few examples that contain deadlocks, we chose to leave these spurious deadlocks in the models to provide more data on how quickly the tools can nd a deadlock when one exists. One drawback is that these deadlocks are not subtle|unlike the deadlock in the dining philosophers problem, these deadlocks would be likely be found by random simulation.
Several of the examples (DAC, ELEVATOR, HART-STONE, KEY, Q, SENTEST, SPEED) used the Ada terminate alternative to synchronize the termination of groups of tasks. In all of these examples, all tasks are declared within a single package; thus a task will select its terminate alternative exactly when all other tasks are either terminated or similarly blocked on terminate alternatives. We represent this in our model by making a state of a task FSA that can select a terminate alternative a nal state.
C. General Approach
We ran SPIN+PO, SMV, and INCA on the examples described in Section V-A. We also ran SPIN, a straight reachability analyzer, on each of the examples to give a baseline for measuring the e cacy of the techniques in curbing the state explosion problem. For all examples, we measure the CPU time and memory consumed by the tools in performing the analysis.
For each scalable example, we selected four arithmetically increasing sizes ending near the maximum size that could be handled by all of the tools. This facilitates comparison of the tools, although it makes most of the measured run times small since the maximum size is set by the tool that performs worst on that example. The step value for the size growth was chosen to magnify the variation in the resource measurements. For most examples, this meant dividing the size range into roughly equal pieces (e.g., if the maximum size is 12, run sizes 3, 6, 9, 12). For a few examples, however, the resource requirements for one or more tools increased very quickly with the size and were thus very small until the size approached the maximum. For such examples, we chose to use larger sizes to minimize the number of small measurements, which are dominated by xed overhead. Rather than nding the largest size each tool can handle given certain resource constraints, as we did in the preliminary version of this paper 10], we simply measure the growth in the resources consumed as the example is scaled up (the calculation of these growth rates is described in Section V-E). We believe these growth rates are more meaningful than the maximum sizes gathered in 10] for several reasons. First, various kinds of constant overhead in the implementations are factored out. Second, for some of the examples, the use of translated input, not the tool itself, imposes the maximum size (see the discussion of the di culty in scaling HARTSTONE and SENTEST in Section V-D). For larger sizes of such examples, either the INCA front end runs out of memory building the FSAs or (more often) the translated input, being much larger than a native speci cation, is too large for the tool. This is a limitation of our comparison method, although a more compact canonical model (e.g., the EFSAs of Section VI-B) and better translators, which use language constructs like Promela arrays to replicate components, would largely solve this problem. Third, the resource constraints, although reasonable, are somewhat arbitrary, especially the limit on CPU time. In 10] we used three hours, though one could argue for a higher or lower limit.
In Section V-E, we will investigate the relationship between the size of the concurrent system and the resources consumed by the tools to perform the analysis. Given a concurrent system in the model of Section II, we must dene exactly what the size of this system is. We might use the number of reachable states in the model (jSj), but this obscures the state explosion by making the size measure itself explode as systems are scaled up. Also, it bears little relation to the size of the Ada program from which the model was derived. Alternatively, we might use the number of tasks (n). This is closer to the programmer's view of a program's size, but does not take into account the size of the tasks (e.g., the 2-state fork tasks in DP are counted the same as the 1400-state operator task of GASQ (4)). To account for task size, we might use the number of component states ( P i jS i j) as the measure, but this obscures the state explosion that results from the use of data within a component (as in the gas station and elevator examples). Instead, we use the number of bits required to store the state of the task as a measure of the task's size, and the sum of these measures ( P i log 2 jS i j) as a measure of the size of the program. To avoid confusion, we call this measure of size the scale of the example and continue using the word \size" to denote the value of m in the scalable examples. We note that, for all of our examples, the scale is a linear function of the size, thus an arithmetic sequence of sizes of an example will have an arithmetic sequence of scales.
D. Raw Data
In this section, we present the data from our experiments along with various details of how they were collected.
All experiments were conducted on a SPARCstation 10 Model 51 with 96 MB of memory. Analysis times are reported in user CPU seconds collected using the time command of tcsh and the (get-internal-run-time) function of Allegro Common Lisp. Statistical analysis of the behavior of these functions reveals that the time they report has a near normal distribution with a standard deviation around 0:06. If a tool takes very little time on all sizes of an example, these small variations can have a signi cant e ect on the growth rate we calculate in Section V-E. Therefore, if the measured run times of a tool on all sizes of an example are less than two seconds, then we run the tool 100 times and use the average time as a point estimate. For the INCA results, the analysis times reported include the translation from an Ada-like source language to the FSAs. For the other tools, the analysis times include only the actual run times of those tools|not the time to translate the Ada-like source to FSAs, nor the time to translate the FSAs into the tool's input language.
The tools themselves report the amount of memory they used; we trust these gures, but regard them as approximations. Memory usage is very di cult to measure externally since a tool will generally allocate more memory than it actually uses. Also, there is some variation among the tools as to what memory (i.e., code, stack, heap) is counted in the total reported. These di erences are small constant factors, however, and do not signi cantly a ect the rate of growth of memory usage as examples are scaled up.
For these experiments, we used version 1.5.10 of SPIN, version 2.0 of the partial order package for SPIN, an unocial version of SMV dated 8/6/93, and version 3.2 of INCA. SPIN, SPIN+PO, and SMV are all written in C. INCA is written in Common Lisp, with the integer programming package written in FORTRAN.
SPIN+PO and SMV take various command line parameters that can a ect their performance. We used the parameters suggested by the authors of those tools, which we found produced the best performance. SPIN+PO was run with the \proviso" disabled (the -p ag). The proviso causes SPIN+PO to generate more states than are needed for deadlock detection in order to allow state assertion checking. Since we are evaluating deadlock detection only, this is not needed and removing it improves the performance of the tool. SMV was run with the -f ag, which calculates the reachable states of the system before checking the CTL formula.
SPIN and SPIN+PO use arrays whose sizes are set by various command line parameters, including: the maximum number of processes, the size of the state vector, and the maximum search depth. While the default values for these parameters su ced for most of the examples, we had to increase them to complete the analysis of some of the examples. The maximum number of processes (default: 32) was raised to 34 for DARTES and RW, and to 110 on HARTSTONE and SENTEST. The state vector size (default: 1024) was raised to 2048 on DARTES, HART-STONE, and SENTEST. The maximum search depth (default: 10K) was raised to 100K on DP (for SPIN only). Parameters were raised uniformly for all sizes of a scalable example. This increased the memory usage unnecessarily for smaller sizes of those examples, but we feel that this is more consistent since we do not vary these parameters for the di erent sizes of the other scalable examples. In our analysis of performance in Section V-E, we will be concerned primarily with growth rates rather than magnitudes.
For SPIN and SPIN+PO, the analysis tool rst generates C source code for an analyzer which is then compiled and run to perform the analysis. We do not include the generation and compilation times in our data. The translated input we generate is much larger than the equivalent Promela code that would be used to specify the same system, and takes much longer to generate and compile (e.g., for GASNQ(5), a native Promela speci cation takes 5 seconds to generate and compile, compared to 62 seconds for the translated input). Thus the real run times of these tools would be slightly larger.
The HARTSTONE and SENTEST programs have a very simple communication structure. Indeed, the number of states in these examples grows linearly with the problem size (m). We had di culty scaling these examples to the limits of the tools, however, due to our use of translated input for SPIN and SPIN+PO. As noted above, the translated input is usually much larger than the equivalent native Promela speci cation. For large sizes of HARTSTONE and SENTEST, the generated analyzer was too big for our C compiler, and thus could not be run.
For SMV, the order in which the state variables are declared in the input le is the order in which they appear in the OBDDs. Since this greatly impacts the performance of the technique, we assisted the tool by providing a reasonable ordering for each example based upon our knowledge of OBDDs and some limited trials with di erent possible orderings. Unfortunately, there is no general algorithm to determine the best order for a particular situation, although heuristics exist. In general, the state variables for tasks that communicate were placed as close together as possible.
The examples DP, DARTES, KEY, Q, and SPEED contain deadlocks. For these systems, SPIN, SPIN+PO, and SMV display the sequence of state changes leading up to the deadlock state, and INCA nds a solution that is interpreted as a sequence of actions of each task. For all other examples, the tools correctly declared that the system was free of deadlocks.
The raw performance data are given in Tables II{III.  The columns of the table show In this section, we consider what our data suggest about the scalability of the evaluated techniques. In general, it is di cult to characterize the scalability of an analysis technique. Complexity results indicate that there must exist problems on which a technique will not scale. In practice, we are more interested in the average case, but it is di cult to know what an average program looks like. Techniques may scale well on certain kinds of programs and poorly on others. Furthermore, most techniques are su ciently complex that it is hard to estimate their cost on a particular problem a priori. The best we can do is examine the performance of the technique on a nontrivial collection of programs and try to determine the kinds of programs on which the technique seems to scale well. we have a set of four points f(x i ; y i )ji = 1; 2; 3; 4g where each x i is the scale of a di erent size of the example and each y i is the the amount of the resource consumed by the tool on that size. By looking down each column, we can get an informal sense of how quickly the resource requirements are growing with the scale of the problem. Such examination of the data is tedious, however, and makes comparisons di cult. We explored graphing the raw data, but the range of the measured resource units is too great to plot all the data for each example on a single graph, and using many separate graphs with di erent scales does not facilitate comparison. We tried selectively plotting only certain data or using mathematical transformations to make the data t (e.g., log-linear or log-log graphs), but we found that the resulting graphs were at best di cult to interpret, and at worst extremely misleading. In the end, we decided to obtain a numerical measure of the rate of growth of each resource of each tool on each example.
We want to measure how quickly the resource requirement (y i ) is growing with the scale (x i ) of the example. At rst, we considered tting a curve of some kind to the points f(x i ; y i )ji = 1; 2; 3; 4g and using a parameter of the tted curve to estimate the growth rate (e.g., we might use a linear t and take the slope, or t the points to 2 ax+b and use the parameter a). Unfortunately, the underlying forms of the actual resource functions generating the data are unknown. Furthermore, they are clearly not of a single form; some of the data appear to be linear, while some appear highly exponential. Rather than make unjustiable assumptions about the form of the actual resource functions, we simply estimate how much faster the function appears to be growing on the right side of the interval x 1 ; x 4 ] than on the left size. Speci cally, for each data set f(x i ; y i )ji = 1; 2; 3; 4g, we calculate a growth rate for the resource function by taking the ratio of the slope of the line segment connecting f(x 2 ; y 2 ); (x 4 ; y 4 )g to the slope of the line segment connecting f(x 1 ; y 1 ); (x 3 ; y 3 )g. (We considered using the ratio of the slopes of the line segments connecting f(x 1 ; y 1 ); (x 2 ; y 2 )g and f(x 3 ; y 3 ); (x 4 ; y 4 )g, but in some data sets, consecutive resource measurements are equal and thus the slopes of these segments is zero.) We also determined the growth rate of the state space for each example in a similar way (i.e., by using the points f(x i ; jS i j)ji = 1; 2; 3; 4g where S i is the set of global states of the i th size of the example). The growth rates obtained are shown in Table IV .
This growth rate measures only the apparent curvature of the actual resource function and factors out any xed overhead (e.g., the memory taken by the analyzer code) or constant factors (e.g., the units in which the resources are measured, the speed of the machine used to run the tool). Consequently, we may not only compare the growth rate of a single resource for di erent tools and examples, but we may directly compare the growth rates of di erent resources. On the other hand, our measure conceals the actual slope of the growth function over the interval x 1 (e.g., both f(x) = x and g(x) = 100x have a growth rate of 1.0). Although the growth rate is clearly more important than such constant factors in the limit, in practice large constant factors may have a signi cant impact on the scalability of a technique over the range of sizes on which its tool can be run. Thus, if the slope of a resource function over the interval x 1 ; x 4 ] is large, then this should be considered along with the function's growth rate in estimating the scalability of the technique on the example. As discussed in Section V-C, we chose to measure growth rates rather than determine the maximum sizes each tool could handle given certain resource constraints. One concern with this approach is that the behavior of the tool on large sizes of an example may be dominated by di erent factors than its behavior on the small sizes over which we measure the growth rate. In other words, the measured growth rate may not be an accurate characterization of the scalability of a tool. We validate our scalability measure by comparing the growth rates calculated here with the maximum sizes determined in 10] on a couple of examples.
First we consider DP, an example on which most of the tools can be scaled to much larger sizes than those shown in Table III . For this example, SPIN, SPIN+PO, and INCA all exhausted the 64 MB limit imposed in 10], while SMV exceeded the three hour time limit instead. The memory growth rate was 5.9 for SPIN, 2.7 for SPIN+PO, 0.5 for SMV, and 1.0 for INCA. From these rates, we would expect SPIN+PO to be able to handle signi cantly larger sizes than SPIN, and INCA to be able to handle much larger sizes. In fact, SPIN exhausted its memory at size 14, SPIN+PO at size 22, and INCA around size 325. Note that, while it would be di cult to predict the maximum sizes without additional information, they are roughly consistent with the growth rates.
We also consider the data intensive example GASNQ. In 10], SPIN and SPIN+PO exhausted the memory limit, while SMV and INCA exhausted the time limit. The memory growth rate is 9.1 for SPIN and 7.6 for SPIN+PO. From these rates, we might expect SPIN+PO to be able to scale a bit farther than SPIN. In fact, both tools exhausted their memory at size 6. Since the memory growth rate is high, each additional customer adds a great deal of memory, and SPIN+PO could not take the extra step without exceeding the limit. The memory growth rates for SMV and INCA are 1.5 and 1.8, respectively. As expected, memory is not a problem for these tools on this example. The time growth rate is 4.0 for SMV and 2.9 for INCA. From these rates, we might expect INCA to be able to scale a bit farther than SMV. In fact, both tools exhausted the time limit at size 10. Examining the raw data reveals that the time function for INCA has a much larger slope than the time function for SMV, so again, the maximum sizes seem to be consistent with the growth rates. We therefore believe that our growth rates provide a useful characterization of the behavior of the tools on larger sizes of the examples, and thus that they are a reasonable measure of scalability.
F. Results
We now discuss the implications of our growth rates for the scalability of the techniques on di erent kinds of programs. For convenience in our discussion below, we will refer to rates below 2:0 as low, and rates above 5:0 as high, and intermediate rates as moderate. We chose these boundaries such that rates near 1.0 (linear functions) would be low, and such that most of the state growth rates would be high, although we admit that this classi cation is somewhat arbitrary. A picture of the overall results is given in Fig. 9 , which plots the growth rates from Table IV for each  tool. We were able to correlate the performance of the different tools to various features of the example programs, principally:
Comminication structure. We may view the communication structure of a program as a graph in which each task is represented by a node, and a (possible) rendezvous between two tasks is represented by an edge between the corresponding nodes. Our examples exhibited several di erent communication structures. In a linear communication structure, the tasks can be arranged in a line or ring such that each task communicates primarily with its neighbors. In a single star comminication structure, most communication is between one particular task and the other tasks. A couple of our examples exhibited a combination of these two patterns (i.e., a line or ring of tasks communicating with their neighbors and one central task). Finally, in a multiple star communication structure, several tasks communicate with many other tasks. Task size/structure. As discussed in Section IV-E, data intensive tasks require that the values of certain task variables be encoded into the state of the task's FSA for accurate modeling of the task's synchronization be- Table I , we characterize the communication structure and task size/structure of each example. For the scalable examples, the task size classi cation is determined by the size of the tasks in the largest size of the example analyzed (e.g., the fork manager task has only 3 states in DPFM (2), but has 1025 states in DPFM (11) and is therefore classi ed as large in the table).
Of the deadlock-free scalable examples, SPIN exhibited low growth rates only for HARTSTONE and SENTEST, examples on which the state spaces grow linearly with the scale. Although the synchronization structure of these examples is very simple (a master task starts/stops m worker tasks), such a structure may not be uncommon in software. SPIN also exhibited low growth rates on KEY, an example for which SPIN is able to nd the deadlock state without exploring a signi cant fraction of the state space.
SPIN+PO exhibited signi cantly lower memory growth rates than SPIN for CYCLIC, DAC, DP, DPD, OVER, and RING. The common feature of these six examples is that their communication structure is linear. This structure creates the many commuting transitions that allows the partial order technique used by SPIN+PO to achieve signi cant reduction in the state space. Although the partial order state space reduction helped in most cases (especially with the memory growth rate, which is generally the limiting factor for the state space tools), SPIN+PO exhibited somewhat higher growth rates than SPIN for DPH, ELEVATOR, and RW. These examples contain a single star communication structure. Unlike SPIN, SPIN+PO creates a data structure for each possible pair of synchronizing transitions, and thus performs poorly on examples with this communication structure. This memory overhead is an artifact of the implementation, not of the technique itself.
SMV exhibited low time growth rates for CYCLIC, DAC, and RW, and moderate time growth rates for DP, DPD, DPH, GASNQ, HARTSTONE, KEY, MMGT, RING, and SENTEST. Like SPIN+PO, SMV performed better on examples with linear communication patterns, although it also performed reasonably well on examples with a single star communication pattern. SMV performed worse on ELEVATOR, FURNACE, GASNQ, GASQ, and DARTES, examples whose communication patterns contained multiple stars. This is not surprising since such a nonlinear structure makes it di cult to nd a good variable ordering for the OBDDs. SMV also exhibited moderate/high growth rates for programs with data intensive tasks. These programs include DPFM, which has a single star communication pattern, as well as ELEVATOR, GASNQ, and GASQ. Although symbolic model checking has been used primarily for the veri cation of hardware, our experience indicate that it may also prove e ective for verifying software. Note that SMV performed better than SPIN on most of the scalable examples, exhibiting lower growth rates for time and much lower growth rates for memory.
INCA performs worst on DPFM, ELEVATOR, GASNQ, GASQ, and RING. Of these, DPFM, ELEVATOR, GASNQ, and GASQ all contain one data intensive task whose size grows rapidly as the example is scaled up. Unlike the other tools, INCA is generally not sensitive to the communication structure of the program, but rather to the kind of tasks that comprise it. The time required to solve the ILP problems INCA generates increases very rapidly with the size of the task FSAs, unless they have a simple linear structure, like those in DPH, HARTSTONE, RW, and SENTEST. As mentioned in Section III-E, INCA uses necessary conditions and thus its analysis can be inconclusive if these conditions are not strong enough. We note, however, that these conditions were strong enough for all of the analyses in this paper.
Meaningful comparisons are more di cult for the nonscalable examples. SMV is clearly worse than the other tools on Q and DARTES (large systems with simple deadlocks), but was clearly better on FTP (a large system with no deadlocks). This re ects a general trend in our experiments: SMV tended to be slower than the other tools in nding deadlocks. SPIN(+PO) and INCA use techniques that allow them to stop as soon as a deadlock state (or integral solution, in the case of INCA) is found. SMV must construct the OBDD for the reachable states of the system in its entirety before checking for deadlock states. We do not draw any conclusions from these data, however, since our sample contained too few programs with deadlocks, and all but one of the deadlocks were trivial and would likely have been found by random simulation. In general, it is much more di cult to evaluate the performance of analysis tools when they are used to nd errors rather than prove their absence since the time it takes to nd an error is very dependent on factors over which the analyst has little control (e.g., the order in which a reachability analyzer explores a state space). Note that SMV performed better than SPIN in nding the obscure deadlock in the standard dining philosophers (DP).
On each deadlock-free scalable example, SPIN exhibited a memory growth rate similar to the growth rate of the state space. SPIN+PO, SMV and INCA, however, each exhibited signi cantly lower memory growth rates on several examples, indicating that the techniques used by those tools tend to require more time than memory as a problem is scaled. Since memory is usually the scarcer of the two resources, this often allows these tools to tackle larger problems.
VI. Alternative Models
In our experience, the most controversial aspect of our comparison method is our choice of communicating FSAs as the canonical model. Several people have questioned whether the choice of this model and the use of translators does not bias the evaluation against one or more of the techniques. To address this issue, we explore alternative canonical models. First, we consider using an informal model and manually generated native speci cations for each analysis tool (thus avoiding the use of potentially biasing translators). Second, we consider using a more complex canonical model in which data values are made explicit. Finally, we consider a simultaneous model of concurrency. In the end, we found that these alternative models either did not affect or actually degraded the performance of the tools, thus increasing our con dence in the validity of our results.
A. Native Speci cations
Rather than use a formal canonical model, we might have used an informal model of each example (e.g., a prose description) and used this to specify the example in each tool's speci cation language directly. The SMV input language is intended for describing systems much di erent than the programs we analyze. As a result, we would have had to convert the programs to some kind of state machine just to encode them in the language. Thus we had little choice but to use a more abstract model of the examples to generate the input for SMV. Promela, however, can easily specify communicating processes. To determine what e ect using native Promela speci cation might have had on the experiments, we selected two examples, coded them directly in Promela, and used SPIN (a straight reachability analyzer) to compare number of states in the native model with the number of states in the translated model.
The rst example we selected was a version of the standard dining philosophers problem in which deadlock is avoided by having the rst philosopher pick up her left fork rst while all other philosophers pick up their right fork rst. This example is representative of many of the programs we analyzed in that no variables are modeled; the states of the automata representing the program encode only the control location within each task. For this example, the number of states in the native model was exactly the same as the number of states in the translated model|a reassuring result.
The second example we selected was the non-queuing version of the common gas station problem (GASNQ). This example is representative of the programs in which the state of the task automata encode the values of task variables as well as the control location within the task. The story behind our selection of this example is interesting. We sent a draft of the predecessor to this paper 10] to Gerard Holzmann, the author of SPIN, to solicit comments on our use of his tool. At his request, we also supplied the generated Promela inputs. He was concerned that using translated inputs would unduly bias the evaluation, and gave us a version of GASNQ that he had directly coded in Promela. While our translated Promela could be scaled only to 5 customers, his native Promela could be scaled up to 50 customers|a worrisome result, suggesting that our translated Promela code was much inferior to a native Promela speci cation.
Upon closer examination of his code, however, we noticed that his version of GASNQ was not quite the same as ours. For those familiar with the problem, the di erence was that our operator task allowed many customers to prepay and kept a count of how many customers had prepaid at each pump so that the pumps could be activated so long as any waiting customers remained. This causes a state explosion in the operator task as the number of customers is scaled up. Holzmann, who worked directly from the translated Promela without knowledge of the problem or reference to our Ada-like speci cation, speci ed a system in which the operator allows only one customer to prepay at each pump. For this system, the size of the operator task does not increase with the number of customers.
We pointed out this di erence and wrote our own native Promela version of GASNQ to illustrate the structure of the program we intended to model. Being the rst real Promela speci cation we had ever written, it was quite ine cient, and could be scaled only to 4 customers, one size smaller than our translated model. Holzmann then used our speci cation as a guide and modi ed his own version to allow multiple customers to prepay. This version could be scaled to 7 customers, two sizes larger than the translated Promela. When we examined this new version, however, we noticed that it too was not quite our GASNQ. Promela allows multiple processes to read from the same channel, while Ada allows only one task to accept an entry call. Holzmann's Promela used the same channel to represent several Ada entries, and achieved some reduction in the state space as a result. In this example, at most one task would call such an entry at any given time, thus the behaviors generated were the same. Without knowing that multiple entries sharing a channel would not be called simultaneously (something that would have to be veri ed independently), this reduction cannot be applied since it would not correctly model the synchronization behavior of the Ada tasks; in general, each entry must be modeled with its own channel. When we modi ed Holzmann's speci cation to use one channel per entry, the resulting model could be scaled to 5 customers, the same as the translated model, and at this size had roughly three times as many states as the translated model|again, a reassuring result.
This exchange gave us con dence that our Promela translator is not introducing a signi cant bias into the evaluation. More importantly, it also convinced us that the use of a canonical model is essential since seemingly small differences in the speci cation of a problem can produce large variations in the resulting model.
B. Extended Model
The simple model of Section II is well suited for programs in which little or no data must be represented. For data intensive programs, however, that model may introduce a bias, as discussed in Section IV-E. We modi ed the INCA front end to build EFSAs from Ada-like speci cations. We then wrote new translators for SPIN and SMV to generate input from this new canonical form. Both SPIN and SMV support arrays of integers, thus the translation was still relatively straightforward despite the richer semantics of EFSAs. We omit the formal denition of EFSAs and the details of these translators since both are extensive and are not used in the evaluation described in Section V. Note that by encoding the memory contents into the state of an automaton, INCA translates EFSAs into FSAs for use by the inequality necessary condition technique, which is thus not a ected by the di erent model.
We applied the new translators to two examples and compared the performance of the tools on the inputs generated from EFSAs to their performance on the inputs generated from FSAs. The examples we used for this comparison were the dining philosophers with host (DPH) with ve philosophers, and the non-queuing gas station (GASNQ) with two customers. The DPH example has only one task with data: the host task has a single variable, which is used as a counter. The GASNQ example has several tasks with data: the operator task has an array of two counters, and the customer tasks each have a variable storing the pump selected. The results of this experiment are shown in Table V , the columns of which show: the example, the model, the number of reachable states reported by SPIN, the analysis time (in seconds) for SPIN, the number of OBDD nodes in the transition relation generated by SMV, and the analysis time for SMV. See Section V-D for details on how the analysis times were obtained in all of the experiments reported in this paper.
Using a richer canonical model generally degraded the performance of the tools reading the translated input. This e ect was greater for SMV, and for the more data intensive example GASNQ. For SPIN, the EFSA model of DPH was almost identical to the FSA model, but the EFSA model was much worse for GASNQ. Most of the extra states in the EFSA model result from the inability to express a certain kind of atomicity in Promela. In the FSA model, the memory updates for both processes involved in a rendezvous are performed atomically with the rendezvous (since the memory is encoded in the state). The EFSA translator employed the atomic sequence construct of Promela to simulate this, but the semantics are not quite the same. The problem is that it is not possible to make the memory updates in two processes part of the same atomic action, thus the EFSA model generated by SPIN must have additional states. The performance of SPIN+PO was similarly degraded using the EFSA model.
For SMV, the EFSA models for both examples required much larger OBDDs to represent the transition relation. We believe any bene t of representing the variable values explicitly was overwhelmed by the signi cant increase in the number of state variables required to store the task memories. In the case of GASNQ, we believe that array indexing caused the explosion in OBDD size that exhausted the memory. The Ada version of this example is most naturally coded by using an entry accepting the pump number as a parameter, which is then used as an index into an array storing the number of customers waiting for each pump. For comparison, we coded the example without the array using a separate entry for each pump and using two integer variables to hold the number of waiting customers. SMV was able to analyze the translated input generated from these EFSAs in just about twice the time it took to analyze the input generated from the FSAs|a slowdown comparable to that obtained for DPH.
After some experience with the EFSA translators, we were disappointed to nd that they produced uniformly worse performance than the FSA translators. We believe it is possible that an improved EFSA translator might produce comparable or better performance for SMV. Hu et al 27] have explored the veri cation of higher-level speci cations with OBDDs and use techniques that we have not attempted, such as interleaving the bits of memory cells that are functionally related and partitioning the transition relation. We have decided not to pursue this matter further at this time for several reasons. First, most of the Ada tasking programs we have collected are not data intensive, so this issue is not critical to our evaluation. Second, even for data intensive examples, it is not clear that representing a task's state symbolically with OBDDs, as is done in the EFSA model, will produce better performance than explicitly enumerating the task's states, as in done in the FSA model. Hu and Dill 26] report that state enumeration is more e cient than their OBDD-based techniques on most of the real-life protocols they have tried. Finally, we have tried to avoid the use of special purpose techniques for speci c kinds of problems in favor of techniques that are generally applicable. Fully automatic tools will have to sacri ce some e ciency for generality and ease of use.
C. Simultaneous Model
Another potential bias against the OBDD-based technique is the use of an interleaving model of concurrency rather than a simultaneous one. In an interleaving model, events are totally ordered, thus exactly one event occurs at each step. In a simultaneous model, many events may occur simultaneously. OBDD-based techniques tend to perform better on simultaneous models 32], especially when the number of asynchronous processes is large. In this section, we convert the model of Section II into a simultaneous model, describe a translation scheme for this model into the SMV language, and compare the performance of SMV on this simultaneous model to its performance on the interleaving model.
We begin by rede ning the composition of a set of FSAs We ran several sizes of the standard dining philosophers problem using the interleaving and simultaneous models. These results are shown in Table VI , the columns of which show: the example (and size), the model, the number of OBDD nodes in the transition relation generated by SMV, and the analysis time for SMV. The dining philosophers systems have a ring structure on which the simultaneous model should perform well. Unfortunately, the addition of the action state variables, which are unnecessary for the interleaving model, makes the performance worse. We note that our original translation scheme for the interleaving model included action state variables and caused similar performance problems for SMV. We conclude that an interleaving model, like the one in Section II, is better than a simultaneous model for representing Ada tasking programs.
VII. Conclusion
We have explored the methodological issues involved in empirically evaluating deadlock detection techniques for Ada tasking programs. Among these issues are the selection of examples and implementations, the speci cation of the examples, and the analysis of the resulting behavior of the implementations. We chose to represent each program in a canonical model and apply tools implementing the techniques to inputs generated automatically from this canonical representation. In our analysis, we calculated a numerical measure for the rate of growth of the time and memory required by the tools to complete the analysis as the example is scaled up. We believe these issues are of general signi cance in the empirical comparison of analysis techniques.
We have conducted an empirical evaluation of three techniques for deadlock detection in Ada tasking programs: a partial order state space reduction, symbolic model checking, and inequality necessary conditions. No technique was clearly superior to the others, but rather each excelled on certain kinds of programs. The state space reduction and symbolic model checking techniques performed best on programs with a linear communication structure. For programs with a single star communication structure, symbolic model checking generally performed better than the state space reduction technique. Inequality necessary conditions performed well on programs with small or linear tasks, regardless of the communication structure.
While our evaluation gives some indication of the kinds of programs to which the evaluated techniques might best be applied, it is only the beginning. Considerable e ort on the part of many researchers will be required to fully characterize the range of applicability of each technique. Such work will require a large collection of example programs, a common model, and an agreed upon method for evaluation. This paper takes an important rst step towards such a rigorous evaluation of concurrency analysis techniques.
