Fast Parallel Deterministic and Randomized Algorithms for Model Checking by Lee, Insup & Rajasekaran, Sanguthevar
University of Pennsylvania 
ScholarlyCommons 
Technical Reports (CIS) Department of Computer & Information Science 
January 1993 
Fast Parallel Deterministic and Randomized Algorithms for Model 
Checking 
Insup Lee 
University of Pennsylvania, lee@cis.upenn.edu 
Sanguthevar Rajasekaran 
University of Pennsylvania 
Follow this and additional works at: https://repository.upenn.edu/cis_reports 
Recommended Citation 
Insup Lee and Sanguthevar Rajasekaran, "Fast Parallel Deterministic and Randomized Algorithms for 
Model Checking", . January 1993. 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-93-09. 
NOTE: Page 2 is missing. 
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/cis_reports/869 
For more information, please contact repository@pobox.upenn.edu. 
Fast Parallel Deterministic and Randomized Algorithms for Model Checking 
Abstract 
Model checking is a powerful technique for verification of concurrent systems. One of the potential 
problems with this technique is state space explosion. There are two ways in which one could cope with 
state explosion: reducing the search space and searching less space. Most of the existing algorithms are 
based on the first approach. 
One of the successful approach for reducing search space uses Binary Decision Diagrams (BDDs) to 
represent the system. Systems with a large number of states (of the order of 5 x 10") have been thus 
verified. But there are limitations to this heuristic approach. Even systems of reasonable complexity have 
many more states. Also, the BDD approach might fail even on some simple systems. In this paper we 
propose the use of parallelism to extend the applicability of BDDs in model checking. In particular we 
present very fast algorithms for model checking that employ BDDs. The algorithms presented are much 
faster than the best known previous algorithms. We also describe searching less space as an attractive 
approach to model checking. In this paper we demonstrate the power of this approach. We also suggest 
the use of randomization in the design of model checking algorithms. 
Comments 
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-
CIS-93-09. 
NOTE: Page 2 is missing. 
This technical report is available at ScholarlyCommons: https://repository.upenn.edu/cis_reports/869 








School of Engineering and Applied Science
Computel' and Information Science Department
Philadelphia, PA 19104-6389
January 1993
Fast Parallel Deterministic and Randomized Algorithms
for Model Checking*
Insup Lee and Sanguthevar Rajasekaran





Model checking is a powerful technique for verification of concurrent systems. One of the
potential problems with this technique is state space explosion. There are two ways in which
one could cope with state explosion: reducing the search space and searching less space. Most
of the existing algorithms are based on the first approach.
One of the successful approach for reducing search space uses Binary Decision Diagrams
(BDDs) to represent the system. Systems with a large number of states (of the order of 5 x 1020)
have been thus verified. But there are limitations to this heuristic approach. Even systems of
reasonable complexity have many more states. Also, the BDD approach might fail even on
some simple systems. In this paper we propose the use of parallelism to extend the applicability
of BDDs in model checking. In particular we present very fast algorithms for model checking
that employ BDDs. The algorithms presented are much faster than the best known previous
algorithms. We also describe searching less space as an attractive approach to model checking. In
this paper we demonstrate the power of this approach. We also suggest the use of randomization
in the design of model checking algorithms.
1 Introduction
One of the most successful techniques for automatic verification of concurrent systems has been
model checking, which was developed by Clarke, Emerson and Sistla [9]. Model checking determines
whether a finite state system satisfies a property specified as a formula in propositional branching
time temporal logic by computing the set of states that satisfies the formula. This method has been
implemented and used to successfully prove the correctness of a large class of hardware circuits.
"This research was supported in part by ONR N00014-89-J-1l31, DARPA/NSF CCR90-14621 and ARO DAAL
03-89-C-0031.
1
1.3 Organization of this Paper
The rest of the paper is organized as follows. Section 2 contains the definition of BDDs and
introduces parallel computational models. Sections 3 and 4 describe and analyze parallel algorithms
for various operations on BDDs. In Section 5, we present parallel model checking algorithms that
employ BDDs and their time complexities. Section 6 has a very simple encoding scheme that is
optimal with respect to memory usage. In Section 7 we show how one could use probablism to
search less space. In Section 8 we demonstrate the power of randomization in algorithms design.
In particular we present a randomized parallel algorithm for the Coarsest Partitioning Problem
(with a single function) that is better than the previously best known algorithm. Finally, Section
9 concludes this paper.
2 Preliminaries
2.1 Binary Decision Diagrams
A Binary Decision Diagram is a canonical representation of a boolean formula. Its structure is an
acyclic graph. Each node in the BDD is labeled with a variable and has two children corresponding
to the two values this variable can take (viz. 0 and 1). There is a total order imposed on the
occurrence of variables along any path starting from the root and ending in a leaf. The leaves of
the BDD are labeled with a 0 or a 1. Any path starting from the root will end up in a leaf labeled
1 if and only if under the corresponding assignment to the variables along the path, the formula
has a value 'true'. To check if a given assignment to variables satisfies the boolean formula, one
traverses a path dictated by the assignment from the root of the BDD until a leaf node is reached.
This leaf node gives the value of the formula under the given assignment. It has been shown that
there is a unique minimal BDD corresponding to any boolean formula under a given ordering of
variables [3].
In Figure 1, an example is given. This BDD represents the formula a+ bc, with a < b < c. For
instance, to check if a = 1, b = 0, c = 1 will satisfy the formula we start from the root and traverse
the corresponding path to reach the second leaf (from left) which is labeled with a O.
2.2 Parallel Machine Models
A large number of parallel machine models have been proposed. Some of the widely accepted
models are: 1) fixed connection machines, 2) shared memory models, 3) the boolean circuit model,
and 4) the parallel comparison trees. Of these 1) and 2) are the most popular. The time complexity
of a parallel machine is a function of its input size. Precisely, time complexity is a function g(n)
that is the maximum over all inputs of size n of the time elapsed when the first processor begins
execution until the time the last processor stops execution.
3
Figure 1: A Binary Decision Diagram
A fixed connection network is a directed graph G(V, E) whose nodes represent processors and
whose edges represent communication links between processors. Usually we assume that the degree
of each node is either a constant or a slowly increasing function of the number of nodes in the
graph. Fixed connection networks are supposed to be the most practical models. The Connection
Machine, Intel Hypercube, ILLIAC IV, Butterfly, etc. are examples of fixed connection machines.
In shared memory models (also known as PRAMs), a number (call it P) of processors work syn-
chronously communicating with each other with the help of a common block of memory accessible
by all. Each processor is a random access machine. Every step of the algorithm is an arithmetic
operation, a comparison, or a memory access. Several conventions are possible to resolve read or
write conflicts that might arise while accessing the shared memory. EREW PRAM is the shared
memory model where no simultaneous read or write is allowed on any cell of the shared memory.
CREW PRAM is a variation which permits concurrent read but not concurrent write. And finally,
CRCW PRAM model allows both concurrent read and concurrent write. Read or write conflicts
in the above models are taken care of with a priority scheme.
The parallel run time T of any algorithm for solving a given problem can not be less than fj
where P is the number of processors employed and S is the run time of the best known sequential
algorithm for solving the same problem. We say a parallel algorithm is optimal if it satisfies the
equality: PT = O(S). The product PT is referred to as work done by the parallel algorithm.
The model assumed in this paper is the PRAM. Though a PRAM is supposed to be impractical,
it is easy to design algorithms on this model and usually algorithms developed for this model can
be easily mapped on to more practical models. Also there is a simulation algorithm that will map
any PRAM algorithm into an algorithm for the hypercube network (the CM being one) with at
the most a logarithmic factor of slow down [15]. Thus all the time bounds mentioned in this paper
will apply to the CM if multiplied by a logarithmic factor.
4
2.3 Some Useful Lemmas
In this section we state some results which will prove useful in interpreting results presented in this
paper.
Lemma 2.1 {2} If W is the total number of operations performed by all the processors using a
parallel algorithm in time T, we can simulate this algorithm using P processors such that the new
algorithm runs in time L~J + T.
As a consequence of the above Lemma we can also get:
Lemma 2.2 If a problem 11" can be solved in time Tusing P processors, we can solve the same
problem using pi processors (for any pi :s P) in time 0 (I]J-) .
Definition. Given a sequence of numbers k}, k2 , ••. ,kn , the problem of prefix sums computation
is to output the numbers k}, k} + k2 , • .• , k} + k2 + ... + kn .
The following Lemma is a folklore [12]:
Lemma 2.3 Prefix sums of a sequence of n numbers can be computed in O(log n) time using lo;n
EREW PRAM processors.
3 A Fast Parallel Algorithm for Reducing a BDD
One of the frequently used operations in the model checking algorithm of [6] is reducing a BDD.
The problem of reducing a BDD is to take as input an arbitrary BDD and reduce it into an
equivalent BDD with as few states as possible. Bryant [3] has shown that there is a unique minimal
BDD corresponding to any boolean formula under a given ordering for the variables. He has also
presented an efficient sequential algorithm for reduction. In this section we present an efficient
parallel algorithm for BDD reduction. We also analyze the sequential and parallel complexity of
Bryant's algorithm.
3.1 Bryant's Algorithm
Let G(V, E) be the given BDD for a boolean formula on n variables that has to be reduced. Then
Bryant's algorithm can be shown to have a run time of O(n + IVlloglVl) (even though no such
bound is explicitly mentioned in [3]). A brief description of this algorithm is in order to be able
to follow the rest of the discussion. His algorithm processes nodes level by level starting from the
leaves and proceeding toward the root. The leaves of G are associated with values (0 or 1), whereas
the rest of the nodes are associated with variables. The algorithm will assign unique labels to the
nodes such that all the equivalent nodes get the same label. In the case of leaves, all the nodes
with a 0 value get the label 1 and the nodes with a 1 value get the label 2.
5
As far the rest of the nodes two simple rules are used to label them. If v is any node in G, let
LEFT(v) and RIGHT(v) stand for the left and right children of v respectively. Let LABEL(v)
denote the label assigned to v. The rules are: 1) If there are two nodes VI and V2 such that
LABEL(LEFT(VI» = LABEL(LEFT(V2») and LABEL(RIGHT(VI») = LABEL(RIGHT(V2)),
then VI and V2 should get the same label; 2) If LABEL(LEFT(v» = LABEL(RIGHT(v)) for
any node V in G, then we should set LABEL(v) = LABEL(LEFT(v)), since this implies that the
node V is redundant and can be eliminated.
Assume inductively that all the nodes in level i + 1 or below of G have been processed (i.e.,
each node in these levels has been assigned a label). Now we describe how to label the nodes in
level i. Let V be an arbitrary node in level i. If LABEL(LEFT(v)) = LABEL(RIGHT(v)) we
immediately set LABEL(v) = LABEL(LEFT(v)) and the processing of node v is over. If not, we
create a tuple (LABEL(LEFT(v)),LABEL(RIGHT(v))) corresponding to this node. For all the
nodes in level i that have not been assigned a label yet, we take their tuples and sort these tuples
in lexicographical order. The effect of this sorting is to group all the nodes with the same tuple
together. Clearly all the nodes with the same tuple should be assigned the same label.
Let mi be the number of distinct tuples at this level. Let ni and Ni denote the largest label
assigned to any node at level i and the number of nodes at level i respectively, for any i ~ n.
Nodes at level i are labeled with integers in the range niH + 1, niH + 2, ... ,ni+1 + mi = ni. This
completes the processing of all the nodes at level i. Here is a detailed version of the algorithm:
Algorithm Reduce(G(V, E» j
(* LIST is a list of lists. LIST(i) is a list of nodes at level i, for 1 ~ i ~ (n+ 1).
LABEL is an array such that LABEL(v) is the label assigned to node v E V.
*)
for each node v in LIST(n + 1) do
(* Process the leaves *)
if the node v has a value 0 then LABEL(v) := 1 else LABEL(v) := 2;
nextlabel:=3;
for i := n downto 1 do (* Process the nodes in level i, n 2: i 2: 1 *)
TEMP:= 0; (* TEMP is the list of nodes at level i that should get
new labels *)





add v to TEMP with a corresponding 'key' of (LABEL(LEFT(v)),
LABEL(RIGHT(v)));
Sort the nodes in TEMP with respect to their keys in lexicographical
order;
Scan through this sorted list and assign labels to nodes starting from
nextlabel, such that nodes with the same key get the same label;
nextlabel := nextlabel +mi;
(* mi is the number of distinct keys in TEMP *)
Clearly, the amount of time needed to process nodes at level i is dominated by sorting of the
tuples. If there are Ni nodes at level i, this sorting can be done in O(NdogNi) time [1]. Therefore
the run time of the algorithm is l:i:l O(Ni log Ni) = O(lVllog lVl +n). Thus we get the following
Theorem 3.1 Bryant's algorithm for BDD reduction runs in time O(lVlloglVl + n) where V is
the number of nodes in the BDD and n is the number of variables in the formula.
3.2 The Parallel Algorithm
In [14] Kimura and Clarke present a parallel algorithm for BDD reduction. This algorithm is based
on the observation that one could think of a BDD as a deterministic finite state automaton (DFA).
Though no explicit mention is made in the paper about the complexity of their algorithm, it is
straightforward to realize that this algorithm can be implemented on a CREW PRAM in time
T = O(n + n log ~) using 1V11~ IVI processors. However, their algorithm makes the following
assumption: The nodes of the BDD are not explicitly associated with variables; the level of any
node uniquely determines the variable associated with it. This assumption may not hold always.
For instance in Figure 1, the leftmost leaf and the node associated with b are at the same level,
but they do not correspond to the same variable. Thus in order to make their algorithm applicable
to all inputs, one has to perform some preprocessing. Later in this section we demonstrate how to
accomplish this preprocessing in parallel.
The same time bound can be achieved when we implement the Bryant's algorithm (described
in the previous section) on a CREW PRAM. In this parallel version of Bryant's algorithm no
assumption is needed on the structure of the input BDD. An analysis of the parallel run time of
Bryant's algorithm follows: To process the nodes at level i, the tuples can be created (if there is
a need) in time 0(1) using Ni processors. Total work done is O(Ni). The next step is to sort the
tuples (there can be at the most Ni tuples). This can be done in O(logNi) time using Ni processors
[12]. Total work done for sorting is O(Ni log Ni).
Thus the parallel run time of Bryant's algorithm is l:i=l O(log Ni) which is O(n +n log ~) = T,
say. Total work done (Le., number of operations) in all the n levels is l:i=l O(Ni+Ni log Ni) = O(JVllog JVI).
7
Therefore, using Brent's Lemma 2.1, this algorithm runs in time O(n + n log ~) using 1V11~g IV!
processors. Thus we get:
Lemma 3.1 Bryant's algorithm can be implemented on a CREW PRAM in time T = O(n +
n log ~) using 1V11~~1V1 processors. Here IVI and n stand for the number of nodes and the number
of variables in the BDD.
Therefore as far as the work done is concerned, Bryant's algorithm can be implemented optimally
on the CREW PRAM. But the question is can we reduce the run time further keeping the work
done the same or nearly the same? Ideally we would like to have a run time which is only a
polynomial in the logarithms of IVI and n. Next we present an algorithm for BDD reduction which
runs in O(log2 1V1) time using only Mv processors, where Mv is the number of processors needed
to reduce a DFA with IVI nodes into minimal form in O(log2 1V1) time. We also first make the
same assumption about the structure of the input BDD as in [14], namely that the level of any
node uniquely determines the variable associated with the node. We call a BDD that satisfies this
assumption as a 'BDD in restricted form' from hereon. Later we show how to preprocess the input
in parallel such that the resultant BDD will satisfy this assumption. Our algorithm will also be
based on the fact that one could associate a BDD in this restricted form with a DFA.
A DFA A is a quintuple (Q, 1:" 6, qo, F) where Q is the set of states of A, 1:, is the alphabet,
6 Q x 1:, -- Q is the state transition function, qo is the start state, and F is the set of
accepting states. A is said to accept a string ao, at, ... ,an (where ai E 1:, for each i) if and only if
6( ... 6(6(qo,ao),at} ... ,an ) E F.
A BDD B in restricted form can be associated with a DFA A in the following manner: ~ for
A is {O, I}. Q is the set of nodes of B. The start state of the automaton A is the root of B.
The only accepting states of A will be the leaves that have a 1 in them. There is a one-to-one
correspondence between the nodes of B and the states of A. For any state s in A, let CORR( s)
stand for the corresponding node in B. If 6 is the transition function of A, then, for any state s
in A, 6(s,0) = t (6(s, 1) = t) if and only if CORR(t) is the left (right) child of CORR(s). This
correspondence simply means that the transition graph of A is the same as the DAG representing
the BDD. Thus the task of converting a BDD into a DFA is trivial.
Also notice that if B is a BDD on n variables and if A is the corresponding DFA, then the
following holds: If a binary string of length n is accepted by A, then the corresponding assignment
to variables will satisfy the boolean formula that B represents.
There are many parallel algorithms in the literature for minimizing a DFA. One of the fastest
algorithms has been given by J<i J<i and Kosaraju [13]:
Theorem 3.2 If G is a DFA with n states, we could obtain the minimal form of G in O(log2 n)
time on the EREW PRAM using n4 processors.
8
The above algorithm can be used in the reduction of BDDs: 1) Given a BDD B, convert it into
a DFA A in a straight forward way (as explained above). This can be done in 0(1) time using
IVI CREW PRAM processors; 2) Minimize this DFA using any good algorithm. This will take
o(1og2IV\) time using Mv CREW PRAM processors; and finally 3) Convert the minimal DFA
into a BDD. This can again be accomplished in time 0(1) using IVI processors. The total work
done in this parallel algorithm is O(Mv log21V\). It is an open problem if one could reduce the
work done to O(lVIIog IV\) keeping the time bound the same. The above algorithm together with
an application of Brent's Lemma 2.2 yields the following
Theorem 3.3 A BDD with IVI nodes can be reduced in time 0 (Mv l,;t IVI + log2 1V1) using P
processors (as long as P ~ Mv).
3.3 Converting a BDD into Restricted Form
The algorithm given in the previous section for BDD reduction assumes that the BDD B is in
restricted form, Le., the level of any node uniquely determines the associated variable. But usually
in practice we can not expect the input BDD to be in this form. In this section we show how to
convert an arbitrary BDD into restricted form.
The idea is to ensure that every path from the root to a leaf is of the same length. Consider a
BDD B on n variables XI, X2, ••• , X n , with a variable ordering of Xl < X2 ••• < X n . We may have to
introduce redundant nodes in the BDD in order to bring it to restricted form. (We say a node is
redundant if on any input transition from this node leads to the same node.) For instance if there
is a node associated with Xl and if one of its children is associated with X4, then we introduce two
nodes associated with X2 and X3 along the path from Xl to X4. From X2 there is a transition to X3
on either input, and also there is a transition from X3 to X4 on either input.
In general, for any i < j, if a node associated with Xi has a child associated with Xj, the path
from Xi to Xj is completed with redundant nodes with the variables: Xi+ll Xi+2,' .. , Xj-I. Without
loss of generality assume that the nodes of B are named 1,2, ... , IVI. A detailed description of the
parallel algorithm follows:
Algorithm Restrict(G(V, E»;
(* N I and N2 are arrays of size IVI. For each node v E V, NI(v) (N2(V)) is
the number of nodes to be introduced between v and its left (right) child. *)
for i := 1 to IV I do
Let x p , x q , X r be the variables associated with i, its left child and its
right child respectively;
NI(i) := q - p - 1; N2(i):= r - p - 1;
9
~~~ooo6-'; ~a12 III 11'2 000 •+ I :+-5-H)




Figure 2: A bad input for algorithm Restrict
Compute the prefix sums of N1(1), N1(2), ... , N1(1V1), N2(1), N2(2), ... , N2(1V1);
Let N = Ll~~ (N1(l) + N2(1));
If we have N processors we could create all the redundant paths in O( 1) time,
each processor being in charge of creating a single redundant node; Realize
that the scheduling problem of deciding which processor should create which
node is solved by the above prefix sums computation.
Analysis. We could compute all the N1(i)'s and N 2(i)'s in 0(1) time given IVI processors. Prefix
sums can be computed in O(log IVI) time using lo~~1 processors (d. Lemma 2.3). Total work done
in these two steps is O(IVI). Finally, creation of redundant nodes can be completed in 0(1) time
given N processors. Total work done in this step is O(N). Thus applying Lemma 2.1 we get the
following
Theorem 3.4 An arbitrary BDD can be converted into restricted form in time O(log IVI + NjJVI)
'Using P processors.
An interesting question is how large can N be? Clearly an upper bound on N is O(lVln). In
fact there are examples for which N could be !l(lVln). A description of one such BDD B follows:
Say B is on n variables. Up to level l}, B is a complete binary tree. There are only two more
levels. Each node at level l} has two children. These children are labeled (from left to right) with
X!!.+l' X!!.+2,'" ,Xn in a cyclic fashion. The leaves have arbitrary labels (from {O, I}). See Figure2 2
2.
The number of nodes in the BDD is 2~+2 - 1. Number of redundant nodes that should be
introduced in order to convert B into restricted form is 2: 2~ (~ - 1), implying that the size of B
in restricted form is !l(lVln).
We could perform the following preprocessing in the algorithm Restrict, which may be helpful
on certain inputs: Search through B to identify variables that do not occur in the label of any node
10
or which occur only in redundant form. We could reduce the number of levels in the output BDD
accordingly. Identifying such variables can be done for example by sorting the nodes with respect
to their labels. This will take deterministic time O(log IVI) on a lVI-processor CREW PRAM [12],
or randomized o(log IVI) time using lo~~l CRCW PRAM processors [18].
4 Other Operations on BDDs
In applying BDDs to model checking we must be able to perform the following operations on BDDs:
1) Boolean ANDj 2) Boolean ORj and 3) Boolean NOT. Other operations such EXOR, existential
quantification, etc. can be expressed as a sequence of the above three elementary operations. Thus
we will focus our attention on these three operations.
Kimura and Clarke make use of the DFA representation of BDDs for the simple reason that
standard algorithms available for DFAs manipulation can then be employed. If B I and B2 are
any two BDDs, let Al and A2 be their corresponding DFAs. The boolean AND of the two BDDs
can be constructed by intersecting Al and A 2. Similarly the boolean OR and the boolean NOT
correspond to the union, and the complement operations in the DFAs domain.
In this section we show how to perform the AND of two given BDDs. The other boolean
operations can be handled along the same lines. Elegant sequential and parallel algorithms are
given in [14]. Here we give a very simple parallel algorithm which is much faster than that in
[14]. If there are N I states in Al and N2 states in A2, then this algorithm takes constant time
to construct the product automaton using N I x N2 CREW PRAM processors. But the resultant
product automaton may not be minimal. We then apply the algorithm of Lemma 3.3. This
minimization will take 0(log2(NI N2)) time using MN1 N2 EREW PRAM processors.
Let ql,l, ql,2, ... , ql,N1 be the states of Al and q2,1, q2,2, ... , q2,N2 be the states of A 2. The only
possible states of the product machine are tuples of the form (ql,i,q2,i) where ql,i is any state of
Al and q2,i is any state of A 2 at the same level. The arcs going out of any such node in the
product can also be determined easily. For instance if there is an arc from ql,i to ql,k on input 0
and there is an arc from q2,i to q2,l on the same input, then there will be an arc from (ql.j) q2,i) to
(ql,k,q2,1) in the product machine on input O. In general, the arcs going out of any such node can
be determined in 0(1) time using a single processor. A description of the algorithm follows: For
any state s in AI, IN1(s) is a list of incident edges and OUT10(s), OUTll(s) arrays of outgoing
edges (on inputs 0 and 1 respectively). Similarly define IN20, OUT200, OUT210 for A2 , and
IN30, IN400, and IN410 for the product automaton. Each node in the product automaton is
labeled as (i,j) for 1 ~ i ~ N I and 1 ~ j ~ N2 • A processor is in-charge of each such node. The
corresponding processor will also be named with the tuple (i,j).




node: the other end of the edge; input: 0 .. 1;
end;
var
INl, OUTI : array[l..Nt ] of list of edge;
IN2, OUT2 : array[l..N2] of list of edge;
IN3, OUT3 : array[l..Nt , l..N2] of list of edge;
. for 1 ::; i ::; Nt; 1 ::; j ::; N2 processor (i,j) does in parallel:
if the level of node i in At and the level of j in A 2 are not the same
then quit;
Let the transition state from i on input 0 be i (this information is
obtained from OUTI0) and the transition state from j be q on the
same input;
Draw an edge from (i,j) to (i, q), i.e., set OUT30[i,j].node:= (i, q);
Let the transition state from i on input 1 be r and the transition
state from j be t on the same input;
Set OUT31[i,j].node:= (r, t);
(* Notice that by now, the product automaton has been constructed; But we
need to fill the array IN3 with the necessary information: *)
Sort the nodes such that all the nodes that have the same transition state
on input 0 are grouped together, and all the nodes with the same transition
state on input 1 are grouped together;
Now easily fill in the array IN3;
Minimize the resultant automaton using Lemma 3.3;
If we have Nt X N 2 processors, we can construct the product machine in 0(1) time. Sorting of
the nodes to fill in IN3 can be done in 0(log(Nt N2»time using N t N 2 processors. However there
may be many nodes in the automaton thus constructed which are either impossible or can not be
reached from the start state. For instance if there are two nodes qt,i and q2,j at the same level in
At and A 2 respectively, and if the incoming arc to ql,i is labeled with a 0 and the incoming arc to
q2,j is labeled with a 1, then we don't have consider the node (qt,i,q2,j) since such a node is clearly
impossible. There may still remain nodes which are not reachable from the start node. We will
eliminate these nodes when we apply the minimization algorithm. The minimization algorithm will
be modified in order to perform this. Thus we have the following
12
Theorem 4.1 The product of two automata with N1 and N2 nodes respectively can be constructed
in O(MN1N21ol(NIN2) +log2(N1N2)) time using P CREW PRAM processors (for any P :S MN1 N2 )·
5 Model Checking
Model checking has been used widely to verify finite state systems. Model checking procedure
typically involves the following steps: 1) Obtain the state transition graph of the system. If the
system consists of say n components, the number of states in the global system will be exponential
in n causing the state space explosion. But for a moment assume that the state transition graph
has been constructed; 2) Express the properties to be verified (like fairness, deadlock free, liveness,
etc.) as formulas in an appropriate logic (such as CTL); 3) Perform a reachability analysis in the
state transition graph to identify all the states that satisfy the formula derived in step 2.
Even for reasonably large values of n, it is impossible to construct an exponentially sized graph.
Thus it becomes necessary to represent the system 'symbolically', i.e., we should come up with
efficient encoding schemes. BDDs have been used successfully in verifying systems with even
5 x 1020 states. However it should be pointed out any such clever encoding scheme is at best a
heuristic and is bound to fail in the worst case unless P =NP.
Consider a boolean circuit on n variables that we want to verify. Any assignment of values to
these n variables can be thought of as a state of the system. One could represent the states of such
a system as well as the transition graph of the system as BDDs in a straight forward way [6].
5.1 Algorithms for Reachability Analysis
Let V stand for the state variables of the system to be verified. Let N(V, V') stand for the BDD
corresponding to the transition graph of the system. Here V' = V and stands for the next states.
The problem of identifying the set of reachable states from a set of start states is an essential
step in model checking. We present an efficient parallel algorithm in this section for reachability
analysis. This algorithm makes use of procedures for manipulating BDDs. In particular, the
following operations on BDDs will be used: OR, AND, substitution (for a variable), and existential
quantification.
If 4» is a BDD involving a variable say v, the substitution operation is defined as follows: [<p ]v=o,
(or [<p]v=d is nothing but the BDD for the formula of <P when the value of v is substituted as 0
(or 1). This operation can easily be performed in o(log !VI) time using lo~~1 processors where V
is the set of nodes in 4». Existential quantification is defined as: 3x4». Given the BDD for <P, we
could obtain the BDD for 3v4» as follows: [<p]v=o V [4»)v=l.
Let V = {Xl> X2, .•• , x n } and let Bo stand for the BDD of the set of start states. Let Bi (for
i = 1,2, ...) stand for the BDD of states that are reachable with i or less transitions from the
start states. (Bi(V) is also used to denote Bi.) One could compute the BDD Bi , given Bi- l and
13
N (V, V') as follows:
Algorithm Reach;
(* Band Care BDDs *)
C := 0; B := Bo;
repeat until B = C
C:=Bj
for j := 1 to n do
B(V) := B(V) V3X )EV1 [B(V') 1\ N(V', V)]
In the above algorithm, B i is nothing but B after i iterations of the outer for loop. If the
number of nodes in Bi-l is Ni-b then we could execute one run of the for loop in time O(log2( Ni))
using M(Nf)2 processors, where Nt is defined to be the maximum size of any BDD generated in
computing Bi. Thus one iteration of the repeat loop runs in time O(nlog2(Nf)) time using M(N;)2
processors. The above fixed point computation can be speeded up by iterated squaring technique.
If m iterations are needed to compute the fixed point, only o(log m) iterated squaring steps are
needed. Thus we arrive at the following
Theorem 5.1 Each iteration of the above analysis can be performed in 0 (nM ry2 ~Og2 N) +n log2 N )
time using P CREW PRAM processors (for any P :s MN 2), where N is defined to be the maximum
size of any BDD generated in the process. If m iterations are needed to compute the fixed point, we
can reduce this number to o(log m) with the help of the same number of processors.
There may be some problem with the repeated squaring trick, as has been pointed out in [5].
The size of the intermediate BDDs may be much larger than the final BDD, in which case it may
help to perform the iterations sequentially. But each phase of the iteration can be substantially
speeded up as explained above.
5.2 Checking for Other Conditions
One may want to check for liveliness, fairness, absence of deadlock, etc. The approach is essentially
the same. Given any condition, we express it as a formula in an appropriate logic and compute all
the states that satisfy this condition. The steps involved in such a computation will be analogous
to those in the reachability analysis, and can be completed along similar lines.
In summary, the efficient parallel algorithms proposed in this paper for BDDs manipulation
can be put to use in model checking. This will potentially speed up the computing time by several
orders of magnitude.
14
6 A New State Graph Encoding
Consider a concurrent system with n components. Let Gi(~, Ei) (for i = 1,2, ... , n) be the state
graphs of the processes and let G(V, E) be the state graph of the global system. Assume w.l.o.g.
that I~I = 2ki for some integer ki and for all i in the range [1, n]. Then we could map the nodes
of G uniquely into the integers {I, 2, ... , IVI}. Given an integer N in the interval [1, 1V1l, the
corresponding node of G can be inferred readily: the first k} bits correspond to the state of the
first process, the next k 2 bits correspond to the state of the second process, and so on.
Likewise, we could also map the edges of G uniquely into the integers {I, 2, ... , 1V1 2 }. (The
cardinality of this set could be reduced to lEI. But for simplicity of discussion assume this range.)
If M is any integer in the interval [1, 1V1 2], the first log IVI bits and the second log IVI bits of M
represent the two end vertices of the corresponding edge.
The above simple encoding has the following property: The vertices of G are simply
{I, 2, ... , IVI}. Also, given an integer M in the interval [1, 1V1 2], we could check if it is an edge of
G or not, in O(logn) time using lo;n processors. One of the major advantages of this encoding is
that we never have to generate and store the whole of G; we could generate the nodes and edges
on the fly. This encoding scheme and the BDDs can be thought of as two extremes of a spectrum
of encoding schemes.
BDD is a very efficient encoding scheme and it makes use of the special structure of the graph
to be encoded, namely, that the graph represents a boolean function. The very structure of BDDs
enables the design of efficient algorithms for their manipulation. On the other hand, the simple
encoding scheme we mentioned above applies to an arbitrary graph. Though the memory usage
is minimal, manipulating the graphs might be time consuming under this encoding scheme. An
interesting and important question is: Are there other encoding schemes that are in between these
two encoding schemes, whose applicability will be more general than that of BDDs and which
will lead to more efficient manipulation algorithms than the above simple encoding scheme? For
instance, one could obtain a number of encoding schemes in the following manner: Partition the
components of the concurrent system. Use BDDs to represent each group and employ the above
encoding scheme for the conjunction of groups.
7 Searching Less Space-Probabilistic Model Checking
The idea of probabilistic model checking is to prune the search in the state graph using information
about the transition probabilities. If G(V, E) is the state graph of a single process, edges of this
graph could be labeled with probabilities in the following manner: Edge (i,j) is labeled with Pij
if the probability of a transition from state i to state j is Pij' If we have such a labeled graph for
each one of the n processes, we could obtain a labeled graph for the composition of these processes
as well (using basic rules for composing probabilities).
15
In probabilistic model checking, for instance, we could search only certain portions of the state
graph and make high probability inferences. As an example consider a system of n processes in
which we would like to detect any deadlocks that may be present. We randomly select and explore
only a (small) portion of the graph and based on the information gained, we output (a confidence
interval on) the probability of a deadlock occurring in the system. The nodes and edges to be
explored can be generated on the fly; there is no need for generating the whole state graph.
If we can afford to visit only a small portion of the state graph, it is quite reasonable to expect
that our answer may not be correct in the worst case. But we can show that with high probability
our assertions will be correct. To illustrate our idea, consider the following simple scenario: Assume
that each node has information about whether or not it is a deadlock node. Also assume that the
deadlocks are uniformly distributed in the following sense: Starting from a node in the graph, if we
perform a random walk in the graph for t steps (for some t), the probability, Pt, of encountering a
deadlock node is independent of the starting node.
Under the above assumption it turns out that we only have to explore the graph (for t steps)
starting from L = o(log N) random nodes in the state graph (here N is the number of nodes in the
composed state graph), before we'll be able to specify a confidence interval on the probability of a
deadlock. Similar random sampling techniques have been employed in designing optimal sequential
and parallel algorithms for various problems (see e.g., [19] and [18]).
If L(= 8(a log N)) is the total number of random origins explored and if M was the number
of origins starting from which a deadlock node was visited, we conclude that the total fraction of
deadlock nodes in the whole system will be in the interval [(1- E:)t, (1 +E:)'Y] , where E: is any
fixed number (0 < E: < 1). In the special case M = 0, this interval reduces to a single point, namely,
O. The probability that our conclusion is correct can be shown to be 2: 1 - exp (-e2°1ogN), using
the sampling lemmas proven in [19]. For instance we can make this probability 2: 1 - ~ (which
is very nearly equal to 1) for proper choices of E: and a. If we desire a narrow confidence interval
(i.e., a small E:), naturally we will have to spend more time (i.e., we have to choose a bigger a).
The above discussion applies to any representation of the concurrent system, in particular to
CTSMP as well as BDD representations. We propose to investigate the above idea of probabilistic
search in general instances of model checking.
The encoding scheme we mentioned in the last section seems to yield easily to parallelism. For
example, if G has a regular structure (for example G could be a complete graph), and if we have
P processors we could deterministically partition the task of reachability analysis equally among
the P processors and achieve optimal speedup. This involves each processor analyzing different
portions of the state graph in parallel. Even if G has no known structure we could use the following
randomized strategy for obtaining near optimal speedup. Each processor starts from a random
node in G and starts exploring. For example, each processor could perform a depth restricted DFS
starting from a random node. We can show that with high probability the load will be equally (up
to a small multiplicative constant) shared by all the P processors.
16
------------- .__._._._ ....._~._._----
This encoding scheme is also well suited in the context of probabilistic model checking. Probab-
lism on this encoding will enable us to prune the search considerably. Any processor while exploring
a path could give up this path if the path has a very low associated probability. Notice also that
the memory requirements are very reasonable. Even if one performs a (sequential) recursive DFS
of the whole graph, at any time we need only a memory that is polynomial in n for the stack.
8 Randomization-A Powerful Technique
We believe that both of our approaches to state space explosion can be enhanced with the wonderful
technique of randomization. Since its introduction in 1976, randomization has been used to solve
myriads of computational problems. Recent implementation results show that this technique is
very practical as well. In this section we give a brief introduction to randomized algorithms. One
of the goals of this project will be to identify steps in model checking which can benefit from
randomization. We have some preliminary results in this direction as well, as we mention toward
the end of this section.
Classical approaches to introducing randomness in algorithms typically assume a distribution
on possible inputs and, compute the expected performance of various (deterministic) algorithms.
Quick sort is a good example. If one assumes that each input permutation is equally likely to
occur, Quick sort runs in an expected O(n log n) time to sort n numbers. The credibility of such
an approach critically depends on the assumption made on the inputs. There may be applications
where the input distribution is quite different from the one used in the probabilistic analysis.
As an attractive alternative, Rabin [16] proposed introducing randomness in the algorithm itself.
A randomized algorithm is one wherein certain decisions are made based on the outcomes of coin
flips. No matter what the input is, a large fraction of all possible outcomes for the coin flips will
ensure 'good performance' of the algorithm. Thus the two approaches differ in the probability space
used for analysis. In the former one considers the space of all possible inputs and in the later one
employs the space of all possible outcomes for coin flips.
To give a flavor for the above notions, we now give an example of a randomized algorithm. We
are given a polynomial of n variables f( Xl, •.• , X n ) over a field F. It is required to check if f is
identically zero. We generate a random n-vector (Tt, ... , Tn) (Ti E F, i = 1, ... , n) and check if
f(Tl," ., Tn) = O. We repeat this for k independent random vectors. If there was at least one vector
on which f evaluated to a non zero value, of course f is nonzero. If f evaluated to zero on all the k
vectors tried, we conclude f is zero. It can be shown that the probability of error in our conclusion
will be very small if we choose a sufficiently large k. In comparison, the best known deterministic
algorithm for this problem is much more complicated and has a much higher time bound.
Notice that the algorithm we mentioned in section 7 for checking for the absence of deadlock
(Le., the case when M = 0) resembles the above algorithm for polynomial identity. The same
algorithm could be used to check for other conditions such as liveness, fairness, etc. In fact the
17
search algorithm presented there is a specific randomized algorithm. However, randomization is a
powerful tool that can be used in many different ways in model checking.
Two extremely important advantages of randomized algorithms are their simplicity and effi-
ciency. This fact has been demonstrated many a time in this decade by the enormous number
of algorithms that have been designed for several vital problems. We list some of these problems
now (This list by no means is exclusive): packet routing, parallel sorting, the maximal indepen-
dent set problem, parallel connectivity, parallel DFS, parallel dictionary operations, computational
geometry problems such as Voronoi diagram, convex hull, etc. A survey on randomized parallel
algorithms for comparison problems is [19].
We have already obtained some success in applying randomization to a problem related to
model checking using BDDs. As was mentioned before, DFA minimization is an important step in
manipulating BDDs. DFA minimization can be thought of as Coarsest Partition Problem (CPP)
with two functions. Recently, Sarnath [20] has given a simple algorithm for CPP with one function.
His algorithm runs in time O(log2 n) on an n-processor CREW PRAM. In this section we present
a randomized algorithm for the same problem that runs in time O(log* n log n) using 10; n CReW
PRAM processors. Though this algorithm does not immediately imply an optimal algorithm for
DFA minimization, we believe that the ideas employed can be extended to DFA minimization. We
plan to investigate this further.
Next we give a brief summary of Sarnath's algorithm and our improvement of this algorithm.
Coarsest Partition Problem Given a set S of n elements at, a2, ... , an, a partition 11" =
(11"t, 11"2, ... , 11"k) of S and a function f : S - S, the coarsest partition problem (CPP) is to find the
. . , (" ') h hcoarsest partItIOn 11" = 11"1' 1r2 , ... , 1r( suc tat:
1. Vai,aj E S,ai,aj E 1r:n => 3p such that f(ai),f(aj) E 1r~; and
2. Vm,1 :$ m:$ f, 3p,1 :$ p :$ k such that 1r:n ~ 1rp •
The crucial lemma proven in [20] is the following: Two elements ai and aj of S will belong to
different blocks in the final partition if and only if 3k,0 :$ k :$ n(n - 1)/2, such that fk(ai) and
fk(aj) belong to different initial blocks. (Here fO(a) = a and fk(a) = f(Jk-l(a))).
The idea of the CPP algorithm is to make use of the above fact and the 'pointer jumping' or
'doubling trick'. In order to check if two elements ai and aj from the same initial partition will
belong to the same final partition we can evaluate fk(ai) and fk(aj) for k = 1,2, ... , n(n - 1)/2.
We can stop as soon as we find a k for which the have different values. Instead of checking this
condition for each k, we can use the doubling trick and reduce the number of evaluations drastically.
Here is a description of the algorithm: P is an array of size n. After t iterations of the for loop,
P[i] holds the index of pt(ai)' Initially, P(i] will have a value j if f(ad = aj. L is another array
of size n. To begin with L will have the initial labels (i.e., partition labels) of elements. As the
18
algorithm proceeds, the labels will get changed such that at any given time, two elements will have
the same label if they have not been so far shown to belong to two different final partitions.
for i:= 1 to 2logn do
1.1 for eaeh i in parallel: L[i] := n * (L[i] - 1) + L[P[i]);
1.2 for eaeh i in parallel: P[i] := P[P{i]];
2. Sort the array L and assign L[i] := Rank of L[i] in Lj
The crucial observation we make is that step 2 in the above for loop can be replaced with a
more efficient operation. Sorting is an expensive step. In step 1.1, each element is assigned a label
in the range [0, n2] such that elements that belong to the same partition have the same label. In
step 2, we reassign a label to each element from the range [0, n] with the help of sorting (now also
elements in the same partition will get the same label). We can indeed get the same effect by
making use of an algorithm for 'representative selection'. There is a randomized algorithm for this
[11] that runs in O(1og* n) time using lo~n CRCW PRAM processors.
The representative selection algorithm runs as follows: A random hash function h : {1, 2, ... , N} ~
{1, 2, ... ,en} (c being a constant> 1) is used, where [1, N] is the range of values that the elements
can take. (In the CPP algorithm above, N = n2 ). Each element i tries to write in common memory
cell h( i). If there are more than one processors that try to write in the same cell, one of them suc-
ceeds. All the elements with the same value i have now chosen a representative (i.e., a label in the
range [1, en]). One can show that with high probability almost all the elements will have chosen a
representative in the first attempt. Those which do not succeed in the first attempt, participate in
future rounds. [11] show that no more than O(1og* n) attempts will be needed before every element
has a representative.
It is still an open problem whether one could obtain a parallel algorithm for DFA minimization
that is optimal.
9 Conclusions
We have presented algorithms for model checking. Our approach to state explosion falls under
two categories: 1) Reducing the search space, and 2) Searching less space. In particular, we have
given efficient parallel algorithms for manipulating BDDs. These algorithms are of independent
interest in many application domains. As an application we have shown how to perform reach-
ability analysis in model checking efficiently. The BDD algorithms could be used for instance in
solving the satisfiability problem also in addition to many other applications. We have also offered
probabilistic searching as an attractive alternative to existing approaches. We have also shown that
randomization could help in the design of algorithms for model checking.
19
References
[1] A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Computer Algorithms,
Addison-Wesley Publications, 1974.
[2] R.P. Brent, The Parallel Evaluation of General Arithmetic Expressions, Journal of the ACM,
21(2), pp. 201-208, 1974.
[3] R.E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation, IEEE Trans. on
Computers, Vol. C-35, No.8, 1986, pp. 677-691.
[4] J .R. Burch, E.M. Clarke, and D.E. Long, Symbolic Model Checking with Partitioned Transition
Relations, in Proc. VLSI Conference, Edinburgh, 1991.
[5] J .R. Burch, E.M. Clarke, K.L. McMillan, and D.L. Dill, Sequential Circuit Verification Using
Symbolic Model Checking, Draft, 1990.
[6] J .R. Burch, E.M. Clarke, K.L. McMillan, D.L. Dill, and L.J. Hwang, Symbolic Model Checking:
1020 States and Beyond, in Proc. International Workshop on Formal Methods in VLSI Design,
1991.
[7] E.M. Clarke and O. Griimberg, Research on Automatic Verification of Finite-State Concurrent
Systems, Ann. Rev. Comput. Sci., 1987, pp. 269-290.
[8] E.M. Clarke, D.E. Long, and K.L. McMillan, Compositional Model Checking, manuscript,
1991.
[9] E.M. Clarke, E.A. Emerson, and A.P. Sistla, Automatic Verification of Finite State Concurrent
System Using Temporal Logic Specification, ACM Transactions on Programming Languages
and Systems, 8 (2): 244-263, April 1986.
[10] R. Gerber, and I. Lee, Specification and Analysis of Resource-Bound Real-Time Systems,
Technical Report, Dept. of CIS, Univ. of Pennsylvania, 1991.
[11] J. Gil, Y. Matias, and U. Vishkin, Towards a Theory of Nearly Constant Time Parallel Algo-
rithms, Proc. IEEE Symp. on Foundations of Computer Science, 1991, pp. 698-710.
[12] J. JiJi, An Introduction to Parallel Algorithms, Addison-Wesley Publishing Company, 1992.
[13] J. JiJi and S.R. Kosaraju, Parallel Algorithms for Planar Graph Isomorphism and Related
Problems, IEEE Trans. on Circuits and Systems, 35(3), pp. 304-310, 1988.
[14] S. Kimura and E.M. Clarke, A Parallel Algorithm for Constructing Binary Decision Diagrams,
in Proc. IEEE International Conference on Computer Design, pp. 220-223, 1990.
20
[15] F.T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays- Trees-
Hypercubes, Morgan-Kaufman Publishers, 1992.
[16] M.O. Rabin, Probabilistic Algorithms, in: Traub,J.F., ed., Algorithms and Complexity, Aca-
demic Press, New York, 1976, pp.21-36.
[17] S. Rajasekaran, and J .H. Reif, Derivation of Randomized Sorting and Selection Algorithms,
Technical Report, Aiken Computing Lab., Harvard University, 1985.
[18] S. Rajasekaran, and J .H. Reif, Optimal and Sub-Logarithmic Time Randomized Parallel Sort-
ing Algorithms, SIAM Journal on Computing, vol. 18, no. 3, 1989, pp. 594-607.
[19] S. Rajasekaran and S. Sen, Random Sampling Techniques for Parallel Algorithms Design, in
Synthesis of Parallel Algorithms, edited by J.H. Reif, Morgan-Kaufmann Publishers, 1992.
[20] R. Sarnath, An Improved Algorithm for DFA Minimization, to be presented in the Thirtieth
Annual Allerton Conference on Communication, Control, and Computing, illinois, Oct. 1992.
21
