The 'test model-checking' approach to the verification of formal memory models of multiprocessors by Gopalakrishnan, Ganesh & Nalumansu, Ratan
The  Test Modelchecking Approach to the Verication of
Formal Memory Models of Multiprocessors
 
Ratan Nalumasu  Rajnish Ghughal  Abdel Mokkedem and Ganesh Gopalakrishnan
UUCS
Department of Computer Science  University of Utah 
Salt Lake City  UT 
Contact email	 fratan  ganeshg
csutahedu
This technical report combines work reported in CAV   and SPAA  
Abstract
We o er a solution to the problem of verifying formal memory models of processors by com
bining the strengths of modelchecking and a formal testing procedure for parallel machines We
characterize the formal basis for abstracting the tests into test automata and associated memory
rule safety properties whose violations pinpoint the ordering rule being violated Our experimen
tal results on Verilog models of a commercial split transaction bus demonstrates the ability of our
method to e ectively debug design models during early stages of their development
Keywords  Formal memorymodels shared memory multiprocessors formal testing modelchecking
 Introduction
The fundamentally important problem AG  of verifying whether a given memory system model or
	a memory system
 provides a formal memory model or 	memory model
 appears in a number of
guises CPU designers are interested in knowing whether some of the aggressive execution techniques
such as speculative issue of memory operations violate sequential consistency IO bus designers are
interested in knowing the exact semantics of shared accesses provided by split IO transactions
Cor  even language designers of multithreaded languages such as Java that support shared
updates GJS  are interested in this problem Formal verication methods are ideally suited for
this problem because i the semantics of memory orderings are too subtle to be fathomed through
informal reasoning alone ii ad hoc testing methods cannot provide assurance that the desired
memory model has been implemented Unfortunately despite the central importance of this problem
and the large body of formal methods research in this area there is still no single formally based
method that the designer of a realistic multiprocessor system can use on hisher detailed design
model to quickly nd violations in the design In this paper we describe such a method called test
model checking
Test modelchecking formally adapts to the realm of modelchecking a formally based architectural
testing method called Archtest Archtest has been successfully used on a number of commercial
multiprocessors Col by running a suite of testprograms on them Archtest is an incomplete
 
Supported in part by ARPA Order  B under SPAWAR Contract  NC Avalanche	
 DARPA
under contract  DABTC UV	

testing method in that it does not under all circumstances detect violations of memory orderings
Col  Nevertheless its tests have been shown to be incisive in practice Col Most importantly
the formal theory of memory ordering rules developed by Collier in Col  forms the basis for
Archtest which means that whenever a violation is detected by Archtest there is a formal line
of reasoning leading back to the precise cause
Being based on Archtest test modelchecking is also incomplete However none of the pre
sumed complete alternatives to date have been shown to be practical for verifying large designs For
example PD  involves the use of manually guided mechanical theorem proving Even approaches
based on conventional modelchecking are impossibly dicult to use in practice For example the
assertions pertaining to the sequential consistency of lazy caching Ger  a simple memory system
expressed in various temporal logics by Gra  in  CTL
 
CES and LLOR  in TLA Lam 
are highly complex We do not believe that descriptions of this style will scale up On the other
hand the test modelchecking method has not only been able to comfortably handle the memory
system dened by the symmetric multiprocessor SMP bus called Runway BCS  GGH
 
  used
by HewlettPackard in their highend machines but also it discovered many subtle bugs in our early
Utah Runway Model URM that we created Our URM includes a number of details such as split
transactions out of order transaction completions and even an element of speculative execution
The errors we made in capturing these details could well have been made in an actual industrial
context We believe that with growing system complexity the role of debugging methods that are
eective and are formally based will only grow in signicance regardless of whether the methods are
complete or not
Test modelchecking has a number of other desirable features It involves modelchecking a xed
set of safety properties for each formal memory model that are very nearly independent of the actual
memory system model being tested This xed nature greatly facilitates the use of test model
checking within the design cycle where debugging is most eective design changes are frequent
and timeconsuming alterations to the properties being veried following design changes would be
frowned upon test modelchecking will not need such alterations Also the formal adaptation of the
tests of Archtest made in test modelchecking can be veried once and for all thanks to the xed
set of tests used in test modelchecking we describe and argue the correctness of these abstractions
later Finally in test modelchecking a memory model is viewed as a collection of simpler ordering
rules and for each constituent ordering rule a specic property is tested on the memory system We
found that this signicantly helps compartmentalize errors as opposed to producing nonintuitive
error traces that could result during conventional modelchecking which can be very dicult to
understand for nontrivial memory systems
Test modelchecking is also a more eective debugger for memory models than Archtest in a
formal sense The tests of Archtest are straightline programs of length k one per node Such
programs execute on various nodes of the multiprocessor concurrently The recommendation accom
panying Archtest is that users run the tests for as large a k that is feasible because then the
chances of being scheduled according to dierent interleavings by the underlying operating system
memory controller arbiter etc increase In adapting the tests of Archtest test modelchecking
gives the eect of choosing k   Thus we cover all possible schedules The subtle bugs detected
by test modelchecking on realistic examples that are reported in Section  corroborate our intuition
that test modelchecking is indeed an eective debugging tool for memory models
To reiterate our specic contributions in this paper are i the adaptation of a formal testing
method for memory models to modelchecking that can be applied during the design of modern mi
croprocessors whose memory systems are very complex ii a formal characterization accompanied
by proofs of how the tests of the testing method are abstracted and turned into a xed set of safety
properties that are then modelchecked and iii experimental results on three examples using the
VIS modelchecker the last example being much larger than any previously reported in this context

C	  a  d	  address datum  i  index  init  AGenableread
i
a  d		  avail
i
a  d		
C	  a  d	  a  d

	  address  datum d  d











C	  a  d	  address datum  i  k  index  init  AGafterwrite
k
a  d		  AFavail
i
a  d		
S	  a  d	  address datum  i  index 
init  AGafterwrite
i








S	  a  d	  a  d

	  address  datum d  d

 i  k  index 
init  Aavail
i








a  d		  avail
k







Figure  Part of the specication of Sequential Consistency from Gra 
Related Work
In Gra  abstract interpretation CC is employed to reduce innitesystem verication to 
nite  CTL
 
modelchecking They apply this technique to verify the sequential consistency of lazy
caching with unbounded queues They recognize that to get an exact characterization of sequential
consistency involving only the observable event names one needs full second order logic Gra  To
be able to express sequential consistency in  CTL
 
 they give a stronger characterization of sequen
tial consistency For this stronger characterization the expression of sequential consistency is very
complex as shown in gure  this gure shows only part of their sequential consistency expression
A technique very similar to test modelchecking was proposed in McM  under the section heading
Sequential Consistency To give a historic perspective our test modelchecking idea originated in
our attempt to answer the following two questions i which memory ordering rules is McM 
really verifying ii is this a general technique ie can other memory ordering rules be veried
in the same fashion We still have not found a satisfactory answer to the rst question because the
test in McM  uses only one location which then couldnt make it a test for sequential consistency
it could plausibly be a test for coherencewhich again does not correspond to what Collier formally
proves in Col  One of our contributions is that we answer these questions by elaborating on the
theoretical as well as practical aspects of test modelchecking
In PD  the authors use a method called aggregation on a distributed shared memory coherence
protocol used in an experimental multiprocessor to arrive at a simplied model of system behavior
Their technique involves manual theorem proving The work in HMTLB  as well as DPN 
are aimed at verifying that synchronization routines work correctly under various memory models
where the memory models themselves are described using nitestate operational models They do
not address the problem of establishing the memory models provided by detailed memory subsystem
designs which is our contribution In GK  GK  the authors analyze the problem of deciding
whether a given set of traces are sequentially consistent Our approach diers in two respects First
we are interested in proving that detailed models of memory systems are correct while they obtain
traces presumably from actual machines and analyze them for sequential consistency Second our
method is more useful for CPU designers as it can give feedback during early phases of the design
pinpointing which ordering rules are violated if any
 Overview of Archtest
Archtest is based on the theory presented in Col  that formally denes and characterizes archi
tectural rules obeyed by memory subsystems of multiprocessors Although these rules are elemental









 A   X	  A
L






 A   X	  A
           
L
k
 A  k Xk	  A
Figure  Test
ROWO
 Archtest test for ACMP RO WO
tantamount to obeying all the constituent elemental rules violating a compound rule is tantamount
to violating any of the constituent elemental rules Each such elemental rule describes a constraint
on the order in which various read and write events can occur For read operations there is one
read event per each read operations However for write operations there is one write event per pro
cess per write operation which captures the eect of a write operation becoming visible to dierent
processors at dierent times Some of the elemental ordering rules are
Rule of Computation CMP  This is a basic rule dening how the terminal value of each
operand is calculated from the initial values of the operand Though most of the literature
on memory architectures implicitly assumes this rule we will often keep it explicit in our
discussions
Rule of Read Order RO  For any pair of read events a and b in the same process if a comes
before b in program order then a happens before b
Rule of Write Order WO  For any pair of write events a and b in the same process if a comes
before b in program order then a happens before b
Rule of Program Order PO  For any pair of events a and b in the same process if a comes
before b in program order than a happens before b Event a or b can be either read or write
event So both RO and PO are special cases of PO This is one of the strongest ordering rules
and is essential for sequential consistency
Rule of Write Atomicity WA  A write operation becomes visible to all processes instanta
neously More precisely one conceptual store S
i
is associated with each processor node P
i

Then for each write operation W  one write event W
i
is dened per store S
i
 Then WA
guarantees that there is no i  j and no event e such that e is before W
i
and is after W
j

In order to check memory subsystems for a compound rule Archtest provides a test for each
compound rule along with a set of conditions to be checked for If any of the conditions is violated
then a violation to obey the compound rule is detected
Test
ROWO
  Archtest test for ACMP RO WO
The test of Archtest for the compound rule consisting of the elemental rules CMP  RO and
WO denoted ACMP RO WO is shown in Figure  Process P

executes a sequence of write
instructions intended to check for WO and P

executes a sequence of read instruction intended
to check for RO If the memory system correctly realizes ACMP RO WO then Condition 
produces a positive outcome
Condition   Monotonic The sequence of X values is monotonically increasing ie
 i  j    i  j  k  X i  X j or equivalently  i    i  k    X i  X i 
If Monotonic condition is violated then at least one of the CMP  RO and WO rules is violated












 A   L
A
 
 U   A L
B
 
 X  B L
 
 B  
L

 A   L
B
 
 V   B L
A
 
 Y   A L

 B  
   L
A

 U   A L
B

 X  B   
L
k
 A  k L
B

 V   B L
A

 Y   A L
k
 B  k




 U k  A L
B
k




 V k  B L
A
k
 Y k  A
Figure  Test
WA
 Archtest test for ACMP RO WO WA
InitiallyA  B  
L
  
 A   L
  
 B  
L
 
 Y   B L
 
 X  A
L
 
 A   L
 
 B  
L

 Y   B L

 X  A
     
L
k 
 A  k L
k 
 B  k
L
k 
 Y k  B L
k 
 Xk  A
Figure  Test
PO
 Archtest test for ACMP  PO
Test
WA
  Archtest test for ACMP RO WO WA
Test
WA
 shown in Figure  tests for ACMP RO WO WA with the conditions checked being
i the Monotonic condition suitably modied for arrays U  V X  Y  and ii Atomic which is
Condition  Atomic  i  j    i  j  k  V i  X j Y j  U i
The Atomic condition watches for the possibility that a write operation from P

and a write oper
ation from P








  Archtest test for ACMP  PO
Test
PO
 shown in Figure  tests for ACMP  PO with the conditions checked being i the
Monotonic condition suitably modied for arrays X  Y  and ii PO Cross which is
Condition  PO Cross  i  j    i  j  k  X i j  Y j  i  X i  j  Y j  i




etc are meant to be run on real machines
and there cant be any real guarantees that the particular interleavings that reveal violations such
as for memory ordering rule WA watched by condition Atomic in Test
WA
 will indeed happen
To allow for as many interleavings as possible Archtest recommends that its tests be run for
large values of k With test modelchecking we eectively run the tests for k   Test model
checking achieves this by transforming each Archtest test into a test automata which exploits
nondeterminism to eectively check for k   Also the modelchecking framework guarantees
that we explore all possible interleavings than a particular interleaving

 Test modelchecking
Test modelchecking converts the tests of Archtest to corresponding memory rule test automata
	test automata
 that drive model of the memory system being examined In our experiments we
use the Verilog language supported by VIS Ver to capture the memory system models as well as
the test automata The Conditions corresponding to each compound memory rule being tested
are turned into corresponding memory rule safety properties that are checked by the VIS tool The
reader may take a peek at Section  to know which compound rules dene sequential consistency
Lam  In the remainder of this section we explain the assumptions under which we formally
derive test automata as well as memory rule safety properties followed by a description of how test
automata as well as memory rule safety properties are derived for specic cases
  Assumptions about memory systems realized in hardware
Memory systems realized in hardware as well as nitestate models thereof are assumed to be data
independent ie the control logic of the system moves data around and does not base its control
point settings on the data values themselves We also assume that the system is address semi 
dependent HB  ie the control logic can at most compare two addresses for equality or inequality
and base its actions on the outcome of this test These assumptions are standard and form the basis
for dening test automata as well as memory rule safety properties
  Creation of test automata
As illustrated in Figure  we obtain test automata for various memory models by nitely abstracting
the data used in test of Archtest using nondeterminism to justify the abstraction For example
we abstract the specic activities of process P

of Figure  into that of nondeterministically writing
all possible ascending values over fg as shown in P

of Figure  Also since we cannot store innite
arrays in creating process P

 we turn P

and the corresponding memory rule safety property into an
automaton that checks that the array values read are monotonically increasing This in turn can be
performed using just two consecutive array values x and x that are nondeterministically recorded
by P

 Hence the memory rule safety property we modelcheck for is P

in nal state  x  x
We now provide a justication that these abstractions preserve the memory rule safety properties
ie for the same memory system model ie a violation of a condition occurs in a test of Archtest
for k  i the same violation will occur in modelchecking the corresponding memory rule safety
property when test automata are used to drive the memory system model To keep the presentation
simple we formally argue how the test automata nds every violation present in the test ofArchtest
with k   the opposite direction of i ie how a test of Archtest with k   nds violations
found by the test automata is easy to see because the test automata just appears as a 	stuttering

of the test of Archtest For example the actions of P

in Figure  can be viewed as repeating the




of Figure  Our proof sketches
are illustrated on the two tests presented in Section  and another test described in this section
   Abstracting Test
ROWO
We show that if the test program in Test
ROWO
shows that Monotonic is violated then the test
automaton also reveals the error Since Monotonic is violated
i    i  k  X i X i 
	 i      i  k  X i   X i   
	 i      i  k  X i   

























 A   X  A  	
L

 A   X  A  	
L

 A   X  A  	
     
L
k
 A  k Xk  A  	
a







 A    	 X  A
L

 A    	 X  A
L

 A    	 X  A
     
L
k









 A   X  A
     
L
 
 A   X  A
L
  
 A   X   A
L
 
 A   X   A
     
L
k
 A   Xk  A
c
Figure  Abstraction of Test
ROWO
Since the last formula compares X i and X i  only to  we can rewrite the test program as
shown in Figure a assuming data independence and rewrite the last formulae as
i    i  k  X i   X i   
Note that in Figure a all reads of A occur in the expression A   Hence we can replace every
A  v with A  v   and X i  A   with X i  A without aecting Monotonic again
if data independence holds to obtain Figure b Figure c is obtained by simplifying Figure b
each v   evaluates to  for v   and  otherwise This gure is generalized to obtain the test
automaton in Figure b Intuitively the automaton nds the violation as follows P

remains
in the initial state for  iterations executing A and then switches to second state executing
A Also P

remains in the initial state for i   iterations and then switches to second state
recording x and then x dashed edges show when these variables are recorded Thus the test
automatons execution is identical to that in Figure c except that the test automaton gives the
eect of taking k to Also notice that x and x get the values corresponding to X i and X i
Also corresponding to X i    X i    we have x    x   Hence the memory rule
safety property corresponding to condition Monotonic is found violated by the test automaton
exactly when Test
ROWO
for k  detects a violation Note that the nondeterminism employed in








Test automaton for Test
WA





ascending sequences of f g in A and B respectively Each processor independently and non 
deterministically decides to switch from writing  to writing  Modications similar to those in
Test
ROWO






























test automata  Test Automata for ACMP RO WO WA
X jY j pair are recorded in u  v and x  y The memory rule safety property corresponding to con




in their nal states  v  x  y  u As was explained in Section 
for Test
ROWO
 our abstraction avoids having to remember the entire extent of the arrays U  V 
X  and Y  In Test
WA








i  j  U i  Y j X j  V i
	   i  j      Y j    U i    V i   X j 
Similar to Test
ROWO
 assuming data independence we have an execution of the test automaton








iterates for   i    j     times respectively in their initial
states before switching to their nal states This test automaton execution detects violations of
Atomic exactly when Test
WA
for k   would A violation of Atomic happens exactly when
u    v    x   y  
  Abstracting Test
PO
We now discuss a test for the elemental ordering rule Program Order PO which is somewhat more
complex than the previous two tests PO requires that two events of the same process occur in the
order specied by the program Archtest provides the test for the compound rule ACMP  PO
shown in Figure  Violation of ACMP  PO is detected if Condition  fails We obtain the test
automaton and the memory rule safety property for Test
PO
of Figure  as illustrated in Figure 
P

executes a pair of instructions write to A followed by read from B innitely often The value




nondeterministically selects a pair of write followed by read instruction It assigns the value written
to A to j and the value read from B to y Similarly processor  updates i and x The dashed edges
in Figure  show when x  y  i  j are updated The memory rule safety property corresponding to




in their nal states  x  j  y  i  x  j  y  i




i  j  X
i
 j  Y
j
 i  X
i
 j  Y
j
 i
	   i  j      X
i
   j    Y
j
   i  
X
i
   j    Y
j
   i  
Similar to the case of Test
WA
 if i  j  X i  j  Y j  i then we can get a case in the test






















































test automata Test automata for ACMP  PO
Event Action or condition
Rid
 a	 if Mema  d
Wid
 a	 Mema  d
Figure   Serial memory transaction rules
get a case in the test automata where x   j    y    i   Hence the memory rule safety
property corresponding to PO Cross will be violated in test automata if and only if PO Cross




To demonstrate the eectiveness of our approach we veried three dierent memory systems namely
serial memory lazy caching and a simplied version of the Runway bus all using VIS Ver These
three memory systems are described in some detail below along with some of the subtle bugs that
we could detect using test modelchecking Details of all our experiments can be obtained from the
Web Mok or by contacting the authors
 How do we check for sequential consistency
A sequentially consistent memory system Lam  requires that there be a single selfconsistent trace







a  d for processor i is according to program order for P
i
 As suggested in Col 
we can show that sequential consistency is ACMP  PO WA
As Col  does not list a single compound test to check for ACMP  PO WA we can use
the following two tests that are available Test
WA
which tests for ACMP RO WO WA and
Test
PO
which tests for ACMP  PO This combination is exactly equivalent to testing sequential
consistency because PO implies RO and WO as formally dened in Col  For every memory
system we consider these two tests are modelchecked separately and summarized in Figure 
 





a	  d Out
i
 fg




























  d  a  		
MR
i


























Initially  a Mema  







Fairness no action other than CI
i
can be always enabled but never taken
Wwrite MWmemory write CUcache update
Rread MRmemory read CIcache invalidate
Figure  Gerths version of the lazy caching algorithm from Figure  of Ger 
 Serial memory and Lazy caching
The serial memory protocol for n processors and a memory is shown in Figure   Serial memories
are often used to dene SC operationally The lazy caching protocol Ger  shown in Figure 
also implements sequential consistency and is geared towards a bus based architecture The memory
interface still consists of reads and writes however caches C
i
are interposed between the shared
memory Mem and the processors P
i
 Each cache C
i
contains a part of the memory Mem and has




write requests are buered and an
inqueue IN
i
in which the pending cache updates are stored These queues model the asynchronous
behavior of write events in a sequentially consistent memory A write event W
i
a  d doesnt have
an immediate eect Instead a request d  a is placed in Out
i
 When the write request is taken out
of the queue by an internal memorywrite event MW
i
a  d the memory is updated and a cache
update request d  a is placed in every inqueue This cache update is eventually removed by an
internal cache update event CU
j
a  d as a result of which the cache C
j
gets updated Cache evictions
are modeled by internal caches invalidate events CI
i
can arbitrarily remove locations from cache
C
i
 Caches are lled both as the delayed result of write events and through internal memoryread
events MRa  d The latter events model the eect of a cachemiss in that case the read event
stalls until the location is copied from the memory A read event R
i
a  d predictably stalls until a
copy of location a is present in C
i
but also until the copy contains a correct value in the following
sense SC demands that a processor P
i
reads the value at a location a that was recently written
by P
i
unless some other processor updated a in the meantime Hence a read event R
i
a  d cannot
occur unless all pending writes in Out
i
are processed as well as the cache updates requests from In
i
that corresponds to writes of P
i
 For this reason such cache updates requests are marked with a
 Figure  shows the structure of the Verilog model we created for the memory model verication



























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































CCC1 DR1 CCC2 DR2
Figure  Simplied View of RunwayPA Memory System
Transaction Generated by State ccr
 self  coh ok
 other invalid coh ok
rsp other privateclean coh shared
rsp other shared coh shared
rp other shared coh ok
rp other privateclean coh ok
 other dirty coh copyout
Figure  ccr generated when a transaction gets to the head of CCC queue
the data in data return queue until the ccr is sent out
 Delay in ccr generation
If a client has a ccw transaction for a line yet to go on Runway then it delays generating any more
ccrs for that line To see why this is necessary consider the following Suppose a client C has a
dirty line Client C requests this line by issuing rsp transaction on bus C will generate coh copyout
in response to Cs request invalidate its own line and create a ccw transaction for C Note that
the most recent data for this line is with C and not HOST Now a client C requests the same
line by issuing rsp C and C generates respectively coh shared and coh ok ccrs in response to Cs
request Cs ccr will be coh ok in response to Cs request If C sends coh ok to HOST before its
ccw goes on the bus then HOST can provide a stale data to C by its hdr transaction To avoid
this C delays generating ccr until the ccw goes on the bus
 Arbitration
Runway follows a complex pipelined arbitration algorithm to determine the bus master Here we
only present an approximation of the algorithm Every bus user client or HOST must become the
bus master before it can drive the bus Bus mastership at cycle N is acquired by initiating the
arbitration in cycle N by driving the request through dedicated arbitration lines not shown in the
gure During cycle N every potential bus user evaluates the others drives and in conjunction
with roundrobin pointers for arbitration priorities determines who wins busmastership for cycle
N Those who do not win bus mastership keepo the bus Bus arbitration proceeds in a pipelined
manner concurrently with transaction processing

 PA			 Runway interface
In addition to the Runway specics described above PA Runway interface PARI also adheres
to the following constraints in order to ensure Program Order and Write Atomicity PARI allows a
client to initiate Runway transactions for various cache misses it is possible that these transactions
complete out of order However all instructions strictly complete in program order PARI guarantees
that the client will stall the coherency response for any cache line which it has an outstanding miss
for ie it has initiated a Runway transaction has assumed the ownership but is still waiting for the
data The coherency response will be generated only after the client has received the data and has
used it to make forward progress at least one instruction PARI guarantees that if a client receives
data for its Runway transaction before it assumed the ownership then it will not modify or use the
data until it processes its own transaction and thus assumes ownership PARI guarantees that if a
client has ccw transaction then it gets the highest priority to go to the Runway
 The RunwayPA			 in VIS Verilog
We constructed a Verilog model of the RunwayPA system Utah Runway Model URM and




to verify that its memory model is sequential consistent
The complexity of the system stems from a number of sources a multiple outstanding transactions
for each processor b outoforder completion of the Runway transactions but inorder comple
tion of instructions c eager assumption of ownership without receiving the corresponding data
d 	equivalent
 states introduced by decoupled execution due to coherency queues e speculative
execution features of the processor to ensure performance in spite of inorder completion of the in
structions f an involved distributed pipelined arbitration algorithm We did not try to model each
of these features in their full glory but we did include a modicum of these aggressive features into
our URM which in fact occupies more than  lines of VIS Verilog code see Mok For instance
all essential features of a b c and e are included f is abstracted by using nondeterminism
d is abstracted as explained below
Abstraction of Queues Additional abstraction eort was necessary to make our URM digestible
by VIS This essentially consists in getting rid of the CCC CCR and DR queues which are the main
cause of state explosion but retain HDR queue in the HOST and CCW queues in the HOST and
clients
In Runway most of the conicts are detected and resolved by the HOST There is one situation
where a client detects conict the client has a pending ccw transaction The client resolves this
by delaying its coherency response the net result of this delay is that the HOST would not generate
hdr transactions until the ccw goes on the Runway Since we abstracted away the CCR queues
in our URM the clients send the coherency response for a coherent transaction immediately after
its occurrence on the bus Hence in our URM the clients cant resolves conicts by delaying the
coherency response instead the HOST computes if the coherency response needed to be delayed and
if so delays the hdrs appropriately This is achieved as follows A counter is associated with each
HDR queue entry If the counter is nonzero then it is waiting for some ccw transactions for that
line from the clients hence the hdr needs to be delayed After all the pending ccw transactions for
that line go on the bus the counter becomes zero and hence the hdr transaction can go on the bus
In our URM we used a twobit counter which allows up to four processors
In Runway all clients save the data returns hdr and ccw transactions in DR queue until
the corresponding request appears at the head of its CCC queue This is necessary to enforce in
order completion of instructions We abstract away the CCC queues and the data return queues by
associating a onebit information with each cache line in each client This bit is set for an address

ACMP
PO	  states  bdd nodes conditions veried runtime mnsec	
serial memory   Vacuity 
PO Cond 
lazy caching e  Vacuity 
PO Cond 





WA	  states  bdd nodes conditions veried runtime mnsec	
serial memory   Vacuity 
Cond  Cond 
lazy caching e  Vacuity 
Cond  Cond 
URM   Vacuity 
Cond  Cond h
Figure  Verication results using VIS on a SPARC ULTRA with  MB Memory
a whenever a data return happens for a but a preceding instruction is not yet completed After all




The tables in gure  show execution time for modelchecking our Serial memory Lazy caching and
URM models for tests of ACMP PO and ACMPROWOWA recall that ACMP PO WA





checked for the following conditions Figure  does not show some of these states
Test
WA




































































































As can be seen all these conditions are safety properties and independent of the model itself which
is a distinct advantage over other methods
The size of the state space and number of nodes in BDDs are also reported Note that lazy caching
has more states than Runway due to the queues present in the model However the complexity of
the Runway protocol is much higher which results in large BDD size and higher run time However
in all our experiments whenever there was any memory ordering rule violation in our model test
modelchecking detected it quickly in the order of minutes A very desirable feature one can provide
in a tool based on test modelchecking is amenu of previously generated test automata for the various
compound rules in Col  using which designers can probe their model
Our Verilog models captures quite faithfully the cache coherence protocol and the ordering rules
of the three memory systems





have a high condence that the memory model provided by Lazy caching and RunwayPA is
sequentially consistent The verication of serial memory was straightforward

Description of a Bug found in preliminary model of lazy caching  The following bug in
our model of Lazy Caching was caught by a violation of PO Cross in Test
PO
 The bug was in the
queues used by Lazy Caching which were implemented as shift registers We forgot to shift the bit
in In
i
when the processor P
i
receives a cacheupdate from In
i
queue With this bug it is possible that
In
i
queue is not ed when it should be and consequently reads in P
i
may bypass writes This results
in a violation of PO This is a dicult bug to catch because its detection involves understanding the
complex feedback from all components of the protocol to each other queues memory and caches
Moreover this bug is interesting because it violates PO but doesnt violate WA This is so because
only writeread WR order is aected by this bug Our technique eectively caught this bug





note that it doesnt involve PO passes This shows the futility of ad hoc testing methods
one could apply subjective criteria to consider a test similar to Test
WA
to be suciently incisive
when in fact it fails to account for a crucial ordering relation such as PO
Description of a Bug found in preliminary URM  Similarly another cornercase bug was
caught by test model checking in our URM by a violation of PO Cross condition using Test
PO

This bug generated a long counterexample trace due to the depth of the sequential logic of the
model The trace revealed the following situation
 client
i
has removed its own read transaction from the bus then
 client
i
sends coh ok in response to a subsequent coherent transaction for the same line before
getting the data for its transaction by hdr or ccw
This problem was xed using the counter in the HOSTs HDR entries to record the pending ccws
and the onebit information in the clients cache lines to record whether the data is supplied as
explained in paragraph  After xing the bug the PO condition passed
 Conclusion and Future Plans
We presented a new approach to verify multiprocessors for formal memory models which combines
two existing powerful techniques modelchecking and the testing method of Archtest From our
results we conclude that test modelchecking can be of great value in detecting bugs during early
stages of the design cycle of modern microprocessors whose memory subsystems are complex Our
results on our URM of the HP PARunway bus attest to this
So far we have identied the rules and corresponding tests for sequential consistency We are cur
rently working on identifying similar rules and tests for other wellknown formal memory models such
as TSO PSO and RMO AG  that are described in the SPARC V  architecture manual WG 
This work may involve dening new rules as well as new tests corresponding to them
We are currently working to formulate some reasonable assumptions about the memory system
model under which the tests administered by our test automata can be rendered complete Also
for a limited class of models modelchecking the test for some small value of k might actually be
sucient Our initial attempts in this direction are encouraging
Acknowledgments We would like to thank Dr Collier for his help in explaining his work his
very informative emails and providing Archtest We would like to thank Dr Narendran for many
fruitful discussions We would like to thank Dr Al Davis and his Avalanche team foro oering us
the unique opportunity to work on stateoftheart processors and busses

References
AG  Sarita V Adve and Kourosh Gharachorloo Shared memory consistency models A
tutorial Computer    December   
BCS  William R Bryg Kenneth K Chan and Nicholas SFiduccia A highperformance
lowcost multiprocessor bus for workstations and midrange servers Hewlett Packard
Journal pages   February   
Cam  Albert Camilleri A hybrid approach to verifying liveness in a symmetric multi
processor In Theorem Proving in Higher Order Logics th International Conference
TPHOLs	
 Murray Hill NJ pages    August    SpringerVerlag LNCS 
CC P Cousot and R Cousot Abstract intepretation a unied lattice model for static
analysis of programs by construction or approximation of xpoints In Proceedings of
th POPL pages   Los Angeles CA ACM Press  
CES E M Clarke E A Emerson and A P Sistla Automatic verication of nitestate
concurrent systems using temporal logic specications ACM TOPLAS  
 
Col WW Collier Multiprocessor diagnostics httpwwwinfomallorgdiagnosticsarchtesthtml
Col  W W Collier Reasoning About Parallel Architectures PrenticeHall Englewood Clis
NJ   
Cor  Francisco Corella April    Invited talk at Computer Hardware Description Lan
guages    Toledo Spain on Verifying IO Systems
DPN  David L Dill Seungjoon Park and Andreas Nowatzyk Formal specication of abstract
memory models In Gaetano Borriello and Carl Ebeling editors Research on Integrated
Systems pages   MIT Press   
Ger  Rob Gerth Introduction to sequential consistency and the lazy caching




  G Gopalakrishnan R Ghughal R Hosabettu A Mokkedem and R Nalumasu For
mal modeling and validation applied to a commercial coherent bus A case study In
Hon F Li and David K Probst editors CHARME Montreal Canada   
GJS  James Gosling Bill Joy and Guy Steele The Java
TM
Language Specication Sun
Microsystems  edition August    appeared also as book with same title in
AddisonWesleys The Java Series
GK  Phillip B Gibbons and Ephraim Korach On testing cachecoherent shared memories
In Proceedings of the th Annual Symposium on Parallel Algorithms and Architectures
pages   New York NY USA June    ACM Press
GK  Phillip B Gibbons and Ephraim Korach Testing shared memories SIAM Journal on
Computing   August   
Gra  S Graf Verication of a distributed cache memory by using abstractions Lecture
Notes in Computer Science     

HB  R Hojati and R Brayton Automatic datapath abstraction of hardware systems In
Conference on Computer Aided Verication   
HMTLB  R Hojati R MuellerThuns P Loewenstein and R Brayton Automatic verication
of memory systems which service their requests out of order In CHDL pages   
  
Kan  Gerry Kane PA RISC  Architecture Prentice Hall    ISBN 
Lam  Leslie Lamport How to make a multiprocessor computer that correctly executes mul
tiprocess programs IEEE Transactions on Computers         
Lam  Leslie Lamport How to make a correct multiprocess program execute correctly on a
multiprocessor Technical report Digital Equipment Corporation Systems Research
Center February   
Lam  Leslie Lamport The temporal logic of actions ACM Transactions on Programming
Languages and Systems    May    Also appeared as SRC Research
Report  
LLOR  P Ladkin L Lamport B Olivier and D Roegel Lazy caching in tla Distributed
Computing   
McM  Kenneth L McMillan Symbolic Model Checking Kluwer Academic Press   
Mok A Mokkedem Verication of three memory systems using test modelchecking
httpwwwcsutahedu mokkedemvisvishtml
PD  Seungjoon Park and David L Dill Verication of FLASH cache coherence protocol by
aggregation of distributed transactions In SPAA pages    Padua Italy June
    
Ver Vis release httpwwwcadeecsberkeleyeduRespepResearchvisindexhtml
WG  David L Weaver and Tom Germond The SPARC Architecture Manual  Version 
 P
T R PrenticeHall Englewood Clis NJ  USA   

