In this paper we put forward a design for a multicompurer system based on a network of workstations which we call COMA-BC. It has a common address space in which a shared variables prograanming model can be used. The management of the shared address space is performed in a similar way to that in existing multiprocessor COMA systems. To be exact, the shared address space is divided into blocks, and their copies reside in the attraction memories of the workstations.
Introduction
COMA-BC [20] is a multicomputer system based on "~ set of workstations connected within a standard shared medium network. This network has only one physical line which is used to send and receive all the information exchanged in the system. COMA-BC is a shared mereory system with physically distributed memory modules. The different processors use the same address space arid a fraction of this space is common for all of them [16] .
Each workstation ill COMA-BC is standard UNIX-wpe. The idea of a multicomputer system based on a network of workstations is not original [17, 2] , but COMA-BC has new features consisting of tile mmmgement of the shared memory using the same techniques used in COMA multicomputers [11] .
The design target of COMA-BC is to mininfize the communication time in parallel workloads. The use of a standard local area network supposes that the communication time depends on two factors: the number of interchanged messages and their size. In tile COMA-BC design the main concepts underlying COMA nmltiln'Ocessors have been used to enable conununication with a small nmnber of small messages.
The shared memory space ill COMA-BC is divided into blocks. Each workstation (system node) can have access to a copy of each block. These copies are sent through tile LAN to tile nodes that ueed tile information. The blocks have no home nodes, that is, there is no node where the information of each block resides permanently. Each node has a copy of tile block, but it is just a copy. Tlms, the local memory of each ,lode can be considered a cache of the shared address space; this local memory is called attraction memory (AM), using the usual terminology of the multiprocessor COMA system . The memory blocks stored in the attractiou memory are ca~e copies of the blocks of the shared address space. That is why we call them "cache blocks". In the same way as in a COMA multiprocessor, the only physical location of the shared address space is the set of attraction memories, that is, the set of cache memories of the system.
When one processor tries to use a specific location of the shared space there are two possibilities: either a local copy is stored in the local attraction memo, T or there is no local copy in the attraction memory. Ifa copy is not available then a cache miss happens. That iniss is always related to a read or write memory operation.
Onc cache block can have multiple copies in different nodes; that is wily one of tile key elemeuts in the development of COMA-BC is its coherence protocol, it is responsible for maintaining the coherence of the information stored in the different cache copies of each block of the shared space. This coherence protocol is a cache coherence protocol. The COMA-BC protocol developed is invalidation-based and uses a hybrid snoopy and directory scheme. The mason for using this technique is to best exploit the characteristics of the shared medium local area network in two ways: I) each node can 8e~ every block movement through the network, 2) each node can send a message that is received simultaneously by all the nodes. The coherence directories are used to eliminate race conditions. A detailed description of the directories system is given later in this paper.
A COMA-BC system can be built in two different ways: 1) with standard workstations, in which the management of access to the shared address space is carried out by software running in the local workstation, 2) with modified workstations using a specifically designed memory controller that manages every access to the shared space and every action related to the coherence protocol.
In order to check the correctness of COMA-BC, simulation and verification studies have to be done on three levels: 1) Verification of the coherence protocol for systerns with 2 and 3 nodes. 2) Simulation of the system with coloured Petri Nets and with other models based on finite state machines, up to 100 nodes. 3) Simulation based on a synthetic workload using Ptolemy.
A multiprocessor simulation system has been specifically developed for measuring the capacity of a COMA-BC system. This simulation system is program driven; a significant number of standard parallel applications have been executed on it. The applications belong to the SPLASH-2 parallel benchmark. The real execution of such applications has permitted a series of interesting indexes to be obtained: such as, the number of accesses to the shared address space, the number of access misses, the number of network messages needed and the size of the named messages. At the same time a simple analytic model has been developed to provide temporal indexes (speedups) using the results from the simulation system. The analytic model uses the results of the multiprocessor simulation as inputs together with the data relative to the processing speeds of the nodes and the description of the physical characteristics of the local area network used to build COMA-BC.
The results obtained show that COMA-BC can be used as a platform for exploiting parallelism with a stalldard LAN interconnecting workstation.
This article is organized as follows: firstly a detailed description of COMA-BC is given. Secondly, due to its importance, the functioning of the COMA-BC coherence protocol is specified. Thirdly, the simulation and verification procedure of the correctness of C OMA-B C is explained. Fourthly, the multiprocessor simulation en- After that the analytic model is studied which allows us to obtain performance results in terms of acceleration for the various standard parallel applications executed in the simulator. Finally, the experimental resuits obtained are explained and discussed aitd the main conclusions of this work are summarized.
2

Description of a COMA-BC System
The architecture of a COMA-BC system appears in figure 1. COMA-BC systems are made up of a set of nodes connected by a common medium intercolmection network. The nodes are workstations. The network is a standard LAN. The coherence messages are sent mid received using tile network. Each node has a processor that uses tile information contained ill its local lnelnory. The local memory of each node is divided in two parts: exclusive and attraction znemory. Exclusive memory contains the text of the local processor pro&r~tms and the data that is used exclusively by the local processor. Attraction memory plays the role of a cache memory in a shared address space; the data needed for several nodes is stored in this address space. The exclusive memory works in the same way as the main memory of an ordinary computer; that is the reason why we focus on the description of the attraction Hteulory.
As mentioned above, attraction memories are the physical support, and the only physical support, of the shared address space of the different nodes. In addition, each attraction ulemory works as a cache of the shared space. Each attraction memory works ~ a cache of the said space, but there is no other memory component in tile systen~, apart from tile attraction mentories, that act as a physical support. Accordingly, that shared address space must be managed using tile same rules as those used in COMA multiprocessors. In this kind of system, information does not reside permanently iu tile s a m e host node, but the i n f o r n~t i o n moves t h r o u g h the system, residing in the nodes t h a t m o s t frequently access t h a t information. All the nodes have t h e s a m e characteristics for accessing a n d m a n a g i n g t h e information.
T h e s h a r e d address space is divided into cache blocks. Each a t t r a c t i o n m e m o r y is divided into a set of frames, each of which can store one cache block. T h e relationship between blocks a n d frames is established by a direct m a p p i n g . In C O M A -B C each a t t r a c t i o n m e m o r y can store one copy of the whole s h a r e d address space. T h i s is due to an initial design decision t h a t tends to minimize the n u m b e r of necessary o p e r a t i o n s in the n e t w o r k thus m a k i n g a r e p l a c e m e n t m e c h a n i s m unnecessary. T h e disa d v a n t a g e of this decision is t h a t the a t t r a c t i o n m e m ories do not m a k e up pieces of m e m o r y t h a t can be joined t o g e t h e r to form a larger c o m m o n space.
C O M A -B C P r o t o c o l
T h e C O M A -B C cache coherence p r o t o c o l has been especially designed to be able to m a n a g e the cache coherence in the C O M A -B C s y s t e m while t a k i n g into account an i m p o r t a n t design restriction: t h e i n t e r c o n n e c t | o n supp o r t is a s t a n d a r d s h a r e d m e d i u m local a r e a n e t w o r k with one physical line i n t e r c o n n e c t i n g c o m p u t e r s . T h e best e x a m p l e of this kind of n e t w o r k is an E t h e r n e t network. Tile C O M A -B C protocol is built on the basis of mssigning a state to each copy of a cache block (from now on s i m p l y a " c o p y " ) . T h i s s t a t e is used to indicate t h a t the in£ormation contained in the copy is valid, a n d can be read directly by the processor, or t h a t the i n f o r m a t i o n contained in the copy is n o t valid because a copy of it has been w r i t t e n on a n o t h e r node b y the c o r r e s p o n d i n g processor. If the i n f o r m a t i o n is not valid and the processor tries to r e a d it, t h e n the coherence controller detects a cache miss and it should request a valid copy. State information is also used by the coherence controller to indicate whether a processor carl write on a copy without coherence violation. C O M A -B C protocol uses five states for each copy. These states are'. 1) Invalid, 2) Clea~l, 3) Dirty, 4) Invalid awaiting K R I (briefly IARIU), a~d 5) Invalid awaiting R R B (briefly IARRB).
The C O M A -B C protocol is invalidation based. This means that when a node wRites a copy, its coherence controller sends a mess~.ge through the network in such a way that every other node in the system receives the message and invalidates its o w n copy. This behavior is specially well adapted to shared m e d i u m networks that are able to send bro'd~icast information with just one message. In addition, every coherence controller receives all messages sent through the network by every node', this means that it receives all of the coherence information for every copy of a cache block arid uses it to maintain state and directory information updated. This is usual iu s n o o p y bus coherence protocols.
Besides the s t a t e iIfformation of each copy, the C O M A -B C protocol uses the concept of the ow~ler node of a block. T h e only processor t h a t can write over a copy is the owner n o d e processor. E a c h coherence controller m a n a g e s a d i r e c t o r y containing the owner nodes of every block of tile s h a r e d space. T h e ownership concept is d y n a m i c , changing in the execution o[ the p r o g r a m s to those nodes t h a t need to write on each block. W h e u a non owner node needs to write over its copy, it m u s t first ask the owner in order to get the ownership of the block; to do this it uses d i r e c t o r y i n f o r m a t i o n to locate t h e owner. W h e n a n o d e gains ownership, it can write over its copy. Accordingly, the directory m a n a g e d by each coherence controller has as m a n y entries as blocks in the shared space and each entry stores just one number from I to n, where n is the number of nodes itt the system. To this end, each node in a C O M A -B C system has one assigned n m n b e r t h a t does not change aud is used to identify each node in the coherence directories. A node will consider itself to be the owner of a block w h e n its own node ~mmber a p p e a r s i~ the corresponding d i r e c t o r y i n f o r m a t i o n . T h e protocol m u s t g u a r a n t e e t h a t j u s t one node is the owner of each block.
V a l i d a t i o n o f t h e C O M A -B C P r o t o c o l
The study to validate the C O M A -B C protocol has to be carried out on three levels: 1) Different simulations have been done using Petri Nets and other models based oil finite state machines.
2) The verification of the protocol for systems with 2 and 3 nodes has been done. 3) The simulation of the global operation of C O M A -B C has been performed. W e shall n o w c o m m e n t on each o[ these three levels.
I) Simulations of the protocol using Petri Nets.
Firstly, tile protocol has been formalized using Colored Petri Nets [13] . Using this formalization the initial design errors were easily detected and corrected. Tlle development of this f o r m a l i z a t i o n was done US|lit the De-
2) Verification of the protocol. T h e verification of the protocol has b e e n carried out using the m e t h o d of expansion of states applied to a C O M A -B C model based on a finite s t a t e s machine. T h e said s t u d y has been carried out using the software tool M u r~ [7] version 3.0. Due to the p r o b l e m of tile explosion of states inherent to this type of verification, tile simul~ttion could ouly be p e r f o r m e d for two and three nodes. For these, as the verification is an ext~austive technique, the a p p e a rva, ce of protocol errors has been c o m p l e t e l y ruled out. For s y s t e m s larger t h a n three nodes, the s a m e software t,.ol has been used with tile s a m e finite states m a t h | a t model in order to obtain simulation results with up to 100 nodes. There were no detected errors in the normal working of the protocol.
3) Global simulation of the system. Finally, a global simulation of the system has been carried out using the simulation tool Ptolemy. In this third phase of the validation, the simulation was driven by synthetic workloads. Basically, using the simulation environment, the different elements of the COMA-BC system have been reproduced, including the processor, the coherence controller, the network interface and the intercommunication network. Basing ourselves on the experiments carried out, we have been able to rule out the existence of working errors in the system, considering it globally, where these errors are due to design faults in the COMA-BC protocol.
5
The COMA-BC Multiprocessor Simulation Environment
The prototype of a COMA-BC system has not been built. Thus, to get results concerning the performarlce of the proposed system it has been necessary to build a multiprocessor simulation environment which, together with an analytical model, allows the simulation of standard parallel applications and, finally, to obtain performance indexes of the parallel execution in terms of efficiency and speedup. In this section, we shall explain the multiprocessor simulation environment and in the following section we shall comment on the characteristics of the analytic model used.
The developed simulation environment is based on carrying out a working simulation of COMA-BC driven by the execution of standard parallel applications [6, 15, 3] . The said simulation is performed using standard workstations connected via a 10 Mbps Ethernet network. Each workstation reproduces one COMA-HC node executing its part of the parallel application and generating all the memory references to the shared address space that a real COMA-BC system would generate. Each workstation executes two processes whid~ implement the simulation and which we shall call the functional emulator and the architecture emulator, representing the application axed the coherence controller respectively.
The architecture emulator is a process that reproduces the functions of a COMA-BC coherence controller. It receives requests for access to the shared address space, manages access misses and executes all the protocol actions needed to maintain the system coherence.
The functional emulator is a process that simulates the execution of the part of the parallel application that is executed in a particular COMA-BC node. When this functional emulator is executed, in reality, what is executed in the workstation processor is the part corresponding to the parallel application, as the functional emulator builds itself by instrumenting every access to the shared address space [8] . To be exact, this instrumentation consists in detecting all accesses to the memory address that are inside tile shared space mid to replace them with a function call that sends a message to the architecture emulator. Once the functional emulator has sent the request, it waits until it is resolved. This instrumentation is implemented by expanding the source code of the parallel application. To do this the tool Pegaxo [12] is used. This expanded source code can be assembled and linked to the destination workstation and the resulting executable code is what makes up, ill fact, the functional emulator.
The described simulation environment allows the execution of parallel applications developed in C with tile programming style based on tile usc of tile Ar&omm [18] PARMACS macros. Pe&axo expands tile PARMACS macros and replaces each memory access by an invocation to a certain procedure. Pegaxo has been developed for HPPA arid SunSPARC architectures. ]t is important for this expansion process that the processors used as the base for the simulation are RISC arddtectures because it reduces the complexity of analyzing the source code as the set of memory access instructions is very simple.
To maJLage the various parallel processes involved ill the simulations (that is: pairs of functional-architecture emulators in the different nodes) and to implement the communications between them, PVM has been used. PVM allows the sending a~d receiviu& of protocol events between the different architecture emulators (there is one architecture emulator on each workstation) mid the sending axLd receiving operations between tile functional and the architecture emulators on each workstation.
The main drawback of PVM in this simulation is that the messages it uses to access the network (that is, those that reproduce the protocol events between llodes) cannot be broadcast messages, they lnUSt h.~ve ouc particular node destination. Then, it is impossible to send events with just one access to tile shared medium of the network. To avoid this problem, each architecture emulator simulates the sending of a PVM message with the protocol event to each one of the remaining architecture emulators and waits for the confirmation from each one of them.
As explained before, tile COMA-BC multiprocessor simulation allows us to obtain indexes of how parallel applications really behave in a COMA-BC system. These indexes are obtained from a series of statistical variables updated during the execution of tile simulation; some of them are updated by the functional emulator and others by the ardfitecture emulator. The information obtailLed from these statistical variables is finally reduced to a set of indexes as follows: 1) i: total number of executed instructions. 2) m: total number of read and write executed instruction. 3) s: total number of read and write instructions sent to the shared address space. 4) e: total number of events or messages interchanged using the network to manage access misses. 5) Ira: average size of messages interchanged in the network, expressed in bytes per message. These indexes are obtained for each one of the workstations in the parallel simulation environment, and thus refer to how each parallel branch of the application has been executed.
Analytical Model
A simple analytic model has been developed to obtain performance results of the parallel applications executing in a COMA-BC system. This model is based on the consideration that all the workstations in C O MA-B C are physically identical. The said analytical model has both input and output data. The input data of the model caa~ be divided into two subsets: 1) The results obtained in the mull|processor simulation through the colection of statistical variables explained in section 5.
2) The pararneters describing the physical characteristics of a real COMA-BC system. Inside the second group there axe two more subsets: i) Parameters describing the physical characteristics of the processor, including the duration of the processor cycle time (~') and the number of cycles needed to execute each instruction, ii) Parameters describing the behavior of the interconnect|on network of the workstations; which have been chosen representin& the behaviour of the interconnecting system by means of the model LogP [4] . Using the analytic model, execution time of a parallel application in a COMA-BC system can be obtained. This index represents the execution time of the same application in a real COMA-BC system with the same number of processors as used in the simulation. Using this index, significant performance indexes, such as speedup and efficiency for each parallel workload for the different processors, ca+l be obtained.
The total execution time of the parallel application is calculated from the beginning of the execution of the code, supposing that this is simultaneous for all the processors, up to the moment the parallel code ends. As the different workstations can execute parts of the workload of different sizes, the execution time T(n), in a set of n workstations, is defined as the biggest of all the execution times calculated for all the workstations. Note that in the sirnulation phase, the output data described in section 5 refers to each workstation involved. From this point, these results refer to tim workstation which has the biF:~est workload.
The aa~alytic model obtains the total execution time in a set of n workstations as the sum of two terms:
where tc represents the time spent to execute that part of the workload that does not refer to the communication through the network. However, tr represents the time taken in communications through the network that are necessary to execute the workload. To calculate to, tim time spent in the execution of each instruction is considered as being divided into three parts: 1) a fixed time for each instruction which we call ti; 2) an extra fixed time tin, if the instruction is a read or write to memory instruction; 3) another extra fixed time t,, if the read or write to memory falls inside the shared a~ldress space. Then: t~ = it~ + rnt,~ + st, = (ici + rnc,~ + sc,)~ These three terms el, c,n alld cs represent the tilnes ti, t,~ and t, expressed in number of clock cycles of the processor. They constitute the three input parameters of the model ~md they describe the number of cycles necessary to execute each instruction. They must be a~ljusted (the same as I") as a function of the physical architecture of the workstations used.
To calculate tr that same time is considered to include all the operations necessary to execute, using the network, all the read and write instructions carried through the shared address space and that have generated cache misses. In other words, the tilne tr refers to all the memory references that need the network to be completed. tr caai be expressed simply as tr -~ e~ , where w represents the average time needed to send an event or message through the network. In order to calculate w we use a representation of the network based on the LogP model. To be exact, the frequency of the messages is supposed to be low enough as to ignore the time term between two consecutive messages ("gap") [14] . Then w can he expressed as:
where oa(l), or(l) and L(l), are respectively, the fixed cost of sending a message, the fixed cost of receiving a message and the latency of the network for messages of I byte. The terms a and b are the input parameters of the model that describe the behavior of the interconnect|on network. The term l,,+ represents the average size of the messages interdtanged, as nmntioned above.
Experimental Results
The execution of six applications of Splash-2 [2/] has beeu simulated. Specifically LU, Oceau, FFT, Radix, B a r n e s -H u t and Radiosity. T h e workload for each application is t h e s t a n d a r d which is described in [21] .
T h e execution of t h e n a m e d applications has been done in a simulated C O M A -B C system with 2, 4, 8 and 16 workstations. To execute the parallel simulation, the same n u m b e r of workstations as simulated nodes has been used, all of t h e m connected with a 10 Mbps Ethernet network. A functional e m u l a t o r and an architecture emulator is executed on each workstation. T h e results o b t a i n e d are detailed in the table 1. These results correspond to the workstation t h a t had to execute the biggest workload in the simulation; t h a t is, the node t h a t limits the total execution time of the application. All the simulation e x p e r i m e n t s were carried out using a block size of 256 bytes. T h e reason is t h a t this size allows us to obtain a compromise between spatial locality and the access miss r a t e [20] .
With these simulation results and using the analytic model proposed in the previous section, an estimation of the execution time, s p e e d u p a n d efficiency can be obtained for a specific C O M A -B C system. T h e C O M A -B C system p r o p o s e d is based on workstations with a cycle time of ~" 7 ns, t h a t is, a clock frequency of 167 MHz. In accordance with the d a t a obtained from sta~tdard architectures, the values for the rest of the p a r a m e t e r s are: ci = 1 cycle, c~ ~ 2 cycles and cs ~ 10 cycles. T h e r e are five possibilities for the interconnection networks: a) E t h e r n e t n e t w o r k at 10 Mbps with the stack of T C P / I P protocols, b) E t h e r n e t network at 10 Mbps w i t h o u t T C P / I P and with active messages, c) Fast Ethernet at 100 Mbps, d) F D D I network with active message layer [19] and e) M y r i n e t network [1] (ATM with fast messages). Each of these m e t h o d s of interconnection can be assigned i n p u t p a r a m e t e r s , which we call a a11d b, for the analytic model. T h e s e two p a r a m e t e r s are o b t a i n e d from t h e bibliography [14, 19, 9] and are detailed in the table 2. Finally, feeding the results from the simulation and the p a r a m e t e r s that describe the specific C O M A -B C s y s t e m into the analytic model, speedup a n d efficiency d a t a is o b t a i n e d (table 3) .
From the results obtained, it can be concluded t h a t a C O M A -B C system with a fixed low cost of sending and receiving messages to t h e n e t w o r k and low l a t e n c y in transmissions to the network is viable for obtaining speedup in the execution of parallel applications over a network of workstations. Unfortunately, there are no o t h e r perforrnax~ce results referring to similar systems, t h a t is, systems based on a n e t w o r k of workstations with a p r o g r a m m i n g model of shared variables. T h e r e are results, however, of s p e e d u p for multiprocessor systems with similar m e m o r y m a n a g e m e n t , such as SGI-Challenge and Origin2000 [5] . C o m p a r i n g with these systems, the speedups o b t a i n e d in the best of the cases considered for C O M A -B C (a M y r i n e t network) are a b o u t half t h a t of the speedups of the aforesaid systems. This is not much in absolute terms b u t it can be col~sid-e r e d a good result, if we take into account that they have been obtained using a system costing much less than half t h a t of the former and using s t a n d a r d resources (workstation, interconnection network).
It is clear t h a t the construction of a C O M A -B C system based on an interconnection network with high fixed costs of sending-receiving messages and low b a n d w i d t h is n o t viable; such systems could be an E t h e r n e t at 10 Mbps or F a s t -E t h e r n e t at 100 Mbps with T C P / I P . It" the fixed cost of sending a message is reduced, the results improve coi~siderably, as can be observed whelL, in the Etherllet 10 Mbps network, the T C P / I P stack is s u b s t i t u t e d for active messages.
C o n c l u s i o n s
We have shown t h a t C O M A -B C is an example of how a distributed shared m e m o r y system over a network of workstations can be implemented. We have also established the usefulness of applying the concepts of COMA multiprocessors to the design of C O M A -B C . T h e idea of having only "cache copies", t h a t is, copies of tile cache blocks t h a t migrate to those nodes t h a t lleetl them at each instant has been seen to be useftl[ for carrying O U t this design. Moreover, having s e p a r a t e d the co]lcept of cache block of tile shared space with respect to the pages of virtual m e m o r y , the size of the blocks used is small, thus less time is needed to travel through the interconnecting network.
T h e simulation s t u d y shows the viability of the proposed system for exploiting parallelism in a network of workstations. In tile first version, it is n o t possible to use a s t a n d a r d lletwork in C O M A -B C because of the loss of performmlce. "1"o be exact, we have show]~ t h a t it is not possible to use a s t a n d a r d E t h e m e t network with the T C P / I P protocol st'~k. It is necessary to use lower latency networks and protocol stacks with lower costs of sending and receiving messages, such as the protocol stack of active messages.
Finally, it is necessary to emphasize t h a t the key concept in C O M A -B C is the cache coherence protocol, because it manages the coherence of the block copies ill the different workstations w i t h o u t an excessive iI~crease of tr'~lic in the network. Tids is a difficult task because the proposed system is n o t hierarchical and because a shared m e d i u m i]Ltercom~ectimJ l~etwork is used. Table 3 : Performallce results expressed as speedups and emciency for several Splash-2 applicatio..~ iu COMA-BC
S(n) E(n) Sln) ]g(n) S(n) E(n) S(n) E(n) S(n) E(n)
0
