Parallel computing performance on scalable shared-memory architectures is a ected by the structure of the interconnection networks linking processors to memory modules and on the e ciency of the memory/cache management systems. Cache Coherence Non-Uniform Memory Access (CC-NUMA) and Cache Only Memory Access (COMA) are two e ective memory systems, and the hierarchical ring structure is an e cient interconnection network in hardware. This paper focuses on comparative performance modeling and evaluation of CC-NUMA and COMA on a hierarchical ring shared-memory architecture. Analytical models for the two memory systems for comparative evaluation are presented. Intensive performance measurements on data migrations have been conducted on the KSR-1, a COMA hierarchical ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and comparisons of the two cache coherence memory systems. Our analytical and experimental results show that a COMA system balances the work load well. However the overhead of frequent data movement may match the gains obtained from improving load balance. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement.
should be done based on a particular network structure, because di erent network structures can make2.1 The architecture for the CC-NUMA and the COMA The CC-NUMA and the COMA systems to be discussed in the following sections share the same ring interconnection network architecture shown in Figure 1 . This architecture has the following hardware and software structures, functions and parameters which are similar to the ones in 7] and 13]:
1. The architecture consists of a global ring and M local rings. Each local ring is connected to the global ring through a inter-ring port with a pair of bu ers. The bu ers are used to temporarily store and forward packets passing the port. The global ring has M equally sized slots connecting 2. Each station module consists of one main memory module, one subcache and one processor. In the CC-NUMA system, the main memory module is a home addressed memory which occupies a unique contiguous portion of a at, global (physical) address space. In the COMA system, the main memory module is a big-capacity cache which results in dynamic mapping between the context address and the system logic address through segment translation tables.
3. Both the global ring and local rings are rotated constantly. The rotation period between each slot is de ned as t r . A processor node in a local ring ready to transmit a message waits until an empty slot is available. The response and a request, such as a read/write will be rotated back to the requesting processor.
4. Each inter-ring port is able to determine the path of the message packets passed through it. In the COMA system, each port keeps a directory to record the mapping relationships among the cache memory modules in the corresponding local ring. In CC-NUMA, each port can determine the destination of the message packets according to the home address carried by the message packet.
Cache coherence protocols of CC-NUMA and COMA
The cache coherence protocols in our models are based on the available ones which have been proposed/implemented on hierarchical ring architectures (see e.g
. 2] and 4] and 7]
). In order to compare the di erences between the two memory systems, similar hierarchical data directories and cache coherence protocols are designed in each system. In both systems, sequential consistency is preserved.
Cache coherence protocol of CC-NUMA
1. Hierarchical directory Each processor maintains a local directory in its local cache, which records the data allocation information in the local cache. Each local ring maintains a global directory built in the inter-ring port, which records the data allocation information in the local ring.
Ownerships of a shared data
Share: there is more than one copy of the shared data existing in other memory modules. Exclusive: the current copy is the only one in the system.
Read/write protocol
Reading the shared-data | the processor will get the data in its local memory if it is available there, otherwise it will get it from one of the memory modules in the local ring, or in a remote ring through searching. The newly loaded copy will have the \share" ownership.
Writing the shared-data | the processor will either write the data locally if it is available or will do a remote-write in the destination memory module. The associated invalidation operations are de ned as follows. The invalidation of shared data is conducted during the process of a write request traveling to the home node and returning from the home node. Each time a write request passes by a global directory which has copies of the requested data, it will produce a invalidation packet to invalidate the copies in the local ring.
Cache coherence protocol of COMA 1. Hierarchical directory
It has the same structure as the one de ned for CC-NUMA.
Ownerships of cache segments
Copy: there is more than one copy of a physical address segment in the system. Non-Exclusive Copy: it has the same features as \Copy" except the location is the owner of the physical address segment. Exclusive: there is only one valid copy of the physical address segment in the system. Invalidation: The copy in the cache segment is not valid.
Read/write protocol
Read shared data | if the copy exists in the local cache, the processor performs the read immediately. Otherwise, it sends a probe packet into the local ring or remote rings to search for a copy of its required physical address segment. The processor will receive a copy with the ownership of Copy. The ownership of the cache copy in the destination module will be changed into \Non-Exclusive Copy" if its original ownership is Exclusive. Write shared data | the processor will rst search the owner of the shared data through the entire ring hierarchy. As soon as the owner is found in a cache module, the processor loads the data back to its local cache and invalidates all the existing copies in the system. After the invalidation, the processor perform the write in the local cache. The updated data copy becomes \Exclusive".
Performance parameters and assumptions
In order to fairly compare the performance di erences between CC-NUMA and COMA, it is necessary to de ne a common evaluation base. The models presented in the next sections are based on the following common performance parameters:
1. : request miss rate of each local cache.
2. h : a fraction of directing to a hot-spot address segment.
3. l : a fraction of (1 ? h ) directing to memory modules on its local ring.
4. r : read miss fraction of .
5. w : write miss fraction of .
6. t r : rotation period of each ring.
7. N: the number of stations connected to each local ring (or the number of slots in a local ring).
8. M: the number of local rings connected to the global ring.
9. N c : number of cache segments in each local cache memory (this parameter is only used for the COMA system).
Furthermore, we assume:
1. The request miss rate of each local cache follows a Poisson process.
2. The local request rate and the non-local request rate are uniformly distributed.
3. In the request sequence of each station, the read misses and write misses to a data location are uniformly distributed.
4. One message packet can be completely carried by one slot which only conveys this message packet. So the successive slots behave independently.
5. When a station receives a message packet from a slot, it will produce a reply into the same slot without any delay.
In this paper, major latency analyses for both memory systems on the ring architecture are based on evaluating two important performance factors. First, the ring network contention is modeled by studying hot spot e ects. Second, analytical models of read/write miss latencies are constructed using the network contention models associated with the cache coherence protocols. The M/G/1 model is a major mathematical tool to derive the analytical latency formulas. In the following two sections, we present the objectives, assumptions and major results of each model to evaluate CC-NUMA and COMA systems on a hierarchical ring architecture. For detailed derivation process of the mathematical models, the interested reader may refer to Appendices A and B.
3 An analytical model for the hierarchical ring based CC-NUMA
Network contention
In a hierarchical CC-NUMA system, network contention can be well characterized by a hot spot environment where a large number of processors try to access a globally shared variable across the network. In this case, a hierarchical ring is divided into three regions in terms of network activities: hot local ring which is the local ring where the hot spot is located, cool local rings which are the rest of the local rings without the presence of the hot spot, and the global ring. A comprehensive access delay model for the entire hierarchical ring in the presence of the hot spot is presented based on contention in each of the three parts of the rings. In Appendix A, the following CC-NUMA latency factors are obtained: d n : the mean waiting time for a message to nd an empty slot in a cool local ring. q cool lport : the queuing time of a message in the interface port from the global ring to a cool local ring.
d h : the mean waiting time for a message to nd an empty slot in the hot local ring. q hot lport : the mean queuing time of a message in the interface port from the global ring to the hot local ring. q hot gport : the mean queuing time of a message in the interface port from the hot local ring to the global ring. q cool gport : the mean queuing time of a message in the interface port from a cool local ring to the global ring.
3.2 Latency of a remote-write to the hot spot A remote-write to the hot spot will be satis ed in the following two possible situations:
1. The write request is from the hot local ring, with probability of This request only needs to travel the hot local ring for one circle. The traveling time, denoted by T l numa , consists of the time for the source processor to nd an empty slot on the hot local ring and the time for the request to travel the hot local ring for one circle:
T l numa = d h + Nt r :
2. The write request is from a cool local ring, with probability of M?1 M :
This write will access the hot memory remotely. The remote-write time, denoted by T g numa , consists of four parts: the time from the source cool ring to the global ring, the time from the global ring to the hot ring, the time for searching the destination processor in the hot ring, and the time for the data packet to go back to the source processor:
T g numa = d n + q cool gport + q hot gport + q hot lport + q cool lport + t r (M + 2N):
Therefore, the average latency of a remote-write to the hot spot is T w numa = T l numa M + (M ? 1) T g numa M : (3.3) 3.3 Latency of a remote-read to the hot spot A read-miss process is more complex than that of a write-miss because multiple copies of the data may exist in the system. In general, the process of a remote-read can be described by the state transition graph shown in Figure 2 . In Figure 2 , O represents the initial state, L represents the state where the requesting processor receives the data from the local ring, G represents the state where the requesting processor receives data from a remote ring, and T l and T g represent the latency in states L and G, respectively. Because we have assumed that read misses and write misses are uniformly distributed in the request sequences, whether a data item has multiply copies distributed in other memory modules is determined by the relative ratio of the read miss rate to the write rate. The transition probability P can be determined as follows:
1. When r w , each read miss must be preceded by a write miss. So the probability for a hot read to get the data item from a copy of the home data is zero. In this case, the probability P of reading the hot memory module in the local ring equals to the probability of the local ring's becoming the hot ring, which is 1 M .
2. When r > w , each write miss follows r w read-misses where only the rst read-miss visits the home data and the other read-misses visit a copy of the home data. So the probability for a read miss to visit the home data is w r . Moreover, the probability for the hot data item to be located on a di erent ring as a request is M?1 M . Therefore, the probability P for a request to get the data item from a copy of the home data or from the home data on the local ring is 1 ?
Concluding the above analyses, the transition probability P can be represented as P = By (3.4) , the latency of a read-miss to the hot spot is T r numa = PT l + (1 ? P)T g ; (3.5) where T l and T g are computed under the following two conditions:
1. w r : A read always visits the hot data item in the home node because no copies of the hot data item exist in this situation. Hence, by (3.1) and (3.2), we have T l = T l numa and T g = T g numa .
w < r :
In this case, the remote-read miss latency T g is T g numa . The local-read miss latency T l is (3.6) where ( d n + Nt r ) is the searching time of a read-miss in a non-hot local ring and ( d h + Nt r ) is the searching time of a read-miss on the hot local ring. 4 An analytical model for the hierarchical ring based COMA Based on the cache coherence protocols de ned in section 2, there are only two types of packets running in the ring: probe packets which carry access requests to search their destinations and data packets which carry the data segments back to the source processors. In a steady state, each ring can be considered to have the same number of probe packets and data packets. Moreover a COMA system may cause a physical address segment to be moved in di erent local caches in di erent time. Therefore a physical address segment can be assumed to have the same probability to reside on every local cache at the same time under the condition that each processor requests a physical address segment with the same probability during a unit period of time. Based on this unique COMA data migration feature, we can assume that each local ring has the same contention pattern in a steady state which is independent of the contention di erences among physical address segments. This is the major di erence between the home addressed CC-NUMA (a data item can be xed in a memory module, and access to it is conducted by remote-read and remote-write), and the changeably addressed COMA. However, the data migration feature of a COMA system makes the read/write miss process more complicated than that in a CC-NUMA system. In a COMA system, a read/write miss will dynamically chase request data because the request data does not have a home address and will be dynamically moved in a local cache by a write access. In the following, the latency of a read/write miss is derived mainly by modeling the dynamically chasing process of a read/write miss.
Modeling the network contention in COMA
For all local rings, each interface port from the global ring to a local ring contributes an equal amount of tra c to the local ring, which is not a ected by the hot spot e ects because of the data dynamic migration feature. So the fraction of accesses to the hot spot in each processor should be considered uniformly distributed among the memory modules in N local rings. Based on the above analysis, the packet arrival rate, denoted as lp , in each interface port from the global ring to a local ring can be expressed as
where the hot request rate to a local ring is N(M?1) h M , the non-hot request rate to a local ring is N(1 ? h )(1 ? l ) , the data packet rate to respond to the hot requests of a local ring is N(M?1) h M , and the data packet rate to respond to the non-hot requests of a local ring is N(1 ? h )(1 ? l ) .
Then, using the same method described in Appendix A, we can obtain the following three important performance results:
1. U l coma , the utilization of a local ring in COMA is
2. q wait coma , the waiting time for a message to nd an empty slot in a local ring is 
Latency of remote-write to a hot memory
In a COMA system, the searching process of a write-miss can be expressed as the state transition graph shown in Figure 3 based on the dynamic migration feature of data. State O represents the initial state of a write miss at its source processor. States LS and GS represent two possibilities for a write-miss to nd the owner of its required address segment, where LS is the process of searching and getting the hot segment in the local ring with probability of 1=M, and GS is the process of getting the hot segment in a remote memory with probability of (M ? 1)=M. States INV 1 and INV 2 represent the corresponding invalidation states of LS and GS respectively. We use t ls ; t inv 1 ; t gs and t inv 2 to represent the time spent in each corresponding state. Based on Figure 3 , the latency of a remote-write to a hot spot, denoted as T w coma , can be expressed as T w coma = t ls + t inv 1 M + (M ? 1)(t gs + t inv 2 ) M + q wait coma ; (4.13) where the detailed derivation process of t ls ; t inv 1 ; t gs and t inv 2 is listed in Appendix B.
Latency of remote-read to a hot memory
In a COMA system, the remote-read process can be described by the same state transition graph as shown in Figure 2 . The transition probability P also has the following expression:
The hot read missing latency is T r coma = PT l + (1 ? P)T g ; (4.15) where the computation of local search latency T l and global search latency T g is more complicated than that in a CC-NUMA system because a read miss in a COMA system involves in a process of dynamically chasing for data. In the following, we calculate T l and T g under two conditions:
1. w r : Each read miss to data must be preceded by a write miss to the data, which means that no copies of a data item exist when a read miss to the data occurs. In this situation, the data search procedure of a read miss is the same as that of a write miss except that a read miss does not involve an invalidation process. Hence, we have T l = t ls + Nt r =2; T g = t gs + (M + N)t r =2:
Each write miss follows r w read misses. So the average number of copies of a data item in the system is (1 ? r = w )=2 when a read miss to the data occurs, which reduces the global search latency T g to T g = maxf(2N + M)t r + q wait coma + q l coma + q g coma ; t gs + (M + 2N)t r =2 1 + (1 ? r = w )=2 g; (4.17) where (2N + M)t r + q wait coma + q l coma + q g coma is the least time of remotely getting data, and
1+(1? r = w)=2 is the reduced global search latency by (1 ? r = w )=2 data copies. Furthermore, the average number of copies of a data item in one of M local rings is (1 ? r = w )=(2M), which reduces the local search latency T l to T l = maxfNt r + q wait coma ; t ls + Nt r =2 (1 + (1 ? r = w )=(2M)) g; (4.18) where Nt r + q wait coma is the least time for traveling a local ring for a circle, and t ls +N tr=2
is the reduced search time by the multiple copies. 5 Comparative performance evaluation between CC-NUMA and COMA based on the analytical models
In this section, we provide analytical results dependent on various architecture e ects such as accessmiss latency and bandwidth. The analysis of the bandwidth is based on the analysis of the upper bound of the request rate per processor. The architectural factors to be considered are the size of a ring and the rotation speed of the ring. The system factors to be considered are the data locality, the hot spot e ects and the ratio of between read-miss and write-miss. 
The analysis on access-miss latency

The hot spot e ects
Here we choose 32 as the number of slots in each local ring and the global ring. The rotation period is one unit time. The e ects of the hot spot to miss latency under this condition in CC-NUMA and COMA systems are shown in Figure 4 . Figure 4 indicates that the hot spot has little e ect on remote-read latency in both systems and remote-write latency in CC-NUMA. But it a ects remote-write latency in COMA to a certain degree because of more frequent data migration. The results show that the structure of the hierarchical ring network can well balance network tra c when a hot spot occurs.
The locality e ects
The locality is de ned as the ratio between the number of accesses to a local ring and the total number of non-hot memory accesses in the system. The performance parameters are selected as in section 5.1.1 except that total miss rate is 0.0005. Figure 5 presents the e ects of the locality on the miss latencies in both memory systems. It shows that the increase of the locality signi cantly reduces the miss latencies in both systems. In particular, the read-miss latencies in both systems present almost the same curves. But the write-miss latency in a COMA is less than that in a CC-NUMA system.
The e ects caused by di erent miss rates and request rates
The network contention is mainly determined by the miss rate in each processor. Assuming a uniform distribution of miss rates in each processor, the e ects of the miss rate on access-miss latencies in both systems are shown in Figure 6 . The results show that the increase in miss rate causes higher read/write miss latency in both systems. But the write-miss latency in the COMA is slightly smaller than the write-miss latency in the CC-NUMA because of a more balanced load in the COMA.
The e ects of read/write miss distributions
The e ects of read/write miss distributions to the miss latency are measured by changing the ratio between read-misses and write-misses. Figure 7 shows that, by increasing the read-miss rate, the read-miss latencies in both systems reduce signi cantly; but the write-miss latencies in both systems increase because there are more cache copies to be invalidated. On the other hand, the COMA handles read/write misses slightly more e ectively than that in the CC-NUMA.
System e ects by changing the ring size
The size of a ring is an important architecture factor. Assuming the miss rate in each processor is uniformly distributed, and the rotation period of the ring is in a unit time, Figure 8 shows that, by increasing the size of a ring, read-miss latencies in both systems have the same increasing curves, but the write-miss latency in COMA increases slightly slower than that in the CC-NUMA.
System e ects by changing the rotation period
The rotation period of the rings is another important architecture factor a ecting performance. Figure  9 plots the latencies of both systems by slowing down the rotation speed step by step. The results indicate that the COMA performs slightly better than the CC-NUMA in terms of changing the rotation period. hotspot rate is 10% in COMA hotspot rate is 10% in NUMA hotspot rate is 30% in COMA hotspot rate is 30% in NUMA Figure 10 : E ects of changing program locality with a speci c hot spot rate on the request bound.
Bandwidth analysis
In CC-NUMA and COMA systems, the access-miss request on each processor is bounded by the network contention. It is important to compare the di erent e ects on the request bound by changing the performance parameters between the two memory systems. The detailed mathematical models for bandwidth analysis are given in Appendix C. Figure 10 presents the e ects of localities to the upper bound of the miss rate. It shows that the upper bound of the miss rate in the COMA increases slightly faster than that in the CC-NUMA. Figure 11 shows that the e ects of the hot spot on the upper bound of the miss rate in both CC-NUMA and COMA are almost identically signi cant. Figure 12 presents the e ects by changing the size of the rings with a certain hot spot rate to the upper bound of the access-miss rate. The curves shows that both systems have nearly the same performance. Figure 13 shows that by decreasing the rotation speed, the upper bound of the miss rates in both the CC-NUMA and COMA systems is almost identically a ected. hotspot rate is 20% in COMA hotspot rate is 20% in NUMA hotspot rate is 30% in COMA hotspot rate is 30% in NUMA Figure 12 : E ects of changing the ring size with a speci c hot spot rate to the request bound. 6 Experiment-based validation on the KSR-1
E ects of locality
E ects of architectural factors
An overview of the experiment-based validation
To validate analytical models, execution-driven simulation and real-machine-based experiments are two alternatives. Although the execution-driven simulation can measure a variety of architecture features by exibly changing various architectural parameters, it may be di cult or even impossible to verify whether the simulator has correctly modeled a real multiprocessor system. As pointed out in 10], the validation to a simulation only shows whether it can produce results similar to another simulator. Therefore, we decided to conduct experiments on a real hierarchical ring system to validate our analytical results.
KSR- 1 7] , introduced by Kendall Square Research, is a hierarchical ring based COMA multiprocessor system which provides a direct testbed for validating the analytical results of the COMA system. To validate the analytical results on a hierarchical ring based CC-NUMA system, we simulate its memory operations on KSR-1. A key part of simulating a CC-NUMA system in a COMA system is to generate the CC-NUMA memory access patterns on the KSR-1. While in a CC-NUMA system data is home addressed, in a COMA system a data item is dynamically duplicated and moved upon read/write requests. In order to x a data item, we use an array on the KSR-1 to simulate a home addressed variable in a CC-NUMA system. The array is called the extension vector of the variable. The memory access pattern of a CC-NUMA read/write operation sequence is simulated by directing a read/write operation to each independent element of the variable's extension vector based on the following rule:
Let s be a variable, s m] be the extension vector of s, where m is the length of the vector for multiple accesses to s. Let a 1 (s); a 2 (s); :::; a t (s) be a read/write sequence on s in a CC-NUMA system, where t is the length of the sequence. This sequence is simulated in a COMA system by sequence a 0 1 ; a 0 2 ; :::; a 0 t which is constructed as follows: The above rule guarantees that a series of consecutive read operations access the same variable, and a write operation does not move the location of the data item. Hence, a CC-NUMA access pattern can be rigorously simulated with the support of the extension vector. The cache coherence protocol proposed for the CC-NUMA system in section 2 indicates that a remote write always has the same executing trace which is independent of the number of data copies in the system. This is because the system uses multiple parallel invalidation packets to invalidate the copies in local rings. If two processors in a CC-NUMA system produce two similar operation sequences on a shared variable, the similar memory access pattern can be simulated by making each processor produce its own operation sequence on two di erent variables in the same memory module.
In the rest of this section, we report two sets of experiments on KSR-1 to validate our analytical results, and to give more comparative results between the CC-NUMA and the COMA which complement the results from analytical models. The rst set of experiments designed for performance validation, called uniform experiments where the memory access patterns in the analytical models are simulated and the e ects of cache miss rate, read/write rate, cache coherence protocol, hot spot and locality are measured for comparisons with the analytical results. Based on the analytical models, this set of experiments were uniformly constructed such that all the 64 processors on two rings were employed. In execution, each processor generated read/write request misses to the memory modules on its local ring and on the remote ring respectively. The changes of memory access patterns were adjusted by the four parameters: request rate, read (or write) rate, hot spot rate and locality rate. In the second set of experiments, additional memory access patterns were generated to measure the e ects of hot spot and the two cache coherence protocols.
The objective of running these experiments is to validate analytical results presented in early sections. Since the analytical models and experiments were performed on two di erent bases, the absolute latency measures are di erent. However, we can still well present performance model validations and comparisons based on performance tendencies and implications from both analytical and experimental results.
Cache Coherence E ects
KSR-1 maintains consistency of the data in each cache using a write-invalidate cache coherence strategy. Whenever a data item is requested by a processor for an update, every other copy of that data item, in every other cache where that subpage is located, is marked \invalid" so that the proper value of the data item can be maintained. The distributed cache directories maintain a list of each subpage in the local cache, along with an indication of the \state" of that subpage. To validate the analytical results on cache coherence e ects, we performed the following three experiments. The rst experiment was designed to validate the assumption 5 in section 2.3. In our experiment, one processor writes an array of 500,000 elements into its local cache. All other processors then read that array into their local caches, leaving the exclusive ownership of this array with the processor which originally wrote it and leaving copies of the array located in every other cache. The original processor updates the array again, requiring that all other copies be invalidated. The tests were conducted two di erent ways: 1) the number of processors was scaled from 1 to 62 continuously, 0-31 on one ring, then moving to the second ring for 32-62; and 2) the number of processors was scaled in pairs, one processor on each of the two rings being a member of the pair. With each increment of the number of processors, one processor was added to each ring. Consequently, the number of processors scaled by 2; eg. 2, 4, 6, 8,..., 62. The results in Figure 14 re ect these two di erent ways of scaling the problem and conclusions will be drawn from these two di erent con gurations. Figure 14 shows that the maintenance cost of cache coherence bears no additional cost in KSR-1, over and above the latency of the ring rotation itself, which does not re ect an increase because of the number of processors, but an increase due to the distance the data invalidation must travel from one ring to another ring. This is consistent to the assumption 5 in section 2.3 where the ring rotation is clocked so that any and all actions that could possibly take place during a stop at each cell can be accomplished.
In the second and the third experiments, the uniform experiments were conducted under di erent sets of parameters. To validate the e ect of request miss rate on network contention, in the second experiment, we used the same set of performance parameters used in the analytical model where the hot spot rate was set to 0, the read rate was 0.7 and the locality rate was uniformly distributed. The write miss latencies and the read miss latencies were measured while the request rate was changed from 0.00015 to 0.00045 through a delay function. The measurements are given in Table 1 , which shows the similar varying tendencies of miss latency to the analytical results in Figure 6 . To validate the e ects of di erent read/write miss distributions on miss latency, in the third experiment, we set the following conditions: the request rate was xed at 0.0005, the locality rate was uniformly distributed, the hot spot rate was set to 0 and the read rate was changed from 0.1 to 0.8. Table 2 lists the measurement results showing that the read-miss latency is very close to the write-miss latency when the read rate is less than 0.5, then decreases signi cantly with the increase of read rate beyond 0.5. This is due to the e ect of multiple copies of data items in both COMA and CC-NUMA systems. These experimental results are identical to the analytical results given in Figure 7 in terms of the system e ects.
Locality e ects
The e ect of locality rate was measured through the uniform experiments under the following condition: the request rate was xed at 0.0005, the hot spot rate was set to 0 and the read rate was set to 0.7. The locality rate was changed from 0.1 (where 90% of requests sent by a processor directed to a remote processor in the remote ring) to 0.9 (where 90% of requests sent from a processor directed a remote processor on the local ring). The measurement results reported in Table 3 show that the decrease rate of miss latencies is similar to the analytical result given in Figure 5 . For example, both analytical and experimental results show that the miss latencies reduce about 25% when the locality rate increases from 0.1 to 0.5. 
Hot spot performance on the KSR-1
In practice, a hot spot may occur under di erent memory access patterns, resulting in di erent performance degradation. Hence our measurements were conducted not only by the uniform experiments for validating the analytical results presented in Figure 4 , but also by three additional experiments for studying the hot spot e ects under practical memory access patterns.
E ects on memory access delay
The hot spot on the KSR-1 is allocated either in a xed location, called xed hot spot for CC-NUMA or in movable locations, called movable hot spot for COMA. The xed hot spot remains physically on one processor as other processors try to read it with a single variable or a block of data. The movable hot spot will be migrated around the ring on demand of any processor which does a read with a single variable or a block of data. This data migration is a feature of the KSR-1 intended to enhance data locality.
To validate our analytical results, we rst evaluated the hot spot e ects through the uniform experiments under the following conditions: each processor in the two rings generated request misses at the xed rate of 0.0005 where the locality rate was uniformly distributed, the read rate was xed at 0.7 and the hot spot rate was changed from 0.1 to 0.6. The read/write miss latencies were averaged over 10000 test cases and are listed in Table 4 . The measurement results show that a write-miss in the COMA system is more sensitive to a hot spot than in the CC-NUMA due to the overhead of more frequent data movement in the COMA system. This conclusion is consistent to the analytical results reported in Figure 4 in terms of hot spot e ects.
In practice, a hot spot is usually generated only by part of processors in the system. Experiments reported in 16] simulate this type of memory access patterns, which used 57 out of 64 processors in a KSR-1 system to generate the hot spot on another remote cache module, remaining 6 remote cool cache modules. The miss latencies of remote reads and remote writes of one word, one block, two blocks and three blocks were respectively measured under an environment without any hot spots, an environment with the hot spot generated by cache references in a word unit, and an environment with the hot spot generated by cache references in a block unit. cool variables in the cool cache modules due to heavier tra c caused by more data movement.
Another experiment to verify our modeling work is to see if a hot spot can a ect remote readings among processors that are not involved in the process of generating the hot spot. Again, two rings were used for the experiment. The di erence between this experiment and the previous one is that all of the processors contributing towards generating the hot spot are on the same ring (the hot ring). While the other processors on the other ring (the cool ring) are involved in generating the hot spot. The hot spot is xed to one processor within the hot ring. We varied the experiment by increasing the number of processors to be involved in the hot ring to generate the hot spot. We chose to use 50% and 97% of the available processors on the hot ring to generate the hot spot for these variations. Thus, for 50% usage of the processors on the hot ring there were also 16 processors that were not involved in generating the hot spot. At the same time, there were also 16 processors to be used on the cool ring. These two sets of 16 processors were used to do remote readings of their counter processors, respectively. These remote readings were also unidirectional among the two rings during any run of the experiment. Thus, the 16 processors on the cool ring read the 16 processors on the hot ring during the hot spot activity in one run of the experiment. Then we reversed the remote reading and had the 16 processors on the hot ring read remotely the 16 processors on the cool ring during the hot spot activity in another run. The hot spot was generated by either reading a single variable or by reading a block of data. Thus, there were 4 di erent timings from the hot spot activity that were compared to the remote readings when there was no hot spot. For 97% of processors on the hot ring in usage there was 1 available processor on the hot ring and 1 available processor on the cool ring. As shown on Tables 5 and 6 for the di erent variations, the hot spot has very little e ect on all the remote reads in this experiment. This additional experiment on the KSR-1 further strengthens our analysis that a hierarchical ring based architecture, such as the KSR machine handles the hot spot activity e ciently, as presented by the analytical models in Figure  4 .
E ects on normal parallel computations in cool nodes
In this experiment we measured the e ects that a hot spot may have on a matrix multiplication application. Again, there were two forms of hot spots, xed for CC-NUMA and moveable for COMA.
We used 64 processors for our experiments. We increased the number of processors doing the matrix multiplication from 1, 2, 4, 8, and 16 processors. The matrices to be multiplied are A B and the result Table 5 : Reading measurements ( s) when hot spot is generated by 50% of processors on the hot ring on the KSR-1.
(1 var) (1 blk) (2 blks) (3 blks) from cool R to cool R 27.83 32.14 61. Table 6 : Reading measurements (in s) when hot spot is generated by 97% of processors on the hot ring on the KSR-1.
is put in matrix C. The size of these matrices are 224 224. The computation resided on the same ring. The matrices A and C are distributed so that each processor has to access from its neighbor to do its share of the computation. Each of the contributing processors has a local copy of the matrix B. When there is a hot spot present, the number of processors that contribute towards generating the hot spot is 48 of the 64 processors (75%). The ring where the hot spot resides is the hot ring (this is true only for the xed hot spot implementation). The other ring is the cool ring. As in our other experiments, the hot spot was generated by either reading a single variable or reading a block of data. Table 7 presents the timing results of xed hot spot e ects on the matrix multiplication. The rst row gives timings of the matrix multiplication (MM) without a hot spot present. The second and third rows show the timings during the following setup. The processors doing the matrix multiplication all reside on the same ring (cool ring) while the hot spot resides on the hot ring. The fourth and fth rows show the opposite set up of the previous two rows. That is, all processors involved in the matrix multiplication reside on the hot ring (where the hot spot is located). The sixth and seventh rows show the same setup as in the previous two rows with one di erence. One of the processors that contribute towards the matrix multiplication will be the processor where the hot spot resides. This is shown in the table with a one in parenthesis (1) to signify this. As we predicted, there was virtually no di erence in the timings during the presence of a xed hot spot with any of the implementations.
In Table 8 , we show the timings during the presence of a moveable hot spot. Row one shows the timings without the presence of the hot spot while rows two and three shows the timings with the moveable hot spot. Again, there was virtually no di erence in the timings in the presence of the hot spot. This group of experiments further support our analytical performance evaluation in Section 5.1.1.
Conclusion
In this paper, our analytical models provide performance di erences between the CC-NUMA and the COMA on a hierarchical ring architecture. The model considers the interconnection network and the memory systems, which are two important factors a ecting the shared-memory performance. We also conducted experiments on the KSR-1, a hierarchical ring COMA system to verify some of the analytical results. We summarize performance evaluation results as follows:
1. In a hierarchical ring based architecture, a slotted ring orders and delays remote data access requests. This structure naturally reduces network contention for programs with hot spots. Analysis indicates that in the presense of hot spots overall ring tra c will be moderately increased but it will be distributed evenly in the ring network. Analytical results have been veri ed by the experiments on the KSR-1. When the hot spot memory access rate is increased, the write-miss latency in a COMA system will become slightly bigger than that in the CC-NUMA.
2. In the presence of a hot spot, COMA generates higher write-miss latency due to more frequent data migrations and a larger number of invalidations. 4. Our analyses and experiments show that for applications with dominant read-misses at either high or low rates, COMA and CC-NUMA have nearly identical performance. In contrast, the simulation results in 12] indicate the two systems have the nearly identical performance only for the applications with low miss rates. A main reason for the di erent performance results is related to the di erent evaluation testbeds. The constant latency assumption on the at network architecture simulator causes longer delay for COMA, and is likely to make the network contention independent of memory access patterns of applications. In addition, a at network architecture is more hot spot sensitive than a hierarchical network. In comparison, the KSR ring architecture allows the system to exploit hierarchical locality of reference by moving referenced data to a local cache and satisfying data references from nearby copies of a data item whenever possible.
5. We show that CC-NUMA handles coherence misses only slightly more e ciently than COMA in the ringe architecture, while the simulation results in 12] indicate the di erence is signi cant. Again, this is related to the di erent network architectures used for the evaluations.
We conclude that both CC-NUMA and COMA systems behave similarly on a hierarchical ring architecture. Two main reasons for this are that overhead of data migration in COMA matches the saving from improving locality in CC-NUMA; and that a slotted ring architecture balances the network contention. Our study indicates that a hierarchical ring network is a reasonable candidate for both COMA and CC-NUMA systems. We believe our performance results could be further generalized to the two memory systems on a hierarchical network architecture. Although a CC-NUMA system may not automatically balance the load at the system level, it provides an option for a user to explicitly handle data locality for a possible performance improvement. Therefore, the decision of using or making CC-NUMA or COMA systems on a hierarchical ring should be determined by the programming and manufacturing cost of the systems. Finally, based on our study, we believe a fair and valuable comparative performance evaluation of COMA and CC-NUMA systems should be conducted on each particular network architecture. Appendix A: Modeling network contention using the M/G/1 queue theory There are N stations and one interface port in a local ring which are assumed to be independent of each other. Each independent station contributes an equal amount of tra c to the ring at the same miss request rate of which follows a Poisson process. The interface port also inputs tra c to the local ring at the rate of cool lport , which is expressed as the sum of the probe packet arrival rate and the data packet arrival rate:
cool lport = N h + 2N(1 ? h )(1 ? l ) : (7.19) A general network utilization is de ned by: U = lim t!1 C t ; (7.20) where C is the total number of bytes transmitted to the network from all the connected stations and the interface port during a period of time t. Based on (7.19) and (7.20) , the utilization of a non-hot local ring is formulated as U l = t r ( cool lport + N ) = N t r (1 + h + 2(1 ? h )(1 ? l )):
Because the successive slots have been assumed to behave independently, the probability of a slot on a cool local ring to be full is U l . The time needed for a packet in each station to nd an empty slot can be approximated to have a geometric distribution. (7.22) The interface port from the global ring to a cool local ring uses a queue to bu er the message packets. In order to calculate the average waiting time for a request in the queue, we can model the port as a M/G/1 queue with packet arrive rate of cool lport . The average queue length in the bu er may be calculated by Little's law, Q = cool lport q cool lport ; (7.23) Where q cool lport , an average waiting time in the queue may be calculated by q cool lport = ( d n + t r ) + Q( d n + t r ): (7.24) Combining (7.23) and (7.24), w becomes q cool lport = d n + t r 1 ? cool lport ( d n + t r ) : (7.25) In the hot local ring, the interface port from the global ring to the hot local ring contributes tra c to the hot local ring at the rate of hot lport which is calculated as: hot lport = (M ? 1)N h + 2N(1 ? h )(1 ? l ) : (7.26) By formulas (7.20), (7.22), (7.23), (7.24) and (7.26), d h , the mean time to nd an empty slot on the hot local ring, and q hotlport , the mean queueing time in the interface port between the global ring and the hot local ring before a message enters the hot ring, are derived as: Appendix B: Modeling read/write miss latencies in COMA system 7.1 Modeling the search time t ls in LS state in Figure 3 Using a local ring, we know that the probability for a slot to be non-empty in a local ring is U l coma .
Since there are only two types of packets running on the rings, the probe packets and the data packets, the probability for a slot to have a probe packet is U l =2. In addition, the probability for a probe packet to be a hot write miss is ( h +
(1? h ) NMNc ) w . Therefore, the probability for a slot to have a hot write probe packet is P whl = U l coma 2 ( h + (1 ? h ) NMN c ) w : (7.33) In state LS, let i 1 be the initial distance (number of slots) from the hot segment to the write miss request, which is a random value in 0; N]. We assume that the probability for the hot segment to be located in local cache j(j = 1; 2; ; N) is 1=N. Then the average of i 1 is N=2. The probability for the hot write request to catch up with the hot segment at distance i 1 is (1 ? P whl ) i 1 , which is the probability for none of the other write requests to write the hot segment in advance. If the write-miss has not found the hot segment at distance i 1 , the hot segment must have been carried onto another local cache which has distance i 2 from the hot write probe block. Variable i 2 is a random value in 0; N=2] with the average of N=4. The probability for the write request to catch up with the hot segment at the second local cache is (1 ? (1 ? P whl ) i 1 )(1 ? P whl ) i 2 . The write request will repeat this process until it gets the hot-spot segment. Thus, the search process of a write request can be expressed as the state transition graph shown in Figure 15 .
In Figure 15 , k = log(N); p i = (1 ? P whl ) N=2 i ; q i = 1 ? p i and the time T i can be approximately evaluated as Nt r =2 i because the new owner of the hot spot can be any local cache except the old ones and the source node at each state i(i = 1; 2; ; k + 1 Figure 3 When a hot write request changes from state LS to state INV 1, the invalidation process in INV 1 is determined by the ownership changes of the owner of the hot data:
Case 1: the owner's status of the data copy is changed from \Exclusive" to \Invalidation" with probability w : The invalidation block carries the hot data directly to the source processor. Hence the invalidation time t 0 inv 1 is Nt r =2.
Case 2: the owner's status of the data copy is changed from \Non-exclusive" to \Invalidation" with probability r : the invalidation packet must travel the global ring for one circle to invalidate the other copies and then goes back to the source processor. The invalidation time t 00 inv 1 consists of the following components:
1. the time traveling from a local ring to the global ring. 2. the time traveling one circle of the global ring. 3. the time traveling from the global ring to the local ring. 4. the time traveling to the source processor in the local ring.
So t 00 inv 1 can be expressed as t 00 inv 1 = (2N + M)t r + q wait coma + 2 q l coma + q g coma :
Combining the above two cases, the invalidation time in state INV 1 is:
t inv 1 = w t 0 inv 1 + r t 00 inv 1 :
7.3 Modeling the global search time t gs in state GS in Figure 3 Similar to (7.33), the probability for a slot on the global ring to have a hot write probe block is P whg = U g coma ( h + (1 ? h ) NMN c ) w =2: (7. 35)
The time t gs consists of the following three components:
1. t tra , the time traveling from a local ring to the global ring, which is t tra = Nt r =2 + q l coma ;
2. t sea , the time searching the directory along the global ring. Initially we can consider M=2 (the number of the lots) to be the average distance from the source processor to the hot global directory which connects to the hot local ring. When the request, denoted by r 1 , reaches the directory, the directory may have become cool, which means that there was another write request entering the hot ring before this one. The request r 1 must continue searching along the global ring, to wait for another global directory to become hot, and then to repeat the above procedure until r 1 catches up with the hot global directory. Each time a write request runs into the hot global directory, the hot directory becomes cool. There will be no hot directory until the write request carries the hot segment into the source local ring, which makes the global directory on the source local ring hot.
The delay, denoted by d, in the system between generating two hot directories is the sum of the time for the write request to enter the hot local ring, the time to search the hot segment, the time to carry the hot segment to the global ring and the time to travel to the interface port connected to the source local ring. Hence, this delay can be expressed as d = q g coma + q l coma + Mt r =2 + t ls : (7. 36)
The probability for the write request, r 1 Figure 3 In the INV 2, the invalidation process has the same two alternatives as those in INV 1. Using similar analysis techniques, the invalidation time in INV 2 can be expressed as t inv 2 = Nt r + q l coma + q g coma + Mt r ( r + 1)=2: (7.39) 
