Analytical models were developed and simulations of memory latency were performed for Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), Local-RemoteGlobal (LRG), and RCR architectures for hit rates from 0.1 to 0.9 in steps of 0.1, memory access times of 10 to 100 ns, proportions of read/write access from 0.01 to 0.1, and block sizes of 8 to 64 words. The RCR architecture provides favorable performance over UMA and NUMA architectures for all ranges of application and system parameters. RCR outperforms LRG architectures when the hit rates of the processor cache exceed 80% and replicated memory exceed 25%. Thus, inclusion of a small replicated memory at each processor significantly reduces expected access time since all replicated memory hits become independent of global traffic. For configurations of up to 32 processors, results show that latency is further reduced by distinguishing burst-mode transfers between isolated memory accesses and those which are incrementally outside the working set.
Introduction
Rapid changes in the cost and density of semiconductor memory technology have made the previously preferred multiprocessor design approaches obsolete. In particular, traditional interconnection strategies between multiple processors and a common memory regard the storage space as a very scarce resource. These conventional approaches restrict scalability by requiring latency to transfer data whenever and wherever remote memory references occur. Previous designs have addressed this problem by including local caches, multistage combining networks, and elaborate referencing schemes, but require sophisticated hardware to maintain coherence between the physically-distinct memories.
The most direct and efficient conversion of sequential programs to a parallel form is via the shared-memory programming model. Consequently, there is a need to design architectural support for this model that possesses a reasonable balance between cost and performance. The design criteria should include the following principles:
(1) The design should be simple and inexpensive to build, (2) Coherence among the distributed memories must be maintained, and (3) The design should be optimized for typical memory reference characteristics.
The first criteria can be met in part by using multiple physically-distinct memory units organized hierarchically. By adding small, fast cache memories, a considerable performance improvement is obtained for a negligible increase in system cost.
However, in the memory hierarchy of a multiprocessor computing system, data inconsistency may occur between adjacent levels of different processors or even within the same level of memory hierarchy. This coherence problem generally arises during asynchronous and independent operations of multiprocessors having physically-distributed memories.
While the third criteria can have a very significant impact on system performance, it has essentially been neglected for many years because multiprocessor READ and WRITE latencies were identical, as in uniprocessors. In practice however, their difference in frequency of occurence turns out to be the key in solving the multiprocessor design problem.
Previous Work
Distributed shared-memory multiprocessors communicate using mutually accessible stored data structures called shared variables, with READs and WRITEs of multiple CPUs capable of accessing the shared data. The memory design objective is to match the effective memory bandwidth with the peak processor throughput, so that the maximum demand for memory words can be satisfied. Ideally, the memory bandwidth would match the transfer rate demanded by each processor even after coherence and contention are taken into account. In this section, we review some of the most recent approaches in order to provide the basis for our analysis.
In the Uniform Memory Access (UMA) multiprocessor model, all processors P i experience an equal expected access time to all shared-memory modules M j for all 1 ≤ i, j ≤ N . In this model, each processor P i is attached to a private cache C i so that it can take advantage of locality of reference of the data and/or instructions which are currently used. All processors may access their private cache in t c time with hit rate h. In case of a miss, P i may have to compete with other processors to access global memory. The waiting time to access shared-memory is dependent on the number of pending memory references and their characteristics.
In a Non-Uniform Memory Access (NUMA) multiprocessor model, the shared-memory is physically divided into smaller regions and each is assigned to a processor (P i ) forming an ensemble of local memories, LM i , where 1 ≤ i ≤ N . Together, these local memories (LM i ) comprise the global address space accessible to all processors. In a NUMA multiprocessor model, the processor's access time to main memory varies with the location of memory word within the global space. Every P i is attached to a private cache C i . In case of a miss, then P i may have to access local or remote memories. In this shared-memory model, access time of a local processor to a local memory is less than the same processor's access time to any remote memory. The increase in access time results from the added delay and possible contention within the interconnection network.
Local-Remote-Global (LRG) architecture consists of several clusters with every cluster containing two processors and a shared local memory. Each processor on the board is attached to a private cache. 1 The LRG architecture provides also a global memory module which is uniformly accessible by all processors. A READ hit is answered in t c time. A READ miss requires checking the local memory for a copy of the data. The time it takes to fetch a word from local memory is t L plus the waiting time for the local bus. A processor will access the global memory if there is a read-miss on local memory or the processor is performing a shared-write.
In Replicated Concurrent-Read (RCR) architecture 2 every processor is attached to a private cache, a local cache, and a replicated memory module as shown in Fig. 1 . A private READ/WRITE is done locally by a processor with no delay. A shared READ will take place in one clock cycle since each processor is attached to a replicated memory module. A READ miss will cause an access to auxiliary memory through write or read bus. If the word is located incrementally outside the fence, then a block of data will be transferred to all replicated memories, otherwise the intended word will be loaded to the requesting processor immediately. Simultaneously, the spatial cache controller (SCC) is monitoring all addresses being accessed so that it can broadcast blocks of data to replicated memories ion slow traffic time. In the case of shared WRITEs, all replicated memories will be updated concurrently. A demux is attached to write read bus so that it determines whether the address is in auxiliary memory or is in replicated memories as shown in Fig. 1. 3. Uniform Memory Access (UMA) Architecture
Configuration parameters
In UMA architecture, all processors share a global memory which is equally close to all processing elements while every processor is attached to a private cache. A READ hit feches data from the cache in t c time. The probability of a cache hit is denoted as h c . A READ miss will cause P i to access the global memory in order to fetch data. The probability of having to access global memory is (1 − h c ). P i may have to wait to access the cache if there are a number of pending WRITEs since only one validation can occur at a time. Since every processor in the system with a probability of (1 − h c ) will have to access shared-memory, a delay in accessing the shared-memory will be inevitable. In UMA architecture, as the number of processors increases, undesirable delays will increase the average memory access time. As a result, UMA architecture can support only a small number of processors. UMA simulators will generate access to memory and will also define the characteristics of all memory accesses. Table 1 lists the input to the simulator. In UMA machine, all processors share a global memory module while every processor is attached to a local cache and a separate private cache. Every memory access starts with searching data in either a local cache or private cache depending upon the address of the access. In the case of a miss, global memory is accessed. Let h c denote the cache hit rate. A READ hit fetches data directly from the cache in t c time if there are no pending WRITEs, since only one invalidation can occur at a time. The probability of having to wait for a pending write to complete may be expressed as P shared write . The time penalty involved depends on the number of pending WRITEs in the queue. Let t UMA wait pending write denote the time a processor may have to wait before accessing data from the cache. A READ miss requires checking the global memory for a copy of the data. The time required to access any such data is simply the global memory cycle time t G . The chance of having to wait for another memory reference to complete its execution is (1 − h c ). The wait time depends also on the number of pending memory accesses. Let t UMA wait global bus denote the time a processor may have to wait for accessing global memory. The average READ time may be expressed as:
t UMA wait pending write can be approximated by scaling the probability of each individual processor requesting global access by the expected number of requests pending from other processors which is (N − 1)/2. It is due to delay from another processor's pending invalidation. Thus, the expected value of t UMA wait pending write may be calculated as follows:
The waiting time for global bus is dependent on whether other processors are trying to access global memory. t UMA wait global bus may be expressed as:
A private WRITE takes place in t c time plus waiting time for any other pending WRITEs. Any update to shared-data requires also updating the global memory. As a result, any shared-write memory accesses may have to wait for global traffic. Let P private write denote the probability of a private-write memory reference such that P private write = 1 − P read − P shared write . The average WRITE time may be expressed as: 
The overall memory access time is dependent on the probability of an access being READ or WRITE. Let t Ave denote the overall memory access time. t Ave may be expressed as:
As seen in Eq. (5), the expected memory access time is the total of average READ/WRITE time with respect to the probability of an access being READ/ WRITE.
Simulation results
The simulation code consists of a series of functions in C programming language. The main program contains a FOR loop which allows simulations in ten nanosecond iterations per cycle. Within the FOR loop, memory references are issued by calling the related routines. Each memory reference could be a READ or WRITE access. As simulation progresses, the total memory access time is recorded and finally, expected access times are computed. A few assumptions have been made in order to resume consistency in analyzing generated data from the simulators. It is assumed that the workload will be evenly distributed between processors in all models. This implies that each processor has the same chance to perform READ/WRITE to shared/private data. If there is a READ miss, a block of data will be transferred to the cache. In case of a WRITE request, write-through policy will be assumed. For the sake of simplicity, it is assumed that block sizes are the same between all levels of caches.
In UMA architecture, since there is no local memory other than cache, the cache hit rate is a major concern in regard to its performance evaluation. Figure 2 shows the results of a simulation as the cache hit rate increases from 10% to 90%. The simulation has been repeated for a various number of processors in the system. The average access time shows an improvement of over 85% as h c increases from 10% to 90%. The effects of other parameters of simulation will be discussed as UMA, NUMA, LRG, and RCR simulation results are compared.
Non-Uniform Memory Access (NUMA) Architecture

Configuration parameters
In NUMA architecture, shared memory is distributed among all processors. Every processor can address its local memory or remote memories of other processors. Every P i is also attached to a private cache. A READ hit fetches data from the cache in t c time. The cache hit rate is denoted as h c . A cache miss with a probability of (1 − h c ) will cause an access to local memory. Fetching data from local memory may be delayed if there are other pending READs or WRITEs by other processors. In the case of a local memory miss, remote memories will be accessed. This NUMA simulator generates memory references as it defines their characteristics whether the access:
• is a READ or WRITE, • is a cache hit or miss, • is a local memory hit or miss, or • refers to shared or private data. Table 2 lists the input to the simulator. 
The waiting time for pending write times depends if other processors are trying to invalidate a shared-word in the private cache. Also, the number of pending write accesses in the queue affects the duration of waiting time to access the cache. This can be approximated by scaling the probability of each individual processor requesting global access by the expected number of requests pending from other processors which is (N − 1)/2. Thus, t NUMA wait pending write could be calculated as: t NUMA wait pending write = N − 1 2 (P shared write t c ) .
The waiting time for local bus depends on the number of remote accesses by other processors and characteristics of those remote references. Thus, t NUMA wait local bus may be expressed as:
where h L is the local memory hit rate. The waiting time for global bus is the function of the number of remote accesses in global traffic and also local traffic in a specified remote memory module. The characteristics of pending accesses in local and global traffic is also a factor in the calculation of waiting time for global bus. Thus, t NUMA wait global bus could be calculated as:
A private WRITE access takes place in t c time plus the waiting time in the cache queue. A shared-write also requires updating the distributed memory module. The average WRITE access may be expressed as:
The overall memory access time may be obtained by combining expected READ and WRITE access times with respect to the probability of an access being READ or WRITE. The expected memory access time may be expressed as:
As seen in Eq. (11), the overall access time is a function of average READ and WRITE times plus the probability of an access being READ or WRITE.
NUMA simulation results
The cache hit rate h c and local memory hit rate h L are two major parameters in the performance evaluation of NUMA machines. Figure 3 shows results of simulation for h c = 0.50 and varying h L percentages. There is a direct relationship between average access time and h L . An increase in h L will decrease t ave directly as shown in Fig. 3 . This experiment is repeated with h c = 0.90 for varying percentages of h L in order to study the effects of higher h c . As shown in Fig. 4 with h c = 0.90, a 77% improvement in average memory access time is achieved over h c = 0.50. The effects of other parameters of simulation will be discussed as RCR, UMA, NUMA, and LRG simulation results are compared. 
Local-Remote-Global (LRG) Architecture
Configuration parameters
Local-Remote-Global Architecture is a combination of UMA and NUMA machines. Every cluster contains two processors and a shared local memory. Each processor on the cluster is attached to a private cache. LRG provides also a global memory accessible by all processors.
A READ hit with a probability of h c fetches data directly from the cache in t c time. In the case of a cache miss, the local memory is referenced. Let h L denote the local memory hit rate. The probability of accessing global memory is (1 − h c )(1 − h L ). As a result, the average access time is a function of h c and h L . Table 3 lists the input to the simulator. Performance of Scalable Shared-Memory Architectures 11 
Local-Remote-Global analytical model
Local-Remote-Global (LRG) architecture consists of several clusters with every cluster containing two processors and a shared local memory as shown in Fig. 5 . Each processor on the board is attached to a private cache. The LRG architecture provides also a global memory module which is uniformly accessible by all processors. A READ hit is answered in t c time. A READ miss requires checking the local memory for a copy of the data. The time it takes to fetch a word from local memory is t L plus the waiting time for the local bus. Let h c and h L denote the probability of cache and local memory hit rates, respectively. The chance of having to access global memory is (1 − h c )(1 − h L ). The time it takes to fetch a word from global memory is t G plus the waiting time for the global bus. Let t LRG wait global bus denote the waiting time for the global bus. The average READ time may be expressed as:
The waiting time for the local bus is dependent on whether the other processor on the board is accessing the local memory and also on the characteristics of its access. A processor will access local memory if:
(1) the processor is performing a shared write, and (2) there is a cache read-miss. t LRG wait local bus may be expressed as:
The waiting time for global bus is dependent on the number of pending global memory accesses by other processors and characteristics of those accesses in the queue. A processor will access the global memory if:
(1) the processor is performing a shared write, and (2) there is a read-miss on local memory. t LRG wait global bus could be calculated as:
The average WRITE time is dependent on whether a WRITE is on a private or shared word. A private WRITE takes place in t c time. A shared WRITE requires also updating local and global memories. The average WRITE time may be expressed as:
The overall memory access time is dependent on average READ and WRITE times with respect to the probability of an access being READ or WRITE. Let t LRG ave be the expected memory access time, t LRG ave may be expressed as:
Equation (16) reflects the fact that the average memory access time is directly the function of the number of memory accesses with respect to their characteristics at any given time. Performance of Scalable Shared-Memory Architectures 13
LRG simulation results
For a complete analysis of the effects of h c and h L on expected memory access time, various percentages of h c and h L are studied. Figure 6 shows the average access time as h L increases from 10% to 90% for various cache hit rates. The cache hit rate has a more drastic effect on expected access time than the local memory hit rate. As shown in Fig. 6 , the local memory hit rate also has a significant effect on the expected memory access time. The effects of other parameters of simulation will be discussed as RCR, UMA, NUMA, and LRG simulation results are compared.
Replicated Concurrent-Read (RCR) Architecture
Configuration parameters
In the RCR, a READ miss will cause an access to replicated memory in search of a requested word. If the search of replicated memory is unsuccessful, then the auxiliary memory will be referenced. Let P i denote processor # i. The probability of a cache hit is h c and the replicated memory hit is h L . P i with (1 − h c ) probability will face a cache miss and has (1−h L ) chance of replicated memory miss. Therefore, the chance of having access to auxiliary memory is (1 − h c )(1 − h m ). P i will have to wait until this data is transferred to C i (P i 's private cache). During this time, P i will be inactive. P i has to compete with other processors to access the global bus. As a result, P i may have to wait for its turn to access the auxiliary memory.
A WRITE access is treated differently in RCR architecture. Every shared write access is broadcasted to all replicated memories. The simulator determines all memory reference characteristics:
• whether the memory access is a READ or WRITE, • if memory access refers to shared data, • if memory access is a cache hit or miss, • is a replicated memory hit or miss.
The simulator also generates the number of other memory references pending for bus access in order to compute the wait time for P i . Table 4 lists the input to the simulator. Table 4 . RCR system parameters.
Input parameter
Value Range 
Analytical model
In this architecture, every processor is attached to a private cache, a local cache, and a replicated memory module as shown in Fig. 1 . A READ hit fetches data from the cache in t c time or in a one clock cycle. Let h c denote the probability of a cache hit. A READ miss with the probability of (1 − h c ) will cause an access to replicated memory in search of a requested word. The time required to access such a word is simply the replicated memory cycle time t M . Let h R denote the probability of a replicated memory hit. In the case of a miss, the processor has to access the auxiliary memory in t aux time. The auxiliary memory space is partitioned into addresses in the working set and addresses outside the working set. Replicated memories are images of the working set. A reference to auxiliary memory addresses are distinguished between isolated memory accesses and those which are incrementally outside the working set by some distance δ such that L − δ ≤ X ≤ H + δ where L and H are the lowest and highest addresses currently stored in replicated memory respectively. If an access is to an isolated memory address, then the requested word will be directly transferred from the auxiliary memory to the processor immediately. In the case of a reference being Incrementally Outside the Fence (IOF) such that X ≤ L − δ or X > H + δ, a quantity of D IOF words will be transfered to the processor. Let P IOF denote the probability of a memory access being incrementally outside the fence.
The time it takes to fetch a word from the auxiliary memory also depends on the global bus traffic. Every processor is equally likely to reference memory address space. The chance of a READ memory access is (1 − h c )(1 − h R ). A processor may have to wait for other READs or WRITEs to complete their execution. The time penalty associated with the duration of wait depends on the number of pending READs and WRITEs in the bus queue. Let N and P i denote the total number of processors and processor # i respectively. At any given time, the number of pending memory accesses is between 0 − N inclusively. As a result, P i may have to wait for 0 − (N − 1) memory references to be completed before its turn. Let t RCR wait global bus denote the wait time. The WRITE access time depends on whether it is a shared or private WRITE.
A private WRITE takes place in t c time to the private cache of the writing PE. On the other hand, Shared WRITEs are broadcasted to all replicated memories via the global bus. A WRITE to memory takes place in t w time. Let t RCR read denote the average READ time which be expressed as the sum of products for each access type and time mentioned above:
Equation (17) could be simplified as:
The value of t RCR wait global bus depends on the number of pending accesses in the queue and the time penalty associated with it. This can be approximated by scaling the probability of each individual processor requesting global access by the expected number of requests pending from other processors which is (N − 1)/2. The time penalty with (N − 1)/2 accesses in the queue depends on the characteristics of each access:
(1) whether the memory access is a READ or WRITE, (2) whether a requested word is in incrementally outside the fence or it is in isolated memory space.
The probability of a READ memory reference to be in the queue is (1 − h c )(1 − h R ). Let P shared write denote the probability of a shared-write memory access such that P shared write + P private write + P read = 1. Let P private write denote the probability of a private-write memory reference such that P private write = 1−P read − P shared write . The value of t RCR wait global bus may be expressed as:
Equation (19) may be simplified as:
Hence, the t RCR read may be expressed as:
The WRITE access time depends on whether it is a shared or private WRITE. A private WRITE takes place in t c time while a shared WRITE is broadcasted to all replicated memories via the global bus. In the case of a shared WRITE, P i may have to wait for 0 − (N − 1) memory references to be completed before its turn. 
Let t
RCR ave
be the average memory access time. The overall expected memory access time depends on the percentages of READ/WRITE operations with respect to total memory references. By adding average READ/WRITE access time with respect to probability of an access being READ/WRITE, the overall expected memory access time becomes:
RCR simulation results
Various cache hit rates with respect to different replicated memory hit rates have been studied. Figure 7 shows the average access time of RCR with a cache hit rate of 10% in conjunction with various replicated memory hit rates. This experiment has been conducted for 8, 16, 32, and 64 processor systems. Average access time decreases as the replicated memory hit rate increases, as is shown in Fig. 7 . Let N denote the total number of processors. When N = 8, there is more than a 64% improvement in access time as replicated memory hit rates increase from 10% to 80%. Systems with 16, 32, and 64 processors also demonstrate an improvement in average access time by over 68%. Figure 8 shows the expected access time for the cache hit rate (h c ) of 80% for various replicated memory hit rates. This figure shows expected access time improves over 72% as h L increases from 10% to 80% for N = 8. For N = 16, 32, and 64, expected memory access time improves over 84%, 89%, and 89% respectively. Comparing Fig. 8 with Fig. 7 shows the effect of h c on average memory access time (t ave ). As more memory accesses are satisfied by cache and replicated memory, better average access time and more CPU utilization results were obtained.
Conclusion
Comparison of distributed shared-memory architectures performance
For the purpose of comparing these machines, the effect of numerous varying parameters will be examined. As shown in Fig. 9 , the effect of various cache hit rates is illustrated. As expected, NUMA machine has shown great improvement in average access time with respect to varying cache hit rates. The reason being that the delays caused by interconnection network is decreased. All of the other machines have shown improvement as h c is increased from 10% to 90%. When h c is above 75%, RCR delivers a lessened memory access time. When h c is below 75%, RCR's memory access time is comparable to LRG's average access time. Figure 10 shows the effect of varying local memory hit rates on expected memory access time. Since UMA machine does not have local memory, its average memory access time does not vary. RCR demonstrates a direct effect as a result of increasing the replicated memory hit rate. The NUMA machine demonstrates a better performance as hit rate increases: The LRG machine is less affected by the local memory hit rate than the RCR and NUMA machines. Figure 11 illustrates the effect of varying local memory hit rates when the cache hit rates increase. The effect of varying local memory hit rates accompanied by higher cache hit rates is more pronounced with the RCR and NUMA machines. The other machines did show improvement but not as significantly as that of the rate of RCR machine. Figure 12 also illustrates the effects of varying local memory hit rates in conjunction with a 90% cache hit rate. The RCR, LRG, and NUMA machines demonstrate an improvement as local memory hit rates increase. The effect of varying shared-write percentages on these machines have been analyzed. Let P shared write denote the probability of a shared-write memory access such that P shared write +P private write +P −read = 1. The results shown in Fig. 13 illustrate a slight increase in the average memory access time as the probability of shared-writes increases from 0.0 to 0.5. Figure 14 shows the effect of varying block sizes on expected memory access time on the RCR, UMA, NUMA, and LRG machines. The NUMA machine demonstrates a drastic increase in memory access time as the block sizes increase from 8 to 64. The main reason for this significant increase in memory access time is due to the transfer of blocks of data from remote memories. The UMA machine also experienced a significant increase in memory access time due to the transfer of blocks of data from global memory. The RCR and LRG machines demonstrate a lesser effect as block size increases. In conclusion, the varying probability of READs is examined. Let P read denote the probability of READ. The results shown in Fig. 15 exhibit the NUMA machine's performance decreases as P read increases. The reason is that the READ from the remote memories are costly. The UMA machine demonstrates a similar effect but at a lesser rate. The RCR machine shows the most desired result since it performs READs locally. The RCR configuration has less sensitivity to READ/WRITE percentage.
Performance of Scalable Shared-Memory Architectures 21 Fig. 15 . Expected access time for various P read percentages.
Summary
Memory latency is the most significant issue in the design of a shared-memory multiprocessor. Memory latency could be improved by increasing cache hit rate in a uniprocessor. Unlike uniprocessors, increasing size of caches is not a dominant factor in multiprocessor hit rate. In a multiprocessor system, maintaining coherence between caches is a significant factor in memory latency. These coherence misses are independent of the cache size. Increasing cache sizes will not decrease the expected memory access time, since data sharing causes invalidations and extra misses because of coherence. Replicated Concurrent-Read architecture offers favorable performance over UMA and NUMA architecture for all ranges of application and system parameters. RCR outperforms LRG architectures when the hit rates of the processor cache exceed 80% and replicated memory exceed 25%.
