Abstract-Shared memory is a common inter-processor com-to propagate coherence operations only to those processors munication paradigm for on-chip multiprocessor SoC (MPSoC) that must participate in the operations. This solution requires platforms. The latency overhead of switch-based interconnection us to keep track of which processors have copies of various networks plays a critical role in shared memory MPSoC designs. [7] uses an invalidation approach and allows for the existence increasingly attractive platforms due to their better scalability, of multiple unmodified cached copies of the same block in higher data throughput, flexible IP reuse and by solutions the system. However, in such a system, each directory with to clock skew problems associated with bus-based on-chip distributed shared memory or cache is distributed among all interconnection schemes. nodes in the system to provide a closer local memory or Distributed shared memory (DSM) [3] or distributed shared local cache and several remote memories. While local memory cache (DSC) [4] is an architectural approach which allows access latencies can be tolerated, the remote memory accesses multiprocessors to support a single shared address space generated during the execution can reduce the performance of that is implemented with physically distributed memory. A applications. DSM or DSC multiprocessor platform is also called non-In this paper, we present a method to mitigate the impact uniform memory access (NUMA) [5] or non-uniform cache of remote memory access latency. We propose a switch architecture (NUCA) [6], since the access time depends on architecture for low-latency cache coherency of a distributed the physical location of a data word in memory or cache. shared memory MPSoC platform which we denote as DCOS, Coherence protocols allow such architectures to use caching in Directory Cache On a Switch. The proposed architecture was order to take advantage of temporal and spatial locality without applied to our proposed MPSoC platform that features packet changing the programmer's model of memory or cache.
architecture with distributed shared cache and distributed shared This state information is stored in a place called the directory, memory. It is able to reduce the number of home node cache and the cache coherence scheme based on such information is accesses, which results in a reduction in the inter-cache transfer called directory cache coherence. In a distributed shared memtime and the total execution time. Simulation results verify that ory MPSoC that connects all the processors through switches, the proposed methodology can improve performance substantially over a design in which directory caches are not embedded the directory cache coherence scheme can be applied. In the in the switches.
conventional directory cache protocol, each directory resides in a distributed shared memory bank or distributed L2 cache I. INTRODUCTION bank and it contains entries for each memory or cache block. Rapid advances of silicon and parallel processing technolo-An entry points to the exact locations of every cached copy of gies have made it possible to build multiprocessor systems-on-a memory block and maintains its status for future reference. chip (MPSoCs). In particular, packet-switched MPSoCs, [1] , The classical full-map directory scheme proposed by Censier which are called networks-on-chip (NoC) [2] , are becoming [7] uses an invalidation approach and allows for the existence increasingly attractive platforms due to their better scalability, of multiple unmodified cached copies of the same block in higher data throughput, flexible IP reuse and by solutions the system. However, in such a system, each directory with to clock skew problems associated with bus-based on-chip distributed shared memory or cache is distributed among all interconnection schemes. nodes in the system to provide a closer local memory or Distributed shared memory (DSM) [3] off-chip shared memory.The proposed DCOS architecture was coherence mechanisms. We exploit a full-map directory cache~~~~~~~~~~~~~~~~~~~~~~~~~~ impleentedwithi eac swith to educ the ache-o-cace coerenc protcol s ourDCOS rchitctur. In his sheme data transfer time and the switches were connected through a the directory resides in main memory and contains entries for~Netw 4 x 2 2D mesh topology. Wormhole routing was adopted as the each memory block. An entry points to the exact locations of.ed M-0, packet switching methodology. For our simulations, we used every cached copy of a memory block and maintains its status.~~~~~~~~~~~~~~~~~~~~~~~ the---MIPS-R----OOOO-----core-model-hich-is-suported-by-SIM-With-tis-informaton,-the-diectory-preerves-the-oherence-o as shown in Figure 2 . The main features of the MIPS RiOOOO data ineachdistributedshared- fixing the cache size of the shared L2 directory at 32 entries.
IV. SIMULATION ENVIRONMENT
As shown in Figure 5 (a), the total execution time for each We have used the RSIM simulator for distributed shared benchmark application was reduced proportionally as the size memory multiprocessor systems. Some core parts of the Sim-of the on-switch directory cache is increased. When comparing ulator written in C++ were modified for a shared L2 cache the execution time of the non-DCOS to the DCOS for 2048 environment and the proposed directory cache module was entries, the overheads for FFT, Radix, Ocean and Barnes were added to the default switch block. In addition, as shown in Figure 5 (b), the total cache-to-cache The application programs used in our evaluations are FFT, transfer time overhead from home node to local node was Radix, Ocean, and Barnes from the SPLASH-2 benchmark analyzed to determine the impact of the DCOS scheme. Cachesuite. The input data sizes are shown in Table II. to 
