The Chip Multiprocessor (CMP) 
Introduction
With the emergence of the CMP design paradigms, multiple processors and large caches have been integrated on a single chip. The CMP makes it possible to execute multiple processes/threads simultaneously and exploit process/thread-level parallelism. The CMP architecture has been explored and studied in the context of a wide range of applications, such as commercial transactional server, network devices, personal computers and embedded systems etc. Obviously, parallel multiprocessor computing and high capacity caches dramatically improve system performance. Therefore, optimal design of cache architecture plays an important role in improving the system performance of CMP.
Several commercial offerings [1, 2, 3, 4] and research projects have addressed CMP L2 cache design. For instance, Liu et al. [5] proposed an L2 cache organization where different numbers of cache banks can be dynamically allocated to processor nodes, connected through shared bus interconnects. In their scheme, a hardware-based mapping table should be maintained by the OS so that necessary size of cache banks can be allocated to different processors. Liu et al. evaluated multithread applications running separately and also examined the scenario of two multithread benchmarks. Kim et al. [6] proposed fair cache sharing metrics and dynamic partitioning algorithms to increase throughput, using a set of coscheduled pairs of benchmarks. Marino [7] evaluated a 32 processor CMP with different numbers of processors sharing the L2 slice and for the case of multi-thread benchmarks. Another related work on the topic of adjusting cache size to suit application characteristics is done in [8] and [9] , wherein they partitioned the cache space between multiple processes/threads executing on one processor in order to improve IPC (Instructions per Cycles). In their papers, they mixed several single-thread benchmarks for testing. However, they did not mention multithread applications. NUCA approaches are used in [10] to optimise the degrees of sharing and improve the performance. Only multi-thread benchmarks are used in the simulation. CMP-NuRapid [11] is based on private L2 caches bus-based protocol and makes copies close to requestors to allow fast access for read-only sharing, and does not make copies for read-write sharing to avoid coherence misses. In their work, multi-programmed and multi-threaded workloads are simulated separately.
CMP should have the ability of dealing with hybrid workloads whereby different combinations of singlethread and multi-thread tasks could be executed simultaneously. To the best of our knowledge, there is no research on performance evaluation of CMP cache architecture considering hybrid workloads.
In the real world, hybrid workloads are required for many applications, such as control systems of autonomous vehicles, missiles, etc.. One particular research application that we are studying is the development of the host computer platform for an Autonomous Underwater Vehicle (AUV). Some of the computation tasks are simple single-thread real-time tasks, while others are designed as multi-thread programs which require a large proportion of CPU and memory resources. Some vital tasks, for instance, navigation and guidance control should be executed simultaneously and continuously, whereas other tasks, such as sonar image processing, image classification, acoustic communication, etc, may be loaded when needed. Based on the complexity of the AUV system, we have designed our CMP architecture to include a hybrid workload-aware L2 cache architecture to meet the need of hybrid single-thread and multi-thread workloads. In our scheme, each processor has Split Private and Shared L2 (SPS2) cache. This scheme makes efficient use of on-chip L2 cache resources between different tasks and has low miss ratio and access latency.
The rest of the paper is organized as follows. In Section 2 the proposed cache hardware architecture is explained and the corresponding cache coherence protocol is described in Section 3. The simulation results are presented in Section 4. Section 5 presents a summary of this paper with conclusions and future work.
2.
Hybrid workload-aware cache architecture
The simplest and the most cost-efficient architecture is the bus-based, shared memory multiprocessor platform. The advantages of such a system are its simple and well-understood programming model with low communication latency. An additional benefit of this multiprocessor organization is that multi-threading and any uni-processor system software, in general, can be easily applied to bus-based shared memory multiprocessor architectures. This is due to the fact that the physical memory is shared amongst all the processors. As the common bus is inherently a broadcast medium, snoop-based cache coherence protocols are used in general.
In this paper, we limit our consideration to a 2-level on-chip cache hierarchy, although the proposed SPS2 architecture can be readily extended to include more than two levels. A traditional bus-based sharedmemory multiprocessor has either private L1s and private L2s, or private L1s and a shared L2. We refer to these two structures as L2P and L2S, respectively. Both schemes have their advantages and disadvantages. L2P architecture has fast L2 hit latency but can suffer from large amounts of replicated shared data copies which reduce on-chip capacity and increase the off-chip access rate. Conversely, the L2S architecture reduces the off-chip access rates for large shared working datasets by avoiding wasting cache space on replicated copies. The banked shared L2 cache organization is a well-known method to reduce access latency. However, average L2 access latency is still influenced by relative placement on the die and network congestion.
We propose a new scheme, SPS2, to organize the placement of data and cater for combinations of singlethread and multi-thread workloads. All data items are categorised into one of two classes depending on whether the data is shared or exclusive. Correspondingly, the L2 cache of each processor is also divided into two parts, private and shared L2. Each processor has its own private L1 (PL1), private L2 (PL2) and shared L2 (SL2). In this paper, shared data refers to data that has been accessed by more than one processor, and private or exclusive data refers to data that has only been accessed by one processor. The proposed scheme places exclusive data in the PL2 and shared data in the SL2 cache. With the consideration of hybrid workloads, this arrangement provides fast cache accesses for unique data from the PL2. It also allows large amounts of data to be shared between several processors without replication of the data and thus makes better use of the available SL2 cache capacity. The proposed SPS2 cache scheme is shown in Figure 1 . SL2 is a multi-banked multi-port cache that could be accessed by all the processors directly over the bus. To minimise the latency of PL2 and SL2 a similar design layout could be employed for SPS2 as that used in [12] . SL2 would reside in the centre of the chip with processors placed around the SL2 and PL2s located around the outer boundary, close to the processor. As shown in Figure 1 , two local buses are used to connect L1 and PL2 (dashed line), and L1 and SL2 (solid line). Bus transactions for cache coherence separately run on another bus between L2s and memory (bold line). In this paper, we define a node as an entity comprising a single processor and three caches, i.e., PL1, PL2, and SL2. In a physical realization, SL2 could have the same number of banks as the number of processors. The role of the SL2 is to store shared data, while the role of PL2 is to store private data. Data in PL1 and PL2 are exclusive, but PL1 and SL2 could be inclusive. If a data block exists in PL1 then it cannot also exist in PL2, but the existence of a data block in SL2 does not imply that a copy must exist in one of PL1. All new data is fetched from memory to PL1 as private data initially. If private data is evicted from PL1, it will be placed in PL2. If shared data is evicted from PL1 then it will be placed in SL2. When PL2 is full, the private data could steal the capacity from SL2.
Unlike the unified L2 cache structure, the SPS2 system with its split private and shared L2 caches can be flexibly and individually designed according to demand. First, PL2 could be designed as a directmapped cache to provide fast access and low power, while SL2 could be designed as a set-associative cache to reduce conflict. Second, PL2 and SL2 do not have to match each other in size, and they could have different replacement policies. We can evaluate the advantages of dividing shared data and exclusive data. Take for example two tasks running on the system. One is a multi-thread application comprising a large quantity of shared data that needs to be used by several threads from different processors, while another is a singlethread application in which most of data that is exclusive. In the case of L2P architecture several copies of the same shared data set will exist in the L2 caches of the different processors. This architecture will suffer from wasted cache space and incur higher numbers of on-chip misses for large data sets and/or multiple concurrent threads. In the case of L2S architecture, any processor can access all of the shared data. however, since many of the requested data blocks would not be available in the local bank, it will result in high access latencies while blocks are fetched from other banks. At the same time, there is a high possibility of data conflicts between different threads. In the case of the SPS2 architecture, each processor has two separate L2 caches (PL2 and SL2), which could be individually and simultaneously accessed to reduce access latency. In addition, SPS2 reduces access latency and contention between shared data and private data. It imposes a low L2 hit latency because most of the private data should be found in the local PL2. Shared data will be placed in SL2 which collectively provide high storage capacity to help reduce off-chip access. The SPS2 cache system does not need any new additional CPU instructions to support its protocol, and hence, no changes to the instruction-set or CPU hardware interface are required to enable the cache system to work with a conventional multiprocessor design. The only parts that need to be modified comprise the cache architecture and the cache controller that includes the realization of cache coherence protocol.
Description of coherence protocol
The protocol employed in SPS2 is based on the MOSI (Modified, Owned, Shared, Invalid) protocol and is expanded to incorporate six states (M 1 , M 2 , O, S 1 , S 2 , I). Subscript 1 or 2 indicates whether the block has been accessed by 1 processor or by 2 or more processors. Data contained in PL1 and PL2 may have all six possible states (M 1 , M 2 , O, S 1 , S 2 , I), while data contained in SL2 has only four states (M 2 , O, S 2 , I). The SPS2 protocol uses the write-invalidate policy on write-back caches. To keep consistency and coherency between the three different caches, the cache coherence protocol should also be modified accordingly.
Coherence Protocol Procedure
The protocol behaves as follows. Initially, any data entry in the three caches (PL1, PL2 and SL2) should be Invalid (I). When node i makes a read access for an instruction or data block at a given address, PL1 i will be searched first. Since PL1 i is empty, then PL2 i and SL2 will be searched next. Again, neither PL2 i nor SL2 will have the requested data, so a GetS message will be sent on the bus. Since all the caches in all the processors are initially invalid, the memory will put the data on the bus, and PL1 i will store the data and change their states from I to S 1 . If this block is replaced, it will be put in SL2. If another node j requires this same data shortly after, the data will be copied from SL2 to PL1 j without needing to fetch the data from memory, and the state in node i will be changed from S 1 to S 2 . If a read request finds the data in the local PL1 i , then no bus transaction is needed and data will be supplied to the processor directly.
When node i needs to make a write access and a write miss is detected because the data block is not present in either PL1 i , PL2 i , or SL2, the cache controller starts a block replacement cycle, possibly evicting a dirty block from the cache. A GetX message will be sent on the bus to fetch data from the other nodes or memory and place the requested data in the recently vacated slot. All the other nodes will check their own PL1 and PL2 caches for the requested data. If none of the other nodes have valid data, then memory will send data to PL1 i and its state will be changed to M 1 . However, if any node, for example j, finds valid data (M 1 , M 2 or O) with same address as the requested data, the contents will be sent to PL1 i and all the caches (including j and excluding i) should invalidate data with the same address. Once the data is placed in PL1 i and updated, its state will be changed to M 2 . Since the SPS2 scheme employs a write-back policy, modified data will not be written back to memory until it is replaced. A data block with state M 1 means this node exclusively retains the most recent version of the data and that the data in memory is obsolete. State M 2 means that the data has been accessed by other nodes for reading or writing before, so data with state M 2 in PL1 will be transferred to SL2 when it is evicted. If, conversely, the write operation finds the data block in PL1 i or PL2 i with state M 1 or M 2 (implying a write hit), then the write hit process will proceed without bus transactions involved.
Suppose that after node i executes a write command, another node j needs to read data from same address, it will check PL1 j first, then PL2 j and SL2. Since any copy that node j may have previously held would have been invalidated by the write operation from node i, it will be unable to find the data locally. Therefore a GetS message will be placed on the bus requesting the other nodes to send back the data. Node i will check its own PL1 i and PL2 i , and find the requested data block with state M 1 or M 2 in PL1 i . The modified data will then be placed on the bus and stored in PL1 j . The cache state in PL1 i will be changed from M 1 or M 2 to O and that in PL1 j will be set to S 2 .
If no free slot is available in any of the caches, then the existing data block will need to be swapped out and replaced with the new block. The old data block in PL1 will be evicted to PL2, if its state is M 1 or S 1 . Data with state S 2 , M 2 or O in PL1 will be relocated to SL2. Data with state M 1 evicted from PL2 and data with state M 2 or O in SL2 will be returned to memory. If the state of the data in PL1, PL2 and SL2 is S 1 or S 2 , indicating it is shared data, then the data will simply be invalidated.
State graph of SPS2 cache protocol
To maintain data consistency between caches and memory, each node is equipped with a finite-state controller that reacts to the read and write requests. Abstracting from the low-level implementation details of read, write, and synchronization primitives, one may consider the cache coherence protocol as families of identical finite-state machines. The following section illustrates how the SPS2 protocol works using a state machine description as shown in Figure 2 . Our coherence protocol requires three different sets of commands. All the transition arcs in Figure 2 (a) correspond to access commands issued by a local processor. These commands are labelled as read, write. The arcs in Figure 2 (b) represent transfer related commands, e.g., replacement commands (rep2 and repS). All the arcs in Figure 2 (c) correspond to commands issued by other processors via the snooping bus. They include GetS and GetX. All these commands are defined below:
read: issued when a processor needs to read an instruction or data;
write: issued when a processor needs to write data; rep2: issued when PL2 needs room for new data, and old data exists with state S 1 , S 2 , M 1 , M 2 , O;
repS: issued when SL2 needs room for new data, and old data exists with stateS 2 , M 2 , O;
GetS: issued when requesting to share data, following a read miss;
GetX: issued when requesting for exclusive data, following a write miss.
As shown in Figure 2 , the cache state of any node will change to the next state according to its current state and the received command.
Simulation analysis
To evaluate the performance, we employ GEMS SLICC (Specification Language Including Cache Coherence) [12] to describe three different cache coherence protocols (L2S, L2P and SPS2). GEMS is based on Simics [14] , a full-system functional simulator, as a foundation on which various timing simulation modules can be dynamically loaded. The GEMS Ruby module provides a detailed memory system simulator. The above three protocols are modified versions of the MOSI SMP broadcast protocol from GEMS. The simulated processor is the UltraSPARC-IV which has a 64 byte wide, 64 Kbyte:L1 cache. We use CACTI [15] to estimate the latency of caches based on an assumed 65nm technology. In L2S, all processors share one 4 MB 8-way, 4 port SRAM with 18 cycles latency. For L2P, each processor has a private 1 MB 4-way, 1 port SRAM with 6 cycles latency. In our SPS2, each processor has a private 0.5 MB 4-way, 1 port SRAM with 5 cycles latency, while at same time four processors share one 2 MB 8-way, 4 port SRAM with 12 cycles latency. We assume 4GB memory is shared with 200 cycles latency. The MOSI protocol is employed to manage the shared bus interconnect and maintain cache coherence across the processors. The Linux Aurora (linux kernel 2.6.15) operation system is used to set CPU affinity, and gcc3.4.2 is installed to produce binaries of application code.
We use a mixture of multithread scientific benchmarks from the SPLASH-2 suite [16] : (ocean, and barnes) and single-thread benchmarks from Mibench [17] (gsm, blowfish, adpcm and susan) to simulate typical computation tasks from AUV. The PARMACS macros must be installed in order to run multi-thread benchmarks. The main parameters of these benchmarks are listed in Table 1 . The combinations of different applications are shown in Table 2 . To minimise the start-up overhead caused by filling the cache, collection of statistics is delayed until the initialisation period. We have realized and evaluated three different L2 cache architectures, L2P, L2S, and SPS2, and compared their characteristics using three different metrics: L1 cache misses, miss latency, and network traffic. The results are shown in Figures 3 -5 . The horizontal axis shows 8 different combinations of benchmarks from Table 2 . Each of them has three columns indicating three L2 cache schemes, L2P, L2S, and SPS2. Multi-threaded benchmark run on processor 0-3,
Single-thread benchmark run on processor 0 L1 cache miss breakdowns are shown in Figure 3 . According to the L2P protocol, requests that miss in the local L1 cache are sent to the local L2 cache where the request could hit, be forwarded to remote L1 and L2 caches, or be sent off-chip. Because of duplication, nearly 36-48% L2P requests will find valid data in local L2, while 18%-24% requests get data from remote caches. For L2S, requests that miss in the local L1 cache are sent to the shared L2 cache where the request could hit, be forwarded to remote L1 caches, or be sent off-chip. Shared L2 cache hit include local SL2 bank hit and remote SL2 bank hit. From Figure 3 , in L2S, bigger capacity leads to less (2-5%) off-chip memory access than L2P, but more data have to be fetched from remote caches than L2P. One-third to half SL2 access could be found in local banks. For our SPS2 scheme, PL2s only keep private data and has half of L2P's capacity, so only 5-10% requests hit in private L2. However, SL2 has all the data that have been accessed at least twice, so those data have high chances (38-45%) to be used again. In addition, nearly one third of SL2 hits happen in local bank. After searching in PL2 and SL2, only a small proportion of the requests in SPS2 have to go to remote caches (5-7%) compared with L2P (9-14%) and L2S (15-27%). SPS2 also has slightly less off-chip accesses than L2P and L2S. Comparing the two different CPU affinity configurations, configuration 1 has less L1 misses than configuration 2. This is because in configuration 1, single-thread applications exclusively use one processor and multi-thread applications use the other three processors. In this way resources are distributed more efficiently than mixing up all sorts of applications.
Figure 3. Comparison of L1 cache misses breakdown for the three architectures
To calculate L1 miss latency, firstly, we define access frequency to level i memory
=1 and f 1 =h 1 . h i indicates hit ratio of level i memory. Using the access frequency f i for i=1, 2, ..., n, we can formally define the access latency of a memory hierarchy as follows: (1 ) For L2P architecture, L1 miss requests split into three categories: local L2 cache hits, remote hits, and off-chip accesses. Similarly, L2S miss requests have four categories: local SL2 bank cache hits, remote SL2 bank hit, remote hits, and memory accesses. Miss latency of SPS2 is a bit more complex: local L2 cache hits, local SL2 bank cache hits, remote SL2 bank hit, remote hits and off-chip memory access. By using the above formula and data in Figure 3 , we calculate L1 miss latency in Figure 4 . It is shown that SPS2 gain shorter latency than L2P and L2S architectures. For M2C1 benchmark, SPS2 has 9% less latency than L2P scheme, and 5% less than L2S. For M1C2, L1 miss latency of SPS2 is only 0.2% higher than L2P, but 6.8% less than L2S. Comparing CPU affinity configurations, configuration 2 has a smaller average latency than configuration 1.
Figure 5. Comparison of network traffic for the three architectures
For shared memory multiprocessor systems, network traffic reflects the usage of the bus, which increasingly becomes a bottleneck as the number of processors increases. From Figure 5 , it can be seen that L2S has the highest network traffic throughput while L2P consumes less because most private data could be found locally so less network transactions are needed than L2S. Although the SPS2 protocol has extra operations to keep consistency between PL2 and SL2, the network traffic of SPS2 is 14-29% less than those in L2P, and 33-51% less than L2S. Through analysing the difference between L2S and SPS2 protocols in Section 2, we learn that SPS2 separates local requests and remote requests on different buses, but L2S mixes all requests together which leads to higher traffic. In addition, it is shown that the GET transactions (GetS, GetInstr, and GetX) in SPS2 are less than those in L2P and L2S. According to the reduction of requests in SPS2, data transferred on the bus are also smaller than those in L2P and L2S. In L2P, data replications lead to limited space so that more data is put back to memory (PUT in figure5). A comparison of the network traffic metric reveals that all benchmarks with configuration 1 have less traffic on the bus than those with configuration 2.
Conclusions and future work
To improve the performance of CMP cache structure for hybrid workloads, we have proposed a new cache architecture SPS2. SPS2 architecture takes advantage of the low latency of L2P and the high capacity of L2S, because each node has private L1, private L2, and shared L2 caches. We use a new state transition graph method to describe the SPS2 cache coherence protocol.
Using hybrid benchmarks, we compared SPS2 with two traditional cache architectures. Simulation results show that remote accesses and off-chip accesses of SPS2 are less than those of L2P and L2S. The SPS2 scheme reduces the L1 miss latency by 9% as compared with the L2P scheme, and 5% compared with L2S scheme. The network traffics of SPS2 are 14-29% less than those in L2P, and 33-51% less than L2S. Based on our research, it is clear that hybrid workloadaware cache design scheme allows better system performance under mixed single-thread and multithread loads. In addition, through comparison of different CPU affinity configuration, the performance of SPS2 could be improved when single-threads are exclusively assigned to separate CPUs.
This paper is only a first exploration on hybrid workload-aware cache design. Future works towards this direction will include: variety of CPU affinity configurations, time constraint problem of singlethread task, configurable PL2 and SL2 cache capacity, etc.
