ABSTRACT
INTRODUCTION
Increasingly large numbers of modern embedded applications are imposing high performance demands, while at the same time exhibiting stringent power constraints; such applications, to name a few, include multimedia support such as audio/image/video capture and processing, and data-intensive wireless devices, such as sensor nodes for environmental, industrial, or security data acquisition and analysis. Since design cost and time-to-market are major requirements for product success, implementation platforms based on processor cores are typically utilized instead of custom hardware. In order to meet the high performance requirements, implementations containing multiple processor cores have started to emerge. Such Multi-Processor SystemOn-Chip (MP-SOC) platforms are quite natural as task parallelism and specialization are inherent in these applications.
Shared-memory multiprocessor architectures are typically used for MP-SOC platforms as they exhibit low communication latency, wellunderstood programming model, and decentralized topology easily supporting heterogeneous processing units. Since all the processors share the bus for their memory accesses, the available bandwidth can be easily exhausted and thus can quickly lead to significant performance degradations. To alleviate this problem, caches are used to replicate the data and bring it closer to the requesting processors, thus saving bus bandwidth and minimizing memory contention. However, caches must be maintained coherent, since when a processor modifies a cached data, other caches might be left with an older version of the same data. To resolve this issue, snoop-cache coherence protocols are used. These protocols use broadcast bus transactions and snoopy cache controllers in order to keep caches coherent. The general-purpose nature of this scheme results in significant power consumption, which prevents the utilization of these powerful platforms for energy constrained embedded applications. It has been reported [1] that the power due to snoop related cache lookup can amount to 40% of the total power consumed at the cache subsystem.
The snoop-cache coherence schemes are general-purpose in their nature as no prior knowledge regarding the application structure and communication patterns, in particular, is available; it is assumed that all memory references can potentially access a shared data. This is a very conservative approach, where each bus transaction triggers a cache lookup for all the processors in order to find out whether a locally cached data needs to be invalidated, updated, or written back to the memory; thus leading to significant power consumption. Clearly, only a small fraction of all memory accesses refer to shared memory that need remote cache invalidation or update. This power overhead can be drastically reduced if information regarding the application communication patterns and ranges of shared memory is captured and utilized dynamically by the hardware.
Recently, a related work targeting snoop protocol optimization was proposed in [2] . The authors have introduced a cache-like structure, which dynamically identifies which remote memory references have been known to be not present in the local cache. The introduced table is updated each time the cache is probed by the snoop controller or new data is brought to the cache. A number of other approaches exist which reduce energy through cache resizing or circuit techniques [3] , [4] .
Most of these studies focus on general purpose processor architectures and in most of the cases the introduced approach relies on dynamic, run-time identification and utilization of program patterns. The methodology proposed in this paper targets embedded applications, where application knowledge is available in advance and can be captured, analyzed, and exploited in a deterministic manner.
The communication patterns for each task are captured and provided to the system software and hardware by the compiler or the software developer. Each time a task is created it informs the system software which set of virtual address ranges must be treated as shared memory for that task. The system software, in turn, identifies the corresponding physical address ranges and informs the special hardware support. The information regarding the exact shared memory is captured in an extremely efficient way through special unique identifiers assigned to the page translation entries by the OS memory manager. As capturing and utilizing the set of shared memory regions is implemented in a software-programmable way, the run-time switch between tasks using the proposed approach can be performed in a transparent way with practically no performance and power overheads. 
MOTIVATION
Various forms of multiprocessing configuration and interconnect topologies exist. However, the simplest and the most cost-efficient one is the bus-based, shared memory multiprocessor platform. The advantages of such a system are its simple and well-understood programming model with low communication latency. Additional benefit of this multiprocessor organization is that multi-threading and any uniprocessor system software, in general, can be easily extended for bus-based shared memory multiprocessor. This is due to the fact that the physical memory is shared amongst all the processors, and thus all system code and data structures are placed in that memory. The processor cores simply provide multiple hardware contexts to the shared system software layer.
The common bus in these systems, however, can quickly become a bottleneck as each access to the shared memory has to be performed through the bus. A common practice to resolve this problem is to employ local caches at each processor node. In this way, the data is replicated and brought closer to the processors. Not only is the amount of traffic on the bus reduced, but also bank conflicts on the shared memory are eliminated. Caching, however, introduces the fundamental problem of incoherent data stored in the local caches. In bus-based shared memory systems, write-back caches are typically used, since the goal is to minimize the bus utilization as much as possible.
To resolve the cache coherence problems, coherence protocols have been introduced for general-purpose multiprocessor systems. As the common bus is inherently a broadcast medium, a snoop-based cache coherence protocols are being used in general. The fundamental principle of these protocols is that each memory reference placed on the bus is detected by all the snoop controllers in the system, and each one of them probes its local cache to check whether the data requested through the shared bus from the memory happens to be present in the local cache. If yes, then depending on the type of request and the state of the cache line, different actions must be undertaken.
The general architecture of the bus-based shared memory multiprocessor with support for snoop cache coherence is shown in Figure 1 . Each cache is associated with a snoop controller and each cache line has its own state. The snoop controllers monitor the system bus for read and write misses that are generated by the processors in the system. Such memory requests are generated by the processors when the needed data is not present in the local cache.
Fundamentally, the purpose of the snoop controllers is to react to any such memory requests on the bus. For instance, for a read-miss on the bus, the snoop controllers for all nodes must probe their local caches to check whether the requested data is present and modified but not yet written-back to main memory. In such a case, the corresponding cache would be the only place in the system where the data can be found. Alternatively, when a processor modifies a data in its local cache, it needs to place a write-miss on the common bus (even when the data is already in the local cache) in order to notify all the other processors to invalidate this data if it happens to be present in some of the other caches as well. When the snoop controllers on all the processors detect a write-miss transaction on the bus, the local caches are probed in order to invalidate this data if it is cached locally. Several versions of snoop-based cache coherence protocol exists; nonetheless, all of them are based on the same principle.
Probing the local cache to identify whether an address requested through the common bus is present in the local cache entails an almost full cache lookup. In this probing, the tag arrays of the cache structure need to be accessed. The tags stored in all the associativity ways need to be read and compared with the actual tag of the address present on the bus. Such snoop induced cache lookups for each memory request are the major contributing factor to the excessive power consumption of the cache coherence protocols.
It can be immediately observed from this brief description of snoopbased cache coherence protocols, that these protocols are generalpurpose in their nature. It is conservatively assumed that each memory request on the common bus is a request to a possibly shared data; hence, the request needs to be handled appropriately. However, if application knowledge regarding the shared memory regions of the tasks running on each processor node is made available to their snoop controllers, a large amount of snoop-related cache probing can be eliminated. Consider for example, that in a shared memory multiprocessor system, a set of tasks is running, which happen to communicate through a single shared memory region from address 100 to address 200. If such a knowledge is made available to the snoop controllers, they can decide to filter all memory requests outside of this region.
FUNCTIONAL OVERVIEW
In shared memory multiprocessors, application tasks and processor nodes communicate through shared memory. At application level, parallel processes or threads are created by the software developer in order to utilize the underlying multiprocessor platform. Shared memory regions are allocated and each task is given a means of accessing them. In the case of multi-threading, this happens implicitly, as the threads run in the same address space and are usually implemented as a procedure defined as a part of the program, which has access to all globally allocated data.
Information regarding shared data access for each parallel task is readily available when the application software is developed and compiled for the underlying system. For instance, it may be the case that only one of the global arrays is used by a particular task as an input buffer, and one for an output buffer, while all other memory references of that task are to private data. However, this information is lost when the application is transformed into a binary form and loaded into the system. The only events observable from the thread library and the operating system are the creation of threads and utilization of synchronization primitives. What is left at hardware level is simply memory references, which the memory system needs to handle assuming that they can refer to any possible memory location.
In the proposed methodology, we make the information regarding shared memory utilization available down to the hardware level, where the snoop controller can judiciously utilize it and filter out all the memory references, which do not refer to a shared memory region of interest to the local processor node. The information is transfered from the application to the system software, which in turn utilizes it to identify the physical page frames, which belong to the shared regions.
Compiler and Application Support
As part of the task creation the compiler or the software developer makes sure to inform the thread library or the operating systems which global arrays should be treated as a shared memory region for that task. This can be easily achieved in multiple ways, one of them being to include a pointer to the beginning of the global array and its size, when calling the primitive for creating and starting a new thread. In this way, the application will explicitly inform the underlying system software that only memory references to specified global arrays must be treated as references to shared memory; all other memory references are private for that task; hence, no other processor in the system can generate a valid reference to them.
System Software Support
The memory manager module, which is usually a part of the system software is the component responsible for allocating the data into the physical memory. The application executes in a virtual address space, which is mapped to the available physical memory. When the parallel application task is created the information regarding its shared global array is made available to the operating system. At this stage, the memory manager identifies the set of physical memory frames, which correspond to each shared array of the task. Additionally, a unique identifier is provided for each such shared memory region in the system, which is utilized by the hardware in order to determine if a given memory reference bus transaction refers to a shared memory regions.
A region corresponds to a consecutive set of virtual pages. At application level most often a shared region corresponds to a global array; our granularity of forming shared memory regions is at page level. Consequently, each region is assigned a unique ID and each physical page is associated with one such ID. The region ID is associated to each translation entry in the page table and is also stored in the hardware translation cache.
On each cache miss the address of the required data is put on the shared bus. When the process generates an address, which is virtual, this address is translated to a physical one by the TLB. Both the physical page number and the associated unique region identifier are extracted from the TLB. At this stage, we can annotate each memory reference with the region identifier of the shared region to which it belongs. If all the snoop controllers in the system can observe this region identifier for each memory request on the system, a very small and efficient hardware would suffice to check whether that region ID matches with the shared regions for the particular processor node. Providing the shared region identifier to all the processor nodes can be performed very easily by placing it on the common bus together with the memory request. It can be easily observed that all memory requests, which are related to cache coherence are mostly cache misses. These memory requests use only the address lines from the common bus, since they need to either fetch a new data or inform every other processor node that they are modifying a data in their local cache. Consequently, the region identifier can be included as a part of the bus transaction by simply using the available data lines in the shared bus; no additional bus lines are needed to transfer the region ID and hence no hardware modifications on the bus structure are needed.
HARDWARE SUPPORT
The purpose of the hardware support is to capture the set of shared regions of each task currently executing on a given processor in the system. This information becomes a part of the state of the task and is loaded by the operating system or the thread library when the parallel task is scheduled for execution. Since the only information that the snoop controllers require is a status bit, indicating whether a region identifier is part of the shared regions of the local processor, this can be easily implemented by a bit-mask register. The hardware register is basically an n bit register where n is the number of shared regions and log2n bits are used to denote the region ID. Our experimental results indicate that 8 shared regions are enough for all the benchmarks we have used. Each bit of the n bit register indicates whether the region with ID equal to the bit index is a shared region for that task or not. For example if region ID is 3 then the fourth LSB is flagged as 1.
Note that region identifiers are assigned the values from 0 to n. The proposed hardware architecture is depicted in Figure 2 . When a bus request is seen by the snoop controller, the region ID is used as an index to check the values of the corresponding bit in the bitmask register. In order to accomplish this, a simple decoder circuit is required, which in the case of 8 or 16 regions is trivial in size, power, and delay. If the bit in the register is set, the snoop controller probes the cache. Otherwise, no cache probing is needed as the address on bus is either a private address of a remote node or a shared address that is not operated by the particular local processor node. It is noteworthy, that the introduced hardware is extremely cost efficient, as it constitutes one bit-mask registers, whose bits are indexed with the region ID of the snoop related bus transactions.
EXPERIMENTAL RESULTS
We have evaluated the proposed approach on benchmarks chosen from the SPLASH-2 benchmark suite [5] . The benchmarks were chosen from the suite because of the realistic workload provided by the kernels and the benchmark's inherent property to expose application parallelism for shared-memory multiprocessor systems. Specifically, we chose the F F T , the LU , and the RADIX kernels.
The LU kernel factors a dense matrix into the product of lower triangular and upper triangular matrices. In our case, we have used LU to decompose a data set consisting of 128 × 128 matrix. The F F T data set consists of 2048 complex data points to be transformed, and another set of 2048 complex data points containing the roots of unity. The RADIX kernel implements the traditional radix sort. In our experiments, we have used RADIX to sort 3072 keys. It is to be noted that increasing the size of data sets in each of these kernels results in an increased number of shared pages per region.
We simulated our target platform using Simics − 2.0.25 [6] functional simulator. Simics is a full system simulation platform capable of running an unmodified commercial operating systems on top of a simulated multiprocessor machine. For our experimental results, we simulated a 4-processor system running sparc processors and solaris operating system. As our approach focuses on the memory system, the choice of particular RISC instruction set architecture makes no difference for the methodology, which we propose. We have used Ruby [7] Figure 4: Shared Read/Write misses as a memory simulator which plays the role of a detailed memory system simulator, including shared memory, communication bus, local caches, and snoop controllers. Since we target the memory system, the ruby simulator is our main driver.
In order to obtain the experimental results, we added instrumentation code to the cache coherence protocol module to generate our desired statistics. We have also inserted additional code in the benchmark suite to obtain the virtual addresses of the shared memory regions. These virtual addresses were translated into physical addresses inside the simulator. The region based statistics is thus based on actual physical addresses. The physical addresses (pages) that belong to shared regions were assigned a unique region ID. Further, the region ID is used to match the physical address requests in our modification to the snoop controller simulation module.
For our baseline architecture, we have performed experiments on a 32K direct mapped (DM) and a 2-way set associative (SA) L1 cache. The results in terms of total misses, read misses (RM) and write misses (WM) per processor node are given in Figure 3 .
For each of the benchmarks we identified the shared arrays by inspecting the benchmark code. We calculated the size of these arrays and divided them into physical pages where page size is 4096 bytes. Now these shared arrays were annotated by a region ID. Since the size of the logically shared arrays varies on the input data set, the number of pages will also vary according to the input data size. For our benchmarks we got between 1 to 5 shared regions per benchmark, each of which is a set of shared physical pages. In our approach we are filtering the snoop requests based on the addresses of the read or write misses available on the shared bus. For this we will need to figure out how many of these addresses lie in a shared region. We compare the physical page number of the addresses on the bus with the physical page number of the shared arrays and then keep a count of the number of addresses that have the same page number as the shared arrays. We divide this count into shared read misses and shared write misses by taking into account the type of miss that was placed on the bus. These statistics are shown in Figure 4 . With our approach we will probe the cache only if the addresses lie in a shared region. consumption per access for the tag arrays of data caches is measured using the CACTI tool [8] for a baseline 32KB cache. The percentage of energy savings per benchmark is calculated in Figure 5 . As we discussed earlier, the energy contribution of the snoop operations to the total energy of the memory system can be very high and depends on the cache/memory sizes and bus organization. This pattern is repeated for all the benchmarks. From the results we can see that we get the maximum reduction in F F T and RADIX benchmarks. However, the energy reduction in LU is not that significant because of two reasons. Firstly, LU loops and threads operate on a large part of shared regions as compared to F F T or RADIX. Since the number of private regions in LU is small, the energy savings is not as high as the ones achieved for the other benchmarks. Secondly, our filtering process is conservative as it operates at page level granularity at the moment. We filter shared addresses based on the physical page number of the address, hence some private data, which belong to that page would also trigger a snoop cache probing.
CONCLUSION
In this paper, we have presented a low-power methodology for maintaining caches coherent in an embedded multi-processor system. The proposed approach exploits application information regarding shared memory regions of the communicating tasks in order to eliminate a large number of power consuming snoop-induced cache probing. The proposed methodology is very cost-efficient as the required additions to the system software and the hardware architectures are minimal and impose no performance or area overheads. Such an approach would be of great utility to many modern embedded applications, for which both high-performance and low-power are of great importance.
