We propose a novel technique for reducing the power consumed by the on-chip cache in SNUCA chip multicore platform. This is achieved by what we call a "remap table", which maps accesses to the cache banks that are as close as possible to the cores, on which the processes are scheduled. With this technique, instead of using all the available cache, we use a portion of the cache and allocate lesser cache to the application. We formulate the problem as an energy-delay (ED) minimization problem and solve it offline using a scalable genetic algorithm approach. Our experiments show up to 40% of savings in the memory sub-system power consumption and 47% savings in energy-delay product (ED).
INTRODUCTION
The rise of cell phones as the most preferred mobile computing device has necessitated the need for more general purpose solutions on embedded systems -a trend reflected in the introduction of multi-core ARM Cortex Series processors with configurable cache sizes, which lets the designer choose an appropriate cache size based on the use case. The number of cores in embedded processors is increasing due to the increasing number of applications on the mobile platform, making larger caches necessary on these platforms. One of the major challenges in designing efficient cache hierarchy is to have low cache access latency and low cache miss rate, which are often conflicting demands. To strike a balance between them, the huge cache is partitioned into multiple banks and these banks are connected using switched network or crossbar network.
Similarly, to make a CMP platform scalable, cores and shared caches are arranged in identical building blocks called tiles (Fig. 1) . Tiles are replicated and connected through an on-chip switched-network (NoC). Each tile has a core, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. a private L1 instruction and data cache, L2 cache and a router. L2 cache is distributed in all tiles and is shared by all the cores. To maintain cache coherence between the private L1 caches, a directory is present in each tile. These tiles are connected to one another via 2D-mesh switched network and per-tile router. Memory address determines the location of L2 bank where data will be cached, which we call a "home location" of that address. Data can reside only in its "home" L2 bank. Hence, the architecture is referred as Static Non-Uniform Cache Access (SNUCA) architecture.
When an application is executed on such a tiled architecture with distributed but shared L2 cache, accesses get dispersed to L2 slices 1 in all tiles. This increases time spent by the application in transit, as a result it shows degraded runtime. The use of different L2 slices is not uniform. Due to large size of the cache, most of the L2 slices remain unused. We use a novel technique, which we call as a "Remap Table" to merge the accesses made to the under-utilized distant L2 slices to nearer L2 slices. The unused L2 slices can be switched off thereby saving power.
For embedded applications like video applications, the remap table can be configured using an offline method by determining parameters which influence the application cache requirement. Cache requirement is independent of input data value if these influential parameters are kept constant.
MOTIVATION
We varied the number of allocated L2 slices in steps of 2, for a sixteen tile CMP platform. For each configuration, we exhaustively searched all remap table configurations and evaluated them using simulation. The remap table giving the highest ED savings over the reference, which allocates all the sixteen L2 slices, is desired.
On allocating fewer than the required number of L2 slices, the number of offchip DRAM accesses increases. It degrades the execution time of the application when compared to the reference. When more than the required number of L2 slices are allocated, it causes dispersion of the accesses, increasing time spent in transit. Hence, one needs to allocate the optimum number of L2 slices to the application.
In our problem, not just the number of L2 slices allocated to the application is important but their location is also important. Clearly, as the number of tiles and hence, the number of L2 slices present on the CMP platform is increased, exhaustive search method will not scale. Hence, we solve this problem using more scalable "Genetic Algorithms".
Following are our contributions:
• We propose to reduce the footprint of an application with smaller cache requirement than the available cache, using a remap table, which allows the choice of size as well as the position of chosen L2 slices.
• We model the penalty incurred by all L1 misses and the energy consumption of the memory sub-system components, in the presence of a remap table using a trace based approach.
• We apply genetic algorithm (GA) to determine the optimum remap table configuration. We observe that our remap table strategy gives power and energy-delay savings up to 40% and 47% without compromising much of the performance. Such an offline determination of the remap table is applicable for embedded platforms.
• The simulation framework models power consumption of all memory sub-system components like offchip DRAM, interconnect and cache, unlike prior research.
Our experimental evaluation and conclusions are described in sections 3 and 4, respectively.
EVALUATION
We formulate the problem as an energy-delay minimization problem and use genetic algorithms to determine near optimal remap table configuration. Please refer the technical report [1] for more details. We use multithreaded workload from Splash2 and Alpbench benchmarks to evaluate our strategy.
For applications like FFT, memory requirement depends on the number of points on which FFT is performed and not on the actual values of these points (Table 1(a)) 2 . We varied values of these points and performed simulation using the same remap table, obtained with the reference input. % savings in execution time, power and energy-delay do not change even on changing the input data set.
For video applications like MPEG decoder, encoder and H.264, one can vary most influential parameters like frame resolution, bit rate etc and configure remap table for various frame resolutions. For such applications, working set does not change drastically even on changing the number of frames. For example, we determined the remap table using input image with frame resolution of 640x336 for MPG Decoder and Encoder application and took readings for images with different frame resolutions. The same remap table continues to give power savings as shown in Table 1 (b). These experiments indicate that one can certainly determine the important performance determining parameters of media applications and configure the remap table using an offline method.
CONCLUSIONS AND FUTURE WORK
In this paper, we propose and implement a technique for reducing the power consumption of an on-chip cache on a SNUCA chip multicore platform. We use a remap table to achieve this. To arrive at appropriate table sizes and entries, we formulate the problem as an energy-delay minimization problem and solve the problem offline using a scalable genetic algorithm. Our technique results in impressive power savings up to 40% and energy-delay savings up to 47%. Results obtained using the genetic algorithm approach, will help us to analyze the effectiveness of the online remap table changes, which we plan to do in future.
