This paper proposes a software-controllable variable linesize (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines. key words: low power, variable line-size, on-chip DRAM, high bandwidth, embedded systems
Introduction
System-in-Silicon (SiS) is an architectural framework to implement a whole system onto a silicon interposer [1] , [2] . For instance, as shown in Fig. 1 , an application ASIC and a specific DRAM (SiS-DRAM) are mounted and they are connected via eutectic solder bumps. In this framework, we can mainly obtain the following two advantages. First, high memory bandwidth is easily achieved by increasing the number of on-silicon communication wires. Second, we can reduce the energy consumption for DRAM accesses because no activations of high capacitive IO pads are required. Although SiS architecture improves memory bandwidth, large Manuscript received August 18, 2008 . Manuscript revised November 15, 2008 . † The author is with the Graduate School of Information Science and Electrical Engineering Kyushu University, Fukuoka-shi, 819-0395 Japan.
† † The authors are with the Faculty of Information Science and Electrical Engineering Kyushu University, Fukuoka-shi, 819-0395 Japan.
† † † The author is with the System Fabrication Technologies, Inc. a) E-mail: ono@c.csce.kyushu-u.ac.jp b) E-mail: inoue@ikyushu-u.ac.jp c) E-mail: murakami@ikyushu-u.ac.jp d) E-mail: yoshida@s-f-t.co.jp DOI: 10.1587/transele.E92.C.433 access latency of the DRAM arrays is still an important issue. To solve this problem, we employ a small cache memory, called SiS-cache. Note that this is not the cache memory included in the microprocessor core. Although the SiS framework has an advantage in terms of energy consumption, as explained above, it is still required to achieve more energy reduction, particularly the energy for SiS-DRAM accesses. Since the load capacitance of SiS-DRAM is large, it may dominate the total energy consumption. The high bandwidth achieved by the SiS framework makes it possible to transfer a large amount of data between SiS-cache and SiS-DRAM at a time. Therefore, SiScache can employ larger cache line size in order to expect the effect of prefetching, resulting in higher performance. However, the larger cache line size also requires a number of SiS-DRAM bank activations, so that a large amount of energy is required for each SiS-DRAM access.
This paper proposes a SiS-cache architecture, called Software-Controllable Variable Line-Size cache (SC-VLS cache), in order to reduce SiS-DRAM energy without sacrificing the performance. Unlike a conventional cache mechanism, the proposed approach tends to optimize the amount of data to be transferred between SiS-cache and SiS-DRAM. Namely, when a target application program does not require high memory bandwidth, SC-VLS cache reduces the cache line size. Since the total number of SiS-DRAM banks accessed is decreased, we can achieve energy saving. The SC-VLS cache does not require any hardware monitor to decide the line size, thus we are able to reduce energy consumption with trivial hardware overhead. In our evaluation, it is observed that the SC-VLS cache reduces SiS-DRAM energy consumption up to 88%. The abstract of this paper has been presented in [3] .
The rest of this paper is organized as follows. In Sect. 2, we explain energy issue for SiS-DRAM Accesses. Section 3 presents a software-controllable variable line size cache. We show evaluation results of our approach in Sect. 4. Section 5 describes related work and Sect. 6 summarizes our work.
Energy Issue for SiS-DRAM Accesses
Total SiS-DRAM energy consumption (E mem ) is calculated by the following equation:
where, E Bank is energy consumption per a bank access, LineSize S CVLS i is the SiS-cache size. Linesize Bank is the bitwidth of the SiS-DRAM banks, and access is the total number of accesses to SiS-DRAM. Increasing SiS-cache miss rate increases the value of access. In addition, larger LineSize S CVLS i increases the value of E Bank . Therefore, we cannot reduce E mem to employ larger cache line size.
If a large SiS-cache line size is employed, we can expect the effect of prefetching. However, it might worsen the system performance if programs do not have enough spatial localities of memory references. Fig. 2 shows SiS-cache miss rate in each cache size. The x-axis describes benchmark programs which are a part of MiBench [4] . We can observe that larger line sizes are well performed in most programs. The larger SiS-cache line sizes, however, tend to consume more energy than small lines, because many DRAM banks need to be activated. We assume that SiS-DRAM is divided into banks and can activate in bank. Fig. 3 shows the transition of the adequate line size which produces the lowest cache miss rate. The x-axis denotes time intervals and the y-axis is the adequate line size. The program is stringsearch in MiBench. We assume that direct-mapped 2 KB SiS-cache can ideally choose the best line size from 32 B, 64 B, 128 B and 256 B, thus the y-axis in Fig. 3 indicates one of the line sizes for each time interval. In order to obtain the results shown in Fig. 3 , we executed the target program four times by assuming each supported line size, then picked up the line size producing lowest cache miss rates for each interval. From this figure, it is observed that the adequate line size is changed during program execution. We can improve SiS-DRAM energy consumption and SiS-cache miss rate by optimizing the SiS-cache line size. If we use a SiS-cache line size which is not suitable for an application program, the cache increases not only the execution time but SiS-DRAM energy consumption. Therefore, an intelligent control system for the SiS-cache are needed to achieve high performance and low energy consumption.
A Software-Controllable Variable Line Size Cache

Architecture
To reduce SiS-DRAM energy consumption, we attempt to optimize the SiS-cache line size based at runtime. A Software-Controllable Variable Line-Size Cache (SC-VLS cache) is able to select a data transfer size, which it is indicated in a status register. In this paper, we call the SiScache which has ability to select a line size at runtime the SC-VLS cache. Fig. 4 illustrates the block diagram of a direct-mapped SC-VLS cache. In the SC-VLS cache, an SRAM cell array and a DRAM cell array are divided into several subarrays. Data transfer for cache replacement is performed between corresponding SRAM and DRAM subarrays. Since an adequate line size is indicated by the status register, it is impossible to choose more than one line sizes at the same time, even in bank granularity The minimum line size is 32 B, and four line sizes (32 B, 64 B, 128 B and 256 B) are provided. Figure 5 shows four types cache line size of SC-VLS cache. For instance, if we need to replace data of 64 B line size, the data is transferred between two adjacent subarrays.
The line size when the SiS-cache access takes place is always the minimum line size, i.e. 32 B. On cache hit, the required data is also read in the same manner as conventional cache. Otherwise, i.e. a cache miss occurred, the line which size is indicated in the status register is replaced.
In order to set the status register to an adequate line size, we exploit memory-mapped I/O scheme. A memory address is assigned to the status register. Therefore, by ex- Fig. 4 The SC-VLS cache architecture.
ecuting a store instruction, we can control the size of the SC-VLS cache. Note that the performance overhead caused by executing the store instructions is trivial.
Adequate Line Size Analysis and Code Generation
For the SC-VLS cache, it is very important to consider when and how often the cache line size is changed. One of the most promising approaches is to set the line size when the amount of special locality varies. However, it is not easy to accurately detect the change of program execution phases in terms of memory access behavior. A design alternative is to perform a fine-grained line size optimization such as load/store granularity. Although it can effectively follow the varying amount of special locality, the performance overhead caused by executing extra store instructions for line size setting can be significant. On the other hand, the most coarse-grained optimization is to specify an adequate line size program by program. However, in this case, it is impossible to exploit the dynamic behavior of memory accesses as showed in Fig. 3 . By considering this kind of tradeoff, we have finally decided to employ function-level optimization, i.e. an adequate line size is specified function by function.
We execute cache simulation with each line size independently to determine an adequate line size. To analyze an adequate line size, the following approaches are performed.
• An average cache miss rate of each function is calculated.
• We compare the average cache miss rates with all line size candidates. A line size which the cache miss rate is the smallest is determined as an adequate line size.
It is not easy to decide on adequate line size for each function because memory access behavior is changed intra program executions. However, we need to set an adequate line size for the function, because line size change instructions are inserted to a program code. Thus we determine the adequate line size based on average cache miss ratios.
We explain the analysis algorithm by using Fig. 6 . We assume that the SC-VLS cache supports 32 B and 64 B line sizes. First, we measure cache miss rates in each function using a cache simulator. The foo1() causes 10 misses out of 200 accesses. Second, we calculate an average miss rate of foo1(). An average miss rate of foo1() is 5.0% in case of 64 B line size and is 2.0% in case of 32 B line size. Finally, we compare the average cache miss rate with each line size. We decide an adequate line size for the foo1() by comparing the average cache miss rate of 64 B line size with that of 32 B line size. In this example, the adequate line size is 32 B, because the 32 B lines produce a lower cache miss rate in average than the 64 B lines. We decide the adequate line sizes of other functions in the same way as the foo1().
Line size change instructions are inserted into start of functions in original program code after the adequate line size analysis. Before a function is executed, the instruction sets the status register to indicate an adequate line size.
Discussion
The SiS-cache can improve performance except for the program of high SiS-cache miss rate because the programs increase AMAT (Average Memory Access Time) [5] . To improve performance, the SiS-cache needs to satisfy the following equation:
where, T S iS −cache is access time of SiS-cache, T S iS −DRAM is access time of SiS-DRAM and HR S iS −cache is hit rates of SiS-cache. Our scheme can take advantages where reducing the cache line size does not worsen cache miss rates. The SC-VLS cache is suitable for programs whose adequate line sizes are frequently changed. In addition, it is preferable that the adequate line sizes do not vary depending on input data sets of the programs.
It is very important to find the adequate line size for each function. This is because un-adequate line sizes increase not only the miss rates but also DRAM energy consumption due to the increased value of LineSize S CVLS i in Eq. (1). Unfortunately, it is hard to perfectly predict the adequate line size at the analysis phase. This fact comes from mainly the following two reasons. First, we use a test input data to decide adequate line sizes. Therefore, it does not work well if the memory access behavior strongly depends on the input data. The second reason is that we assume warm-start a condition in the analysis phase. We execute the target program by assuming a fixed line size. For instance, if the SC-VLS cache supports the line sizes of 32 B, 64 B, 128 B, and 256 B, the program is executed four times. Then for each function, we pick up the line size which produces the lowest cache miss rate. However, in the execution phase, functions executed consecutively may use different line sizes. For instance, function foo2() with 256 B line size may be executed just after function foo1() using 32 B line size. In this scenario, function foo2() is executed under the condition of cold start, resulting in higher cache miss rates.
Evaluation
Experimental Setup
In this section, we quantitatively evaluate the efficiency of our approach. The processor configuration assumed in this paper is shown in Table 1 . We assume the SC-VLS cache with the ability to dynamically choose four line sizes, 32 B, 64 B, 128 B and 256 B. The SiS-cache targets data cache, i.e., it is between data L1 cache and SiS-DRAM. We modified the SimpleScalar tool set [6] in order to measure cache miss rates. We assume that LineSize Bank of Eq. (1) is 32 B. The SC-VLS cache has mainly the three energy sources, the energy for data memory (SRAM) accesses, that for status register accesses, and that consumed by peripheral circuits. We have measured the dynamic read energy consumed in the Fig. 7 The results of SiS-DRAM energy consumption. 4 KB data memory by using CACTI 5.3 [7] , [8] , and have found that it is only 0.01nJ. This can be negligible because it is two orders of magnitude smaller than the energy consumed in 128 MB DRAM arrays. In addition, the energy consumed for 8 bits status register accesses and peripheral operations is trivial. We ignore the SC-VLS cache energy consumption because it is much smaller than the energy dissipated for SiS-DRAM accesses. As the same reason, we do not take the energy for status register access into account.
We use 20 programs in MiBench [4] . Since the adequate line sizes depend on inputs, we used two types of input data sets in this evaluation. To analyze the adequate line size (analysis phase), we use a small input sets for MiBench, respectively. A large input sets are used for evaluating our approach (execution phase).
Energy Reduction and SC-VLS Cache Miss Rates
Here we discuss the efficiency of the SC-VLS cache in terms of energy and cache miss reduction. Figure 7 shows SiS-DRAM energy consumption for each benchmark program. The x-axis shows benchmark programs and the y-axis describes the energy consumption normalized to the result of FIX256B in each program. Figure 8 shows the SiS-cache miss rate. The x-axis is benchmark programs and y-axis shows the SiS-cache miss rate. The FIX32B, the FIX64B, the FIX128B and the FIX256B mean the result of the fixed SiS-cache line size and the SCVLS is the result of our approach. We see from Fig. 7 that our approach reduces SiS-DRAM energy comparing with the fixed 256 B line size. In particular, SC-VLS cache reduces 88% of energy consumption for blowfish enc and blowfish dec. Table 2 reports average cache line size for each benchmark program. We see that blowfish enc and blowfish dec tend to choose smaller size, their average cache line sizes are very close to 32 B, resulting in large energy reduction as explained in Eq. (1).
The SiS-cache miss rate is reduced comparing with FIX256B in jpeg enc, jpeg dec, tiff2bw, tiffmedian, dijkstra, stringsearch and sha. In tiffdither, rsynth and rijndael dec, the miss rate of SC-VLS cache is the same as the FIX256B. The SC-VLS cache reduces SiS-DRAM energy consumption without sacrificing the miss rate. In other programs, however, the SC-VLS cache increased the miss rate. The SC-VLS cache changes the line size to not suitable one since the adequate line size differs from analysis phase as explained in Sect. 3.3.
Overhead
To set an adequate line size for the status register, our approach needs to execute extra store instructions. Executing the extra store instructions may worse the processor performance and energy consumption, because it increases the total number of instructions executed. Figure 9 shows the instruction ratios of all executed instructions. It is observed that the overhead is lower than 2% in most benchmark programs. Therefore, extra instructions have little impact on the processor performance and energy consumption.
In blowfish and CRC32, the extra store instruction ratio is larger than other program. Table 3 shows the number of the function in each program. It is observed that the number of the function of blowfish and CRC32 is almost the same as other programs, e.g., adpcm and dijkstra. In adpcm and dijkstra, however, the overhead is trivial. It is likely that Fig. 8 The results of SiS-VLS cache miss rate. the executed function is frequently changed during program execution in blowfish and CRC32. Thus, the extra store instructions are increased.
Sensitivity to Input Data
In Fig. 8 , the cause of the increasing the miss rate is that the adequate line sizes can be different input data of programs. Fig. 10 and Fig. 11 shows the SiS-cache energy consumption and miss rate for three SiS-cache configurations, respectively. The L1 cache size is 4 KB and the SiS-cache size is 2 KB. In the SCVLS-SI and SCVLS-DI, the programs are executed using the adequate line sizes which are analyzed with same input data as execution phase and different input data form execution phase, respectively.
The SiS-cache energy consumption of the SCVLS-DI is higher than SCVLS-SI in bitcount, patricia, ispell, stringsearch and CRC32. In mad and tiffdither, the energy consumption is lower than SCVLS-SI, however, the Table 3 The number of the function. Fig. 9 Overhead of the extra store instructions to change the line size. Fig. 10 The impact of input data for SiS-cache energy consumption.
Benchmarks The number of the function
miss rate of SCVLS-DI is higher than SCVLS-SI. The energy consumption strongly depends on how many DRAM and SRAM subarrays are activated, depending on cache line sizes used. This means that smaller numbers of subarray activations produce lower energy consumption even if cache miss rates are high. Actually, for mad and tiffdither, SCVLS-DI can achieve lower energy consumption regardless of its higher cache miss rates compared to SCVLS-SI with 256 B line size. Table 4 and Table 5 show the breakdown of the SC-VLS cache misses. From these figures, it is obvious that the total number of cache misses with the largest line size (256 B) in SCVLS-DI is smaller than that of SCVLS-SI. In these programs, we need to analyze the adequate line size with more input data sets.
Sensitivity to SiS-VLS Cache Size
The SiS-cache size is able to affect SiS-DRAM energy consumption and the performance because the hit ratios are different in each size. Figure 12 shows the results of energy Fig. 11 The impact of input data for SiS-cache miss rate. Fig. 12 The Impact of SiS-VLS cache size. reduction in four different sizes. The y-axis is normalized energy consumptions by the result of FIX256B in each size. The L1 cache size is 4 KB. Note that our approach is effective for small size SC-VLS cache in many cases. The SiS-cache miss rate is high when the size is small. Thus the number of accesses of SiS-DRAM increase and the energy consumptions also increase. The SC-VLS cache can reduce the data transfer size comparing 256 B fixed line size.
Related Work
To improve cache performance or reduce the cache energy consumptions, some approaches with the ability to dynamically choose appropriate line sizes have been proposed. The line size selecting methods are classified into two types, based on static and dynamic selection techniques.
Static selection techniques have been studied. Agarwal et al. [9] present software approach for controlling memory bandwidth consumption. They target memory references that are likely to exhibit sparse memory access patterns. They employ the static analysis of source codes to identify memory references that have the potential to access memory sparsely. Vleet et al. [10] proposed using off-line profiling to determine the line seize which are normal or large size. The line size is determined based on their cost model which describes trade-off between bytes improving cache hit ratio and minimizing additional bytes transferred into the cache. They have also proposed dynamic selection algorithm. These studies have looked at improving cache performance. They do not describe any mechanism to reduce energy consumption. Grun et al. [11] customize the local memory architecture suitable for both the diverse access patterns and locality types presented in an application program. They achieve to decrease the main memory bandwidth thus generating power savings. We also exploit memory access behavior. However, our approach can dynamically change the cache line size. Witchel et al. proposes software-controlled cache line size [12] . The compiler specifies how much data to fetch on a miss, allowing greater cache utilization and reducing bandwidth requirement. Zhang et al. presents a configurable line size cache [13] . The cache has a counter in the cache controller, which specifies how many words to read from the off chip memory. An embedded system designer would determine the best line size for a program based on simulations or actual executions on the platform. The designer would then modify the boot or reset part of the program to set the cache's configuration registers to the chosen line size. They determine best line size for a program. Inoue et al. [14] propos an adequate line size based on cache simulation. They also determine best line size for a program. Our approach, on the other hand, is able to choose an adequate line size for each function of a program.
Dynamic selection techniques have also been proposed. Kumar and Wilkerson suggested using the SFP (Spatial Footprint Predictor) to improve cache miss ratios [15] . The SFP predicts the neighboring words that should be prefetched on a cache miss. Chen et al. proposed Spatial Pattern Predictor to reduce leakage energy and to improve processor performance [16] . González et al. [17] presented a double cache architecture which has temporal and spatial cache. They employ dynamic prediction mechanism to store the data to either the temporal or the spatial caches. Veidenbaum et al. [18] proposed reconfigurable cache architecture to improve the cache miss rate. They used dynamic selection techniques that are based on monitoring the access to a given line and changing the future line size accordingly. The line size adjustment is computed during line replacement based on what was observed during the line's current stay in the cache. Johnson et al. [19] presented using a Spatial Locality Detection Table to alternate line fetch sizes between a conventional line size and a macroblock size. Inoue et al. [20] , [21] proposed a realistic cache architecture, which referred to as dynamic variable line-size cache (D-VLS cache). The D-VLS cache changes its cache-line size at run time according to the characteristics of application programs to execute. Line size determiner selects adequate line sizes based on recently observed data reference behavior.
In dynamic selection techniques, they select adequate line sizes based on recently observed data reference behavior. These schemes do not require any modification of instruction set architecture or program codes. The extra mechanisms, however, are needed for online profiling resulting in a power and area overhead which is prohibitive in embedded systems. Most dynamic selection techniques have this weak point. Static selection techniques, on the other hand, do not require such the extra mechanisms. Instead, we have to analyze adequate line size to execute the programs on the systems beforehand.
To save DRAM energy consumption, many approaches have been proposed. They are classified into compilerdirected [22] - [24] , operating system based [25] - [29] and hardware-based [30] - [32] methods. Some approaches [33] - [35] reduce the energy consumption exploiting low power (energy) mode of DRAM by managing the DRAM controllers. Our approach, on the other hand, attempts to reduce accesses for DRAM banks with trivial area and performance overheads.
Conclusions
We have proposed a Software-Controllable Variable LineSize cache (SC-VLS cache) to reduce on-chip DRAM energy consumption for low power embedded systems. The SiS-cache line sizes are dynamically changed during a program execution with few hardware and performance overheads. In our evaluation, it is observed that the SC-VLS cache reduces SiS-DRAM energy consumption up to 88%, compared to a conventional cache which line size fixed to 256 B.
In this work we need to analyze the adequate line sizes using software cache simulation. If the SiS-cache line size candidates are four, we need to simulate four times. In our feature work we plan to reduce the analysis costs.
Takatsugu Ono
was born in Fukuoka, Japan in 1982. He is received the B.E., M.E., degrees in engineering from Fukuoka University, Japan in 2004 and 2006, respectively. Since 2006, he has been a doctoral student at Kyushu University, Japan. His current interests are performance evaluation techniques, workload characterization, and low power design. He is a member of the IEEE, the IEEE Computer Society, and the IPSJ (Information Processing Society of Japan).
