Abstract: Multiple clock domains mobile SoCs typically adopt dynamic voltage and frequency scaling (DVFS) for flexible power/energy management. However, adoption of system-level cache under DVFS-enabled CPU may incur abnormal cache hierarchy (i.e., a delay reversal between the highlevel and low-level caches). It may lead to performance-and energyinefficiency due to slower data delivery and meaningless accesses to intermediate levels of caches. To resolve this problem, we propose a DVFS-aware cache bypassing technique. Our technique profiles latencies of the various levels of the caches. Based on the profiled information, our technique adaptively bypasses intermediate levels of caches in the case of abnormal cache hierarchy and applies power-gating to that cache memory for better energy efficiency. According to our evaluation, our technique reduces L2 and system-level cache energy consumption by up to 14.5% while improving performance by up to 0.13% compared to the baseline.
Introduction
Modern mobile computing systems are trying to keep up with high-end system performance. For example, performance of smartphones or tablet PCs has hugely increased for several years [1] . On the other hand, employing high-performance features have also led to high energy consumption of mobile systems. Since mobile devices are generally battery-powered devices, it is crucial to properly manage energy consumption. Otherwise, a limited battery power will lead to shorter battery life, eventually resulting in user's dissatisfaction.
In terms of performance improvement, modern mobile application processors (APs) or SoCs generally have large cache memories to meet performance requirement and reduce off-chip memory traffics. For example, it is known that Apple A9 mobile processors have 4 MB system-level caches [2] . According to [2] , it operates as a victim last-level cache and temporarily maintains data from all IPs in the SoC. In desktop or server processors, large-scale last-level caches are also quite prevalent (on-chip SRAM, on-chip eDRAM, or off-chip DRAM) as they need much higher performance with less tight power budget compared to mobile environment.
In terms of energy-efficiency, modern mobile system-on-chips (SoCs) widely adopt dynamic voltage and frequency scaling (DVFS) for efficient power/energy management. Mobile processors have a range of different voltages and frequencies and operating systems typically control voltage and frequency levels depending on the required amount of computing loads. For example, when the high computational power (i.e., running computationally heavy workloads) is required, OS will increase voltage and frequency level to meet the performance requirement. In the opposite case, OS will decrease voltage and frequency to reduce power/energy consumption.
On the other hand, typical mobile SoCs have multiple clock frequency domains [3] for energy-efficiency rather than having a single global frequency domain. Thus, mobile CPUs, GPUs, various types of controllers (or accelerators), and large system-level caches may have different operating clock frequencies in runtime. However, due to the clock frequency difference in various IPs in SoCs, there would be a case of an abnormal cache access behavior, meaning that higher level caches may have longer latency than last-level caches (L3) or system-level caches which operate at different clock domain compared to CPUs or GPUs. This can happen when CPUs are running in relatively low clock frequencies while last-level caches operate at high clock frequency. In this case, accessing the cache inside of the CPU may lead to performance degradation and energy-inefficiency rather than directly accessing the system-level (last-level) cache.
To tackle this problem, we propose a DVFS-aware adaptive cache bypassing technique. Our technique adaptively determines whether or not to use L2 caches depending on the clock frequency of CPU and system-level cache. When L2 caches are not used, they are power-gated and we only use L1 and system-level caches in our cache hierarchy. Through bypassing L2 caches, our technique can lead to better energy-efficiency while improving performance thanks to faster data delivery to the processor cores.
Related work
There have been many studies regarding cache bypassing. Cache bypassing is particularly effective when there is a large amount of streaming data (used only once and never used again) since streaming data will not be accessed in the future and meaninglessly occupies the effective cache capacity. In general, the main body of the cache bypassing techniques is to determine which block should be bypassed or not. For example, a reuse distance can be a key measure to determine whether to bypass or not [4, 5, 6, 7] . Some other works consider reuse counts for making a decision in cache bypassing [8, 9, 10, 11] . In [12] , cache bypass decision is performed by using support vector machine. However, previous studies related to the cache bypassing have focused on general computer system or architecture and does not consider DVFS in the heterogeneous system with multiple clock domains (for more detailed information on recent work regarding cache bypassing mechanism, please refer to the survey [13] ).
For DVFS-aware power management in computer systems, Deng et al. proposed a combined approach for managing CPU and memory system DVFS in server system [14] . In [15] , multi-component DVFS algorithm is proposed by considering the data and control flow of IPs in the SoC. Pathania et al. proposed a CPU-GPU coordinated DVFS technique for 3D mobile game applications [16] . For efficient runtime power/energy management, Linux provides various policies considering the application loads [17] . The frequency governor in Linux controls voltage and frequency levels of multiple IPs in the mobile SoCs. However, as far as we know, no work has considered cache hierarchy in-efficiency when applying DVFS in the CPU within the mobile SoC.
3 A DVFS-aware cache bypassing technique
Preliminaries
In this subsection, we explain our baseline architecture and assumptions. Our baseline SoC architecture consists of various types of IPs including CPUs, system-level caches, and specialized accelerators. System-level caches are shared across all of the IPs inside of SoCs. There is a mobile CPU which is composed of four processing cores and private L2 caches. The processing cores are modeled after ARM Cortex-A15 [18] as closely as possible. In the perspective of the CPU, system-level cache operates as a last-level cache (LLC) though it can be shared across all of the IPs in the SoC. In this work, we only focus on the mobile CPU and system-level cache because they are most power hungry and crucial parts in the mobile SoC. Mobile CPU, system-level cache, and other IPs operate at different clock frequency domains for fine-grain power/energy management [3] . The CPU can operate from 0.4 GHz to 2.0 GHz and one frequency step is 400 MHz while system-level cache operate at 1 GHz. Please note that DVFS in modern mobile system mainly focuses on CPU and GPU while memory-related system rather uses a fixed clock frequency [16] . Table I summarizes assumptions in our mobile CPU and system-level caches.
Motivation
As a motivational example, we compare two possible scenarios in our SoC. As shown in the case 1 of Fig. 1 (for brevity, only one core out of quad cores is shown in the figure), both CPU and system-level cache operate at maximum frequencies (2.0 and 1.0 GHz, respectively). In this case, the order of cache access latencies exactly follows the cache hierarchy, meaning that the L1, L2, and system-level cache access latencies follow an ascending order (i.e., access latency: L1 cache < L2 cache < system-level cache). This case is a normal case of the cache hierarchy. In the other scenario (case 2 of Fig. 1 ), the CPU operates at low frequency (400 MHz) while the system-level cache operates at normal frequency (1.0 GHz). This case can happen when less loads are assigned to the CPU cores while other heavy loads are assigned to other IPs (thus, requiring high system-level cache and memory bandwidth). In this case, an order of latencies in the cache hierarchy does not follow an ascending order of the cache hierarchy. As shown in Fig. 1 , the L2 cache becomes slower than the system-level cache. In this case, if we follow a typical cache access mechanism (L1 → L2 → system-level cache), it would lead to energy-inefficiency and performance degradation. This also means it would be better not to use L2 cache and use system-level cache as an L2 cache (i.e., bypass L2 cache). In addition, if we apply power-gating to the L2 caches in the case 2, we may reduce a huge amount of leakage energy consumed by the L2 caches.
Our DVFS-aware cache bypassing technique
As we explained in Section 3.2, if CPU and system-level cache frequencies are not carefully considered, there would be a significant inefficiency within a cache hierarchy in terms of performance and energy. To optimize performance and energy of cache hierarchies under multiple clock frequency domain SoC, we propose a DVFS-aware cache bypassing technique.
Our technique provides two different operation modes: normal and bypass mode. The normal mode utilizes entire cache hierarchy ( Fig. 2(a) ). On the contrary, in the bypass mode, we do not use L2 caches and cache misses from L1 are directly served by system-level caches (Fig. 2(b) ). To reduce leakage energy, we apply power-gating to the L2 caches when using the bypass mode.
To determine the appropriate operation mode, we compare the L2 cache access latency (LAT L2 ) and system-level cache latency (LAT sys ). If LAT L2 is less than LAT sys , we use L2 caches and CPU operates with a normal cache hierarchy (i.e., normal mode). Otherwise, we do not use L2 caches and cache misses from L1 are directly served by system-level caches (i.e., bypass mode). The determination process is carried out on every DVFS scheduling time-quantum (in this work, we assume it as 10 ms). After operating system determines the clock frequency of the CPU, our technique also determines whether or not we use the L2 caches by comparing both LAT L2 and LAT sys . Since clock frequency of the CPU can be dynamically adjusted in runtime, L2 caches can also be dynamically turned on and off based on our determination. When we determine to bypass L2 caches (from 'normal' to 'bypass' mode), L2 caches should be power-gated (turned-off) and we lose the data in L2. In this case, the dirty data must be updated in the system-level caches to prevent data corruption. Thus, our technique writes all of the dirty data in the L2 caches to the system-level caches (i.e., dirty data flushing) before applying power-gating in L2 caches. During the dirty data flushing, CPU operation must be temporarily stopped. It can cause performance losses because it must freeze CPU until we complete the write-back requests. However, it only has a marginal impact on performance since there are only a limited number of dirty cache lines in L2 caches. In the opposite case where we determine to use L2 caches again (i.e., from bypass to normal mode), L2 caches are empty resulting in compulsory misses. It may also cause performance losses as it takes time to warm-up L2 caches. However, performance overhead from compulsory misses is also negligible. Please note that we fully modeled performance overhead from the dirty data flushing and compulsory misses for our performance evaluation (see Section 4.2.2). Fig. 3 shows a brief hardware block diagram to implement our technique. The bypass controller resides between the L1 and L2 caches to control the cache bypassing. There are three main components in the bypassing controller. For every DVFS interval, the control logic determines whether or not we bypass the L2 cache. It simply compares LAT L2 and LAT sys and if LAT L2 is greater than or equal to LAT sys , it determines to bypass L2 caches. If we change the operation mode from 'bypass' to 'normal' (b_to_n change), the control logic disables both L2 power gating logic (so that L2 caches can be turned on) and data consistency logic. In the opposite case where the operation mode changes from 'normal' to 'bypass' (n_to_b change), the control logic enables both L2 power gating logic and data consistency logic. The data consistency logic generates write-back requests of all dirty cache lines in the L2 caches to the system-level cache so that we can guarantee functional correctness. The L2 power gating logic applies power-gating to the L2 caches since we do not use the L2 caches until we change the operation mode to 'normal'. In the next subsection, we explain our controller implementation in detail.
Controller implementation and cost
We implemented our bypass controller in Xilinx Zynq-7000 FPGA board as a proof-of-concept and performed verification of the logic to ensure operation correctness. There are three main blocks in our controller: control logic, data consistency logic, and power gating logic. In the control logic (Fig. 4) , there are input ports that receive core frequency (Cpu_freq) and system-level cache frequency (Cache_freq) information. The cycle-level latency information of L2 and system-level cache is stored in L2_var and SYS_var, respectively. Based on the those information, the control logic calculates latencies of L2 and system-level cache and stores them into the LAT_L2 (LAT L2 ) and LAT_SYS (LAT sys ), respectively. After that, it compares LAT_L2 and LAT_SYS. If LAT_L2 is equal to or greater than LAT_SYS and there is a frequency change (i.e., Check_in = 1), then it sends '0' signal to the L2_on and '1' signal to the DataConsistency_enable (n_to_b change). In the opposite case where LAT_L2 is smaller than LAT_SYS and Check_in = 1 (b_to_n change), it sends '1' signal to the L2_on and '0' signal to the DataConsistency_enable. If there is no change in Cpu_freq and Cache_freq (i.e., Check_in = 0), we do not perform comparison operation and the control logic just maintains the current state.
In the data consistency logic ( Fig. 5(a) ), if the DataConsistency_enable is equal to '1', it accesses 512X8-bit memory array (maintaining dirty-bit information) and send write-back requests via wbreg port. When updating the array, it accepts 9-bit set and 8-bit Data from the L2 cache controller and sets mem_Write signal as '1'. The wbreg_out_end signal is used for synchronization between the data consistency logic and control logic.
The L2 power gating logic (Fig. 5(b) ) only receives L2_on signal. If the L2_on signal is asserted, the L2 power gating logic inverts the signal and delivers it into the L2 cache controller. Please note that L2 cache power gating can be implemented by using various schemes (e.g., applying Gated-Vdd [20] to SRAM cells.); thus, implementation of power gating circuit is out-of-scope of this work.
Evaluation

Evaluation framework
We evaluate our bypassing technique in the perspective of performance and energy. M-SIM architectural simulator [21] was used for performance evaluation. We fully modeled our bypassing technique and DVFS algorithm in our simulator. Since multi-core CPU is widely used these days, we also modeled quad-core CPU. Each core has per-core private L2 caches and system-level cache works as L3 (last-level) cache which is shared across four CPU cores. The processor architectural parameters are summarized in Table I . We run 10 multi-programmed workload groups. Each group consists of four different SPEC2006 workloads (each of workloads is assigned to each core) and the workload groups are same as used in [22] . For accuracy, we warm-up (fast-forwarding) 2 billion instructions per process and actually run 1 billion instructions (once one among four processes hits the 1 billion instruction limit, it finishes simulation).
The processor core operates between 0.4 GHz and 2.0 GHz. The frequency step is 400 MHz, meaning that the possible clock frequencies of the CPU are 0.4, 0.8, 1.2, 1.6, and 2.0 GHz. To model the realistic DVFS scenarios, we modeled three different DVFS scenarios: random, updown, and set400. The 'random' changes the DVFS level randomly for every 10 ms. The 'updown' changes the DVFS level by one step for each 10 ms time quantum, which resembles the behavior of CPU load fluctuating (thrashing between heavy and light loads) with conservative governor in Linux. The 'set400' always set the clock frequency at 0.4 GHz, which is similar to powersave governor in Linux. Fig. 6 depicts three different DVFS scenarios. For energy evaluation, we evaluate L2 and system-level cache energy. Energy parameters are estimated from CACTI cache modeling tool [23] . Since L2 caches can operate at various voltage and frequency levels, we also extracted L2 dynamic energy and leakage power for each DVFS level. Both L2 and system-level caches are modeled as 6T SRAM-based caches. Table III shows the per-access dynamic energy and leakage power parameters. For entire energy evaluation, energy consumption is calculated by using the cache access statistics, which are also collected from M-SIM simulator. Fig. 7 shows normalized energy results when adopting our proposed technique compared to the baseline in the case of three different DVFS scenarios (random, updown, and set400). In the case of 'random' and 'updown', our DVFS-aware bypassing technique saves L2 and system-level cache energy consumption by 5.9% and 5.5% compared to the baseline, respectively. The amount of energy saving depends on the ratio of the time spent in the normal mode and bypass mode. In the case of 'random' scenario, our technique will put the operation mode into the bypass mode by 40% (on average) of the execution time, saving the large amount of L2 cache energy. In the case of 'updown' scenario, 37.5% of the execution time will be put into the bypass mode, resulting in 5.5% of energy saving in L2 and system-level cache (a little lower than the case of 'random'). In the case of 'set400' scenario, our DVFS-aware bypassing technique saves 14.5% of L2 and systemlevel cache energy. This is because our technique will make the L2 cache in the bypass mode at nearly 100% of the execution time, resulting in almost full powergating of L2 caches. Fig. 8 shows the energy breakdown comparison in the case of baseline and our proposed technique across three different DVFS scenarios. Energy saving from our technique mainly comes from the L2 cache leakage energy saving. In L2 and system-level caches, since most of the cache accesses are filtered in L1 caches, dynamic energy consumption is almost negligible compared to the leakage energy consumption. If we only account for L2 cache energy saving (i.e., not considering system-level cache energy), our technique saves L2 cache energy by 37.4%, 34.8%, and 99.9% compared to the baseline in the case of 'random', 'updown', and 'set400', respectively. Since our technique fully bypasses L2 caches in the case of 'set400', we can gain nearly 100% L2 cache energy saving.
Evaluation results 4.2.1 Energy
In terms of system-wide energy consumption, our technique also leads to nonnegligible energy savings. We performed system-wide energy estimation by using McPAT [19] with 32 nm technology node. We show the case of the workload group8 for conservative estimation (the least energy saving across the workload groups). Table IV summarizes a ratio of per-component energy consumption against the total energy consumption when running the workload group8. As shown in Table IV , L2 and L3 caches consume non-negligible energy (49.9% of the total energy). Based on our estimation, our bypassing technique leads to 3.1%, 2.6%, and 7.1% (on average) of the system-level energy reduction in the case of random, updown, and set400, respectively. In addition, we can expect more systemwide energy savings in the other workload cases. Fig. 9 shows performance comparison results across three DVFS scenarios. In the case of 'random' and 'updown', our technique shows performance benefit of 0.03% and 0.05%, respectively. The performance benefit of our proposed technique mainly comes from the faster data delivery when using low CPU clock frequencies (at 0.4 GHz and 0.8 GHz). Though our technique enables faster data delivery compared to the baseline, low AMAT (average memory access time) does not directly translate into performance benefit. This is because performance benefit could be offset by overhead from the dirty data flushing and cache compulsory misses during the operation mode change. As we explained in Section 3.3, the entire CPU operation must be stopped during the dirty data flushing. In addition, in the case of b_to_n change, the entire L2 caches will be empty, taking warm-up time to fill the data in the L2 caches (i.e., it may cause the non-negligible number of compulsory misses). In the case of 'set400', our technique improves performance by 0.13% because this scenario does not incur operation mode change while providing lower AMAT compared to the baseline.
Performance
Conclusions
In this paper, we propose a DVFS-aware cache bypassing technique which can improve performance and energy-efficiency under DVFS-enabled CPUs in the SoCs. In addition, we also implement our proposed controller as a proof-ofconcept. Our technique reduces energy consumption of L2 and L3 caches by up to 14.5% compared to the baseline and also reduces system-wide energy consumption by up to 7.1%. Furthermore, we slightly improve performance by up to 0.13% compared to the baseline.
