Energy harvesting is one of the most promising battery alternatives to power future generation embedded systems in Internet of Things (IoT). However, energy harvesting powered embedded systems suffer from frequent execution interruption due to unstable energy supply. To bridge intermittent program execution across different power cycles, non-volatile processor (NVP) was proposed to checkpoint register contents during power failure. Together with register contents, the cache contents also need to be preserved during power failure. While pure non-volatile memory (NVM) based cache is an intuitive option, it suffers from inferior performance due to high write latency and energy overhead. In this paper, we will propose replacement and checkpoint policies for SRAM and NVM based hybrid cache in NVPs whose execution is interrupted frequently. Checkpointing aware cache replacement polices and smart checkpointing polices are proposed to achieve satisfactory performance and efficient checkpointing upon a power failure and fast resumption when power returns. The experimental results show that the proposed architectures and polices outperform existing cache architectures for NVPs.
INTRODUCTION
The applications of Internet of Things (IoT), such as smart manufacturing, smart city and transportation, and smart energy, have been and will continue to transform the way we live in a positive way. In IoT applications, small sensors and systems are used to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. collect information of interest to support optimal decision making. It is predicted that IoT will consist of 50 billion objects by 2020 [1] . While the vision is promising and exciting, there are several challenges in achieving this goal. One of the most important challenges is how to power these 50 billion embedded devices. While battery power has been the energy source for most embedded systems these days, it is not a favorable solution in the long run due to size, longevity, safety, and recharging concerns. Therefore, researchers are actively pursuing power alternatives. Among all solutions, energy harvesting is one of the most promising techniques to meet the requirements of large scale embedded devices.
CODES/ISSS
Energy harvesters generate electric energy from their ambient environment using direct energy conversion techniques. Examples of power sources include kinetic, light, RF, and thermal energy. The obtained energy can be used to charge a capacitor to power the electronics. However, there is an intrinsic drawback with harvested energy. They are all unstable. With an unstable power supply, the whole computer system will be interrupted frequently, which will cause severe performance degradation. What is worse, large tasks may never finish since the intermediate results cannot be saved.
In order to bridge the intermittent execution under unstable energy supply, non-volatile memories (NVMs) [18, 30, 20] based nonvolatile processors (NVPs) [17, 30, 39] were proposed. Upon a power failure, the program status, including registers and caches are saved in NVMs. When the power comes back on, the saved contents are loaded back to the registers and caches so that the execution can continue from where was interrupted. Since NVMs can retain the data even when the power is off, it can successfully preserve computation status across different power cycles.
In existing NVPs, ferroelectric memory (FRAM) based NV register was adopted where a FRAM cell is attached to each standard flip-flop. The standard flip-flops are accessed during normal execution and the FRAM cell is used to save the state during a power failure. The same design strategy could be adopted for caches in NVPs. Li et al. [12] integrates a SRAM cell and a non-volatile element in cell level, forming a direct bit-to-bit connection. In this design, the NVM part is underutilized because the NV elements are idle most of time, and the area size is unnecessarily large. Therefore, it is not always desirable. Another possible design is adopting pure NVMs-based cache [16] . However, since NVMs such as STT-RAM [7] and PCM [19, 21] typically have high write latency and energy overhead, pure NVM based cache will become the major performance bottleneck.
Different from two designs mentioned above, hybrid cache architecture, which consists of both SRAM and NVM (e.g. STT-RAM), was proposed to achieve energy efficiency and high performance [31, 28, 33, 37] . It also serves as a promising cache architecture for energy harvesting systems since it promises both high efficiency and non-volatility. However, all existing hybrid cache architecture and policies are designed for energy efficiency and performance purposes. None of them considered checkpointing efficiency. Therefore, in this paper, we aim to develop checkpointing aware hybrid cache architecture and policies to achieve the following goals: 1) high performance; 2) full utilization; 3) reliable and efficient checkpoint during a power failure.
The complexity lies in the following aspects. First, during a power failure, the available energy for checkpointing is limited by the capacitance of the capacitor. Second, NVM is both used for normal cache access and for preservation of the volatile blocks at power failure. The usage of its space between these two purposes should be balanced to achieve the best performance. Third, it is also challenging to identify data that is unnecessary to checkpoint. Consequently, the cache architecture and cache management policies must be carefully investigated for energy harvesting powered systems. In order to answer these challenges, this work makes the following contributions:
• We present a hybrid cache architecture built with SRAM and STT-RAM, and STT-RAM is fully utilized not only for normal cache access but also for volatile SRAM checkpoint.
• We design replacement and migration policies to balance the usage of STT-RAM space between normal access and checkpoint. Proactive write back policy is also designed to guarantee successful and efficient checkpointing.
• We propose an efficient checkpointing policy to save all necessary volatile blocks to STT-RAM before capacitor energy is depleted upon a power failure.
The remainder of this paper is organized as follows. Section 2 presents the related works. Section 3 describes the system architecture. Section 4 describes the motivation of this paper. Section 5 presents the new cache architecture checkpointing policy. Detailed experimental evaluation is provided in Section 6. Finally, Section 7 concludes this paper.
RELATED WORKS
In this section, we will describe related works on energy harvesting systems, NVPs, and cache architectures.
Energy harvesting system: Energy harvesting technologies are promising long-lasting replacements for batteries in embedded system. Ambient sources such as solar [26, 6] , motion [25, 10] , radiofrequency (RF) electromagnetic radiation, thermal gradients [9, 11] can provide enough energy for embedded system to be completely self-sustainable. For ultra low power devices, the sources of low power densities, such as micro-solar [26] and body heat (2.4∼4.8W ) should be able to provide sufficient power to drive the devices at low duty cycles [9] . For systems that require high reliability, multisource energy sources can be employed [32] . For example, [26] designs a micro-solar power sensor network, [27] proposes a nonvolatile microprocessor powered with a solar energy harvesting system even under low solar irradiance, and [22] designs a self-powered pushbutton controller which functions with a piezoelectric conversion mechanism. Even though the power harvested is lower than the power required by the complete system, it is still possible to operate the system with proper energy management. [9] presents a self powered wearable health monitoring system where each module works under different number of recharge cycles.
However, instability of energy harvesting sources will lead to power failure and state loss as a consequence, which impedes wide adoption of large software in energy harvesting powered embedded system. This problem have attracted a great deal of interest from both academic institution and industry. Researchers have deployed NV memory into energy-harvesting devices to store the execution states [39, 23, 24, 14] because of their features of non-volatility and low leakage. Among all these works, non-volatile processor appears as a promising solution to bridge intermittent program execution under unstable power.
Non-volatile processor: NV processor attaches a nonvolatile memory cell to each volatile element and therefore allows fast local backup of intermediate results and fast recovery. FRAM based processors [39, 17, 36] , present great potential to be deployed in energy-harvesting devices. They show many desirable characteristics of energy-harvesting systems, such as no battery, zero standby power, and fast access. FRAM also has a superior endurance as long as 10 14 write cycles. For example, Yu et al. [8] propose a non-volatile processor architecture which integrates non-volatile elements into volatile memory at bit granularity. Wang et al. [30] design a FRAM based processor, which attaches a NV FRAM cell to each volatile standard flip-flop. The flip-flops are accessed for normal execution while the FRAM cells are used to checkpoint the states in flip-flops at power failure. To reduce the backup overhead and energy, different technologies have been proposed including instruction scheduling [35] , register reduction [38] , compare-andwrite [29] , and instruction selection [34] .
Cache architecture: In addition to register contents, cache contents in NVPs should also be saved to ensure correctness and fast resuming. Among all existed cache designs, two options can be adopted in energy harvesting systems. One is pure NVM based cache which replaces SRAM with NVMs totally [16] . However, this design incurs large write overhead and degrades the system performance, which is especially true when the cache is part of a pipeline stage. Another option is adopting NVSRAM [15] which integrates a SRAM cell and a non-volatile element in cell level, forming a direct bit-to-bit connection. In this design, the NVM part is underutilized because the NV elements are idle most of time and the area size is unnecessarily large. There are some other designs based on both SRAM and NVM to achieve energy efficiency and high performance [31, 33] . However, all existing hybrid cache architectures and policies are designed for energy efficiency and performance purposes. None of them consider the scenario of intermittent power supply and therefore are not resistant to power failure. In this paper, we will design a checkpointing aware cache architecture for the energy harvesting systems.
BACKGROUND
In this section, the system architecture will be presented and the design of the energy harvesting embedded systems will be explored.
System Architecture
The targeting energy harvesting based embedded system is shown in Figure 1 . The system is powered with energy harvested from ambient sources, such as solar energy, thermal energy, piezoelectric, or radio frequency (RF). Besides the power of harvested energy, a small-size storage capacitor is used to supply energy for checkpointing during power failure. The NVP consists of three elements: NV registers, NV hybrid cache, and NV main memory. Since all three parts of the processor have non-volatile components, the execution state can be preserved when there is a power interrupt under unstable power supply. After power returns, the NVP will restore the execution state and resume execution from the interrupted point.
Ambient Energy Energy Harvesting and Management

Energy Storage
Peripheral Devices Nonvolatile Processor
Register The small-size storage capacitor enables accumulating execution states by supporting checkpointing. It is notable that, the energy stored in this capacitor enables a successful checkpointing whenever there is a power failure in this paper supported with the proposed cache architecture. Besides, checkpointing is always necessary. This is because if we do not do checkpointing, although the states in SRAM can retain for a while powered with the capacitor, everything in SRAM will be lost if the power does not come back right after the energy in the capacitor depletes. Figure 2(a) shows a NV processor that is based on the Intel 8051 micro-controller. Figure 2 (b) shows the architecture of a NV register based on ferroelectric memory. In this NV register, the left side is a standard two-stage flip-flop. A ferroelectric memory based non-volatile storage is attached to the standard flip-flop. The content of the standard flip-flop can be copied to the ferroelectric memory to save the state. This work enables the processor's registers to be non-volatile and the execution can be quickly recovered from power failure.
Non-volatile Registers
Checkpointing Aware Cache
As we mentioned in the introduction, currently, there are two designs for cache in NVP. Li et al. [12] integrates a SRAM cell and a non-volatile element in cell level, forming a direct bit-to-bit connection. In this design, the NVM part is underutilized because the NV elements are idle most of time. Another possible design is adopting pure NVMs-based cache [16] . However, since NVMs typically have high write latency, pure NVM based cache will become the performance bottleneck. In this work, we are proposing a checkpointing aware hybrid cache which consists of both S-RAM and STT-RAM. Hybrid cache architecture is not new. However, most existing designs are aiming for high performance and low power consumption. [31, 33] are two representative works on hybrid cache which build last level cache with a small region of S-RAM for fast access and a large region of NVM for large capacity. However, these hybrid caches are not specifically designed for energy harvesting embedded systems where only one level L1 cache is often implemented. Therefore, if this kind of bybrid cache is directly used as first level cache, the performance will be largely degraded and important states in SRAM will be lost when there is a power failure. Instead, the cache proposed in this work aims for reasonable high performance, reliable checkpointing, instant resumption, and low energy consumption. To have a better understanding of this targeted architecture, we compare it with many existed cache designs as shown in Table 1 .
MOTIVATION
In this section, we will show the motivation of this work. First, the performance overhead of building pure STT-RAM cache will be presented. Then, several observations will be provided to show that traditional hybrid cache architecture is not suitable for energy harvesting system. Energy harvesting powered processor built with non-volatile cache will be able to overcome data loss problem caused by frequent interrupts. However, purely non-volatile cache will degrade the performance due to its expensive access time. Among all NVMs, STT-RAM is considered as the most promising candidate for building non-volatile cache because of its fast access time and high density. However, its write latency is 10 times large as its read latency [7] . In a pipelined processor, the slowest pipeline stage determines the clock cycle. We configured two processors with pure SRAM cache and pure STT-RAM cache in gem5 [2] and compared their performances. The SRAM cache and the STT-RAM cache are of the same size and their features are shown in Table 4 . Figure 3 shows the performance comparison of these two systems. This figure shows that STT-RAM based cache increases system execution time by 50% on average when compared with SRAM based cache.
Compared with pure STT-RAM solution for energy harvesting NVP, hybrid cache architecture based on SRAM and NVM can be a better solution. During power failure, all dirty blocks in SRAM need to be saved since they are modified. Figure 4 shows the percentage of dirty blocks for four different benchmarks through their execution time. From the figure, we can see that the percentage of dirty blocks is high for most benchmarks except sperand. In existing hybrid caches, only the last level consists of NVM and NVM is also used for storing dirty data. Therefore, in the presence of power Figure 4 : Percentage of dirty blocks during lifetime failure, it is possible that most NVM blocks are dirty and should not be overwritten. Thus, dirty data in SRAM still needs to be written back to non-volatile main memory. Without any optimization, the checkpoint process is both time and energy consuming due to large amount of data to write. Even worse, checkpoint may fail because of the limited energy provided by capacitor. In this work, we are trying to adjust the usage of NVM blocks dynamically such that certain blocks are reserved for possible checkpointing. Another observation from Figure 4 is that, if we reserve NVM blocks for possible checkpointing, there might be extra clean NVM blocks available after saving all dirty SRAM blocks to NVM blocks for benchmarks with low dirty data such as sperand. In such cases, additional clean blocks can be checkpointed during a power failure to improve system performance after resuming. The proposed cache architecture will take the utilization of NVM, checkpoint efficiency, and system performance after resuming into consideration.
CHECKPOINTING AWARE CACHE AR-CHITECTURE
In this section, we will present a checkpointing aware cache architecture which will both have high performance and checkpointing efficiency. The hybrid cache architecture will be presented in Section 5.1. The replacement and migration policy is presented in Section 5.2. Finally, the checkpointing policy is presented in Section 5.3. In a typical processor for embedded systems, there is often only one level cache and the size of the cache is not large. Besides, it takes longer time to access a second level cache than the first level. Therefore, in this paper, hybrid one level cache architecture is designed. The proposed hybrid cache architecture is shown in Figure 5 . In each cache set, there are both SRAM cache blocks and STT-RAM cache blocks. For the SRAM portion, there is a counter, DBCounter, to record the current number of dirty SRAM cache blocks. For each cache set, there is a SRAM based migration counter, MCounter, for checkpointing purposes. Besides, each cache block has three state bits: valid bit (VB), dirty bit (DB), and live bit (LB). These three bits are used for directing the cache placement. This figure shows a four way set-associative cache with two SRAM blocks and two STT-RAM in each set for illustration purposes. However, this cache can have other associativity settings with flexible ratios of SRAM and STT-RAM. When power failure is detected, the necessary SRAM cache blocks will be backed up to STT-RAM. When power returns, no restoration process is needed to restart execution.
Hybrid Cache Architecture
The novelties of the proposed hybrid cache architecture lie in the following aspects:
• This is the first time that STT-RAM is implemented as the first level cache in embedded systems. The high write latency of STT-RAM is neutralized by only searching all volatile blocks in the same set, when there is a write request.
• DBCounter and Mcounter directed migration policy and writeback policy can fully utilize the fast access speed of SRAM and the space of STT-RAM, while reserving a portion of the STT-RAM for checkpointing upon power failures.
• The cache architecture supports reliable checkpointing. When there is power outage, the most important states in SRAM are always successfully backed up to STT-RAM. When power returns, the execution can resume instantly without restoration.
Replacement and Migration Policies
This section will discuss the checkpointing aware placement policy and migration policy. These two policies will be able to maintain the overall efficiency of the cache architecture, as well as guaranteeing a reliable and fast checkpointing at each power interrupt.
Replacement Policy
The basic replacement policy is the dead block based LRU replacement policy and it will be assisted with proactive writing back. The techniques proposed in [13] is used to predict dead blocks. Before we explain the details of the policy, the following notations will be defined first. Suppose this cache is N-way set associative, and there are totally M sets. In each set of the hybrid L1 cache, N v cache blocks are volatile and N nv cache blocks are non-volatile, which are distributed among multiple banks. Therefore, we have:
Upon a power failure, there should be enough space for checkpointing the dirty volatile cache blocks in the same set. Therefore, the following constraint should be satisfied:
In this inequality, DB i v is the number of dirty volatile cache blocks in set i, and CB i nv is the number of clean non-volatile cache blocks in set i.
Suppose the total available energy in storage capacitor can only support checkpointing T cache blocks, as a result, if we want to guarantee a successful checkpointing, the total number of dirty cache blocks in SRAM should be less than this threshold. That is
The replacement policy will be directed by the inequality (2) and inequality (3). To satisfy these two conditions, we implemented two counters in the cache structure: DBCounter for SRAM portion and MCounter for each cache set.
The DBCounter records the current total number of dirty cache blocks in the SRAM cache part. When the DBCounter exceeds the preset threshold, a DVvictim block in the same cache set will be written back to the main memory. By writing back a dirty block, the total number of dirty volatile cache blocks keeps below or equal to the preset threshold such that checkpointing can always be successful with energy in capacitor.
The MCounter is used to identify whether there is enough space in the non-volatile portion for checkpointing the dirty volatile cache blocks in the same set. The size of the counter depends on the number of non-volatile cache blocks in each set N nv . The number of bits for MCounter in each cache set is set as follows:
This counter keeps tracking the state of each cache block in the same set. Its value is initially set to N nv . If one cache block becomes dirty in this set, the value of MCounter decrements by one. If one dirty cache block turns to be clean, this value increments by one. Once this value reaches zero, it means that there is no space in STT-RAM for checkpointing more SRAM blocks. Therefore, at this time, a DNVvictim (or DVvictim if no DNVvictim) will be proactively written back if another cache block becomes dirty in the same set.
When there is a read, all cache blocks in the same set will be searched. Since STT-RAM's read latency is close to SRAM's read latency, this will not slow down the pipeline stage. In this way, STT-RAM blocks are not only used for checkpointing as in [12] but also for data accesses. If it is a read hit, it will be served right away. If it is a read miss, data will be loaded to the blocks in this set. Due to the maintenance of MCounter, there are always clean blocks. The priority of the destination blocks will be clean SRAM > clean STT-RAM. LRU dead block will be chosen first as victim. If there is no LRU dead block, then LRU live block will be chosen.
When there is a write, instead of all blocks, only SRAM blocks in the cache set will be searched. The rationale is that, since writing to STT-RAM takes much longer than reading, if writing to STT-RAM is in the pipeline stage critical path, the whole stage will be slowed down. If it is a hit in SRAM blocks, the write will be served. At the same time, DBCounter and MCounter will be updated. If it is a write miss in SRAM blocks but the data is in STT-RAM blocks, then the pipeline will be stalled and these two data blocks will be switched. In this way, STT-RAM blocks behave as a second level cache for SRAM blocks during cache write. Since STT-RAM is onchip, the stall cycles are relatively small compared with fetching data from the main memory. However, if the requested block is absent in both SRAM and STT-RAM blocks, the migration policy described in the following subsection will be used to manage the cache replacement process and fetch data from the main memory.
Migration Policy
The goal of the migration policy is to conduct data migrations such that the space of the STT-RAM should be fully utilized while most read or write hits still land on SRAM blocks. The proposed policy answers this challenge by using SRAM as a sifter, and placing the blocks sifted out in the STT-RAM.
Once there is a miss and the requested block is absent in both SRAM and STT-RAM blocks, a Vvictim, which is either a LRU dead block or a LRU live block if all blocks are live, will be selected in SRAM for replacement. If the Vvictim is a block that is predicted live, we will migrate it to the STT-RAM. This is because, although it is selected in the SRAM as a victim to place new block, there is still a high probability that it will be used in the future. If it is written back to the main memory, it will take a long time to reload it from the main memory for the future request. Therefore, rather than writing it back to the main memory, we migrate it to the nonvolatile cache part, since this takes less time and energy. The new destination for Vvictim will be the NVvictim. If Vvictim block is predicted dead, there is no need to migrate it. If it is clean, it will be simply overwritten, since there is very little chance that it will be accessed again. If it is dirty, it will be written back to the memory, Figure 6 : Flow of checkpointing friendly block placement policy in case that this prediction was a mistake. Besides, a small write buffer is implemented to mitigate the latency of migration. Figure 6 illustrates the detailed cache policies with write miss example. In this example, there are 4 blocks in the cache set including volatile blocks 1 and 2, and non-volatile blocks 3 and 4. Initially, DBCounter is equal to 32. Because there are 1 dirty volatile block and 1 dirty non-volatile block, the MCounter is 0. When there is a request for block tag5 and it is a write miss, volatile block tag2 is then selected as the V victim and it is written to the write buffer (operation 1 ). After that, the new block tag5 is loaded to volatile block tag2 from the main memory and it becomes the MRU. At the same time, the block in the write buffer will be written to nonvolatile block tag4 (operation 3 ), which is selected as a NV victim. In this way, operation 2 does not need to wait for a long nonvolatile write time before it starts and it only waits for a volatile write operation. At this time, MCounter is less than 0, which means there is no space for checkpointing block tag5. Therefore, we will choose DNV victim to write back (operation 4 ) and MCounter increases by 1.
In all, the main novelty of the proposed cache replacement and management mechanism lies in two aspects: 1) Six kinds of victim blocks are defined in Table 2 , and the new replacement policy can determine an appropriate kind of victim block for replacement or for writing back, given the system requirements.
2) The new mechanism not only considers the performance of cache access but also considers the checkpointing ability when there is power outage via replacement policy and migration policy.
Checkpointing Policy
In this section, we will present the checkpointing policy based on the proposed cache architecture. This checkpointing policy will specify what to be checkpointed, as well as where to store these volatile cache blocks.
Selecting Volatile Blocks for Checkpointing
Checkpointing is performed within each cache set. Therefore, upon a power failure, the necessary volatile cache blocks will be backed up to non-volatile blocks in each set. The first problem we need to answer is what to be checkpointed. From analysis, we know that all dirty blocks need to be backed up. However, if energy supply and STT-RAM spaces allow, clean blocks that will be used in the near future can also be backed up to NVM to improve the performance after power resumes. This means if the number of dirty blocks in SRAM is less than T , then T − ∑ M i=1 DB i v blocks will be checkpointed to improve the cache performance. In this case, these most recently used live clean volatile cache blocks should be chosen first so that they can be accessed right away after power recovers.
The selection of clean volatile blocks depends on two considerations: first, there should be available non-volatile space in the same set for placing the selected clean volatile block; second, the remaining energy in capacitor should be sufficient for checkpointing these selected clean blocks. ICache DCache MRU LRU Figure 7 : Clean block selecting policy
For the first consideration, during the selection of clean volatile cache blocks, only the MCounter most recently used clean blocks will be selected for checkpointing. The other blocks will be dropped. ICache does not have this consideration because it does not have dirty blocks and there will always be enough space for checkpointing. For the second consideration, age bits are maintained for ICache and DCache blocks in order to differentiate LRU and MRU blocks. As shown in Figure 7 , during the checkpointing process, the most recently accessed cache set will be scanned first for DCache and ICache sequentially. Before a clean block is checkpointed, its LB bit will be checked first. If it is already predicted dead, it will be given up because dead block has high probability not to be accessed and another live block will be selected instead. In this work, we apply dead data prediction policy proposed in [13] to guide the content selection, which predict dead blocks based on bursts of accesses to a cache block.
Selecting Non-volatile Blocks
After we have determined the SRAM blocks that need to be checkpointed, we need to decide which blocks in STT-RAM should be used to store them. Not all non-volatile cache blocks can be used to place these clean volatile cache blocks. Since non-volatile cache portion also contains some dirty cache blocks. Therefore, only these non-volatile clean cache blocks can be overwritten to place the volatile dirty cache blocks. Among all these non-volatile clean cache blocks, dead clean blocks or LRU live clean blocks in STT-RAM will be the first choice. As a result, for each dirty volatile cache block, the CNVvictim will be selected for checkpointing each time. In addition, a MCounter is updated to ensure there is always enough space for checkpointing in STT-RAM. Figure 8 illustrates the process of checkpointing with two cache sets. Before checkpointing, there is only 1 volatile dirty block whose tag is tag2 in set 1, and there are two dirty volatile blocks in set 2. Therefore, current DBCounter is 3. If a power failure happens, we will first find the CNV victim in set 1, which is block tag3, and then replace block tag3 with block tag2. After that, we will find the CNV victim in set 2, which is block tag8, and replace the content of block tag8 with the first dirty block, tag5. Then block tag6 will be selected as the new CNV victim, and it is used to checkpoint another volatile dirty block tag7. After that, if the threshold for DBCounter is larger than 3, then the clean block tag1 in set 1 can be further checkpointed to block tag4 for better performance, because it is the MRU block while tag4 is the LRU.
After power failure happens, not only volatile cache blocks, the values in MCounter and DBCounter will also disappear because they are based on SRAM for fast access speed. Therefore, they also need to be saved to non-volatile memory. After power returns, we do not need the process of restoration, since STT-RAM based cache part is also used for normal access. Therefore, the execution can start quickly without copying anything back to STT-RAM.
Hardware overhead
This section will analyze the area overheads introduced by extra state bits, dead block prediction, and checkpointing logics. We will take a 8-way associative L1 cache (16K hybrid I-cache and 16K hybrid D-cache) with block size of 64B as an example to show the hardware overhead. The overhead of the proposed hybrid cache architecture mainly comes from three respects.
1. One dirty bit (VB) and one live bit (LB) for each cache block; one migration counter (MCounter) for each cache set; one dead block counter (DBCounter) for the whole SRAM part. 16K DCache has 16*1024/64=256 blocks. Since it is 8-way associative, there are totally 256/8=32 sets. Since the volatile part of DCache has 256/2=128 blocks, DBCounter needs 7 bits. Besides, each MCounter needs 3 bits. Therefore, the overhead for DCache is 2*256+3*32+7=615 bits. For ICache, only a LB is needed for each block, thus the area overhead for ICache is 256 bits. Therefore, the total overhead is 256+615=871 bits.
2. An age counting module is maintained for checkpointing the clean volatile part of both ICache and DCache. Since there are totally 32 cache sets, 6 bits can be attached for each set to store the age. Therefore the memory area for counting the priority is calculated as 6*64*2=768 bits.
3. The overhead of the dead block prediction we employed in this paper needs less than 1024 bits.
Totally, the area overhead is 871+768+1024=2663 bits=2.6K. The percentage of hardware overhead is about 2.6K/32K=8.1%. Compared to the efficiency of the proposed architecture, the hardware overhead is trivial.
EXPERIMENTAL EVALUATION
In this section, we will first present the experiment setup in subsection 6.1. Then, the evaluation results will be presented in subsection 6.2.
Experiment Setup
The experiments are carried out on gem5 simulator [2] . The proposed SRAM and STT-RAM based hybrid cache architecture and policies are implemented in gem5 simulator. Table 3 details the experimental system configuration. The hybrid cache architectures to be evaluated include three types of SRAM/STT-RAM set configuration including 4/4, 5/3, and 6/2. We obtain STT-RAM and SRAM parameters using NVSim [3] . The storage capacitor is set to be able to support checkpointing a quarter of the whole cache. The baselines for comparison are two existing cache architectures that are proposed for NVP: pure STT-RAM based cache architecture as proposed in [16] and SRAM based cache architecture as proposed in [12] . Eight benchmarks from Mibench [4] and four benchmarks from the SPEC CPU2006 suite [5] are selected for evaluation and their characteristics are shown in Table 5 . In this table, the first eight rows show benchmarks from Mibench and the last four rows show benchmarks from SPEC CPU2006. The benchmarks are chosen because Mibench is a very representative benchmark suite for embedded system applications, while SPEC CPU2006 suite is for CPUintensive general purpose microprocessors.
Results
The performance of the proposed hybrid cache architecture will be evaluated from two aspects: first, the programs are executed in environment of stable power supply; second, the program is interrupted frequently by power failures.
Execution Under Relatively Stable Power
In this section, we will evaluate the proposed hybrid cache under relatively stable power. The cache performance, energy consumption, and miss rate will be analyzed to evaluate the proposed cache policies. Performance Evaluation Figure 9 shows the execution time of 12 benchmarks under 5 cache settings when the power supply is relatively stable. Here, all caches are 32 KB. In this figure, the performance of the other four cache architectures are normalized based on the non-volatile cache architecture. From this figure, we can see that non-volatile cache architecture has the worst performance of the five because of its high write latency. The performance of the proposed hybrid cache architectures is 18% better than the nonvolatile cache architecture on average. The middle three architectures are based on different ratios of SRAM and STT-RAM, which are 4/4, 5/3, and 6/2. Normally, the 6/2 architecture works better than the other two, since the larger the size of SRAM is, the more blocks will be placed and accessed in SRAM. However, it does not mean 6/2 architecture is the best, because SRAM has higher leakage energy than STT-RAM while STT-RAM has three or more times larger density than SRAM. What is more, in order to perform successful checkpointing, dirty blocks need to be written back to the main memory more frequently in 6/2 configuration. Therefore, the selection of hybrid architectures also need to take into account the requirement of size, energy, and main memory stress.
Energy Consumption Evaluation Figure 10 shows the energy consumption comparison between pure non-volatile cache, hybrid cache, and pure SRAM cache. For each benchmark, there are three bars. In each bar, the solid upper part shows the dynamic energy consumption and the textured bottom part shows the leakage energy consumption for each cache architecture. From the figure, we can see that pure non-volatile cache incurs the largest dynamic energy consumption and lowest leakage energy consumption. On the other hand, pure volatile cache incurs largest leakage energy and lowest dynamic energy. Hybrid cache's energy consumption is in the middle of the other two configurations. This is because the leakage power of SRAM is more than six times of STT-RAM as shown in Table 4 . From the energy comparison results, we can see that the proposed hybrid cache is more energy efficient than the pure nonvolatile cache. However, it does not mean the proposed hybrid cache is less energy efficient than the volatile cache. These results are achieved when the processor is running; when it is idle, Miss Rate Evaluation Table 6 shows the cache miss rate comparison between the proposed hybrid cache and the pure non-volatile cache. We want to figure out if the implementation of checkpointing aware cache replacement policy increases cache miss rate. From the table, we can see that the increased miss rate is negligible. The largest miss rate increase is 2.180%. One of the benchmark, libquantum, actually shows decreased miss rate. On average, the miss rate only increases by 0.116%. Table 7 shows the cache miss rate comparison between the proposed hybrid cache and the SRAM [12] based cache of the same size. Table 7 , we can see that the cache miss rate of proposed hybrid cache architecture is lower than the SRAM based cache, and the proposed cache architecture can decrease the cache miss rate by 0.394% on average. This improvement over the baseline cache is because the nonvolatile memory in the SARM based cache is not for cache access and is only used to store volatile states, and the actual working SRAM part is half the size of the proposed architecture. On the contrary, the proposed hybrid cache architecture fully takes advantage of the nonvolatile memory. 
Execution Frequently Interrupted
In this section, the checkpointing aware ability of the 4/4 cache architecture will be evaluated in the scenario where there are frequent power failures.
The power failures are simulated by imputing two different power traces where power failures happen at different frequencies. In the first power trace, a power failure happens about every 500ms; in the second power trace, a power failure happens every 200ms. The frequencies of both two power traces are set quite large to evaluate the performance. In reality, power failures do not happen so frequently as the two power traces used here. Therefore, the results are quite conservative. It takes much less time for a benchmark to run on the hybrid cache architecture in normal energy harvesting systems. Figure 11 shows the performance of the proposed hybrid cache architecture when there are frequent power failures. In this figure, the first column shows the execution time of non-volatile cache architecture, the second and third columns show the execution time of the hybrid cache architecture when power failures happen every 200ms and 500ms, separately. From this figure, we can see that even facing with radical frequent power failures, the proposed cache architecture outperforms the pure non-volatile cache architecture. In the environment of less frequent power failures, the checkpointing aware hybrid architecture works better than when there are more frequent power failures. Figure 12 shows the energy consumption of the proposed hybrid cache architecture when there are frequent power failures. In this figure, the first column shows the execution time of non-volatile cache architecture, the second and third columns show the energy consumption of the hybrid cache architecture when power failures happen every 200ms and 500ms, separately. For each column, the solid upper part shows the dynamic energy consumption and the textured bottom part shows the leakage energy consumption for each cache architecture. From this figure, we can see that even facing with radical frequent power failures, the proposed cache architecture outperforms the pure non-volatile cache architecture for most benchmarks. When power failures happen every 200ms, more energy is consumed for hybrid cache than when power failures happen every 500ms. This is because, leakage energy increases with execution time while dynamic energy increases with more memory accesses generated by checkpoints.
From the experimental results, we can see that the proposed hybrid cache works efficiently for energy harvesting powered systems. Compared with existed cache architecture, it has properties of high performance, relatively low energy consumption, and instant resumption as shown in Table 1 . Besides, this hybrid cache supports reliable checkpointing because we can control the volatile states in the SRAM to be checkpointed.
CONCLUSION
Hybrid SRAM/STT-RAM cache is a promising candidate to be employed in future generation energy harvesting embedded systems in Internet of Things (IoT), due to its fast access, high-density, and low leakage. To guarantee successful checkpointing with given limited capacitor storage, we propose proactively write back policy directed by monitoring the state of the whole SRAM cache part and the migration state of each associative set, so that the percentage of dirty blocks always remains within the checkpointing ability. To fully utilize the size of STT-RAM, we propose to migrate less recently used live victim blocks to it while retaining enough space in STT-RAM for checkpointing of SRAM states. When power is down, the proposed efficient checkpointing policy is utilized to first checkpoint dirty blocks and then important clean blocks in S-RAM. After power returns, the system resumes fast without special restoration process.
