Though low-power nonvolatile processor and ASIC can meet the requirements of many Internet of Things (IoT) applications, a platform with both hardware programmability and better energy efficiency for edge computing is badly needed. Recently, ultralow-power field-programmable gate arrays (FPGAs) have attracted more and more attentions for IoT applications. For example, iCE40 devices [3] achieve 25-µW static power and are designed for sensor hub and wearable scenarios. Those emerging ultralow-power FPGAs can be possible candidates for many IoT applications in the future.
However, relative high leakage power and large chip area overhead are still two major challenges for FPGAs. It has been shown that when the same application is implemented on an ASIC or an FPGA, the FPGA version consumes 21 times more silicon area and 10 times more power consumption [4] than the ASIC approach. Moreover, as transistor feature size and threshold voltage go down, the leakage power of FPGAs becomes even worse.
The development of nonvolatile memory technologies, such as ReRAM [5] , [6] , ferroelectric RAM, and STT-RAM [7] , [8] , brings new opportunities to FPGAs. Leveraging nonvolatile memories in FPGAs, lots of previous works have been reported [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] . They can effectively reduce the leakage power and accelerate the configuration speed. However, computation data are still volatile and tasks have to be reexecuted after a power failure. Fig. 1 shows the forward progress among different nonvolatile levels of an FPGA. The SRAM-based FPGAs can retain neither configuration nor computation data, and therefore, they have to be reconfigured after every power recovery. Even worse, programs need to be reexecuted, since computation data are lost. It can hardly make progress. If FPGAs adopt nonvolatile memory for configuration data, the reconfiguration progress can be avoided. However, rollbacks due to the computation data lost still exist, which limits forward progress. If both configuration and computation data are nonvolatile, FPGAs will work correctly and achieve the best forward progress, which is the focus of this paper.
Several works have explored nonvolatile architectures for computation data in FPGAs [23] [24] [25] [26] [27] [28] [29] [30] . They make computation data nonvolatile by two approaches. The first is to replace the master latch of CMOS D flip-flops (DFFs) with nonvolatile memory [23] [24] [25] . The other is to back up computation data to nonvolatile memory when power is switched [26] [27] [28] [29] [30] . These nonvolatile flip-flops require much area and latency overhead. Furthermore, these works adopt fully backup strategy for computation data, which is over pessimistic and leads to large peak current and backup energy. In order to address these challenges for computation data of FPGAs, previous work [31] leverages nonvolatile block memories (BRAMs) to make computation data nonvolatile. Only necessary computation data will be backed up. However, it still needs to back up state after each power failure.
Observing that giving up backup at some locations and allowing a certain amount of computation rollback can achieve higher energy efficiency, this paper proposes an extension of CP-FPGA to support computation rollback. Energy efficiency is improved by offline backup data analyzing and online backup/rollback decision. To the best of our knowledge, this is the first hardware/software codesign for energy-efficient nonvolatile FPGA leveraging online/offline checkpointing optimization.
The contributions of this paper are as follows. It identifies registers to be backed up at each state to avoid fully backup. A CAD tool based on VTR is also proposed to map an application to the proposed CP-FPGA. 3) Formulating the offline checkpointing optimization problem as an integer linear program to determine offline checkpoints with less backup cost. At these checkpoints, CP-FPGA will routinely backup computation state to limit rollback distance so that it can achieve better energy efficiency. The rest of this paper is organized as follows. Section II shows the related work, as well as the assumptions and operation principles of CP-FPGA. Section III presents the hardware architecture design of CP-FPGA. Section IV shows the software design of this work, including offline analysis and CAD tools to map an application to CP-FPGA. Section V presents simulation results. Finally, the conclusion is drawn in Section VI.
II. PRELIMINARY
In this section, we first introduce the related work. After that, the assumptions and operation principles of CP-FPGA are given out.
A. Related Work
There are a plenty of studies on nonvolatile FPGAs to accelerate the reconfiguration progress. Liauw et al. [17] proposed the first 3-D FPGA chip based on ReRAM technology. Chen et al. [21] demonstrated a novel design of run-time reconfigurable FPGA with distributed BRAMs. Gaillardon et al. [19] proposed a generic memristive structure to replace the traditional pass gates in FPGAs. Cong and Xiao [15] proposed FPGA-RPI, which has 96% smaller footprint compared to traditional SRAM FPGAs. Huang et al. [9] proposed a nonvolatile FPGA architecture with a stacked 1D2R RRAM array, which has greatly reduced the area and power by 70% and 43.6%. However, all the above works focus on eliminating the need to reload configuration data. The computation data are still lost after each power failure.
Recently, several works explored nonvolatile FPGA architectures for computation data [23] [24] [25] [26] [27] [28] [29] [30] . Zhao et al. [23] [24] [25] replaced the master latch of CMOS DFFs with nonvolatile memory. However, slow writing speed and large writing current of nonvolatile memory lead to long latency and high energy consumption. In [26] [27] [28] [29] [30] , they back up computation data into nonvolatile memory when power is OFF. For example, Gaillardon et al. [27] proposed an ultralow-power FPGA based on nonvolatile flip-flops (NVFFs). These flip-flops operate as standard volatile CMOS flip-flops during regular operation and store data into RRAM cells before a power failure. However, NVFF leads to a large area overhead, resulting longer routing delay, larger tile area, and nontrivial performance loss. Furthermore, all the above works adopt a fully backup strategy for computation data, which is over pessimistic and leading to large peak current and energy.
To address these challenges, Yuan et al. [31] leverages nonvolatile block memories (BRAMs) to make computation nonvolatile, which has higher density and yield than NVFFs. Furthermore, only necessary computation data are backed up to achieve higher energy efficiency. The drawback is that it still needs to back up state after each power failure. In this paper, backups at some locations are abandoned and computing rollbacks are utilized to achieve higher energy efficiency.
B. Operation Principles and Assumptions of CP-FPGA
CP-FPGA minimizes the backup energy via two operation principles: 1) we offline analyze applications and mark the checkpointing locations, which are less energy consuming to back up; and 2) when power failures occur, an online scheduler determines to select either backup directly or rollback without backup based on the cost evaluation.
We have two implicit assumptions for CP-FPGA to save backup energy. First, we assume that each application can be modeled as finite state machine (FSM). In fact, all highlevel synthesis tools can transform HDL codes into FSMs. Second, we can find some FSM states in applications, whose backup data are less than others. This is usually a typical case in many applications, because the variable number of applications changes during the execution. In the Sections III and IV, we will present the details of CP-FPGA. 
III. CP-FPGA HARDWARE ARCHITECTURE
In this section, we will introduce the hardware architecture of proposed CP-FPGA.
A. Architecture Overview
The proposed architecture is an extension of island-style FPGAs (Fig. 2) . CP-FPGA consists of an nvController, several checkpoint units (CUs), and several IOs, as shown in Fig. 2(a) . The CUs are the smallest backup units in CP-FPGA and the nvController is used to control the backup and restore of computation data. The IOs are universal parts of FPGAs.
B. Checkpoint Units
A CU is the smallest backup unit and it consists of four different parts [ Fig. 2(b) ], including configurable logic blocks (CLBs), connection boxes (CBs), switch boxes (SBs), and a checkpoint BRAM (CPRAM).
CLBs are the main logic blocks of CP-FPGA [ Fig. 2(c) ]. There are N CLBs in one CU. Each CLB has K logic elements (LEs). In this paper, K is set to be 4 for the balance of inter-routing resources and intra-routing resources [32] . N is set to be 8 (N = 8) due to the balance of area and energy efficiency. We will show the impact of N in the experiment section. Each LE [ Fig. 2 (e)] contains a nonvolatile lookup table (LUT) and a CMOS DFF storing computation data. We use a normal CMOS DFF instead of NVFF, because NVFFs suffer from large area overhead. In the proposed architecture, CLB has an additional data path called shift register path (SR path). When we need to back up FPGA, CLB can be changed into a K -bit shift registers to use this data path. Computation data will be stored into CPRAM through this mechanism.
CBs [ Fig CBs are used to connect CLBs with routing resources and SBs are used for connecting routing resources in different directions. All these routing boxes are controlled by RRAM cells. These RRAM cells can maintain configuration bits after power failure. RRAM cells have small reading energy but large writing energy. These configuration bits are frequently read, but only need to be written once when an application is loaded to FPGAs. RRAM cells are suitable for these configuration bits.
There is one CPRAM in each CU. BRAMs exist in almost every modern commercial FPGAs, where they store temporary data or are configured as huge LUTs. CPRAMs in the proposed architecture are RRAM-based RAMs and have two working modes. In the normal mode, they work as normal BRAMs. When an FPGA needs to back up its computation data, CPRAMs work in the backup mode. In this paper, the capacity of CPRAMs is set to be 10k bit, which is the same as Cyclone V, a commercial FPGA of Altera. Note that the capacity can be customized based on the purpose of the CP-FPGA. Fig. 2 (h) shows the details of CPRAMs. In the normal mode, CPRAMs are configured as BRAMs. In the backup mode, the ports of CPRAMs are directly connected to those of CLBs in the same CU with SR paths. Computation data are backed up or recovered to/from the CPRAMs. The address ports of CPRAMs are controlled by an A-bit signal called CPAddr, which is generated by the nvController. A is the address bits width of CPRAMs. For a 10 kb (1250 × 8 bit) CPRAMs, A is set to be 11. Furthermore, each CU has C-bit additional configuration data called checkpoint flag. The nvController generates another C-bit signal called checkpoint signal (CPSignal). Only 1 bit of CPSignal can be 1, which means that CP-FPGA can support C different states of applications at most. Two signals are compared and only necessary CPRAMs will back up the computation data to avoid redundant data. In this paper, we set C to 32, because we find that most of the applications have less than 32 different FSM states.
Once the backup process is started, N-bit computation data will be written to CPRAM in one clock cycle, since there are N CLBs in one CU. The whole backup process takes K clocks, because all CLBs are configured as K -bit shift registers. The number of CLBs in one CU (N) influences the backup granularity. If there are fewer CLBs in one CU, the backup will have higher parallelism. However, there will be more CPRAMs in CP-FPGA. There will be a break-even point between backup granularity and the number of CPRAMs.
C. nvController Design
To complete backup and restore operations in CP-FPGAs, an nvController is designed, as shown in Fig. 2(g) . It has two input ports: StateID and power. If power equals to zero, it indicates that energy is insufficient and CP-FPGA has to be shut down. StateID shows the current forward progress of the task, and it is inserted into Verilog codes as an output port. The nvController has four main components as follows.
1) RunCounter is a counter to record the number of clock cycles since the last checkpoint for online decision. It denotes rollback cost if CP-FPGA gives up backup after power failure. If rollback cost is much more than backup cost, the CP-FPGA should backup state. It is set to a 32-bit counter, since 2 32 clock cycles rollback overhead is much more than backup cost in most practical situations. 2) RKFlag is a 1-bit RRAM cell to avoid repeated rollback to the same checkpoint. When CP-FPGA gives up backup and rollbacks to a previous checkpoint, this flag will be set to true. It will force CP-FPGA to save states when CP-FPGA wants to roll back to the previous checkpoint. It limits rollback cost and ensures the progress of tasks.
3) CostRAM is used to store checkpoint cost of tasks at different states. The capacity of CostRAM is 32 × 32 bits, because we assume that CP-FPGA can support at most 32 different computation states (C = 32) and RunCounter is a 32-bit counter. 4) When power failure happens, the output value of CostRAM will be compared with RunCounter. These cost values are loaded into CostRAM offline. In Section IV, we will show how to calculate these checkpoint cost values. The backups are always decided for the whole FPGA, because we assume that the whole FPGA is powered by one power source and only has one nvController. Finer granularity can be achieved if we divide FPGA into several parts and add an nvController for each part. However, the hardware overhead will increase. It is a tradeoff between backup parallelism and area overhead. 5) The control logic of nvController is an FSM [ Fig. 2(g) 
IV. CP-FPGA SOFTWARE
This section illustrates how to map an application to CP-FPGA. An overview will be shown at the first. Then, the details are presented step by step. At last, we compare the work flow of different FPGAs to show how hardware and software design of CP-FPGA can work together.
A. CAD Flow Overview
The flow of checkpoint schemes is shown in Fig. 3(a) . The input of flow is C/C++ description of an application. After that, HLS tool is used to synthesize C/C++ code into HDL language. FSM and control data flow graph (CDFG) of the application are collected from the HLS tool. An extra output port (StateID) is inserted into the design to show the current FSM state of applications. The design is mapped onto CP-FPGA with traditional FPGA tools to obtain placement results. The software flow works with traditional tools as an additional component. We have validated the feasibility on VTR, an open source FPGA tool [33] . Other tools should support the CP-FPGA software in a similar way. Backup cost of each state is calculated through analyzing CDFG and FSM. Candidates with small backup cost are selected as the offline checkpoints, where CP-FPGA backs up its state routinely to limit rollback cost with small overheads. Finally, placement and routing configuration bits and checkpoint configuration bits are loaded into CP-FPGA.
B. Checkpointing Cost Calculation
An example CDFG and the related FSM of application are shown in Fig. 3(b) . A block corresponds to one state of the FSM. There are four types of blocks. First, combination logic blocks do not have registers. Second, sequential logic blocks have registers. These blocks are the focus of this paper, since registers contain computation data of FPGAs Third, blocks with atom operation are related to atom operation. These operations cannot be interrupted. Fourth, blocks with child module mean that these blocks are calling child module. The states of child modules should be saved. These four types of different blocks lead to three kinds of different paths.
Three kinds of paths are shown in Fig. 3(c) . Atom path means that this edge represents an atomic operation. These atomic operations include reading a memory or communicating with peripherals. Any interruption in the path will lead to a fatal error; therefore, checkpoints cannot be set in these paths. If we want to insert a checkpoint to interrupt these edges, the backup cost will be infinity. Call path means that the application is running a child module. If a checkpoint is inserted at these edges, all registers in the child module need to be backed up. Another important path is checkpoint path (CP-path). Different from previous two paths, CP-paths may contain multiple edges instead of only one. A CP-path exists between block m and n if and only if: 1) block m has a register; 2) block m can reach n; and 3) blocks on the path from m to n do not have registers. A CP-path means that the
Algorithm 1 Calculate Checkpoint Cost of CP-paths
block n depends on m. If checkpoints occur in these CP-paths, registers in block m need to be backed up.
The principle of CP-path is that any registers will be used in the future should be backed up. That is, a register should be backed up if and only if it belongs to: 1) a state which has been executed; and 2) a state that will be executed in the future depends on it. The execution order of FSM state depends on the application input, but the number of states is independent of it, since the application input only decides execution times of loops. In order to address this problem, two binary matrices are defined. The registers needed to be backed up at different FSM states are obtained after checkpointing cost calculation step. Since the basic backup unit is CU, we transfer the number of registers to the number of CUs based on placement information. 
C. Offline Checkpoint Decision
The goal of offline checkpoint decision is to find some states with small backup cost as the offline checkpoints, where CP-FPGA backs up its state routinely to limit rollback cost with small overhead. The offline checkpoint optimization problem can be formulated as minimizing the total execution energy of a task. In order to formulate such a problem, several notations are defined in Table I .
The goal of optimization is to determine the value of b i so that the task can be finished with minimizing total execution energy. The optimization problem can be transformed as follows:
min .E total (1)
Here, n is the total state number of the application. Total energy is the sum of execution energy, backup energy, and rollback energy (2). Backup energy is the sum of each checkpoint energy (4). Rollback energy contains two parts (5). One part is recompute energy from the nearest checkpoint. The m (6) denotes the nearest checkpoint. The other part is the energy of loading computation data from CPRAMs. The probability of power failure happens at each state is proportional to the execution time of this state (7) .
Note that the FSM of application is a directed acyclic graph. The number of states is independent of input, but the execution order of FSM state depends on it. We cannot track how many times each loop is executed offline. Therefore, FSM should be unrolled and divided into branches without loop. The execution time of the child module should be set to infinite, since this time also depends on the input of applications and cannot be tracked offline. This optimization problem can be solved and offline checkpoints are obtained. This optimization problem is an integer linear programming problem and can be solved by commercial ILP solvers. As the FSM state number of real-world applications is usually small, it can be solved in reasonable amount of time. CP-FPGA will make routine backups at these offline checkpoint locations. Fig. 4 shows the workflow of different type of FPGAs. After power failure, it has the following. A1) Traditional SRAM-based FPGAs lose both computation data and configuration data. Upon power recoveries, it needs to load configuration bits from off-chip and reexecuted task from beginning. A2) RRAM + DFF-based FPGAs do not need to reload configuration bits but loses computation data. Upon power recoveries, it still needs to reexecute the task from beginning. A3) RRAM+NVFF-based FPGAs backup computation data to the RRAM cells of NVFFs. Upon power recoveries, it restores computation data from NVFFs and continues the task. A4) Offline version CP-FPGA backups computation data into CPRAMs. Upon power recoveries, it recoveries computation data from CPRAMs and continues the task. A5) Hybrid version CP-FPGA finds it is too far from the last checkpoint, so nvController decides to make an online checkpoint and backup computation data to CPRAMs. After power recoveries, it restores computation data from CPRAMs. After a while, it goes to an offline checkpoint, which was marked by offline analysis due to small backup overhead. Thus, it back ups this offline checkpoint and continues the task. Soon a power failure occurs again, the nvController finds that rollback energy is very small, because an offline checkpoint was just made, and CP-FPGA will directly shut off without backup. After power recoveries, it will roll back to the previous offline checkpoint and continue the task. SRAM + DFF-based and RRAM + DFF-based FPGAs cannot finish the tasks, because all computation data are lost. They need to restart the tasks. RRAM + NVFF-based FPGAs can back up computation data into NVFFs after a power failure. However, all computation data are needed to back up. Since some computation data will not be used in the future, parts of energy are wasted to back up these redundant data. Offline version CP-FPGA only back ups necessary data to CPRAMs, which improves energy efficiency. But when power failure occurs at some states, it needs to back up too much computation data. Hybrid version allows a certain amount of computation rollback to avoid costly backup. Energy efficiency is further improved.
D. Workflow of Different FPGAs

V. EXPERIMENT RESULTS
In this section, we will first present the experiment setup. Then, the comparison of total execution energy and time is given to show the advantages of the proposed work. Next, we will discuss the impact of different parameters in the proposed work. Finally, area and latency overhead of the proposed work will be given.
A. Experiment Setup
As Fig. 3(a) has illustrated, Vivado HLS is used as the high-level synthesis tool and Toolbox VTR [33] is used to simulate the proposed FPGA architecture. CMOS technology is based on the PTM 45-nm technology given by VTR and RRAM is based on back end of line process of CMOS technology [6] . The other parameter of the experiment is shown in Table II .
Several benchmarks are selected from two popular HLS benchmark suits, CHStone [34] and MachSuite [35] . FIR filter is widely used in digital signal processing. KMP is an important algorithm for string matching. SORT is a merge sort algorithm used on the parallel platform and AES/SHA are two popular encryption algorithms. These benchmarks are tested on CP-FPGA under a wide range of power failure ranging frequency. In the simulation, the power failure frequency obeys a uniform distribution. We find out that the experimental results are mainly affected by the expectation of the distribution. Different types of distribution lead to similar results if the expectation is the same. The impact of power failure frequency will be discussed in the Section V-C.
B. Execution Energy and Time
In this section, we will first compare the total execution energy and total execution time breakdown of different benchmarks on different FPGAs. Then, we will show two interesting experimental results.
As shown in Fig. 5 , SRAM+DFF-and RRAM+DFF-based FPGAs cannot finish the tasks under unstable power supply. SRAM + DFF-based FPGAs need to load configuration data from off-chip once power failure happened. RRAM+ DFF-based FPGAs fail to finish tasks, because they lost all computation data and restart tasks when power recoveries every time. RRAM + NVFF-based FPGAs and proposed CP-FPGA can both finish the work. Compared with RRAM + NVFF-based FPGA, the offline version CP-FPGA can reduce 21.94% energy consumption. The energy reduction is attributed to analyzing CDFG and FSM of application offline. Redundant data are discarded when CP-FPGA backups its computation data. The hybrid (offline+online) version of this paper can further reduce 17.6% energy consumption compared with offline version. That means the proposed work can reduce 39.5% energy consumption compared with traditional RRAM + NVFF-based FPGA. In the best scenario, it can reduce 64.9% execution energy, which means that hybrid version only needs 35.1% energy compared with RRAM+ NVFF-based FPGA.
Compared with offline version CP-FPGA, the improved CP-FPGA version needs 16.4% more execution time. The overhead comes from the task rollback. Compared with RRAM + NVFF, the execution time of CP-FPGA increases 54% in SHA benchmark due to four factors: 1) task rollback accounts for 26.4%; 2) power OFF accounts for 16.4%, because longer execution time causes more power failures; 3) RRAM writing accounts for 10.3%; and 4) RRAM reading accounts for 0.9%. However, we believe that energy efficiency is more critical in low-power scenarios, which means that more task can be executed. We can trade energy efficiency with execution time.
One interesting result is that computation rollback does not necessarily need more execution time to finish tasks. As shown in Fig. 5(b) , hybrid version CP-FPGA uses less time to finish FIR compared with offline version, which does not allow computation rollback. The reason is that rollback will give up backup in some situations, which reduce the backup time. Since RRAM writing is time-consuming, rollback may need shorter execution time than backup directly.
Another interesting result is that typically hybrid version has shorter backup time compared with offline version, because in many situations, the backup will be given up. But for AES, hybrid version has longer backup time than offline version. The reason is that AES has many offline checkpoints. In order to limit rollback cost, CP-FPGA will back up at these checkpoints, whether the power is sufficient or not. These routine checkpoints increases total backup time, but rollback energy is limited without much backup energy overhead, since these routine checkpoints have small backup energy.
C. Discussion
In this section, we will discuss the breakdown of hybrid (online+offline) version CP-FPGA, the impact of power failure frequency, and the impact of CLB number in one CU.
1) Breakdown of the Hybrid Version CP-FPGA:
The execution progress and the execution energy of AES are shown in Fig. 6(a) and (b) . Offline version can finish the work faster than hybrid version. The first reason is that offline version does not have computation rollback. The second reason is that hybrid version will routinely back up at states with less backup cost. Note that in some time, the progress of hybrid version is faster than offline version. That is because after power failure happened, power recovers quickly. Offline version will backup state and then restore. Backuping data to CPRAMs takes relatively long time. Hybrid version gives up backup and directly restores to continue the task. When rollback time is less than backup time, hybrid version will achieve more progress.
Hybrid version has lower execution energy. That happens because in some places, offline version needs to back up with a high cost, but hybrid version gives up backup. Therefore, the execution energy of hybrid version grows slower than offline version.
2) Impact of Power Failure Frequency Expectation: We consider that power failure occurs in a random manner with a certain power failure frequency expectation. The impact of power failure frequency expectation is shown in Fig. 6(c) and (d) . It is shown that both execution time and execution energy decrease along with the power failure frequency. That is because lower power failure frequency leads to fewer backups and fewer rollbacks. In Fig. 6(c) , there exists a time crosspoint. When the frequency is faster than this point, offline version needs more execution time. Otherwise, hybrid version needs more execution time. The reason is that when the frequency of power failure is too high, offline version will back up state after each power failure. Backuping data to CPRAMs take too much time. Hybrid version gives up backup and directly restores. Rollback time is less than backup time.
There also exists an energy crosspoint in Fig. 6(d) . When the power failure frequency is slower than this crosspoint, offline version needs less energy than hybrid version. The reason is that frequency of power failure is so low that power failure hardly happens. Hybrid version needs to routinely back up checkpoints, so it needs more energy than the offline version. In CP-FPGA, the energy saving of each backup operation is independent on power failure frequency and its average value is 48.1% in the experiments. However, power failure times may change the ratio of total system energy savings, because efficient backup operations are more dominated when more power failures occur. The typical energy savings of whole system are 39.5%.
3) Impact of CLB Number in One CU: The impact of CLB number in one CU (N) is shown in Fig. 6 (e) and (f). The number of CLBs in one CU influences the backup granularity and chip area. Fig. 6(e) shows the area comparison between CP-FPGA and traditional RRAM+NVFF-based FPGA. Larger N means fewer CPRAMs. The area of CP-FPGA will be smaller than RRAM + NVFF-based FPGA, since NVFFs introduce large area overhead. When N is small, the area overhead introduced by CPRAMs is larger than NVFFs. Thus, CP-FPGA will have larger area overhead. Fig. 6(f) shows the execution energy comparison between CP-FPGA and NVFF-based FPGA. Since CU is the smallest backup unit in CP-FPGA, small N means that each CU is small. CP-FPGA can distinguish finer CLBs so that more redundant data will be reduced. When N increases, CP-FPGA has smaller area, but lower data reduction. Therefore, the execution of offline version will get closer to RRAM+ NVFF-based FPGA. The execution energy of offline version also increases with N. But it grows slower than offline version. The reason is that hybrid version abandons those backups with too much energy overhead. Fig. 7 shows the area and critical path latency comparison among four different FPGA architectures based on 20 Microelectronics Center of North Carolina (MCNC) benchmarks. MCNC benchmarks are very popular in FPGA area and latency estimation.
D. Area and Latency Overhead
After a power failure, traditional SRAM + DFF-based FPGAs will lose both configuration data and computation data. RRAM + DFF-based FPGAs do not need to reload configuration data from off-chip but still lost computation data. RRAM + NVFF-based FPGAs and CP-FPGA can keep both configuration and computation data. They can be both powered by energy-harvesting devices.
Compared with the SRAM-based FPGA, CP-FPGA can reduce average area and latency by 28.5% and 1.82% because of small feature size of RRAM cells. Compared with RRAM+ DFF-based FPGAs, CP-FPGA has 0.23% area overhead and 2.18% latency overhead. Area overhead is caused by nvController and CPRAMs. Latency overhead is caused by additional tracks in data path. Compared with RRAM + NVFF-based FPGA, CP-FPGA can reduce average area and latency by 4.2% and 0.7%. The reason is that NVFFs introduce large area overhead compared with DFFs. Larger area leads to longer routing tracks and larger latency.
VI. CONCLUSION Ultralow power FPGAs are an emerging platform for IoT applications with programmability and strong edge computing capability. However, volatile FPGAs suffer from frequent rollbacks due to power failures. Nonvolatile FPGAs become promising candidates under unstable power supply. This paper proposes a hardware/software codesign of nonvolatile FPGA with efficient checkpointing strategy (CP-FPGA). Backup data can be reduced by offline analysis. An online scheduler is used to balance computation rollback with backup energy. Experimental results show that offline CP-FPGA reduces 21.94% energy consumption compared with the latest RRAM + NVFF-based FPGA. The enhanced hybrid CP-FPGA further reduces 17.6% energy consumption compared with the offline CP-FPGA. Besides the robustness against power failures, CP-FPGA also provides a way to handle the soft error problem by checkpointing, which will be a future work.
