Energy harvesting is an a ractive way to power future IoT devices since it can eliminate the need for ba ery or power cables. However, harvested energy is intrinsically unstable. While FPGAs have been widely adopted in various embedded systems, it is hard to survive unstable power since all the memory components in FPGA are based on volatile SRAMs. e emerging non-volatile memory based FPGAs provide promising potentials to keep con guration data on the chip during power outages. Few works have considered implementing e cient runtime intermediate data checkpoint on non-volatile FPGAs. To realize accumulative computation under intermi ent power on FPGA, this paper proposes a low-cost design framework, Data-Flow-Tracking FPGA (DFT-FPGA), which utilizes binary counters to track intermediate data ow. Instead of keeping all on-chip intermediate data, DFT-FPGA only targets on necessary data that is labeled by o -line analysis and identi ed by an online tracking system. e evaluation shows that compared with state-of-the-art techniques, DFT-FPGA can realize accumulative computing with less o -line workload and signi cantly reduce online roll-back time and resource utilization. is paper has been accepted by ACM Journal on Emerging Technologies in Computing Systems (JETC).
INTRODUCTION
FPGAs have been widely adopted in various embedded systems that are powered by ba eries. However, in the emerging Internet of ings (IoT) [1] [2] [3] , which is full of tiny, cost-sensitive, and space constrained widgets, ba eries are no longer an ideal power supply due to poor scalability, recharging and safety concerns. Out of all possible alternatives, energy harvesting systems are becoming one of the most promising candidates because they convert ambient energy from their surroundings. Some of the widely known energy harvesting techniques include Photovoltaics (PV), ermoelectric generators (TEGs), and Piezoelectric (PZ). A device equipped with these harvesters can utilize the converted energy directly or recharge its energy storage (e.g. capacitors). e ease of access to power makes it a very competitive power source for portable devices. However, there are two primary challenges for such energy harvesting systems: unstable power and low power input.
Even though there are ultra-low power FPGAs such as the La ice iCE40 series, which can work in µW [4] , the unpredictability of available energy renders the power intermi ent which will interrupt computations. e intermi ent power will interrupt computations. In such a condition, long computations may be prohibited since the intermediate data will be lost and the computation has to start over from the beginning. us, it is essential to preserve the FPGA con f i uration data and intermediate data during a power outage. Con f i uration data keeps the functionality of a FPGA chip and intermediate data is the data generated during computation. By keeping both, long computations can be achieved by retrieving a checkpoint a er power resumes. e non-volatile memory based FPGAs (NV-FPGAs) are natural candidates to address this challenge. With the substitution of NVMs such as ReRAM, STT-RAM, and PCM for SRAMs, con f i uration data can be retained locally on the chip with bene ts of low leakage power, short critical path, and small area, etc [5] [6] [7] [8] [9] [10] . erefore, costs associated with loading con f i uration data from o -chip ash memories are avoided when FPGA recovers from a power outage.
In the existing NV-FPGAs and traditional FPGAs, intermediate data is held by registers which consist of volatile ip-ops (FF). Like con f i uration data, intermediate data needs to be saved during power is lost or weak in order to resume system state a er the power comes back. To reserve intermediate data, non-volatile ip-ops (NV-FFs) have been integrated on processors. Such non-volatile processor freezes all registers data locally on the chip if it is shut o [11] . e success of NV-FFs in processors makes it a good candidate for FPGA ip-op. However, FPGA's register resource is signi cantly more than the processor's and FPGA resource utilization varies design to design. Freezing all registers on FPGA will waste rare energy in energy harvesting system.
To improve e ciency in preserving intermediate data and reducing the roll-back impact from power interrupt, this paper proposes DFT-FPGA, a data ow tracking methodology on FPGA via High-Level-Synthesis (HLS). As HLS takes so ware functions as inputs and compiles it to Register-Transfer-Level (RTL) design. DFT-FPGA builds data ow trackers for such functions and a control unit to parse trackers' status in HLS. With an o ine mapping of functions to trackers, DFT-FPGA can online track the intermediate data inside a function via a tracker. en, instead of all data, only a set of tracked intermediate data in registers will be locally stored in non-volatile ip-ops. In this way, the cost to preserve intermediate data can be signi cantly reduced. e main contributions of this work are as follows:
• Design of binary counter based tracker framework which tracks data ow in FPGA.
• Design of control unit which stores the mapping of intermediate data and its on-chip physical address.
• Design of an o -line intermediate data to tracker mapping algorithm.
• Design of function split and merge method for DFT-FPGA.
• A demonstration of the performance of NV-FF based FPGA.
• A demonstration of the e ciency of FPGA design on representative benchmarks. e rest of the paper is organized as follows. Section 2 presents FPGA in an energy harvesting system, non-volatile FPGA, High-Level-Synthesis, and the motivation of DFT-FPGA. Section 3 presents the DFT-FPGA framework. Section 4 presents the function-to-tracker mapping algorithms and Section 5 presents the evaluation results.
PRELIMINARY
We will rst present FPGA in an energy harvesting system in subsection 2.1. en works related to NV-FPGA and NV-FF architecture will be presented in subsection 2.1.1. e background of High-Level-Synthesis will be introduced in subsection 2.2 and the motivation of the proposed work will be presented in subsection 2.3.
FPGA and Energy Harvesting System
FPGA is widely adopted in open eld applications such as wireless network platforms [12] [13] [14] . In this application scenario, the energy harvesting system outperforms other resources such as ba ery and cable thanks to its perpetual power supply. In energy harvesting systems, ambient energy such as solar, wind, mechanical strain, ambient radiation, and human motion can be harvested to power the energy consumer and its peripheral devices. ough energy harvesting can provide a perpetual power supply, when compared to systems powered by cable, energy harvesting system may only harvest and supply a small amount of energy. Moreover, the passive energy harvesting approach makes the harvested energy unpredictable and unreliable. It is observed that small harvesters such as a wrist-worn motion harvester can provide about 40 µW power with worst-case power outages every 10ms in daily activities [15, 16] . erefore, a device with the energy harvesting system should not only be able to work in a low-energy mode but also be robust enough under intermi ent power. An energy harvesting system usually consists of an energy source, regulator, capacitor, and energy consumer. A classical system is shown in Fig. 1 . e regulator is a bridge between ambient energy, energy consumer, and an auxiliary capacitor. Energy consumer is powered by the regulator when harvested energy is su cient. Meanwhile, the capacitor is fully charged and standby for energy spike or power outage/weak. If the ambient energy becomes weak, the additional energy from capacitor should sustain the consumer's rest of work or preserve its current state. As the computation complexity varies from application to application, the energy needed for each application are di erent. erefore, preserving the current states are widely adopted in designs [17] [18] [19] [19] [20] [21] [22] [23] [24] . Moreover, the energy from capacitors is limited as the size of capacitors in small scale energy harvesting system can be as small as µF and under 5 [19] . For energy consumers like microcontroller, ASIC, and FPGA, their dynamic power varies from cycle to cycle. Evaluating and recording the energy from each cycle to the end can be unrealistic as a regular application can be thousands or even dozens of thousands cycles. BRAMs. e breakdown structure of a CLB from Xilinx FPGA is shown in Figure 3 (a). Look Up Table ( LUT) is the smallest programmable computation unit inside a CLB which is wrapped by Basic Logic Element (BLE). As shown in this 4-input LUT, the LUT can have di erent logic by changing its mask binary order; di erent combinations of inputs will lead to di erent results in logic. Each LUT is followed by two ip-ops in a BLE which compose registers to support sequential logic in high-frequency computation. e registers hold the intermediate results from LUTs in each clock cycle. en, data is sent out of CLB by Connect Box and this data is routed through Switch Boxes and Routing Channels. For a speci c application, according to its complexity, a number of LUTs, ip-ops, Switch Boxes, and Routing Channels are selected and activated in FPGA to build its corresponding computation logic and data ow path. (In Xilinx FPGA, a CLB shown in Fig 3 (a) is named as SLICE and two SLICEs form a physical CLB) For traditional FPGA, SRAMs are used to build all these components. SRAMs are usually fast but volatile. us, the intrinsic exibility in recon guring makes an FPGA easily switch between applications. However, con guring data and computation data is vanishing from the chip when power is lost or weak. In order to keep the data, part of the FPGA components need to be replaced by non-volatile memory. 2.1.2 Non-volatile FPGA. e study of non-volatile FPGA has been conducted by Gaillardon, Cong, et, . e existing NV-FPGAs preserve all con guration data on the chip. As the con guration data for LUT, Connect Boxes, and Switch Boxes are stored in SRAMs bit by bit, by replacing con guration data SRAMs with non-volatile memories (NVMs) such ReRAM and STT-RAM, the con guration data can be permanently retained on FPGA. ough NVMs writing and reading speed is slower than SRAMs, it can preserve its data even power is lost. Figure. 3 shows how the con guration data SRAMs are replaced by NVMs on-chip [25] . In this gure, a SRAM is represented by a grey block and an NVM is represented by the green block. e LUT con guration data is stored in its mask which is held by SRAMs. A er replacing the SRAMs in the mask with NVMs, con guration data for LUT can be permanently preserved on chip a er one programming. Figure. 3 [11, [33] [34] [35] . Contrary to con guration data, intermediate data are frequently refreshed and the data ip frequency can be hundreds of mega hertz (MHz). at is, at each clock cycle, the data in a register is refreshed. erefore, the NVMs writing and reading time may bring delay which hinders the FPGA work frequency. It is observed that a 65nm technology based ReRAM NV-FF has achieved writing time as good as 4µs seconds and 46.2 pico joul per bit [11] . Meanwhile, a regular FPGA usually works at hundreds of MHz, i.e working under ns clock period. With such performance, if NVMs writing happens in NV-FF every clock cycle, the FPGA frequency may be degraded to dozens of MHz. e architecture of NV-FF is shown in Fig. 3 (d) . Based on a master-slave ip-op, two pieces of NVMs are integrated to its slave logic. By adding extra NVM control part, data in slave logic can be optionally stored into or retrieved from NVM cell. It works like regular ip-op if NVM writing is not triggered and it can write the data to NVMs or recover data from NVM with a trigger. When writing to NVM is triggered, the clock will be hung and data is wri en to NVM and vice versa. is avoids the writing to NVMs every clock cycle and the working frequency is not in uenced if NVM writing is not triggered. e NVM brings non-volatility but also larger logic area. It is observed in [11] that a NV-FF based processor has an extra 39% area for a single FF but less than 10% extra area for the whole chip. e 39% single FF area overhead is brought by the NVM's control logic and the NVM is growing on the top of the chip, which won't bring extra area. ough NVM writing is not triggered at every clock cycle, the area overhead from the control logic will increase its intrinsic delay and the size of a CLB, leading to bigger FPGA size, which increases the routing distance. is may decrease the FPGA working speed. e impacts on FPGAs working frequency a er bringing in NV-FF is evaluated in section 5.
For each NV-FF, it can be triggered to work as regular, store, or retrieve mode. If other FPGA components are non-volatile, by storing the intermediate data to the NVM if power is weak, the chip state can be held on board and be retrieved later. However, selecting a single ip-op in an FPGA chip is not accessible as a single ip-op can not be indexed inside a CLB. Moreover, FPGA's ip-op resource can number in the thousands and tens of thousands. In the proposed design, selected ip-ops need to be stored in order to keep intermediate data and ip-op storing and retrieve are executed SLICE by SLICE. is is due to FPGA tools like Vivado packing ip-ops into the same SLICE [36] . In DFT-FPGA, the SLICE addresses are acquired a er FPGA synthesis and pre-loaded to FPGA. During online intermediate data tracking, such SLICE addresses can be read out a er parsing trackers' status. en, the ip-ops in these SLICEs can be triggered in writing or retrieving.
High-Level-Synthesis (HLS)
High-Level-Synthesis converts so ware language such as C/C++ to Hardware Description Language (HDL) like Verilog and VHDL. HLS's e ciency and accuracy has been veri ed in modern FPGA applications [37] [38] [39] [40] . It takes so ware functions F as inputs to generate HDL modules M and state transition ow S. A er HLS, a program is split into multiple modules according to program hierarchy. ese modules are interpreted as di erent states in the state transition control.
State transition control initializes and terminates the modules during FPGA operation. In HLS, each function under top function is generally compiled to a standalone module. Based on the data dependency between modules (functions), these modules are triggered in parallel or sequential in state transition control. A basic HLS FPGA design with its data ow control is shown in Fig. 4 [41] .
In this program, there are three functions F 1 , F 2 and F 3 under top function F main . A er HLS, functions are converted to modules as shown in Fig. 4 (a). In the top function, F 1 and F 2 have data dependency and F 3 is independent of F 1 and F 2 . erefore, as shown in Fig. 4 (b) state transition control , F 2 has to be placed a er F 1 in state ow S. Meanwhile, F 3 can start with F 1 . In states S 1 , S 2 and S 3 , component information in module M 1 , M 2 and M 3 such as module names, registers, connections, and cycles can be collected. An unroll of S 2 is shown in Figure. 
Motivation
Existing designs preserve chip state by placing checkpoints in computation [17-19, 22-24, 42] Preserving the data for the entire chip may also be an option [11] . By triggering writing and retrieving of ip-ops in all FPGA SLICEs, all the data on chip can be fully reserved. However, such strategy can be heavily resource intensive.
As FPGA resource utilization varies from design to design, preserving the entire chip causes unnecessary NVM cell writing in unused memories, which will waste rare harvested energy and slow down the writing speed. As it is shown in Figure. 6 (c), this computation only occupies 15% of the chip resource, the remaining 85% components are still stored and retrieved if preserving the whole chip data. A er applying DFT-FPGA, unnecessary back-up can be avoided. e bene ts and impact a er applying DFT-FPGA will be presented in section 5.
Our previous work FC-FPGA applies shi -register like data ow tracker to locate and retrieve intermediate data in RTL level [43] . In this work, binary counters are adopted to further reduce resource utilization and a full High-Level-Synthesis based work ow is proposed, which signi cantly reduces the workload. 3 DFT-FPGA FRAMEWORK e proposed DFT-FPGA methodology includes both framework and data-to-tracker mapping algorithm designs. In this section, we will introduce the framework of DFT-FPGA. e algorithms that work with the framework will be introduced in Section 4.
Hardware Architecture Overview
e proposed design includes function trackers f and NV-FF control unit CU which is shown in Figure 8 (b). In this gure, a nite state machine is generated by HLS in the back-end to control data transition in functions, i.e. state transition between states. Each function is assigned a function tracker. Trackers are read by the control unit which pre-loads SLICE addresses and maps them to associated registers. In DFT-FPGA, at every power outage, the control unit reads function tracker's status and then select the corresponding SLICE to trigger action. In the proposed design, the control path to NV-FF control part is considered already embedded in NV-FPGA.
Function Trackers
As all intermediate data is held by registers, function trackers are designed to track the active registers at each clock cycle. erefore, a tracker is built to have the same clock cycles with its corresponding function. By reading tracker's status, the data ow location in its function can be acquired.
In DFT-FPGA, each function is assigned a function tracker to trace its data ow. e tracker is activated simultaneously with its corresponding function and they are terminated at the same time. e method to insert trackers to a so ware program is shown in Figure. 7 (a) . Functions F 1 , F 2 , and F 3 under F main are assigned private trackers f 1 , f 2 , and f 3 . e state transition and timing among F 1−3 is as it is illustrated in Fig. 4 (b) . erefore, the initialization of trackers f 1−3 should also follow such orders. In the proposed design, lock is utilized between trackers to keep trackers initialized in the right order. e lock in a tracker consists lock head and lock t ail . If a tracker is initialized a er its anterior tracker, its lock head is the lock t ail of the anterior tracker. As shown in Fig. 7 (c) , tracker f 2 s lock 2 he ad is f 1 s lock 1 t ail . A er tracker f 1 is terminated, lock 1 t ail is set to be 1. us, tracker f 2 is always blocked if F 1 and f 1 are not nished. As there is no data dependency between function F 1 and tracker f 1 , f 1 can start with function F 1 simultaneously. For trackers which start with the beginning function, such as f 1 , its lock 1 he ad is pre-de ned to be 1 to unblock itself. In this way, all trackers can be initialized and terminated with their corresponding functions.
A function tracker consists of loop arbitration t, binary counter count, tracker status register f st atus , and tracker lock lock head lock t ail . Without loss of generality, trackers can track function with loops or with regular operations.
An example of a function and its tracker logic is shown in Fig. 7 (b) and (c). is is a function with an outer loop and an inner loop. In the tracker, the rst loop t handles functions with loops. Loop iterations t corresponds to the outer loop iteration number in a function. Count max is the length of all operations under function's outer loop. e binary counter increments to count max and then be reset to zero for the next outer loop iteration. us, the binary counter can be reused in all outer loop iterations. If a function does not contain any loop, t is set to one and count max is the length of the function. When power is lost during computing, the energy harvesting system sends P loss = 1 to DFT-FPGA.
With P loss |r esume = 1, the tracker sends out its status count as f st atus . Trackers' f st atus will be further parsed by control unit. When reaching the end of tracking, the f st atus is reset to zero. In this way, the tracker can count with the function process and send out the process stage when power is loss. A er power resumes P r esume , trackers data is recovered by NV-FPGA rst and the resumed trackers' status are utilized to wake up the stored intermediate data.
To further reduce the binary size, the binary counter can be de ned to lower bit-width such as 4bit, 8bit, and 16bit according to the length of a function. In most cases, the binary counter size is found to be small as 4-8 bit (tracking length ranges 225-65025 cycles). e resource utilization, tracking length, and performance of trackers in di erent sizes will be discussed in section 5. of storage in cu BRAM . In this gure, all trackers f st atus is combined with o set including the rst tracker f 1 . is is because trackers themselves in DFT-FPGA are needed to be reserved as well, cu BRAM keeps trackers SLICE address with index zero to o f f set 1 . e SLICEs where trackers are placed are always stored when power is lost. By online reading f st atus , corresponding SLICE addresses are parsed by the control unit. SLICE storing or retrieving can then be operated on these SLICEs. As tracker f st atus is zero if it is not triggered or terminated, no action to SLICEs will be executed for such trackers and the control unit will parse zero addresses.
NV-FF Control Unit
In DFT-FPGA, one BRAM cu BRAM is instantiated to keep all trackers' mapped SLICE addresses. During HLS, if the depth of the cu BRAM is too big to be placed in a single physical BRAM, the synthesis tool will automatically expand it to multi-BRAM. As the SLICE address in FPGA is organized as X x x Y , a data structure is used in DFT-FPGA to keep address in X and Y direction which is shown in Figure. 8 (c) . e resource utilization and performance of the control unit will be discussed in section 5.
A er assigning trackers to functions, there are two tasks that need to be accomplished before DFT-FPGA can work properly. First of all, for a given program with multiple functions, we need to identify which registers will keep intermediate data at a certain clock cycle in each function. Second, as tracker assignment is arranged according to function hierarchy, the main function should only contain the substantiation of functions. In this task, function merge and function split will be discussed.
DFT-FPGA OFF-LINE ANALYSIS
In previous sections, we show how to generate and assign trackers to functions. In this section, we will present how to establish mappings between a function and its tracker, and how to merge or split functions in a program.
Function to Tracker Mapping
A er analyzing the program hierarchy and the state transition, the activation of trackers are determined. e mapping of tracker status to data ow in a function can be determined by unrolling a function's state. A state breakdown is shown in Figure. 9 (a) . In this gure, rectangle blocks represent operations inside the function and they are arranged to clock cycles from t st ar t to t end . At each clock cycle t n , operations p t n with its register re t n are placed. e connecting arrows indicate the data ow inside a state. In this gure, p 2 and p 3 are operations for p t n+2 ; re 2 and re 3 are registers for re t n+2 . Every operation is followed by its register to get its data held at every clock cycle. For registers re t n in a state, they keep the intermediate data within a function at di erent clock cycles. ose registers are the target to be tracked by trackers de ned with checkpoint t n . As di erent registers are triggered during the function process, registers that hold intermediate data for each clock cycle should be determined and those registers' SLICE addresses are stored by the control unit.
In FPGA design, operations can execute in parallel like p 2 and p 3 . And operations may have multi-cycle length such as p 4 . erefore, at certain clock cycle, there can be multiple registers re t n or previous register re t n to hold the intermediate data. e method to determine checkpoint t n is illustrated in Algorithm 1. At each cycle t n , its register re t n is added to checkpoint t n because its operations end at this cycle. If a multi-cycle operation is cross t n which starts at t n and ends at t n * , the re t n ahead of this operation is also added to checkpoint t n , e.g, checkpoint t n+2 =re 2 ∪ re 3 ; checkpoint t n+3 =re 2 ∪ re 3 . In this way, the registers holding intermediate data for each clock cycle are acquired. For multi-cycle operation like p 4 at t n+3 , DFT-FPGA inserts roll-back logic to ensure the consistency between tracker and function a er retrieve. e roll-back logic is shown in Fig. 9 (b) . A er applying Algorithm 1 to all functions and trackers, the mappings between registers to tracker status is established.
If this is a loop function as indicated by the dashed line in Fig. 9 (a) , t st ar t to t end is tracked by count and its loop iteration is controlled by t in tracker. e same binary counter count will be called t times. In this way, DFT-FPGA can scale down the binary counter size and save more ip-op resources. By analyzing one iteration's mapping relation, the mapping for the whole function is acquired. During synthesis, these registers' SLICE address can be acquired.
A er that, the mappings between trackers status and SLICE address can be established. en, applying the mapping algorithm, registers are mapped to tracker. However, for registers barren computation, DFT-FPGA can choose to not assign tracker to if tracker ip-op resources are more than target registers. Such cases are studied in section 5.
A er analyzing the mapping between a tracker and its function, the trackers for the main function F main need to be arranged. e proposed DFT-FPGA will tune the hierarchy of F main to make it suitable for tracker assignment. e next task includes function split and function forming.
Function Split and Merge

Function Split.
As so ware language is normally exible in hierarchy and coding style. A program's hierarchy may need to be tuned and then DFT-FPGA can be applied. A function needs to be split if there are independent loops or operations aside of a loop. Such a case is shown in Figure. 10 (a) and (b). For independent loops as shown in Fig. 10 (a) , the function is split according to outer loops. As the tracker in DFT-FPGA is designed to track one outer loop with Algorithm 1 Checkpoint t determination Input: function state S , start point t s t ar t , end point t e nd , operations p tn , registers r e tn Output: checkpoint t De ne: st ar t < n < end, st ar t < n < n, n < n * < end for t n ∈ S do checkpoint t .append(r e tn ) for t n ∈ S do for t n * ∈ S do if r e t n ♦p t * = 1 then checkpoint t .append(r e t n ) \\♦ end if end for end for end for return checkpoint t its all inner operations, a function is split to multiple functions according to the number of independent loops. en, trackers will be assigned to split functions. If a function contains more than loop functions, the function needs to be split based on the boundaries between loop and other operations. A case where a loop is followed by other operations is shown in Fig. 10 (b) . It needs to be split into two functions. rough spli ing functions, the proposed tracker logic can be successfully applied.
Function
Merge. Under the main function, there can be operations between functions. As the state transition is arranged between functions, those non-function operations will be automatically merged into a function by HLS. is will cause obfuscation in function-to-tracker mapping and the auto-merged function cannot be directly applied with trackers. Such a case is shown in Fig 11. By wrapping these operations to a function, state transition can be arranged in di erent functions. en, trackers can be built and assigned to all tuned functions.
EXPERIMENT
In this section, we rst evaluate the performance of FPGA architecture with NV-FF in subsection 5.1. Second, we evaluate the proposed tracker and control unit resource utilization and performance in subsection 5.2.
ird, we evaluate the resource utilization and performance a er applying the proposed DFT-FPGA on di erent benchmarks in 
NV-FF FPGA performance
In this subsection, we base on FPGA architecture K6 − f rac − N 10 −mem32K − 40nm from VTR, by adding delay to its D ip-op module and increasing CLB area to evaluate the impact from NV-FFs. As presented in section 2.1.3, the control part in NV-FF brings extra area. Increasing the area of a single ip-op can degrade its timing performance but ip-op area size is not simulated in VTR. erefore, we add extra delay to each single ip-op according to the area scaling up ratio 39% and 49% [11, 44] . e extra 39% case is caused by additional 15T 2R, which is 15 transistors and 2 ReRAMs [11] .
e extra 49% case is caused by additional 22T 2R, which is 22 transistors and 2 ReRAMs [44] . e increase of ip-op size also brings a larger CLB area, which will lead to longer routing distancedistance and may degrade FPGA's working frequency. In the evaluated FPGA architecture, each CLB contains 10 ip-ops. For the two types of NV-FF structure, due to VTR adopts minimum width transistor area [45] to de ne the size of components, we increase the CLB size by 15 minimum width transistor area and 22 minimum width transistor area accordingly. e FPGA architecture size increment is shown in Figure. 12 (a). In this gure, the blue column represents the base architecture, the orange column represents the 15T 2R architecture, and the grey column represents the 22T 2R architecture. We observe a 0.2% logic area increasing from base to 15T 2R and 0.07% from 15T 2R to 22T 2R. Routing area increases 0.16% from base to 15T 2R and 0.04% from 15T 2R to 22T 2R.
To evaluate the impact on FPGA working frequency which su ers from longer ip-op delay and routing distance, we apply seven benchmarks in VTR on three FPGA architectures, respectively. e critical path delay and maximum working frequency are shown in Fig. 12 (b) and (c). In the evaluated benchmarks, we observe that 15T 2R architecture causes less than 3% of additional critical path delay and 22T 2R architecture causes less than 7.4% additional critical path delay compared to base architecture. As the critical path delay determined the maximum working frequency, we also show the maximum working frequency in Fig. 12 (c) . We can observe that less than 6.5MHz degradation is caused by 15T 2R and less than 10MHz is caused by 22T 2R. From the evaluation, we can see that integrating NV-FF on FPGA brings a bit performance degradation, i.e, several MHz. However, as most designs running on FPGA work at hundreds of MHz [37, 39, 46, 47] , such degradation has li le impact on overall performance and achieves non-volatility of ip-ops. 
Tracker and Control Unit Evaluation
In the previous section, we evaluated the performance of NV-FF based FPGA architecture. In this section, we will evaluate the resource utilization and timing performance of DFT-FPGA framework. e evaluation is based on FPGA chip xc7z020cl 484. Table 1 shows the ip-op, LUTs usages, and maximum tracking cycles when a tracker size scales from 4bit to 9bit. e resource of xc7z020cl 484 is also listed. In the evaluation, we observe less than 0.03% of ip-ops and less than 0.21% LUTs are used to build a tracker. Similarly, less than 0.05% of ip-ops and less than 0.31% LUTs are used to build a control unit. If applying DFT-FPGA on large scale FPGAs, the utilization ratio will be further reduced.
e maximum tracking cycles are Max C cle = t * count max which is illustrated in in section 3.2. For example, 8-bit tracker can count a function with up to 65025 = 255 * 255 cycles. e control unit keeps SLICE addresses and indexes it a er reading trackers' status. e control unit's resource utilization is shown in Table 2 . In this table, we show the resource of a control unit with one tracker when tracker size scales from 4bit to 9bit. A n bit counter needs count max depth in BRAM to store SLICE address because the count start over when enter each iteration of t. us, the depth needed for a tracker is much smaller than the length of a function.
When the counter is less than 8bit, Vivado HLS optimizes it into ip-ops and LUTs to save BRAM resource.
e timing performance of a DFT-FPGA framework is shown in Table 3 . In this evaluation, we measure the maximum working frequency of 8-bit standalone tracker, standalone control unit, and DFT-FPGA which is an integration of one 8-bit tracker and control unit. We can observe that a standalone tracker can work up to 500MHz and a standalone control unit can work up to 400MHz on FPGA xc7z020cl 484. With the evaluated NV-FF FPGA performance and the DFT-FPGA performance, we can conclude that the DFT-FPGA consumes a small amount of FPGA resources and will not be the speed bo leneck a er being applied to the source program. 
DFT-FPGA Case study
In the previous section, we showed the performance of a tracker, a control unit and an individual DFT-FPGA frame- sm, and f loat, a less than 1.5% resource increment is observed. DFT-FPGA generates trackers and control unit to track their data ow. For small benchmarks such as lobal and struct, less than 0.3% resource increment is observed. Such benchmarks consume a small number of ip-ops and LUTs, which is even less than its tracker's resource requirement.
For such benchmarks, DFT-FPGA only generates control unit to keep all associated SLICE addresses and save all data.
In this way, DFT-FPGA achieves resource e ciency in di erent sizes of benchmarks. e proposed design shows good adaptability when source programs' size scales up and scales down. It consumes a small number of FPGA resources to achieve intermediate data tracking in the analyzed benchmarks. e roll-back time and the number of stored ip-ops in di erent benchmarks are shown in Figure 13 and Figure 14 .
In Fig. 13 , the x-axis shows the number of power lost during benchmarks running and the y-axis shows the roll-back clock cycles. Power lost is randomly triggered within a benchmark's computation length, e.g at power lost is 5, there are 5 power lost during computing and every power lost is randomly triggered. In the evaluation, we simulate the number of power lost from 1 to 10 during the computation to mimic di erent power conditions. Without loss of generality, for each power lost case, we record the result of its mean of 10 test rounds. As shown in the gure, the roll-back time for the proposed design is near zero in all benchmarks and all power conditions. Our performance is not in uenced even if power condition is worse. is is because the data ow inside each state is aware by DFT-FPGA; DFT-FPGA can retrieve computation from where it is interrupted. For CP-FPGA, it needs to nd its nearest checkpoint and recover from that point. If the length of states is long in benchmarks, the interval between two checkpoints is far from each other. It causes long roll-back if power is lost during the middle of such state. is is observed in the sm benchmark which consists of multiple states with over one thousand cycles each. When power lost occurs in one of such states, long roll-back happens. e performance of periodical placing checkpoint technology is signi cantly in uenced by the source program and power condition. Meanwhile, the proposed DFT-FPGA shows good adaptability in minimized roll-back time for di erent benchmarks in di erent power conditions. Figure 14 shows the stored ip-op data for benchmarks in di erent power conditions. In this gure, we record the number of ip-ops that are stored in CP-FPGA and the proposed design. e x-axis shows the power lost case ranging from 1 to 10. e y-axis shows the number of ip-ops that are stored a er nishing computation. As CP-FPGA periodically places checkpoints in design, the number of ip-ops stored is a constant in all power conditions. A er a state is called, it stores its state result data in registers to BRAM as a checkpoint. It leads to unnecessary data storing when power lost happens occasionally such as once or twice in computation. For DFT-FPGA, the number of ip-ops which is saved consists of both intermediate data registers and trackers' ip-op resources. We can observe linear increasing in ip-ops when the number of power lost becomes worse. is is because the data storage in DFT-FPGA happens only at power lost. Periodical placing checkpoint may have fewer ip-ops usages if a small number of checkpoints are placed. Such as truct, it is arranged to several long states. CP-FPGA have less ip-op storing in such benchmarks. However, it brings long roll-back time as shown in Fig. 13 . 
CONCLUSION
We propose a data-ow tracking framework, DFT-FPGA, for non-volatile FPGA. It is a full High-Level-Synthesis based framework targeting non-volatile FPGA that can online track and locate the physical location of intermediate data. By parsing, storing, and retrieving certain area of FPGA SLICE, the proposed design can assist NV-FPGA in intermi ent computing with minimum resource overhead. e proposed DFT-FPGA also shows good adaptability in di erent benchmarks under various power conditions with be er resource utilization and less roll-back time.
