Debugging consumes a large portion of FPGA design time, and with the growing complexity of traditional FPGA systems and the additional verification challenges posed by multiple FPGAs interacting within data centers, debugging productivity is becoming even more important. Current debugging flows either depend on simulation, which is extremely slow but has full visibility, or on hardware execution, which is fast but provides very limited control and visibility. In this paper, we present StateMover, a checkpointingbased debugging framework for FPGAs, which can move design state back and forth between an FPGA and a simulator in a seamless way. StateMover leverages the speed of hardware execution and the full visibility and ease-of-use of a simulator. This enables a novel debugging flow that has a software-like combination of speed with full observability and controllability. StateMover adds minimal hardware to the design to safely stop the design under test so that its state can be extracted or modified in an orderly manner. The added hardware has no timing overhead and a very small area overhead. StateMover currently supports Xilinx UltraScale devices, and its underlying techniques and tools can be ported to other device families that support configuration readback. Moving the state from/to an FPGA to/from a simulator can be performed in a few seconds for large FPGAs, enabling a new debugging flow.
controllability of FPGA hardware. This causes major development productivity issues, as verification time often takes more than 50% of the design cycle. This worsens as the complexity of FPGA designs grows, and hence demand for intelligent debugging flows has increased. These flows should take advantage of the instant programming of FPGAs and the speed of running a design on an FPGA compared to a simulator while providing simulator-like visibility. In addition, the increasing deployment of FPGAs in data centers, where thousands of CPUs and FPGAs may be collaborating, pushes for novel software-like FPGA debugging flows that have full state observability for crash analysis. For software crashes, data centers rely on checkpointing of software processes to be able to trace the root cause of intermittent faults/bugs, and hence similar capabilities for FPGAs are highly desirable.
Hardware debugging can be inevitable for debugging large, complex systems in which simulation may take almost forever. An IBM study shows that verifying a complex design (a multi-core processor) using a multi-FPGA prototype is 100,000x faster than RTL simulation, which would take almost five years to only simulate Linux booting on that design [3] . In this FPGA prototyping case, the FPGA clock is slowed down to 4 MHz. This means that when directly running an FPGA-targeted application, which typically runs 10-100x faster, the hardware execution to simulation slow down is on the order of 1M-10Mx. Besides the speed disadvantage, RTL simulation usually runs without delays and certainly not with the spectrum of different possible signal delays on different chips. Thus, it does not fully capture everything that could happen on hardware; many timing-related bugs often appear only in hardware execution such as data races between clocks that were missed in the timing constraints, meta-stability, and clock-domain crossing issues. Moreover, the speed of hardware debugging enables running vastly more test-cases, allowing detection of rare bugs such as unusual input combinations, error states and unsafe resets.
For the aforementioned reasons, on-chip debug is an important part of FPGA debugging flows; however, on-chip observability is a major challenge. Debugging a faulty complex design with thousands or millions of buried state elements by looking at the input and output trace is almost impossible. Embedded Logic Analyzer (ELA) tools such as Xilinx's Chipscope Pro [28] and Intel's Signal-Tap II [14] can provide limited on-chip visibility; a small portion of hardware signals can be observed by using ELAs (or trace buffers) and only for a limited time window due to limited hardware resources. Thus, the user has to carefully choose which signals to monitor and at which trigger conditions, and then try to deduce the root cause of the bug from the received window of observed signal values, which can be extremely difficult for intermittent bugs. Any change to these signals or conditions requires a lengthy design recompilation. Previous work has proposed techniques to eliminate the need for complete recompilation [12, 13] , but these techniques cannot be used in highly utilized FPGAs [9] and still do not provide simulator-like visibility.
Current debugging flows for FPGAs are better than those of ASICs, but still vastly inferior to software debugging flows. This is because in FPGA debugging the user must choose between running the design with full visibility and a huge (~10,000,000x) slow down (RTL simulation) or running the design with very limited visibility at full speed (on-chip debug). On the other hand, software can run with full visibility, typically with less than 4x slowdown, by using a debug build. Software debugging is getting even better; recently software debuggers have appeared that automatically create several checkpoints of a running program. This not only allows examination of how state changes as the program runs forward in time, but also allows going backward to see what caused a bug [21] . In this work, we seek to combine simulation and hardware execution in a seamless way to enable a novel FPGA debugging flow that has the software-like combination of speed and observability, and can create several checkpoints similarly to novel software debuggers.
In this paper, we present StateMover: an FPGA debugging framework based on checkpointing that allows moving the design state back and forth between an FPGA and simulator. This tool is capable of: 1) safely interrupting a running design by leveraging the techniques in [4] , 2) checkpointing a design and transferring its state from an FPGA to a host computer, 3) loading a saved checkpoint of a design into a simulator and simulating the design starting from this checkpoint, 4) extracting the state of a design from a simulator, 5) writing a design checkpoint back into an FPGA and seamlessly resuming hardware execution. These capabilities are sufficient to realize the aforementioned novel debugging flow. StateMover leverages the speed of hardware execution by allowing the design to run at full speed on hardware until the design reaches a point of interest (a user-defined breakpoint or a detected fault). Then the entire on-chip state of the design is read out and loaded into a simulator, providing not only full observability, but also the ability to simulate the design starting from this checkpoint for further debugging, thereby fast forwarding hours of simulation. By taking periodic design snapshots, the user can load a previous checkpoint into the simulator to trace back the root cause of a bug. In addition, having the on-chip design state loaded on a simulator augments hardware debugging with the ease-of-use of a simulator where the user can navigate through the hierarchy of the design and not only inspect signal values but also modify the value of any signal to perform "what if" tests. The writeback capability of StateMover allows the modified design state to be pushed to the FPGA which enables the implementation of state changes at no cost (i.e. full controllability) and also allows fast forwarding the simulation in uninteresting periods. StateMover can also be used to perform debugging-unrelated tasks such as error injection and correction for soft error mitigation and fault recovery in which a fault-free checkpoint is written back.
To our knowledge, this is the first work that can move a design state back and forth between a simulator and an FPGA. It is also the first work to enable in-system debugging with full visibility for complex designs that have multiple clock domains and multi-cycle I/O interfaces. Debugging techniques that offer full visibility require task interruption, which if performed naively can cause data loss and deadlocks in such complex systems [20, 24] . StateMover can achieve safe task interruption for such systems, thereby enabling in-system debugging with full observability and controllability. This paper is organized as follows. Section 2 provides a literature review of FPGA debugging techniques. Section 3 introduces StateMover, and the implementation details are given in Section 4. StateMover is evaluated and tested on various designs to verify its functionality in Section 5. Section 6 concludes the paper.
RELATED WORK
Hardware debugging techniques can be divided into three broad categories: 1) trace-based techniques, 2) scan-based techniques, and 3) readback-based techniques.
Trace-based Techniques
Trace-based techniques insert trace buffers into the design to record the values of on-chip signals. The recording is done during the circuit operation, which means that the design can run at its full speed during debugging. The observed signal values are stored in on-chip memories, and then transferred to a workstation on which the values can be shown in a simulation-style waveform. Due to the limited on-chip memories, only a small subset of hardware signals can be observed and for a limited time. Hence, these approached require the designer to create appropriate trigger logic that hopefully fires shortly before the error state is reached to capture of relevant state data. Besides limited visibility, trace-based techniques have other disadvantages such as the need for recompilation of the entire design every time the list of observed signals or the trigger conditions are modified, the modification of the design netlist which could hide some timing-related bugs, and the usage of precious on-chip memories. Previous work has tackled these problems using various approaches. The first approach is to intelligently select the most effective signals to observe in generic designs [11] , high level synthesis (HLS) designs [8] , and application-specific (e.g. machine learning and soft processors) designs [10, 23] . Second, previous work has proposed incrementally inserting the trace buffers and the trigger logic after placing and routing the design, which not only eliminates the need for full design recompilation but also preserves the design mapping [6, 12, 13] . Unfortunately, incremental trace insertion is infeasible in highly utilized (70%-90%) FPGAs [9] . Incrementally inserting trace logic and preserving the previous design implementation as possible reduces compile time by 40% [5] , but this still leads to fairly lengthy (minutes to hours) compiles for each trace insertion. The third approach is to use overlays that can be partially reconfigured to dynamically select the observed signals and change the trigger conditions [7, 16] . This can reduce the compilation time to change the signals traced, but adds restrictions on how many signals can be observed. These trace-based approaches still have less visibility than simulators which have full observability in space and time and full controllability. Simulators also have more powerful features for watching how the signals evolve over time. Moreover, These approaches require the presence of spare on-chip resources.
Scan-based Techniques
Scan-based techniques add additional state access hardware to the design such as scan chains to provide state visibility. They have Session: High-Level Abstractions and Tools II FPGA '20, February 23-25, 2020, Seaside, CA, USA significant area overhead to access all the design registers which can be close to 100% logic overhead for shadow scan chains [19] .
In [18] , Kim et al. propose DESSERT, an FPGA-accelerated method for RTL simulation. It uses scan chains to provide full visibility of a certain region in the design. The RTL design is first translated by the FIRRTL compiler which automatically add scan chains into the design, and transforms assert and print statements in the RTL into error-checking and log generator circuits. DESSERT supports error replays by running two hardware instances of the design spaced apart in simulation time. Once an error is detected on the lead instance, a state snapshot is taken from the second instance and replayed in RTL simulation. DESSERT is similar to StateMover in that it can move the state from hardware to a simulator but since it relies on scan chains, it has a logic overhead of up to 80% and reduces the design frequency by more than 100%, and it completely changes the mapping of the original circuit which can hide timerelated bugs. In addition, it assumes that the hardware execution is deterministic which is not the case for complex designs that may have data races. It also does not support writing back the state.
Readback-based Techniques
Readback is an alternative method to provide on-chip visibility. It is a hardware feature available on Xilinx FPGAs and on Intel's Stratix 10 FPGAs. Its main purpose is to verify that the FPGA is configured correctly by reading back the configuration frames, but it can also be used to read out the current values of on-chip registers and memories such as CLB/ALM registers, distributed memories and block RAM contents [26] . It provides complete visibility of most FPGA state elements and does not require any instrumentation or modification of the design. Note that some state elements, such as the registers embedded within Xilinx DSP blocks and the hyperregisters within the Stratix 10 interconnect, cannot be read back as they are not part of the configuration architecture (i.e. cannot be initialized). The main disadvantage of readback is that it requires stopping the design before reading out its state. Previous work has leveraged readback to build debugging frameworks for FPGAs. In [2] , Angepat et al. presents NIFD, a nonintrusive FPGA debugger, which provides a gdb-like interface that supports single-stepping and breakpoints. NIFD leverages readback to allow users to inspect the value of any state element in the design by typing its name in the console. NIFD adds minimal hardware to the design such as a clock controller to stop/resume the design, and a breakpoint controller.The major drawback of this framework is that it does not provide a way to visualize the readback data. It is almost impossible to debug a bug in a complex design with millions of signals by inspecting the signal values through a console with no signal history. In addition, NIFD supports only the relatively old Xilinx Virtex-II FPGAs. In [17] , Khan et al. presents gNOSIS, an automated verification tool based on readback which verifies the correctness of hardware execution. The design is first simulated with the VCD (value change dump) option enabled to dump all the simulation output. Then, the design runs on the FPGA for some interval. After that, gNOSIS performs a readback and compares the values of registers in the FPGA with their expected values in the VCD file. If they match, gNOSIS resumes the design for another interval, and so on. Otherwise, the location and the time of the error is reported. The main problem with this flow is the need for a complete RTL simulation to be performed. Thus, it does not benefit from the speed of hardware execution. Moreover, gNOSIS can only read out the values of on-chip registers; it has no support for distributed or block memories. It supports Xilinx Virtex-5 FPGAs.
In [15] , Iskander et al. proposes a readback-based low-level debug (LLD) framework. It consists of a on-chip processor which controls the design execution and communicates with the host over a serial port. Like NIFD, the serial console provides a gdb-like interface which allows the user to inspect register values using their design names. The framework has also on-chip condition-based breakpoint logic which allows the design to work at full speed until it reaches this breakpoint. The breakpoint logic is implemented in a separate reconfigurable region so it can be dynamically modified as long as the breakpoint condition is formed from the signals that are already connected to this region. This framework shares the same major drawback of NIFD showing only signal values on the console without a time history which makes it impossible to debug complex designs. Moreover, register value inspection is significantly slow in LLD mainly because readback is controlled through the on-chip processor which initiates a configuration frame readback for each register bit inquiry; a single bit value is retrieved in half a second making examination of full design state impractical. LLD supports Xilinx Virtex-5 FPGAs.
In [20] , Li et al. proposes the AMIDAR debugging framework for debugging software and hardware problems of soft-core processors. The framework provides a user interface through Eclipse which is used for software debugging and also for inspecting the value of on-chip state elements through readback. The framework supports Xilinx series 7 FPGAs.The framework does not freeze the design during readback to avoid data loss from the DRAM controller. However, this introduces some problems. First, since the BRAMs cannot be read out while they are being accessed by the design, AMIDAR has to force the processor not to execute any BRAM accesses during readback. Second, since the design is not frozen, the readback values of state elements could be happening over different clock cycles (i.e. inconsistent state). Third, modifying the contents of state elements (i.e. writeback) is not possible. Moreover, in recent FPGAs such as Xilinx UltraScale reading back the state without stopping all the design clocks is not supported [27] . The debugging framework presented in [27] extends the previous readback-based debugging frameworks by supporting applications with multiple asynchronous clocks. This is accomplished by using a clock-stoprestart-controller (CSRC) block which provides deterministic clock stopping and restarting. Once all the clocks of a multi-clock design are stopped, the framework issues a readback, and constructs a waveform of the readback values for visualization. The framework has a Tcl interface, and supports the latest Xilinx UltraScale FPGAs.
These readback-based frameworks do achieve full visibility but still have the following major drawbacks. First, most of these techniques give the user a gdb-like interface to inspect signal values which could be acceptable in small designs in which the user knows what to look for, but for complex hardware designs with thousands of on-chip registers and memories, they will be very difficult to use. Second, readback, unlike trace buffers, offers no signal history, and hence it provides the user with only a single value for each on-chip signal making it harder to debug the root cause of a bug.
To retrieve signal values over time in order to provide the user with a simulation-style waveform similar to what trace-based techniques offer, readback-based frameworks provide clock-stepping circuits which run the design for one cycle and then stop it so that a readback can be performed again, and so on. Besides being extremely slow, and hence losing the speed advantage of hardware execution, the need for stopping the design every cycle leads us to the third major problem of these techniques, which is: can these frameworks be used for debugging modern, complex designs? Stopping a hardware task that has multiple clocks or multi-cycle I/O interfaces at an arbitrary time can cause several hazards such as data loss and deadlocks [4, 24] . For example, stopping a task in the middle of sending/receiving a transaction on a DDR, Ethernet, or PCIe interface is definitely hazardous. Thus, these readback-based frameworks are not suitable for in-system debugging of complex designs involving multiple clocks and multi-cycle I/O interfaces. In contrast, StateMover, which also uses readback for checkpointing on-chip designs, supports in-system debugging of such complex designs and it overcomes all these other problems (no visualization, no signal history, and unsafe interruption) as shown in Section 3.1.
STATEMOVER DEBUGGING FRAMEWORK
In this section, we introduce StateMover, a debugging framework for FPGAs, and explain how to use it.
Overview
StateMover is an FPGA debugging framework that combines simulation and hardware execution in a seamless way. It is based on checkpointing, which is a technique in which a snapshot of the state of a design is taken and stored so that when loaded back, the design can continue from this state. StateMover allows moving the design state back and forth between a simulator and an FPGA, which we believe enables a novel debugging flow that has a software-like combination of observability and speed. Users are no longer required to sacrifice speed for full visibility or sacrifice on-chip resources to increase visibility.
StateMover has the following features. First, it can safely interrupt a design by bringing it into a state such that when loaded again, the design can continue from exactly where it left off, without any data loss or deadlocks. This includes designs that have multiple clocks and multi-cycle I/O interfaces. Second, StateMover can read out the entire on-chip state of a design and load it into a simulator in which the user can inspect all the signal values and simulate the design starting from this state. Being able to simulate the design starting from an on-chip state shows how the design behaves and is more powerful than readback, which only retrieves a single value for each signal, on its own; the simulation provides the user with a waveform that is similar to the signal history waveform provided by trace-based debugging techniques but with full visibility and at no hardware cost. Simulating the design to construct the waveform is much faster than the traditional method used by previous readbackbased techniques: single-stepping the on-chip design and initiating a readback multiple times. Third, StateMover can extract the design state from a simulator and write it back to an FPGA, which not only speeds up the simulation by fast forwarding uninteresting periods, 
StateMover Flow
The tool flow of StateMover is shown in Figure 1 . First, the designer has to add some interruption logic (IL) modules that we provide to allow safe interruption of the design under test, which we refer to as task. These IL modules, which are discussed in Section 4.2, are added outside the task, have a small area and do not affect the operating frequency of the design. We also provide two Verilog procedures (ILC) that are added into the designer's test-bench to control the IL during simulation. Next, the design files are passed through Xilinx's Vivado implementation flow. After the implementation step is complete, an additional script (SM Extract) we wrote as part of StateMover is run to extract additional information from the design and generate the bitstream files. The design is now ready to run in simulation and in hardware. The designer starts a simulation augmented with StateMover's CSR-SIM, and opens the StateMover (SM) console, which programs and controls the FPGA. The simulator (augmented with CSR-SIM) and the SM console together form the user interface of StateMover. They provide commands that allow the design state to be extracted from the simulator and from the hardware, and similarly loaded into the hardware and simulator. The designer can run the simulation to a certain point and then continue the execution on hardware. This is performed by triggering the dump signal in the simulator using the force command at that point, and then executing the writeback command on the SM console. The designer can also stop the hardware execution at a certain point, then move the on-chip state of the task to the simulator, and continue the execution on the simulator with full visibility. This is performed by first using the set_breakpoint command in the SM console to stop the design. After the design is stopped, the designer executes the readback command in the console to retrieve the state from the FPGA, and then triggers the load signal in the simulator to load the extracted state.
IMPLEMENTATION DETAILS
In this section, we describe the implementation details of State-Mover. As shown in Figure 2 , StateMover is divided into three major components: the user interface, the interruption logic and the backend. The user interface consists of a simulator instance to simulate the design and the StateMover console to control the hardware execution. The interruption logic consists of a task interruption (TI) controller and breakpoint (BP) logic. The backend provides the main functionality of StateMover, and consists of CSR-SIM and StateConfig.
User Interface
The user interface of StateMover is designed to be simple and to allow simulation and hardware execution to run independently.
The first part of StateMover's interface is a normal simulator instance. We are using ModelSim for simulation; however, State-Mover is compatible with most commercial HDL simulators. The designer starts a ModelSim simulation that is linked with State-Mover's CSR-SIM by passing the CSR-SIM object using the pli command line option. The task that the designer wants to debug using StateMover has to have been implemented through Vivado, and the implemented netlist, which is generated by the SM Extract script, is used for simulation. We simulate a post-implementation netlist so that the hardware and simulation state elements match; this allows register values, distributed memory contents, and onchip memory contents to be loaded into the simulator or written back to the device properly, no matter what optimizations were made by Vivado. Note that the task does not have to occupy the entire design; StateMover supports debugging of reconfigurable modules inside a larger design.
The designer can dump the simulation state or load the hardware state into the simulator by triggering the dump and load signals, respectively. These signals are defined in the ILC procedures that are added to the designer's test-bench in Figure 1 . The first procedure defines the dump signal and waits for its assertion, and when this signal is asserted, the procedure interrupts the task using the interruption logic and waits until the task is safely stopped (i.e. the task interruption request is granted). It then triggers the dump_sim_state function in CSR-SIM which extracts the state of this task from the simulator and dumps it into the sim_state file. The second procedure waits for the assertion of the load signal, and then stops the task and triggers the load_hw_state function in CSR-SIM. This function loads the on-chip state of the task, which is saved in the hw_state file, into the simulator. In summary, to load a saved checkpoint (hw_state), all the designer needs to do is to trigger the load signal by using the force command in ModelSim. Then, the user can inspect the entire on-chip state of the task in the simulator, run the simulation from this checkpoint, and modify any signal value. To dump the current state of the task so that it can be written back to the FPGA, the user needs to trigger the dump signal. This can again be performed with the force command, or triggered by any other logic the designer desires.
The second part of the user interface is the SM console, which runs on top of Vivado's Tcl console. The designer can execute any Tcl command supported by Vivado in addition to the commands that StateMover provides. The SM console is used for programming the FPGA, controlling the hardware execution, and performing a readback or writeback. After invoking the SM console, the designer sets some variables related to the files needed by StateMover: 1) the bitstream file of the implemented design, 2) the partial bitstream file of the task if partial reconfiguration is used, and 3) the logic and RAM location (LL and RL) files (explained in Section 4.4). All these files are generated by the SM Extract script. The SM console provides commands for interrupting (stop) and resuming the design (run), and setting a hardware breakpoint (set_breakpoint). These commands interface with the interruption logic on the FPGA through JTAG. Most importantly, the SM console provides the readback and writeback commands. The readback command makes sure that the task is stopped, then performs a readback to read out the configuration frames that contain the state of the task, and then calls the extract_hw_state function in StateConfig, which extracts the state out of the readback file and dumps it into the hw_state file. The writeback command calls the embed_sim_state function in StateConfig, which reads the checkpoint inside the sim_state file dumped by the simulator and embeds it inside the configuration frames located in the bitstream file. A partial/full reconfiguration is then performed to write this checkpoint back to the FPGA.
Interruption Logic
StateMover adds minimal hardware to the design. The main functionality of the interruption logic is stopping the task so that a readback or a writeback can be performed properly. The interaction between the added logic and the Tcl console is performed through Xilinx virtual I/Os (VIOs) [30] over JTAG.
The breakpoint logic controls when the design is going to be interrupted. The design is run normally at full speed until a breakpoint is reached. Depending on the use case, various breakpoint logic can be implemented. Since StateMover supports partial reconfiguration, the breakpoint logic can also be implemented in a separate reconfigurable region so that it can be dynamically reconfigured [15] . In the current setup, we use a counter that starts counting after the design is reset, and the value of the counter is compared to the value set by the user from the Tcl console using the set_breakpoint command. If they match, the breakpoint logic interrupts the design through the TI controller. This provides a simple but effective way to checkpoint the design at any point of interest.
The TI controller takes a TI request as an input, and after it safely stops the task, it grants that request. In simple designs that have one clock domain or do not have multi-cycle I/O interfaces, task interruption is performed by deasserting the clk_en of the BUFG that drives the clock to the design. This is the method used in most prior readback-based debugging frameworks for stopping the design. However, StateMover supports debugging of complex designs that have multiple clocks and multi-cycle interfaces, and stopping these designs at an arbitrary point can cause several hazards such as data loss and deadlocks. StateMover also supports debugging a specific task in a complex system, so we have to guarantee that other parts of the system are not affected by stopping that task. To achieve this, task interruption is performed according to the [14] _srl2 / data c3c3 techniques in [4] which enable safe task interruption, and hence safe checkpointing. The work in [4] proposes TI wrappers which are controlled by the TI controller, and together the controller and wrappers provide an implementation for a set of design rules that should be followed to achieve safe task interruption. TI wrappers are placed on the input and output interfaces of a task, and when a TI request is asserted, the TI controller sends a stop request to all the TI wrappers. Once the TI wrapper receives the stop request, it prevents any new transactions from being issued, and waits for the in-flight transactions to complete. This ensures that the task state is confined inside the task borders (i.e. there is no essential state in the I/O controllers or other interfaces to the task). TI wrappers support the industry-standard interfaces used by Xilinx and Intel: AXI/Avalon memory-mapped, and AXI/Avalon streaming interfaces. They add no timing overhead, and very small area overhead.
CSR-SIM
CSR-SIM is a Context Saving and Restoring SIMulator, that can read, write, and modify the entire state (context) of a specific task in a design during simulation. It is written in C++ and uses PLI/VPI [25] to interface with the HDL simulation of the design.
CSR-SIM takes as input 1) the name of the HDL module (i.e. task) for which the user wants to dump (save) and load (restore) state, and 2) the name of the signals that trigger the dump_sim_state and load_hw_state functions. At the beginning of the HDL simulation, CSR-SIM traverses the design simulation model created by the simulator and creates a list of all the state elements, including registers and memories, inside the specified module and all its sub-modules recursively. When the dump_sim_state function is triggered, it retrieves the values of those state elements from the simulation, and dumps the name and the value of each state element into the sim_state file in a readable format. The load_hw_state function reads the hw_state file which contains state element names along with their on-chip values, and it then searches in the state element list for those state elements and overwrites their values in the simulation. The hw_state and sim_state (checkpoint) files have the same format, which allows comparison of the on-chip and simulation design state at a specific point using any diff tool. A snippet of the checkpoint file of one of the test designs is shown in Listing 1, which shows the value of a register, a LUTRAM, and a shift register LUT (SRL).
To seamlessly move state between the simulator and hardware, we must match each state element in the simulation model with the corresponding hardware element; as a large design contains millions of state elements, this must be done automatically. Algorithm 1 shows the overall method. First, CSR-SIM traverses the simulation netlist by querying ModelSim through PLI/VPI in order to find all the registers and memories. However, this actually creates a superset of state elements because some of these registers and memories do not map to physical FFs or memories, even when an Session: High-Level Abstractions and Tools II FPGA '20, February 23-25, 2020, Seaside, CA, USA Algorithm 1 Create State Element List 1: let task_name: The name of the top-level module of the task 2: let reд_prim: Arch block types that are mapped to registers 3: let mem_prim: Arch block types that are mapped to memories 4: state_elements ← ϕ // Map of (name, simulation node) pairs 5: m ← get_module_by_name(task_name) 6: while m != null do // traverse down the hierarchy 7: while (reд ← next_register(m)) != null do 8: if type(m) ∈ reд_prim then 9: if is_state_holder(reд, type(m)) then 10: port ← get_output_port(m) 11: siдnal ← get_signal_connected(port) 12: state_elements.insert(name(siдnal), reд) 13: end if 14: else if type(m) ∈ mem_prim then 15: if is_state_holder(reд, type(m)) then 16: state_elements.insert(name(reд), reд) implemented netlist is used. Thus, we perform some extra filtering to precisely identify physical state elements. CSR-SIM has a small database of Xilinx's UltraScale register primitives (e.g. FDRE, FDSE) and memory primitives (e.g. RAM32M, SRL16E, RAMB36E) [29] ; while we created this list for UltraScale, it will mostly be compatible with other Xilinx families as well, so support for them could be added with minor updates. Before inserting a state element into the state element list (state_elements), CSR-SIM checks if the parent module of this state element is a register primitive (line 8), a distributed memory (LUTRAM or SRL) primitive or a block RAM primitive (line 14). The algorithm also checks if this state element is the actual register/memory that holds the state inside that primitive (line 9, 15). For example, in the simulation model of some Xilinx primitives, some parameters that are only used for simulation are defined with the reg data type; we exclude these elements. Finally, to match state element names between simulation and hardware, CSR-SIM employs the naming rules used by Vivado as it creates the state element list. The naming depends on the primitive type. For example, the FF name in Xilinx's logic location file, which is used to get the location of FFs in the configuration frames, is actually the name of the net that is connected to the output of the register primitive, rather than the name of the primitive instance. These names are stored in the state element list, and are used for dumping and searching for state elements, and hence allowing a complete hardware state to be loaded into the simulator or written back to the device properly.
StateConfig
StateConfig is a tool written in Python which is responsible for extracting the design state from the configuration frames that are read back from the FPGA, and embedding the design state extracted from the simulator into the bitstream. Figure 3 . First, the designer invokes a readback through JTAG using the readback command in the SM console. When a readback of the entire FPGA is performed, a readback file is generated which contains all device configuration frames. The SM console then invokes StateConfig's extract_hw_state function with the following arguments: 1) the readback file, 2) the logic location (LL) file and 3) the RAM location (RL) file. StateConfig then parses the LL file and the RL file. The LL file, which is generated by Vivado during the write bitstream phase, contains information about the location of state element bits in the configuration frames. It also contains the names of the design nets associated with these state elements. Since the net name is shown only for CLB registers, the location of memories cannot be looked-up by their name. Thus, we use our SM Extract script to extract the placement information of distributed memories (LUTRAMs and SRLs) and block RAMs from an implemented design, and dump this information into a RL file. By using the RL file along with the LL file, the value of any state element can be extracted from the readback file. After parsing the LL and RL files, StateConfig extracts the value of each state element from the readback file, and then dumps the state element name along with the extracted value in the hw_state file. This is performed for all the register names found inside the logic location file, and for all memory names found inside the RAM location file. Writing the value of distributed memories in the hw_state file requires an additional step which we refer to as physical to logical (P2L) RAM mapping. P2L RAM mapping reconstructs the logical representation of the distributed memories' state from the relevant physical bits read out from the FPGA. This is complicated by the fact that the simulator (logical) view of the design primitives does not perfectly match the hardware. The extracted physical bits from the readback file represent the memory content Session: High-Level Abstractions and Tools II FPGA '20, February 23-25, 2020, Seaside, CA, USA of 6-LUTs but these bits can be used in many ways, so we need to map these bits to the same format expected by the distributed memory primitives used in simulation. For example, a LUTRAM primitive can describe multiple physical LUTs (CLB-wide) for wide memories, and two SRLs can be packed inside the same physical 6-LUT. Thus, we had to analyze how each type of LUTRAM or SRL is mapped to the FPGA and how its value is stored in the primitive model. Then, depending on the primitive type, we reconstruct the distributed memory state and dump it to the hw_state file. For example, to reconstruct the 16-bit value of an SRL, we first get the 32-bit value of the 5-LUT, which implements that SRL, out of the 64 physical bits representing the entire 6-LUT, and then discard every second bit of that 5-LUT, as the SRL state is stored only in odd bits. Another challenge for reading back distributed memories is that the LUT content cannot be read out in some cases. When partial reconfiguration is enabled in Vivado, we found that the LUT content is no longer readable when configured for dynamic operation (i.e. used as memory); presumably this is because Vivado assumes readback is being used to validate (CRC check) a programmed bitstream, and hence masks out the (changing) LUTRAM bits. 1 By using some bitstream hacking techniques, we determined that when partial reconfiguration is enabled, specific bits in each configuration frame are set to mask the LUT memory bits if the LUT is configured for dynamic operation. Thus, if partial reconfiguration is enabled, StateConfig is used to enable distributed memory readback by manipulating the bitstream before programming the FPGA to turn the masking bits off in the frames that are associated with the distributed memories we would like to read back. This is shown as the (optional) RAM Readback Enable step in Figure 3 ; it is only required when the partial reconfiguration compile flow is used.
RAM Readback

Embedding the State.
When the designer invokes the writeback command in the SM console, StateConfig's embed_sim_state function is called to embed this state in the configuration frames of the bitstream file so it can be moved to hardware. This bitstream manipulation requires understanding of the bitstream file and the underlying configuration architecture. StateConfig supports embedding the design state into both full and partial bitstreams and in any format: binary (BIT Files) and ASCII (RBT Files). Figure 4 shows design state embedding flow. First, StateConfig reads the sim_state file to extract state elements names and their values. It then uses the logic location and RAM location files to find each state element's location in the bitstream. The location information consists of three parts: the bit offset which is an offset calculated from the start of the first configuration frame; the frame address; and the frame offset which is an offset calculated from the start of this frame. For embedding the state in a full bitstream, StateConfig skips the bitstream header and the configuration commands, and then uses the bit offset to jump to the location of the state element. Then, it performs bit manipulation to overwrite the initial value of the state element with the value written in the sim_state file. For distributed memories, a logical to physical (L2P) RAM mapping is performed, which is exactly the inverse of the P2L RAM mapping, to reconstruct the memory content of the 6-LUT. Embedding the state in a partial bitstream, which is necessary if we wish to write back the task state without affecting other parts of the system, is more complex and requires some reverse engineering. In a partial bitstream, we cannot use the bit offset to jump to the appropriate configuration bit, as the bit offset is valid only for a full bitstream, not a partial one where only the relevant configuration frames are included. Instead, we must compute the bitstream location of the state elements from the frame address and frame offset. The configuration frame addresses are not continuous, and there is no documentation for how the frame address is incremented, and hence, the frame address cannot be directly used. Instead, we convert the discontinuous frame address to another continuous form, which we refer to as frame index. This conversion is device-dependent because the incrementation of the frame address depends on the number of columns and rows in the device, the number of frames allocated for each resource (e.g. CLB column, DSP column, ...), and the number of hard blocks and their distribution inside the device.
StateConfig implements a function that performs the conversion from a frame address to a frame index on-the-fly by reading a devicespecific database which we refer to as the frame count database. To know the location of a certain state element in the partial bitstream, StateConfig uses this function to convert the frame address of the first configuration frame in the partial bitstream, and the frame address of that state element into frame indices (FI). We then use equation 1 to find the state element (SE) location in the partial bitstream and then modify its state as shown in Figure 4 . To create the frame count database for a specific FPGA, we implemented a function that takes the number of rows and columns in that FPGA, and an LL file of an FPGA design implemented on that FPGA that has at least a register at each CLB column (e.g. a huge shift register). It then uses the information in the LL file with some Session: High-Level Abstractions and Tools II FPGA '20, February 23-25, 2020, Seaside, CA, USA reverse engineered information about the number of frames allocated for each resource, which are published in [22] , to create the frame count database which contains the number of frames in each column. It supports any FPGA from Xilinx's latest UltraScale family and could be extended to other families using the same procedure.
EVALUATION RESULTS
In this section, we evaluate StateMover and test it on various designs to verify its functionality. We use the Xilinx Kintex UltraScale KCU105 board for all our tests; however, StateMover supports any UltraScale device. We first use a simple counter design to show StateMover in action. The counter increments every second, by using another counter as a clock divider, and is connected to the GPIO LEDs of the FPGA board. The design is simulated on ModelSim and after the counter is incremented by three, which actually took 12 minutes in simulation, we trigger the dump signal in ModelSim and invoke the writeback command in the SM console. This moves the design state to the FPGA, thereby updating the LEDs to the new count. We set a breakpoint that fires after one minute using the SM console, and after the design is stopped, we invoke the readback command and trigger the load signal in ModelSim. This seamlessly moves the on-chip state to the simulator as displayed in ModelSim's waveform shown in Figure 5 (a), thereby fast forwarding the simulation by four hours on this simple design. Note that whenever the load or dump signal is triggered, the interruption logic (IL) controlled by the ILC procedures stops the clock feeding the counter (the task) so that the state can be loaded or dumped properly, and then enables the clock again as shown in 5(b).
Test Designs
To verify the functionality of StateMover, which is moving the entire design state back and forth between an FPGA and a simulator, we created several test designs: a crossbar, a FIR filter, and a Figure 6 : AES System sorting network [32] . We chose these designs because they have different design elements, which shows that StateMover can be used on any design: the crossbar design uses logic only, the FIR uses 64 (unpipelined) hard DSP blocks, and the network sorter uses memories. To verify StateMover functionality, we wrapped these designs with a built-in-self-test (BIST) structure, that consists of a linear feedback shift register (LFSR), and a multiple input signature register (MISR), similarly to [1] . The LFSR feeds the circuits with pseudo-random inputs, and the output of the design is fed to the MISR. For each design, we first run the design on the FPGA to generate the golden signature. Then in the second run, we begin running in hardware, interrupt the design and move its state to the simulator, and simulate it to complete generation of the test signature. We also test the other way around, where we generate the golden signature and begin test signature generation in the simulator, interrupt the design in the simulator, and then continue the execution on the FPGA to complete the test signature generation. The test signature matches the golden signature for all the designs, which means that StateMover is able to move the entire design state between the simulator and the FPGA without state corruption, deadlock or other issues.
To show that StateMover is able to debug a task in a larger system, we build the system shown in Figure 6 . The system consists of an AES core, which is the task we want to debug, interfacing with a memory controller through an AXI interface. The AXI interface is shared with another task (AES Check) that also interfaces with the memory controller. The AES core reads the contents of the memory, encrypts them and writes them back. The AES Check core reads the encrypted memory contents written by the AES core, compares them to the expected data, and outputs the number of matches. StateMover interrupts only the AES core, keeping the other parts of the system working, and moves its state to the simulator. This shows the importance of the TI wrappers, which are placed on the AXI interface of the AES core. Without the TI wrapper, if we just stop the clock of the AES core at an arbitrary time, several problems can (and do) occur. For example, when the AES core is stopped in the middle of a transaction, the AXI interconnect is locked, and hence the AES Check core cannot run. Moreover, the AES core cannot resume properly in the simulator because of the incomplete transaction; some relevant state is in the AXI interconnect and the memory controller, and it is not transferred to the simulator. The TI wrapper ensures that the core state is confined inside the task borders by stopping the core from issuing any new transactions once interrupt signal is raised, while allowing in-flight ones to complete. Thus, the AES core can be safely interrupted and its execution can be resumed in the simulator properly. We ran this test for various interruptions points and all checks passed. The area of these designs in terms of number of LUTs, including LUTs used as distributed memories (LUTRAMs and SRLs), and registers are reported in Table 1 . The interruption logic added to the BIST designs has a small fixed area (232 LUTs and 590 REGs), and most (> 90%) of this area actually comes from the Xilinx VIOs that we are using to control the interruption logic from the SM console. Since the interruption logic only controls the clock of the BIST designs, there is no timing overhead. In the AES system design, the added TI wrapper also has a small area (26 LUTs and 20 REGs), and it does not affect the operating frequency. The area overhead of the added logic for all the designs is shown in Table 1 .
One of the features of StateMover is creating checkpoints of a design, which can be loaded into a simulator, resumed on hardware, or used for simulation/hardware discrepancy detection. The sizes of checkpoint files (hw_state and sim_state) of the test designs are reported in Table 1 . The checkpoint size is small when compared to the 132 MB size of the readback file.
Speed and Scalability
We tested the speed of moving the design state back and forth between the simulator and the FPGA for all the test designs and for a larger version of one of the designs to show the scalability of our proposed flow.
To retrieve the design state from the FPGA, a configuration readback is performed; a readback of the entire FPGA of the Kintex UltraScale KCU105 board, which is a large FPGA that contains 530K logic cells, takes 10 seconds. A partial readback, in which only the relevant configuration frames are read, can also be performed, and will reduce this time significantly as shown in Table 1 . The extraction of the design state out of the readback file takes 1-4 seconds, depending on the design size. To put the design state back to the FPGA, StateConfig first embeds the state into the full/partial bitstream, which takes 0.5 to 5.5 seconds, depending on the design size. The FPGA is then reconfigured with the bitstream. A full bitstream takes 9 seconds, while a partial bitstream can be written back in one second or less. The time to extract and embed the state can be further improved by switching to C++ instead of Python in StateConfig. CSR-SIM, which is written in C++, loads and dumps the state from the simulator in a tenth of a second even for the largest design. The time of readback and writeback can also be improved by using a high bandwidth configuration port instead of the 15 MHz JTAG interface. However, the current setup is already fast enough to allow both interactive debugging and regular checkpointing of a design in case later debugging is needed. As the last column of Table 1 shows, hardware execution has enormous speedups vs. simulation; from 10,000x to 11,000,000x with the speedup increasing with design size. We calculated this speedup by measuring the functional simulation time of a certain number of clock cycles for each design on an Intel Xeon (E5-1620v3, 3.5GHz) CPU running the full version of ModelSim. It takes ModelSim~127 days to simulate a one second of hardware execution for the crossbar-big design, highlighting the productivity gains possible by using StateMover to run a design in hardware and move to simulation only when full visibility and controllability is necessary for debugging.
CONCLUSION AND FUTURE WORK
In this paper, we proposed StateMover, a novel FPGA debugging framework which can safely stop a running design on an FPGA and move its entire state into a simulator where the designer can inspect and modify any signal value. StateMover can also extract the design state from a simulator and transfer it a hardware FPGA design, which allows fast forwarding uninteresting periods of the simulation given the~10M× speedup of hardware execution. State-Mover supports any Xilinx UltraScale FPGA, and can be extended to any FPGA that supports readback. Reading and writing back the entire state of a large FPGA takes 9-10 seconds. Extracting the state out of the readback data and loading it into the simulator takes less than four seconds for a large design, and around five seconds for the other way around. StateMover opens the door for new debugging flows based on checkpointing, similarly to the latest generation of software debuggers.
Although StateMover supports reading the content of on-chip block RAMs, the test designs used in this paper do not use any block RAMs (we forced Vivado to use distributed memories instead) because the internal pipeline registers of block RAMs as well as the pipeline registers inside DSP blocks cannot be read back. Hence, we would have an incomplete design checkpoint if these internal registers contain relevant data. Our future work includes adding support for reading the data inside these registers using techniques such as automatically adding readable shadow registers to the design to make these state elements visible.
