Abstract-Rapidly increasing FPGA density and complexity has heightened the need for higher levels of abstraction in validation and more rapid, focused approaches for design inspection. We present two methods of validating and debugging active, implemented FPGA designs running at target speeds. The first binds high-level software reference models directly to hardware enabling complex, automated, software-controlled testing scenarios, reducing the reliance on simulation. The second approach provides direct interactivity and visibility into a running FPGA design, enabling software-controlled breakpoints and arbitrary access to design registers. In-circuit breakpoints can be modified without the need to re-implement the entire design.
I. INTRODUCTION
Despite recent, impressive gains in the performance and density of FPGAs, development methods and tools have not similarly evolved. Current techniques will not scale to accommodate the near term projected increases in FPGA size, density, or complexity. Developer productivity can be envisioned as being inversely proportional to FPGA growth: as FPGAs increase in size, design, development, and debugging tasks take longer and become more cumbersome [1] , [2] .
We present our debug methodology as part of the Dynamic Modular Development (DMD) framework. DMD utilizes the Xilinx Partial Reconfiguration (PR) flow to accelerate design and development by partitioning design modules into separate partially reconfigurable regions and automatically merging design modules which are no longer being modified into the surrounding non-PR region (static region). These PR regions provide almost immediate modifications to the debug process by allowing rapid changes to breakpoint conditions. DMD's debug methodology targets two distinct processes of FPGA development. High-Level Validation (HLV) connects synthesized hardware with a high-level language reference design. This allows designers the ability to testbench their hardware implementations with the original reference model using test vectors from their primary development environment. Low-Level Debug (LLD) offers a more agile and software-oriented approach to addressing low-level debugging tasks than current commercial products. LLD provides the familiarity, convenience and flexibility of command-line debugger scripting to a live design running on an FPGA.
Unlike FPGA development environments, software development environments offer intuitive means for quickly testing modifications and validating projects. We informally define three prominent attributes of software development and base our framework on improving them for FPGAs: visibility, controllability, and agility.
Visibility is the extent to which design elements, such as signals, ports, and registers, are observable once implemented in hardware. In software debugging, special annotations such as symbol tables are compiled into the binaries. Once a variable is in scope, it can be printed or "watched" within a debugger. No such standardized facility is widely available in FPGAs other than diagnostic LEDs, and debugging remains largely technology-and vendor-dependent.
Controllability is the extent to which a design's state can be manipulated or altered during execution. Examples include forcing variables to values at runtime, a feature supported by software debuggers but not architecturally supported in FPGAs, and altering the state of the debug mechanism such as breakpoints. Software breakpoint modifications do not require re-compilation or restarting of the affected unit, while most FPGA vendor tools implement capture or trigger mechanisms as part of the monolithic design.
Finally, we define agility as the ease and efficiency at which modifications can be made to a design. It is a common occurrence during FPGA development that a trivial change requires a re-implementation of the entire design, a lengthy process taking potentially tens of hours for large designs [3] . Software build tools such as Make selectively rebuild only the affected units within the dependency tree of the modified unit.
The remainder of the paper is organized as follows: Section II discusses related research and commercial products for FPGA debugging. Section III discusses DMD in detail, including an overview of the PATIS floorplanner. Section IV provides results from two benchmark designs, followed by conclusions in Section V.
II. RELATED WORK
A review of commercial and research products for debugging FPGA designs reveals two prominent approaches: embedded logic analyzers and JTAG-based analysis tools. Embedded logic analyzers recreate the familiar interface of physical logic analyzers but are implemented as part of the FPGA design by inserting special cores with physical connections to signals of interest. These cores rely on a capture methodology-the selected signal's activity is recorded based on predetermined conditions. While some changes can be made at runtime, the trigger conditions that initiate the capture are physically implemented as part of the overall design and largely specified at design time. These cores compete against the design for system resources, particularly on-chip RAMs as captured signal traces are stored until read out and displayed in a graphical user interface. Interaction with the design is limited: neither debug scenarios nor tasks can be scripted into a larger comprehensive or automated test framework. Capture conditions cannot be significantly modified without re-implementing the entire design, and different parts of the design cannot be inspected other than those signals initially specified. However, designs can typically run at or close to their target operating frequency. Xilinx's ChipScope Pro [4] and Altera's SignalTap II [5] are two examples of the embedded logic analyzer class of tools. The Synopsys Identify Debugger [6] can annotate original HDL source code for debugging, using an approach similar to the other two tools.
The other popular debug approach involves JTAG capture. These tools do not require the invasive instantiation of cores and routing of signals as found in embedded logic analyzers and thus do not compete for resources. A snapshot of the device state is captured and serially shifted out for inspection. Once retrieved, any arbitrary state of the design can be inspected. While these tools offer complete visibility without significantly altering the design as in embedded logic devices, they are limited in controllability and agility. JTAG operates at a significantly slower clock rate than the average FPGA design and, as devices increase in size and density, the time required to shift the entire chain out proportionally increases. Xilinx's now discontinued JBits [7] and BoardScope [8] are examples of research-grade JTAG inspection interfaces. Sandbyte's FPGAXpose product is a commercially available product in this category [9] . GateRocket's Device Native Verification system allows designs to be run directly on an FPGA on an external RocketDrive unit, bypassing software simulation and executing directly on hardware [10] .
III. DYNAMIC MODULAR DESIGN AND VALIDATION
When creating a complex design, a high-level language (HLL) reference model is often first created to develop and validate an algorithm. This reference model becomes isolated from the remaining development cycles due to language and technological incompatibilities. Subsequent development in Hardware Description Languages (HDL) re-implements the algorithm anew, focusing on data flow, extracting parallelism, timing, and control signal interaction-aspects not captured in the reference model. After simulation, the HDL design may be adapted for synthesis. Though the same HDL may be used for both simulation and synthesis, this is not always the case as third-party IP may be implemented differently for simulation and synthesis. Simulation models may be obfuscated or encrypted to protect intellectual property. Unintentional variations between models may exist or optimizing synthesizers may misinterpret designer intentions, providing unreliable or unpredictable results during development [11] .
High-Level Synthesis (HLS) offers a direct path to creating synthesizable HDL from an HLL. While HLLs are the fastest, most precise and abstract means of specifying and validating algorithm correctness, HLS has yet to gain widespread acceptance for reasons such as poor coverage, large overhead, and dead logic. Machine-generated RTL is difficult to integrate with existing IP cores and to process with formal verification tools [12] . We refer to this gap between HLL models and synthesizable hardware as the model verification gap.
Our DMD methodology envisions Xilinx's PR [13] flow not as a runtime strategy, but a design time methodology. While developers have not widely adopted PR for production designs, our research has shown that rapid development turnaround times can be achieved by partitioning frequently modified modules into separate PR regions [14] . DMD's use of PR does not extend beyond the development environment with PR regions gradually and automatically merged out of the design. The two principal components of DMD, the PATIS floorplanner and the validation tools, are discussed below.
A. PATIS Floorplanner
DMD extends the traditional PR flow with our Partial module-producing, Automatic, Timing-aware, Incremental, Speculative (PATIS) floorplanner [3] . While the standard PR flow has traditionally been a manual process not normally associated with enhanced productivity, the PATIS tools simplify much of the implementation details of PR by automating this process. Since runtime reconfiguration capabilities are not required, PATIS uses a simplified version of the PR flow that does not have the complications found in runtime hot-swapped modules. Bus macros are automatically inserted on module boundaries and provide passive, readback-based observability of all communication between floorplanned modules [15] .
The primary obstacle to fast implementation times for FPGAs is place-and-route (PAR). While vendor tools implementing parallel PAR algorithms and leveraging the lowercost of multi-core processors have been introduced, they have yielded only a modest speedup due to the complexity of the algorithms and the large, complex data structures which saturate memory bandwidth and overrun caches. PATIS counters these obstacles through a "divide-and-conquer" strategy that creates independent floorplan variants of the design that are parallelizable on separate machines and do not require shared memory access. By separating modules that are currently under development into independent PR regions, modules can be separately implemented without reimplementing the entire design, dramatically reducing implementation times.
The PATIS floorplanner creates a modestly oversized PR region for each top-level module allowing modules to grow during the course of development without disturbing the rest of the design. If an updated version of a module no longer fits within its boundaries, PATIS selects an appropriate floorplan from a database or re-implements the entire design. A speculative floorplanning background process also runs that explores the design space, generating potential future floorplans based on estimated module completeness and past changes. Timing is analyzed across module interfaces and compared to top-level constraints. A thorough discussion of PATIS can be found in [14] . Fig. 1 illustrates the PATIS flow.
B. High-Level Validation
HLV addresses the model verification gap wherein the reference model produced during the initial development stages becomes isolated from hardware development. HLV provides automated functional validation of synthesized hardware using the original, untimed reference model, much like unit testing for software development. HLV is a hardware/software framework based on capture methodologies, similar to those found in embedded logic analyzers. Since HLV is targeted towards automated nightly unit tests, the memory overhead of capture techniques is not considered detrimental to its usefulness. Fig. 2 outlines the HLV framework which consists of an on-chip processor and test harness peripheral. The peripheral provides input and output ports and the necessary hardware control logic to queue input data and capture a predetermined window of output data.
HLV operates by first executing the reference model using prepared, stored, or randomly generated input data and storing the results. Next, the same input data is staged in queues at the inputs of the device under test (DUT). Information gathered during HDL simulation, such as timing, control signal interaction, and latencies are applied using HLV's API. The capture window parameters are programmatically set and define the clock cycle range of output data to store for comparison. Output data outside the capture window range is discarded. The use of queues for input and output data enable the design to run at its target operating frequency without having to develop a technique to control the execution of the design. This can conceivably be used to detect timing errors within the module that produce erroneous results and remains as future work. Captured output data is then compared against the results of the execution of the software model for correctness with results reported on a console. HLV is designed to be used as a nightly, software-controlled, hardware unit-testing framework. With simulation data for control signal interaction and latencies, development of a testbench is relatively quick. HLV testbench software can provide software control of complex designs even during development, scanning output for expected data and reviewing output streams without re-implementing the design as with conventional hardware debugging. Software-controlled testbenching has rapid turnaround and requires no hardware implementation to inspect different parts of the design. Elaborate, software-generated unit tests can be created for each module, enabling hardware validation against the reference model.
High-Level Functional Model

Online Comparison of Results
Hardware Implementation
C. Low-Level Debug
DMD's LLD aims to provide the interactivity found in software development tasks to FPGA development environments. An on-chip processor controls the execution of the design, providing a programmable and scriptable commandline interface via a workstation console application. Whereas embedded software running on an FPGA would likely be too slow and limited to meaningfully interact and monitor a design, LLD uses the on-chip processor to provide a user console over a serial link. Through the use of a custom hardware peripheral, the processor is capable of halting the design at any arbitrary location and reading design registers. A diagram of LLD is given in Fig. 3 . During debugging, the desktop application constructs a model of the design mapping hierarchical design names to individual bit locations in the FPGA. The application then assembles and issues a stream of instructions to retrieve and decode the state information from the on-board processor using the Internal Configuration Access Port (ICAP).
The workstation application raises the abstraction of common low-level debugging tasks to the symbolic level consistent with the original design. The workstation handles tasks too intensive for the limited resources of the on-chip processor, such as processing the large logic allocation file. The logic allocation file is an optional text file produced during bitstream generation and maps the bit locations of design registers to the absolute position within the shifted JTAG chain and the configuration frame and offset accessible from the ICAP. The user application additionally provides an intuitive interface including command-and register-name tab-completion. By masking the delineation between a user workstation and a resource-limited on-chip processor that has full design visibility and control, an insightful and productive interface to an FPGA is created.
An agile and interactive breakpoint management strategy is the novel contribution of this work. Up to 32 softwareaddressable breakpoints on top-level signals can be programmed into the PR region, managed from the user workstation application. Breakpoints can be modified and quickly reimplemented in hardware without a full re-implementation of the entire design, achieving significant time savings especially for large or high-utilization designs. Breakpoint statements can be selectively enabled or disabled from the command-line through a software-controlled breakpoint mask. This allows users to temporarily disable a statement or initially program all the required statements and selectively enable only those required for different scenarios. Breakpoints are implemented in two different ways: as a conventional conditional breakpoint that suspends execution when a condition is met, and as an assertion breakpoint that suspends execution once a condition fails to be met. All breakpoints are implemented using asynchronous logic which allows execution to be suspended at or immediately following the next clock cycle.
The Programmable Debug Controller (PDC), a custom peripheral of the on-chip processor, handles aggregate signals from the breakpoint logic, manages clock logic, and controls the ICAP. Like JTAG, the ICAP can be used to read device state, but internally and through random access rather than a serial shift-chain. This noticeably accelerates the process of reading register values since only the requested values are read on-demand. The ICAP API reads an entire configuration logic frame, which is loaded with register state after an ICAP capture command is issued. Using the map generated from the logic allocation file, the user application generates and maintains a map of symbolic design names to configuration frames and bit offset locations, thereby enabling arbitrary symbolic access to the design similar to that of conventional software debuggers.
The PDC's clock management unit enables fine grain control of the design's master clock through the use of a clock buffer. The design can be run freely at its full, intended design speed or stepped an arbitrary number of clock cycles. The clock is disengaged and execution suspended when either a breakpoint condition occurs, a user-issued command to stop is given, or when the programmable step counter expires. Hardware-controlled clock stepping allows the design to be stepped a predetermined number of cycles using the actual system clock at the design's intended target frequency rather than by a software-generated clock waveform which could mask timing errors. A high-level diagram of LLD's clock control is shown in Fig. 4 . The architecture currently only addresses the master clock, though it could be extended to handle complex scenarios which we leave for future work.
IV. RESULTS
To evaluate DMD, benchmarks were developed and validated using DMD's HLV and LLD tools. All designs targeted a Xilinx XC5VLX110T-1 FPGA using version 12.4 of the Linux Xilinx ISE design suite with the PR patch running on a 2.80 GHz Intel Core i7-930 processor with 24 GB of RAM. Both HLV and LLD utilize Xilinx's MicroBlaze soft-core processor.
To evaluate LLD, the OpenSPARC T1 processor [16] was implemented for our FPGA. Fig. 5 shows a comparison of the implementation times for combinations of building the design alone, with our LLD framework, and with Xilinx's ChipScope debugging framework, as well as the time required for common debugging tasks with both tools. A complete build of the OpenSPARC took over 3 hours, with synthesis accounting for approximately 55 minutes and PAR taking nearly two hours. The OpenSPARC utilized 99% of the FPGA's SLICE resources, 17% of DSPs, and 15% of total BRAM resources.
Tool run times, particularly Map and PAR, were highly variable. The insertion of instrumentation logic can extend or even reduce implementation times in an unpredictable fashion. In general, for large designs the addition of a ChipScope core can be assumed to not noticeably change the original implementation times. The use of the partial reconfiguration flow introduced some unexpected overhead. While only the reconfigurable region would be re-implemented in subsequent runs, we observed the tools processing (but not implementing) the entire design, lengthening run times. For instance, bitfile generation time is proportional to the total design size, not the size of the partial region as for each run a complete static bitstream is produced, in addition to the partial bitstreams.
To evaluate the efficiency of our approach, we performed several common debugging tasks. In many cases ChipScope may require a post-synthesis re-implementation of the design to accommodate a new task, dominated largely by Map and PAR runtime. However, once implemented with the LLD flow, we were able to quickly re-target for different breakpoint scenarios by only re-implementing the altered breakpoint region. For instance, with ChipScope it is not possible to arbitrarily halt the design and read a randomly selected device register. The embedded logic analyzer and control cores must be reconfigured and re-implemented for a specific event. Using LLD, breakpoints were set to suspend execution so that registers could be inspected and then later the design was stepped by a small number of clock cycles to advance execution and continue inspection. LLD allows register values to be randomly read from the command-line using their full hierarchical design name in 2 to 8 seconds depending on the width of the signal and fragmentation of the bits across the FPGA's configuration frames.
The number of resources required for the breakpoint region varies depending on the number of inputs and breakpoints implemented, but requires only slices and therefore does not compete with the target design for other more limited resources. A simple set of breakpoints occupied as few as 10 slices during experimentation, however our test region was generously sized at 204 slices allowing for other more complex scenarios without increasing the breakpoint region size. A transcript of a debug session is shown in Listing 1. The syntax and commands are similar to those found in the GNU gdb debugger.
A debugging session is started by connecting to the on-chip processor through a serial line and then building the design map by loading the logic allocation file. Conditional breakpoints are defined using top-level ports and signals instantiated into the breakpoint module. An information command is used to display breakpoint status. The information output begins with two status masks: the breakpoint mask shows which breakpoints are enabled, while the active breakpoint mask indicates which of the 32 breakpoints are activated. This is followed by a detailed listing, shortened here for clarity, which defines the index of the breakpoint, whether the breakpoint is a conventional conditional breakpoint or assertion, whether or not the breakpoint is active or enabled, and the original text of the breakpoint as entered. As with conventional commandline debuggers, breakpoints can be individually enabled or disabled through software-controlled masks without the need to rebuild or restart the design. The final print command in the listing demonstrates the ability to print the value of an arbitrary design register from the command-line.
Connections to the breakpoint logic region use FPGA routing resources by necessity. Although we would normally consider this approach a shortcoming, DMD's restriction to top-level signals and the ability to arbitrarily read any design register through the processor did not present any of the expected problems such as routing issues. DMD's core is instantiated as part of the static region, meaning subsequent modifications of partitioned modules do not require lengthy rerouting. While a connectionless approach would have allowed any arbitrary register of the design to appear in a breakpoint condition and eliminated the need for an additional PR region, this would require a costly polling scheme of the targeted signals, reducing execution performance. LLD breakpoints are implemented as asynchronous logic, signaling the PDC to interrupt execution and allow software identification of active breakpoints. Each statement represents a breakpoint as entered at the command-line. Activated rules signal to the PDC which of the 32 breakpoints was triggered and to suspend execution. Clock buffers respond to control signals at the next rising edge, provided that setup and hold time requirements are met. If not, transition is deferred until the following rising edge, providing at most a two-cycle latency. We observed consistent one-cycle response times when the design was halted by generated breakpoints and validated the response by checking register values against the programmed breakpoint.
Synthesis often optimizes away logic internal to the modules under observation, even when instructed not to using synthesis pragmas. This was evident by the absence of registers from the final logic allocation file and synthesis logs. We were however still able to successfully use these signals as breakpoint conditions when they appeared as top-level ports. The frameworks were useful during their own development, discovering logic and signal interaction errors, and inspecting data to discover inconsistencies between implementation and simulation. One error was the failure to properly initialize a state machine, which when encoded did not default to a zeroed state after reset. We also observed that memory-mapped registers in our MicroBlaze peripheral did not reset to zero after powerup, but only after their first initial access.
HLV evaluation was performed on a SHA-1 core. Programming the HLV testbench took approximately 20 minutes, including incorporating simulation metadata and test cases. Simulation results provided the basis for correctly programming the framework to reproduce the correct interaction of control signals and extract the expected output window for results. However, once this information is incorporated into the testbench, any changes that affect latencies or final results will be detected. The capture window was used to validate the hardware implementation against the simulation. It is possible to programatically expand and move the capture window's limits, enabling scanning of control or data output. A transcript of an HLV test session is shown in Listing 2 showing the execution of the software reference model, the configuration of the hardware, and finally the comparison between the two.
V. CONCLUSIONS AND FUTURE WORK
Two approaches to rapidly validating FPGA designs were presented. High-Level Validation (HLV) bridges the model verification gap in which high-level reference or algorithmic models are excluded or isolated from the subsequent hardware development. HLV enables the original software reference model to function as the testbench for a hardware design running on an FPGA. This approach provides a higher level of abstraction to hardware and enables complex testing scenarios practical only in software. Low-Level Debug (LLD) utilizes partial reconfiguration to rapidly modify breakpoint logic without requiring a full re-implementation of the design. Software control provides interactivity and full-visibility into a running design from a command-line that can be automated, a feature absent from many commercial tools.
Our next steps are to optimize DMD for improved usability and efficiency, including automated generation of HLV software testbenches directly from simulation. Sophisticated FSM-based breakpoint mechanisms, analysis and statistics gathering, and complex clock control mechanisms for LLD are being investigated. 
ACKNOWLEDGMENT
