Abstract-Commonly used in software design, assertions are statements placed into a design to ensure that its behavior matches that expected by a designer. Although assertions apply equally to hardware design, they are typically supported only for logic simulation, and discarded prior to physical implementation. We propose a new hardware design language-agnostic language for describing latency-insensitive assertions and novel methods to add such assertions transparently to an already placed-androuted circuit without affecting the existing design. We also describe how this language and associated methods can be used to implement semi-transparent exception handling. The key to this paper is that by treating hardware assertions and exceptions as being oblivious or less sensitive to latency, assertion logic need only use spare FPGA resources. We use network-flow techniques to route necessary signals to assertions via spare flip-flops, eliminating any performance degradation, even on large designs (92% of slices in one test). Experimental evaluation shows zero impact on critical-path delay, even on large benchmarks operating above 200 MHz, at the cost of a small power penalty.
I. INTRODUCTION
F IELD-PROGRAMMABLE gate arrays (FPGAs) are a general-purpose silicon technology capable of implementing almost any digital design. This prefabricated flexibility provides generic logic resources (e.g., lookup-tables and switched routing interconnect) that can be configured at implementation-time. Synthesizing a design onto an FPGA uses computer-aided design (CAD) tools to compute a feasible configuration of a subset of these resources to implement the requested circuit.
Modern FPGA devices can exceed 20 billion transistors; hence: 1) FPGA CAD can be time-consuming [1] and 2) due to the heuristic nature of CAD algorithms, synthesized solution quality can be unstable. Rubin and DeHon [2] found small perturbations to initial conditions of routing algorithms affect delay by 17%-110%. Thus, circuit modifications require resynthesizing-a lengthy procedure, which may return worse results and impact designer productivity. We present a solution inserting new, latency-oblivious, logic, such as in-circuit assertions, into an existing design transparently without needing to recompile the entire circuit. We define a latency-oblivious circuit to contain no strict constraints on the number of clock cycles for computing its result; one example is using trace-buffers to record on-chip signal behavior [3] : pipelining trace signals does not affect observability. Another example is invoking circuit reset when the system becomes unresponsive. The key advantage of latency-oblivious circuits is that they introduce a new dimension of synthesis flexibility, allowing transparent insertion.
Traditionally, digital circuits have been developed using a logic simulation environment due to unlimited signal visibility, fast recompilation cycles, and software-like instrumentation. However, as designs become increasingly complex, circuit simulation speed slows. In turn, this causes circuit testing to be less thorough, and reduces designer productivity.
A promising approach uses in-circuit assertions [4] to verify designs at run-time. Because they run in the same circuit as the design under test, in-circuit assertions can run much faster than simulation, allowing testing to be more thorough. In-circuit assertions can be latency-oblivious since designers typically care more about if any assertions were violated rather than needing to be alerted immediately.
We insert additional logic, such as assertions, transparently, without affecting performance or functionality. We therefore insert post place-and-route, using only spare FPGA resources not used by the original user circuit. By using such mutually exclusive resources, new functionality can be added without affecting the user design. To eliminate any impact on the critical-path of the original design, we aggressively pipeline the new circuitry, enabled by its latencyoblivious nature. Our methods allow even large circuits to be thus augmented-we have tested on circuits using up to 92% slices of a large FPGA. We thus make the following contributions.
1) An approach for reclaiming the spare, unused, resources on FPGAs to transparently insert new logic such as incircuit assertions after circuit implementation. 2) An assertion language based on Boolean logic allowing assertions be described at high level. 3) Use of minimum-cost graph flow techniques to simultaneously pipeline-and-route all input signals required by this logic, without impact on circuit timing. 4) Extending to in-circuit exceptions, allowing some circuit errors to be fixed without rerunning place-and-route. 5) Experimental validation, showing that our techniques incur only a small power penalty.
The remainder of this paper is organized as follows. Section II reviews related work, Section III shows our assertion language; Section IV describes our transparent insertion approach in detail. Section V describes exception handling. Sections VI and VII evaluate the methodology and show experimental results. Finally, Section VIII concludes, outlining future work.
The key concepts in this paper were first presented in [5] . Since then, we have developed a new language to describe latency-oblivious assertions and exceptions. The high-level language allows compact description of complex assertions, and translation to multiple design descriptions.
II. BACKGROUND AND RELATED WORK

A. Latency-Insensitive Design
We exploit the flexibility of inserting latency-oblivious logic-logic without strict constraints on the number of clock cycles in which it must return a result. An example of latencyoblivious logic is trace-buffers used to record on-chip signal activity for debugging; pipelining each traced signal does not affect its observability. Latency-insensitive design [6] is a methodology to create designs that are insensitive to communication delays between components, allowing tools to pipeline them arbitrarily to meet performance criteria. This improved flexibility comes at the cost of area overhead and is unsuited to designs with poor communication locality. Note that only the elements we add are latency-insensitive: the rest of the design need not be.
B. In-Circuit Assertions
Assertions specify Boolean conditions that should always hold true if the design is working correctly. An example in software may be that a "malloc()" system call must return a nonzero value; a hardware example could check the carry-out bit of an adder is always "0" to indicate no overflow occurs. While it may not be practical to halt a hardware prototype in the same way as in simulation, it is nonetheless beneficial to alert the designer if any assertion fails. Assertions may be combinational, or include state as well, for example, checking that each DRAM access latency lies within a bound, or even statistical properties [7] .
Hardware assertions form part of the SystemVerilog language standard (SVA) [8] , and can also be described using the Property Specification Language [9] . Typically, such constructs are supported only by logic simulators or formal verification tools and are discarded for hardware, although researchers have proposed extending these into silicon [4] , [10] . Previous approaches, however, insert assertions by modifying the original hardware description and resynthesizing the entire circuit-HLS assertions can degrade FPGA performance by 3% [4] . Although incremental compilation approaches can accelerate this procedure, commonly the original circuit must be partitioned in advance to reserve space for assertions.
C. Network Flow Algorithms in FPGA Tools
A flow network is a graph G(V, E), with a set of vertices V and a set of directed edges E, each edge connecting two vertices and with capacity u ∈ N. A valid flow solution exists when: 1) the flow carried by each edge does not exceed its capacity and 2) conservation of flow exists at all vertices-the sum of all flows entering a vertex must equal the sum of all flows exiting-with two exceptions at the source and the sink. The source node may only produce flow; the sink node may only consume flow. A single-commodity network has only one type of flow present.
Efficient algorithms to compute the maximum integer flow of a single-commodity network exist (multicommodity maximum integer flow is known to be NP-complete), and are applied in FPGA CAD. FlowMap [11] employs a maxflow algorithm (specifically, its dual, the min-cut) during FPGA technology-mapping to compute a mapped netlist with the minimum logic-depth, while Lemieux et al. [12] used max-flow to evaluate routability of depopulated FPGA switchmatrices.
Combining both min-cost and max-flow algorithms is reference [3] , where they are used to connect signals to trace-buffers during FPGA debug. In contrast, we use flow techniques in this paper to concentrate signals into a single region (rather than connecting to trace-buffers distributed across the device) in a way that does not impact the circuit performance. While prior work reports that adding trace-buffer connections reduced the maximum clock frequency from 75 to 55 MHz, we pipeline our signal routing to mitigate all impact on performance.
Recent work on incremental trigger insertion [13] uses spare FPGA resources to insert trigger circuits for enabling debug buffers. Unlike our approach, this paper incurs critical path delay penalties up to 107%, due to not pipelining the signals.
III. ASSERTION LANGUAGE
We develop a high-level language for describing in-circuit assertions, based on Boolean logic, and show an implementation by systematic translation into a target language such as VHDL. Since assertions are written in a high-level language, they are independent of particular implementations, thus potentially reusable between different but related designs.
Compared to industrial assertion languages such as SVA, our assertion language corresponds to SVA's concurrent assertions, which evaluate once per cycle and run concurrently with design code. Our language does not support SVA immediate assertions, since these depend on simulation concepts such as delta time. Unlike SVA, our assertions are not limited to VHDL designs, but can target other descriptions, such as Verilog and OpenSPL.
The assertion language includes useful primitives for complex designs: arithmetic expressions, including floating-point, counters and accumulators, allowing complex assertion conditions without needing to use lower-level primitives as in VHDL. Delays allow assertions to match latencies of pipelined circuits. Users can declare external hardware blocks, allowing assertion conditions to use design-specific primitives. 
A. Translating to Latency-Oblivious Assertions
We systematically translate from the assertion language into a latency-oblivious implementation. The translation is syntaxdirected, proceeding recursively from the root of the assertion condition to the leaves, which will be atomic propositions or Boolean literals (true or false). Each operator maps one-to-one to a block in the implementation-for example, to a VHDL block implementing that operator. The only restriction is that the latency in cycles of the resulting circuit must be the same from each circuit input to the circuit output, ensuring all data is synchronized. The circuit can be arbitrarily pipelined to meet the timing of the design under test. We automatically insert pipeline registers to ensure inputs from the same cycle arrive on the same cycle throughout the graph using a straightforward as soon as possible algorithm.
Example:
where line 1 declares an assertion with two compile-time parameters L and H (inside the angle brackets) and one runtime parameter C; line 2 is an expression checking that C is in the range [L, H]. This could be used wherever a value must be in a defined range, for example to ensure a soft CPU only reads instructions from a valid memory space. Step 1 compiles the user-circuit as normal (for example, by using Xilinx ISE) without reserving any resources a priori or specifying additional constraints over a regular compilation run.
Step 2 examines the floorplan of the compiled result, identifying an underutilized region (typically at the peripheries of the device) that could host any new logic. Currently, this step is manual; future work could automate it.
Step 3 applies minimum-cost flow techniques to transport user signals (perhaps distributed across the whole device) needed by the assertion circuit into its vicinity, via pipelining registers. The exact number of pipeline stages, and the maximum distance between stages are user parameters. Crucially, only spare logic and routing resources not consumed by the original circuit are used-this makes our approach transparent.
Based on results from step 3, which specifies a template containing the location of all flip-flops used in pipelining, and all logic resources occupied by the user circuit, step 4 applies vendor tools to compile (but not route) a separate circuit implementing the new logic tailored to this template, again using only those spare resources. As this new logical circuit is mutually exclusive to the original user circuit, step 5 merges the pipelined-and-routed circuit from step 3 with the newly placed circuit from step 4. Finally, step 6 completes the unrouted connections inside the merged circuit (connecting from the final pipelining stage to the new circuit, and within the new circuit) using vendor tools.
For new functionality using the same set of prerouted signals [case (i) of Fig. 1 ] only steps 4-6 would need to be repeated. However, for new logic operating on signals not already routed [case (ii)] step 3 must also be repeated, to compute new pipelined connections for any new signals.
Pipeline-and-Route: A key ability of this toolflow is transporting circuit signals, perhaps scattered across a device, into a concentrated region as inputs to a new circuit, while only using spare resources. Routing such signals directly incurs large distance-dependent routing delays. To mitigate these delays which can introduce new critical-paths, we pipeline the signals. As our approach targets latency-oblivious logic, additional pipelining stages are acceptable. Although fanout increases by one for each signal routed, this is unlikely to affect overall design timing. Modern commercial FPGAs contain buffered routing-adding an extra routing branch to an existing net incurs only a small capacitative load; on the Xilinx platform we use in testing, timing analysis reports the effect as < 5 ps.
We transform the FPGA routing resource graph (with nodes occupied by the user circuit removed) into a flow network using similar techniques to [3] and employ minimum-cost flow techniques to route all necessary signals to unique pipelining registers from a candidate set. An important degree of freedom with this particular routing problem (and that does not exist with user routing) is that each signal can connect to any register from the candidate set; this provides significant routing flexibility even under constrained scenarios. Our approach differs from the separate placement and routing stages employed by traditional CAD tools; in some ways, our tool can be seen as routing signals, resolving congestion, and placing pipelining registers simultaneously. Furthermore, unlike [3] , we do not seek to find the routing solution with maximum signal observability, but instead use flow algorithms to perform both placement and routing during signal pipelining.
Given timing estimates (costs) for each edge in the flow network, the objective function minimized is the averagecase timing for each connection-not the worst-case timing across all connections determining the critical-path delay. Nevertheless, our experiments show that when a user chooses the candidate register set conservatively (via the number of pipelining hops, and the distance of each hop from the anchor point), our approach can return solutions that do not increase critical-path delay. It is worth pointing out that we do not apply min-cost flow techniques to find the optimal timing solution, for the following reasons: 1) due to the nature of the network flow problem, it is only possible to optimize for average-case timing; 2) we modify the network heuristically to guide algorithm behavior in ways that do not reflect the true device; and 3) while each application of min-cost flow is proven to find the global optimum, when applying this technique iteratively (in a piecewise fashion) to each pipeline stage, optimality is no longer guaranteed. Instead, we consider the flow approach to be an effective heuristic for this particular routing problem.
In our tool, the candidate set of registers is specified as spare flip-flops that fall within a user specified radius from an (X, Y) anchor location. Spare flip-flops may exist inside slices partially occupied by the user circuit (care must be taken to ensure that such logic slices belong to a compatible clock domain to the signals being transported) or within unoccupied sites. The region determined by the anchor and radius is a circle (or a segment, if clipped by the FPGA boundary). By iteratively reducing the radius of this circle over multiple routing passes, and hence reducing the candidate set of pipelining registers, it is possible to migrate signals to the anchor point, at the cost of additional latency for each pipeline hop. to a different flip-flop inside the region, to maintain latency between signals.
To guide the min-cost flow algorithm toward a valid routing solution, we make two heuristic modifications to our network. First, we apply a penalty to all network edges crossing FPGA clock regions. In most devices, all resources are exclusively associated with a single clock region, and due to the clock network design, signals crossing between regions incur clock skew. In our experiments, we observe that sometimes the mincost algorithm returns very short routing paths bridging across two different regions, which combined with a positive clock skew, result in a hold time violation. To discourage such paths, we add an inflated delay penalty to all such edges. Second, we observe that it is possible for the min-cost algorithm to connect to pipelining registers whose output pin is blocked due to routing congestion. Given that we route signals piecewise, it would not possible for one min-cost iteration to understand the routeability of the next iteration. To alleviate this, during candidate flip-flop selection, we prune all registers without sufficient free fan-outs left for downstream usage.
V. EXCEPTIONS: SEMI-TRANSPARENT LOGIC INSERTION
Our method inserts logic transparently (circuit behavior is preserved), but there are limits to what transparent insertion can achieve; essentially, we are limited to adding extra circuit outputs. In this section, we extend our method to allow limited changes to circuit behavior (abandoning strict transparency), which can allow faults in the circuit to be corrected. By analogy with software, we call these additions exceptions; like software exceptions they allow error correction and recovery. We call these additions semi-transparent: they only affect the replaced circuit signal; the rest of the circuit is not directly affected.
A. Motivating Example
The previous section uses range checking: detecting that a circuit signal, such as the program counter in a soft-processor, lies within a valid range. Correcting the program counter could, for example, replace a faulty value with the address of a service routine, allowing operating system software to handle the error. Our approach applies to exceptions, where a bounded amount of latency can be tolerated-such as the program counter for a pipelined processor.
B. Assertion Language Extensions for Exceptions
We extend our assertion language to allow for exceptions. Unlike software exceptions, each exception maps one-to-one with an assertion. An extended Backus-Naur form grammar follows:
where the extra production allows an assertion declaration to have an exception handler: an assignment statement, allowing one of the run-time arguments to the assertion to be overwritten.
The program counter range-checker could look like
where lines 1 and 2 declare the assertion as before; lines 3 and 4 declare an exception handler which, if the assertion is triggered, overwrites the circuit signal C with the value OutOfRangeTrap, the address of the software handler, if the assertion fails. Fig. 3 shows our implementation using the methods of Section IV: SRC is the circuit signal with associated assertion and exception. First, inputs to assertion and exception logic are transported to the spare (possibly disjoint) logic region(s), where the assertion and exception are located. Next, we apply the pipeline-and-route method again to transport the assertion condition and exception value back to its original driver. A multiplexer chooses between SRC and the exception value depending on the assertion condition, reusing as much routing as possible. Note the total latency in cycles from SRC to SRC' via the exception path will be the sum of the latency through exception and assertion circuits (which must be equally balanced), and may be required to be less than some constraint (e.g., the processor pipeline depth). Although the two circuits must be balanced, this need not be to a fixed value; furthermore, this latency can be arbitrarily distributed between input and output links and in the spare region, for further CAD flexibility.
C. Implementation
D. Semantics and Correctness
Clearly adding exceptions is not completely transparent, since circuit behavior is changed if an assertion with an exception is violated. However, circuit timing can be preserved if: 1) the monitored signal SRC is not on a critical path and 2) the additional multiplexer does not make the path from SRC to its downstream readers critical. Even if timing is preserved, the circuit could still be incorrect: if the assertion is triggered, the monitored signal is delayed by several cycles, because it passes through the pipeline-and-route network and the assertion and exception logic. For some applications, this will not matter: the datapath of video or audio applications may tolerate a few cycles of delay. In the program counter example, the processor runs for a few cycles (but before any erroneous computation is flushed) before jumping to the trap routine: the exception value replaces the program counter.
E. Summary
Extending our approach to allow exceptions, replacing a circuit signal by an exceptional value if the assertion is triggered, is not transparent, but can be useful in some applications if care is taken not to alter the critical path or introduce a new critical path. In general, the resulting circuit may not be identical; however, for some useful applications this approach allows a circuit to be corrected without rerunning the time-consuming place-and-route process.
VI. EVALUATION METHODOLOGY: XILINX
Although we believe that our techniques can apply to all FPGA vendors, we evaluate this paper on Xilinx technology. In our evaluation, we first employ Xilinx ISE v13.3 to compile the original user circuit (step 1 from Fig. 1 ). For designs with timing constraints we apply those to ISE, but for designs without we operate ISE in "performance evaluation mode" which infers all clocks from the circuit and minimizes their periods. For step 2, we open the compiled design in Xilinx's FPGA editor to visualize its floorplan, and identify an underutilized region to host any new circuitry.
Next, step 3 translates the place-and-routed netlist returned by ISE from its proprietary binary format, NCD, into the Xilinx Design Language format, XDL using command xdl -ncd2xdl. The XDL format is human-readable and contains a complete description of Xilinx netlists: from lookup table (LUT) masks, component placements, to source and sink pins, and even which individual wires comprise every routed net. Toolkits, such as Torc [14] , can manipulate this format. After decoding the circuit, we apply our pipeline-and-route tool (using Torc to manipulate XDL, and LEMON [15] for flow computations) to execute the procedure described in Section II. Fig. 4 illustrates: given an XDL circuit netlist, a set of signals to be routed (possibly regular expressions matching nets in the XDL netlist), their clock domain, and the set of candidate registers [specified by an anchor point (X, Y) and radius r] the tool applies our techniques to transport all signals to a pipelining flop within this region. The output is an augmented circuit netlist in XDL format, and a template that can be used to build the new circuit in the next step: a Verilog file specifying the location of all pipelining registers, and a Xilinx User Constraints File (UCF) specifying which resources on the device are occupied (using the PROHIBIT constraint). Our pipeline-and-route tool is re-entrant: the output netlist can be used as the input netlist for the next routing run, allowing this procedure to be executed iteratively for each pipeline hop.
Step 4 takes the template produced in the previous step, adds new functionality into the source, and synthesizes and places (without routing) this circuit using ISE. The UCF constraints file forces: 1) mutual exclusivity between logic resources in user and the assertion circuits and 2) the Xilinx placer (with the AREA_GROUP constraint) to use only the host region identified in step 2. Note it is currently impractical (perhaps impossible in the Xilinx toolflow) to enforce mutual exclusivity on routing resources. For step 5, we translate the added circuit into XDL, then use a custom tool to merge with the circuit from step 3.
Finally, step 6 converts the merged XDL circuit into NCD format using command xdl -xdl2ncd (also invoking the Design Rule Check) and invokes the router in re-entrant mode to: 1) route the added circuit and 2) complete last-mile routing from the final pipelining stage to the new circuit's inputs. We set the RCT_SIGFILE environment variable to force use of only spare routing instead of ripping-up user circuit nets.
We target the Xilinx ML605 evaluation kit, containing a Virtex6 FPGA (xc6vlx240t) with 150 000 six-input LUTs within a grid of 162 × 240 slices. We employ four benchmarks, chosen for complexity and high clock rates: 1) LEON3; 2) a System-on-Chip design; 3) two variants of an advanced encryption standard (AES) encoder/decoder; and 4) a floatingpoint datapath. For each, we insert assertions to verify correct operation. Benchmark 1: LEON The Aeroflex Gaisler LEON3 [16] is an open-source VHDL multicore SoC design capable of booting Linux, parameterized to customize the number, size and configuration of SPARC cores and on-chip peripherals. We configure the LEON3 with eight cores, each with 64 kB of I-cache and D-cache, and MMU, DDR3 memory controller, Ethernet and CompactFlash peripherals. The LEON3 ML605 template constrains the main SoC clock to 75 MHz (13.33 ns).
Benchmarks 2 and 3: AES For a datapath orientated benchmark, we build two variants of a 128-bit AES encoder/decoder; Fig. 5 shows a block diagram for the 3-pair variant. The circuit is derived from [17] , modified to insert an extra pipelining stage in each AES round, improving performance but doubling encoding and decoding latency to 20 cycles. The advantage of this benchmark is that it is entirely self-stimulating (both plaintext and encoder key inputs generated by linear-feedback shift registers), and self-checking, with each encoder paired with a decoder allowing the decoded result to be verified against the original plain-text input (regenerated via an offset LFSR).
Benchmark 4: FloPoCo Lastly, we use a floating-point datapath built using FloPoCo [18] . We use P parallel copies of a W-tap single-precision floating-point moving average filter. Each filter's input is stimulated using one 32-bit LFSR; for a 400 MHz target frequency, FloPoCo returns a circuit with pipeline latency of 45. To generate a medium utilization circuit, we choose P = 24, W = 8 and disable shift-register extraction in ISE (which would convert pipeline registers to shift-registers), creating a benchmark with higher flip-flop utilization.
VII. RESULTS
A. Experiment 1-Simple In-Circuit Assertion for LEON3
We insert an assertion to check the program counter for each of 8 cores lies in the memory space of the memory controller, checking instructions only come from main memory.
Using our assertion language, the assertion is shown in Section III; we systematically translate this to the implementation.
Unmodified, the LEON3 benchmark consumes 81% of logic slices, 54% of LUT resources, meeting a 13.33 ns (75 MHz) clock constraint (Table I, column 2 ). Examining the floorplan shows an underutilized region by the upper-left of the device; the anchor point is (0,185). We invoke pipeline-and-route twice (step 3 from Fig. 1) , transporting signals toward the anchor via two stages, with radii 160 and 80, respectively. In total, 240 bits are routed: the 30-bit program counter (the 2 least significant bits are unused) for each of the 8 cores. The resulting circuit consumes modest additional resources (registers from existing and new slices, plus LUTs used as route-throughs).
Next, we synthesize the assertion circuit (step 4); it occupies 50 slices and 153 LUTs over the pipelined circuit. Due to the simple assertion, the prerouting critical-path timing estimate for its pipelined circuit is 3.73 ns (in fact, the estimated critical-path is between the final pipeline stage and the assertion circuit), comfortably meeting the 13.33 ns circuit constraint. After merging and routing the assertion circuit with the user circuit (steps 5 and 6) we find that no new criticalpaths have been introduced, and the circuit meets timing at 13.32 ns.
We compare the efficiency of our transparent logic insertion with the traditional approach of adding the assertion at source-level and resynthesizing the whole circuit. To ensure fairness, we manually modify the source code to extract the signals of interest out through the circuit hierarchy, attaching them to an identical instance of the assertion hardware design language (HDL). Table I shows the results under the "Resynthesis" heading. While the final result shows that, for this experiment, there is no impact on timing because both circuits meet the constraint, designers would still have to resynthesize their circuit for each set of assertions. Interestingly, there is a significant 10% difference in logic slice utilization between the original and instrumented circuits; apparently adding a small amount of extra logic causes the CAD tools to make very different packing decisions. Fig. 6 charts the runtime advantage of our approach. On this benchmark, inserting assertions transparently is 3.9× faster than resynthesizing. For pipeline-and-routing, runtime is dominated by final routing using vendor tools.
B. Experiment 2-Stateful Assertion for AES (3-Pair)
Our second experiment inserts stateful assertion logic into a circuit with both high maximum clock frequency and high device utilization: AES, with 3 encoder-decoders pairs (Fig. 5) .
Using our assertion language, the assertion is 1 user uint<128> deAES( uint<128>, uint<128> ) 2 { latency=N }; 3 assertion checkAES( uint<128> msg, uint<128> key1, 4 uint<128> key2, uint<128> key3,
deAES(delay<1 * N>(key3), enc))); 10 } where lines 1 and 2 declare the AES decoder (a user-defined block deAES with latency N) and lines 3-9 define the assertion as a chain of AES decoders; delayed keys balance decoding latency. This circuit uses 71% of the LUTs, 92% of logic slices, showing that our methods apply to large designs. The AES circuit has no timing constraints, so we operate ISE in performance evaluation mode to find the best timing; to mitigate CAD noise, we compile using five different placement seeds (placer cost tables), the best result returns a critical-path delay of 4.21 ns, or 237 MHz. Table II lists timing for all five seeds.
Examining the original circuit floorplan [ Fig. 7(a) ] we see the top-right region of the device is underutilized, and invoke our tool five times to pipeline-and-route signals into this region. We chose the top-rightmost coordinate as the anchor position (161,239), using decreasing radii on each iteration: 200, 160, 120, 80, 40. The signals we pick are 128-bit buses taken from each of the 3 encoders (specifically, the key_out[127:0] register from the fifth of ten coding rounds), totalling 384 signals. Fig. 7(b) shows the pipelining flip-flops used, each iteration alternates between yellow and green.
The output of a secure cryptographic function should be uniformly distributed; the output should resemble a uniform random number generator. The monobit test [19] , counts the number of "1"s in a data stream. Over a long sequence, the number of 0s and 1s should match, within some statistical bound. We attach three such assertions into the AES circuit, one per encoder, then AND these results, driving an off-chip LED. The monobit circuit counts the number of 1s per 128 bit vector, accumulated over 256 cycles (making a stream of 32 768 bits). A range check tests that the number of 1s lies in bounds: for a statistical significance p-value < 0.01, this is (32768/2) ± 466. In total, the three monobit circuits consume 155 logic slices and 567 LUTs, with a prerouting timing estimate of 2.44 ns.
Using our assertion language, a monobit test looks like
where line 1 declares a user-defined block to count high bits; lines 2-7 form the assertion, declaring a counter (line 4), accumulating population counts while the counter is nonzero (line 5), and testing the range condition (line 6). Fig. 7(b) shows the final merged circuit floorplan: assertion circuit logic in white; pipeline signal routing for one signal in cyan. After routing the merged circuit, preserving all existing user nets, static timing analysis by Xilinx tools shows no effect on the critical-path; the circuit still meets timing at 237 MHz.
Compared to resynthesizing the circuit (with assertions) shows negligible effects (7 ps improvement) on critical-path delay between original and instrumented circuits, over five placement seeds. By chance, the best placement in both cases is found with seed value 5; examining other seeds (Table II) shows significant deviations between the two synthesis solutions: for the default seed value of 1, this timing impact exceeds 10%. The runtime improvement for the transparent approach on this circuit is 3.0 times; while the routing runtime has decreased due to it being a less complex circuit (only one clock domain), we must invoke our pipeline-and-route tool five times.
C. Experiment 3-Complex Assertion for AES (2-Pair)
This uses 2 pairs of the AES encoder/decoder circuit, occupies 69% of logic slices and 47% of LUTs, running at 241 MHz.
We route two 128-bit buses from each of the two encoder blocks in this benchmark (totalling 512-bits) into the top right region of the device, applying a more complex pattern counter test to each. This divides each 128-bit value into disjoint 4-bit segments, counting the occurrences of each 4-bit pattern. Like the monobit test, over a long stream of bits, each of the 2 4 = 16 possible patterns should be equally probable. The four pattern counters occupy 1155 logic slices and 4262 LUTs.
Using our method does not affect the original critical-path delay (4.15 ns). Inserting the same assertion at source level and resynthesizing degrades the critical-path delay to 4.32 ns (232 MHz). The assertion code resembles the monobit test. 
D. Experiment 4-FloPoCo Assertion
The final experiment uses our FloPoCo design. With shiftregister extraction disabled, the benchmark utilizes 65% of all logic slices, 41% of all LUTs, with a critical-path delay of 6.23 ns (160 MHz). The assertion checks for infinity or NaN conditions at each tap in this pipeline. Each condition is represented in FloPoCo's internal format by one bit going high; for all taps this totals 144 bits.
Rather than just signaling if any assertion fails, we build a priority encoder to transform the 144 bit input into an 8 bit encoded output, to assist a designer in locating the failure.
The FloPoCo assertion can be defined as follows: where line 1 declares the priority encoder as a user-defined block, lines 2-10 define the assertion whose inputs are a 24×8 array of 32-bit floating-point numbers, and which concatenates the output of priority encoders whose inputs are bits 32 and 33 of each array element-the NaN and infinity bits of each tap in each parallel filter. Future versions of our assertion language will add loops to ease generation of repetitive assertions.
This assertion circuit is also successfully added into the user circuit without impacting the critical-path delay, while resynthesis with the same placement seed degraded the maximum frequency from 160 MHz to less than 100 MHz. Over five seeds, the best resynthesis result was 162 MHz as shown in Table II .
E. Power Evaluation
Lastly, we investigate the power usage of circuits with and without assertions. We employ the ML605's support for onchip power measurement (via the Virtex6's System Monitor)-results in Table III show power consumption: for the original user circuit without assertion checking; for assertions added at source level where the entire circuit is resynthesized; for the transparent approach (this paper). All power measurements used ChipScope Analyzer averaged over 128 seconds, once the die temperature had stabilized.
For experiment 1, we boot a Linux image supporting up to 4 cores on the SoC, stressing each core using a gzip instance sourced from /dev/urandom. For experiments 2 and 3 based on variants of the self-stimulating AES benchmark, we collect results at two different clock rates. Unfortunately, the high device temperature/current caused by running "AES x3" at 200 MHz triggers the power regulator's shutdown mechanism, so we only show results at 150 MHz.
The results show that, unsurprisingly, adding extra assertion circuitry increases power consumption-on average by 2% for resynthesis, and 4% for our techniques. Although resynthesis may be more efficient (smaller area due to denser packing decisions) than the original user circuit, our approach consumes more power due to transporting all assertion inputs, via pipelining registers, into one region to feed the assertion circuit. This incurs multiple hops of extra switching activity not existing in the resynthesis approach, which can distribute the assertion logic close to the signal source without pipelining. For a circuit resynthesized with assertion logic, however, unless additional gating techniques are used this 2% power overhead is permanent, while for our approach it is only temporary-if the assertion logic is no longer needed, the 4% overhead can be recovered by reverting to the original bitstream.
VIII. CONCLUSION
We propose a language for describing in-circuit hardware assertions in an HDL-agnostic manner, and describe methods to insert latency-oblivious assertion circuitry into a synthesized circuit transparently. Our flow inserts new circuitry after the user circuit has been placed-and-routed, using only spare resources; assertions can be added, changed, or removed without affecting the original circuit. To maintain critical-path delay, we aggressively pipeline both the newly inserted circuit and the routing for its inputs. To pipeline signals, we use min-cost flow techniques to efficiently transport signals via pipelining registers, placing and routing them simultaneously.
The key benefits for transparent insertion are: 1) only spare resources are needed, even on large, complex designs; 2) the critical-path delay is unaffected; and 3) it is 2-3.9-fold faster than resynthesis. Our approach incurs a small, temporary, power overhead: extra switching from pipelining new circuit inputs.
We further extend our technique to allow in-circuit exceptions; by relaxing strict transparency, some circuit errors can be fixed without rerunning expensive place-and-route.
Currently, our transparent insertion flow is encumbered by overly-broad constraints, owing to using the Xilinx toolflow for an unsupported application. When building inserted circuit (step 4 of our flow) we can only mark logic resources as occupied at slice granularity-even if only one of four slice LUTs is occupied, we cannot use the rest of the slice; furthermore, we cannot mark occupied routing resources in the same manner.
Furthermore, we must use constraints to force inserted circuits to be placed near the pipelined signals, to minimize routing congestion between user and inserted circuits, given that the current flow compiles the inserted circuit without knowledge of leftover routing. These limitations may be lifted by building toolflows to create and insert transparent circuits, e.g., modifying the VTR-to-Bitstream project [20] .
In the long term, we would like to consider enhancements to FPGA architectures and CAD toolflows to further improve the effectiveness of inserted assertions and exceptions.
