The quality (computation time, chip area) of the generated FSMDs has been evaluated on modern FPGAs. Our approach overcomes the C code limitations of four HLS tools while maintaining a good speed/area balance.
perfectly nested constant-bound loops. GAUT is incapable of handling non-static loops. SPARK only handles loops with fixed constant iteration counts and assumes that all data is transferred to the chip before the computation starts, rendering some designs infeasible. C-to-Verilog is an LLVM 9 Verilog backend, however presents limitations in accessing arrays within functions. TransC supports streaming constructs for data exchange and process synchronization, through non-standard C constructs.
Existing approaches have certain drawbacks: a) most frontends donnot emit self-contained FSMD specifications; b) mandating the use of code templates to detect memory accesses and intermodule communication and c) succumbing to vendor and technology dependence.
In this paper, NAC [2] serves as both a compiler IR and natural FSMD specification. To support this, a frontend from GIMPLE 10 dumps to NAC is used throughout the text. Relevant RTL facets to memory accesses, hierarchical modules and hardware-oriented optimizations such as operation chaining, are automatically generated. FSMD synthesis does not rely on code templates since it uses a graph-based backend. Further, the generated HDL code is human-readable and completely vendor-and technology-independent. We have implemented our approach as part of the HercuLeS 11 highlevel synthesis tool.
II. NAC (N-ADDRESS CODE)
NAC supports n-input/m-output mappings, user-defined data types (integer, fixed-/floating-point arithmetic), SSA form, and scalar, single-dimensional array and streamed I/O procedure arguments. NAC statements are n-address operations or procedure calls. An (n, m)-operation specifies a mapping from a set of n ordered inputs to a set of m ordered outputs: o1, ..., om <= operation i1, ..., in; where o1, ..., om are the m outputs and and i1, ..., in the n inputs of the operation. NAC uses the notions of "globalvar" (a global scalar or array variable), "localvar" (a local variable), "in" (an input argument to the given procedure), and "out" (an output argument).
For instance, an addition of two scalar operands is written as: a <= add b, c;. Control-transfer operations include explicit conditional and unconditional jumps. An example of an unconditional jump would be: BB5 <= jmpun; while conditional jumps always declare both targets: BB1, BB2 <= jmpeq i, 10;. Multi-way branches corresponding to compound decoding clauses can be easily added.
The memory access model defines dedicated address spaces per array. For an indexed load in C (b = a[i];), a frontend would generate the following NAC: b <= load a, i;, while for an indexed store (a[i] = b;) it is: a <= store b, i;, both using the array identifier as an explicit operand.
Procedures are non-atomic operations; in (y) <= sqrt(x); the square root of an operand x is computed.
III. EXTENDED FSMDS
The FSMD [7] introduces embedded actions within the next state generation logic of an FSM. Our extended synchronous FSMD model supports: array input and output ports, streaming I/O, communication with embedded block and distributed LUT memories, design of a latencyinsensitive local interface between caller and callee FSMDs, and design of memory interconnects for the FSMD units.
A. Conventions
The FSMDs are organized as computations allocated into n+2 states, where n is the number of required computational states. The two overhead states, S_ENTRY and S_EXIT, correspond to the source and sink nodes of the CDFG of the given procedure, respectively. One possible optimization is merging the sink state with its immediate predecessors. Input registering is supported although this intent has to be made explicit in NAC. The control interface is simple:
• clk (I): signal from external clock • reset (I): synchronous or asynchronous reset • start (I): activates the FSMD so that in the next cycle, the first computational state is reached • ready (O): the block is ready to accept new input • valid (O): asserted when the corresponding data output port is streamed-out from the block • done (O): end of computation for the block ready signifies only the ability to accept new input (nonstreamed) and does not address the status of an output.
B. Communication with embedded memories
We assume a RAM model with write enable, and separate data input (din) and output (dout) sharing a common address port (rwaddr). A store operation raises write enable (mem_we) in a given single-cycle state so that data are stored in memory and made available in the subsequent state/machine cycle.
Synchronous load requires the introduction of a waitstate register. This register assists in devising a § ¤ dual-cycle sub-state for performing the load. Fig. 1 illustrates its implementation. During the 1st cycle of STATE_1 the memory block is addressed. In the 2nd cycle, the requested data are made available through mem_dout and assigned to register mysignal. This data can be read from mysignal_reg during STATE_2.
C. Hierarchical FSMDs
Hierarchical FSMDs define entire systems with caller and callee CDFGs. A two-state protocol describes proper communication, using a "preparation" state and an "evaluation" superstate where the entire computation applied by the callee FSMD is effectively hidden.
The caller FSMD performs computations where new values are assigned to _next signals and registered values are read from _reg signals. To avoid the problem of multiple signal drivers, callee procedure instances produce _eval data outputs that can be connected to register inputs by hardwiring to _next signals. Fig. 2 illustrates a call ((m) <= isqrt(x);) to an integer square root. STATE_1 sets up the isqrt_0 callee instance which reads x_reg and produces m_eval. In SUPERSTATE_2 control is transferred to the component instance of the callee. When the callee instance terminates, isqrt_ready is raised. Since isqrt_start is kept low, the generated output data can be transferred to the m register via its m_next input. Control then is handed over to STATE_3.
D. Streaming ports
Streaming suits applications with absence of control flow. In a prime factorization algorithm (pf actor), a streaming output can be used, outp, to produce successive factors. The streaming port is accessed based on valid. Thus, outp is accessed periodically in context of basic block BB4 as shown in Fig. 3 .
E. Operation chaining
Operation chaining assigns dependent SSA operations to a single control step. Simple means for selective operation chaining involve merging successive ASAP states. In § ¤ procedure pfactor (in u16 x, out u16 outp) { localvar u16 i, n, t0; BB1: n <= mov x; i <= ldc 2; BB2 <= jmpun; BB2: BB3, BB_EXIT <= jmple i, n; BB3: t0 <= rem n, i; BB4, BB5 <= jmpeq t0, 0; BB4: n <= div n, i; outp <= mov i; BB3 <= jmpun; BB5: i <= add i, 1; BB2 <= jmpun; BB_EXIT: nop;} ¦ ¥ Figure 3 . NAC code for a prime factorization algorithm. § ¤ successive states, intermediate registers are eliminated by wiring assignments to _next signals and reusing them in the subsequent chained computation, instead of reading from the stored _reg value. To avoid excessive critical paths, a heuristic is defined for disallowing flow-dependent multiple occurrences of expensive operators in the same newly defined state. In Fig. 4 states S_1_3 to S_1_5 comprise intermediate computations in a merged S_1_1 state.
F. High-level optimizations
A set of grammatical transformations has been developed using TXL 12 . As proof-of-concept, matrix flattening and argument globalization are examined.
Matrix flattening deals with reducing the dimensions of an array from N to one. This optimization creates multiple benefits: addressing, interface and communication simplifications, and direct mapping to physical memory. Argument globalization replaces multiple copies of a given array by a single-access globalvar array. It prevents exhausting interconnect resources for single-threaded applications. Through a bus-based hardware interface, globalvar arrays can be accessed by any procedure.
IV. PERFORMANCE OF FSMDS
HercuLeS is used for C-to-VHDL synthesis with the help of a prototype translator from GCC GIMPLE dumps to NAC. It extracts Graphviz CDFGs from NAC, which are then synthesized to vendor-independent self-contained RTL hardware descriptions.
Computation-(C) and memory-intensive (M) benchmarks have been selected from public domain. In Table I , for each benchmark (Bench.), a short description is given (column 3), and its type (C, M or both) is shown in column 2. Source lines are given in columns 4-7, respectively for the user C code, NAC, Graphviz CDFGs and VHDL. Figure 5 . Number of cycles, MOF and TCT (geometric means) for the generated FSMDs.
A. Speed measurements
To assess the performance of the generated hardware, the minimum propagation delay (MP D), maximum operating frequency (MOF ) and total computation time (T CT = cycles × MP D) are evaluated. Four different scheduling scenarios have been examined: S1) sequential scheduling, S2) ASAP, S3) S2 with chaining, and S4) S3 with synchronous read (block RAM) memories. Such linear complexity schedulers are critical for providing fast compiles on sizable benchmarks.
A graphical view of this information, showing relative TCT metrics is illustrated in Fig. 5 . All designs were synthesized on the XC6VLX75T Xilinx Virtex-6 device using Xilinx Webpack ISE 12.3i.
Average TCT is reduced by 47% when comparing S1 to S3. BRAMs impose a fixed cycle readout latency, which limits this gain to about 36.4%. Computer arithmetic problems (float2half, half2float) achieve improvements of about 4× reduction in TCT. Memory-intensive benchmarks (matmult, smwat, walsh) present lesser opportunities due to the memory accesses activity interfering with chaining of arithmetic operations within the same clock cycle. All benchmarks achieve MOFs in the range of 120-450MHz.
B. Chip area measurements
An aspect of the FPGA area measurements is shown in Fig. 6 . S1 allows for the generation of smaller hardware in terms of slice LUTs and registers. In S3, LUT and register demand is increased by 90.1% and 59.1%, compared to S1, a price paid for much higher speed. However, it is more reasonable to compare S2, S3 and S4 which all apply ASAP on the NAC SSA form. Registers are reduced by 17.5% regarding the geometric means for the corresponding metric among S2 and S4. The tradeoff among S3 and S4 is very clear; S3 provides better speed performance in exchange for worse LUT and register utilization, where S4 excels.
C. Comparison against accessible HLS tools
Quantitative comparisons against other HLS tools were investigated on a kernel suite (HLSbench) as shown in Fig. 7 .
Applications in the HLSbench C suite include array sum, bit reversal, Easter date calculation, distance approximation, iterative Fibonacci series, greatest common divisor, integer square root, a synthetic loop, perfect number detection and population count. All tools except HercuLeS required benchmark source code adaptations. GAUT and SPARK required more effort than C-to-Verilog and TransC. Tweaks were applied to absorb third-party tool issues, for instance, Cto-Verilog produced simulation-only code. All tools except HercuLeS do not support hardware division, due to lack of both respective component libraries and support for state multicycling.
HercuLeS is the only tool that supports the entire HLSbench application suite; it maintains performance comparable to SPARK, but less than GAUT. GAUT can effectively pipeline generated designs for the applications it supports (5 out of 12). SPARK has the lowest ANSI C compatibility (4/12). C-to-Verilog and TransC appear to have closely matched results, supporting 7/12 kernels.
HercuLeS is second only to SPARK in chip area; GAUT introduces impractical LUT and register demands. Since only a limited set of optimizations is currently considered, we expect improved performance in future versions through extra optimizations. and hardware optimizations has been presented. Further, an IR that enables the automated synthesis of FSMD-based accelerators using HercuLeS has been discussed. The proposed techniques have been evaluated against four different optimization scenarios on a number of benchmarks using HercuLeS as well as against accessible HLS tools with promising results.
