Cones [23] Early, combinational only HardwareC [12] Behavioral synthesis-centric Transmogrifier C [8] Limited scope SystemC [9] Verilog in C++ Ocapi [19] Algorithmic structural descriptions C2Verilog [21] Comprehensive; company defunct Cyber [24] Restricted C with extensions (NEC) Handel-C [2] C with CSP (Celoxica) SpecC [7] Resolutely refinement-based Bach C [10] Untimed semantics (Sharp) CASH [1] Synthesizes asynchronous circuits [7] adds constructs for finite-state machines, concurrency, pipelining, and structure through thirtythree keywords [5] . Systems written in the complete language must be refined into the synthesizable subset.
Sharp's Bach C [10] adds explicit concurrency and rendezvous communication. The compiler does the scheduling; the number of cycles taken by each construct is not set by a rule. It supports arrays but not pointers.
Budiu et al.'s CASH [1] is unique because it generates asynchronous hardware. It identifies instruction-level parallelism in ANSI C and generates asynchronous dataflow circuits. Concurrency is the biggest difference between hardware, for which is is fundamental, and software. Efficient software algorithms are rarely the best choice in hardware. More disturbing is that C and C++ are optimized for expressing sequential algorithms and contain no language-level support for concurrency, in part because there is no agreed-upon model for parallel programming [20] . The absence of concurrency support means it must be added or inferred by the compiler.
About half the languages require the programmer to express concurrency with parallel constructs. HardwareC, SystemC, and Ocapi all use process-level constructs; Handel-C, Bach C, and SpecC can also group concurrent statements. SystemC's parallelism resembles Verilog or VHDL's: a system is a collection of clock-edge-triggered processes. Handel-C, SpecC, and Bach C's approaches are more software-like: their constructs dispatch groups of instructions in parallel.
Concurrency introduces a fundamental change to the language, demanding substantially different programmer thinking. Even if s/he is experienced with concurrent programming with the usual thread-and-shared-memory model, the parallel constructs in hardware languages differ substantially.
Other languages present a sequential model to the programmer and rely on the compiler to identify parallelism. While compilers for languages with parallel constructs also identify parallelism, Cones, Transmogrifier C, C2Verilog, and CASH rely on the compiler completely. Cones flattens each function, including loops and conditionals, into a single two-level networks. CASH, by contrast, takes a VLIW-compiler-like approach, analyzing inter-instruction dependencies and scheduling instructions to maximize parallelism.
Two common approaches to identifying parallelism differ in their granularity. Instruction-level parallelism (ILP) groups nearby instructions that can run simultaneously. Now the preferred approach in the computer architecture community, it seems that ILP beyond about five simultaneous instructions is unlikely due to fundamental limits [25, 26] . Pipelining, the second approach, requires less hardware than ILP but can be less effective. Again, dependencies and control-flow transfers limit parallelism. Pipelining works well on regular loops, e.g., in scientific computation [11] , but is less effective in general.
For hardware, relying on the compiler to expose parallelism is awkward because using it effectively requires understanding details of the compiler's operation. Efficient implementations demand careful coding, and appropriate idioms would be awkward for programmers accustomed to writing efficient C. Time is absent from the C programming model. It guarantees causality, but says nothing about execution time. A simple model for both programmers and compilers, it can make achieving timing constraints difficult. The transparency of C software compilation makes gross improvements easy, but improving an already-optimized fragment is difficult.
Meeting a performance target under power and cost constraints is usually mandatory in hardware, since it is always easier to implement a function in software. Thus, any hardware synthesis technique needs a way to meet timing constraints.
The C-like languages in this paper generate synchronous hardware (except Cones, which generates combinational logic, and CASH, which generates asynchronous), so there must be a mechanism for dividing time into clock cycles. Solutions range from mandatory cycle annotations to implicit rules.
A designer using Ocapi specifies state machines and each state gets a cycle. State machines in the SpecC refinement flow may start with implicit clock boundaries, but they are made concrete eventually. SystemC's combinational processes become combinational logic, but its sequential processes denote cycle boundaries with wait calls.
Typical in high-level synthesis, HardwareC supports timing constraints such as "these three statements must execute in two cycles." While such constraints can be subtle for the designer and challenging for the compiler, they allow easier design-space exploration. Bach C is similar.
The C2Verilog compiler inserts cycles using complex rules and provides mechanisms for imposing timing constraints. Unlike HardwareC, these constraints are outside the language.
Transmogrifier C and Handel-C use implicit rules for inserting clocks. In Handel-C, only assignment and delay statements take a clock cycle. In Transmogrifier C, only loop iterations and function calls take a cycle. While simple to understand, such rules can require recoding to meet timing. Handel-C may require assignment statements to be fused and loops may need to be unrolled in Transmogrifier C.
