A modern system-on-chip (SoC) includes many heterogeneous IP components. Generally, a few embedded processors are integrated into SoCs. An asynchronous circuit design technique is employed to achieve low power/energy consumption. In this paper, we design an asynchronous embedded processor on FPGAs and analyze its possible benefits on commercial FPGAs. We use commercially available 65 nm high-performance Virtex-5 and 45nm low-power Spartan-6 Xilinx FPGAs to show the impact on power consumption for the two different extreme cases. For the high performance Virtex-5, our asynchronous processor shows 36.8% lower power consumption when compared with its synchronous counterpart. On the other hand, the asynchronous processor consumes 25.6% more power in a low power Spartan-6 FPGA. However, through simple analysis and power simulation, we show that the event-driven nature of asynchronous circuits can further save power/energy even in the Spartan-6 FPGA.
Introduction
Modern system-on-chip (SoC) designs are composed of tens or hundreds of IPs, and microprocessors or microcontrollers have played important roles in the functionality of these SoC designs. In particular, low-power and low-energy consumptions are sometimes very demanding design requirements for the embedded processor IPs in mobile SoC platforms.
Asynchronous circuit design techniques are considered to be a promising alternative for such highdemanding design requirements in a current CMOS design technology [1] . An asynchronous circuit design methodology provides a potential solution for notorious clock distribution or system synchronization problems [1] , [2] . By utilizing asynchronous circuit techniques, power consumed by a clock network can be eliminated but the handshaking overhead can induce performance loss and extra power consumption.
In general, full custom or ASIC design techniques have been widely used for the implementations of asynchronous circuits. In modern circuit and chip design technology, an FPGA device is used not only as a prototyping platform but also as a versatile SoC platform.
FPGA devices are getting much attraction from industries as a device supporting reconfigurable computing, and such reconfigurable devices are expected to be used more frequently in the near future SoC market.
Despite the fact that the asynchronous circuits and FPGA design techniques have a long history of investigation, there has been only a little work on the implementation of asynchronous circuits on FPGA devices [3] - [5] . It is important for both synchronous and asynchronous designers to exploit the traditional benefits of asynchronous circuits such as low power consumption, low electromagnetic interference, average case performance, and delay insensitivity by implementing asynchronous circuits on those promising devices. However, the chip implementations of the asynchronous circuits on commercial FPGA devices are rare due to the difficulty of timing controls for signal propagation delays.
Some of the work on both asynchronous circuits and FPGAs have focused on the new FPGA architecture designs that best support asynchronous circuits [6] - [8] . In addition to those new propositions of the FPGA architectures for asynchronous circuits, it is required to exploit these asynchronous circuits on the currently available commercial FPGA devices.
In this work, we try to describe the best use of FPGAs with asynchronous circuits by exploiting their clock net characteristics and reconfiguration capability. Furthermore, the event-driven nature of asynchronous circuits can be best exploited from the viewpoint of power/energy consumption in an event-driven sensorbased ubiquitous embedded SoC because they consume energy only when they work.
To show power and energy efficiencies, we design an asynchronous processor on high performance Virtex-5 (based on 65 nm process technology) and low power Spartan-6 (based on 45 nm process technology) FPGA devices and analyze them in terms of power consumption. A bundled delay model has been used for the efficient implementation and a new double C-element handshake circuit (that is called C 2 -HC) has been proposed for mapping high speed handshaking circuits onto an FPGA. Finally, some layout design issues have been also considered in order to meet the timing constraints that the circuits have to satisfy for proper operation. In a final test, the chips are verified to work correctly even after eliminating a clock source from our FPGA development board. As far as authors know, C C 2 2 HC HC C C 2 2 HC HC C C 2 2 HC HC C C 2 2 HC HC this is first work reporting the power consumption features of asynchronous circuits mapped onto a commercial FPGA. The remainder of this paper is organized as follows. The detailed design and optimization strategies for our target core architecture are presented in Section 2. In Section 3, the benefits of asynchronous circuit designs on an FPGA device will be discussed in detail. Section 4 describes simulation and measurement results with the interpretation behind the results. Finally, in Section 5, we conclude this paper with a summary of this work.
Design
In this section, a target architecture and detailed design techniques for the proposed asynchronous processor are presented. Special design considerations for FPGA devices are discussed as well. Our presentation is mainly based on a Virtex-5 device. To port the design to Spartan-6, there are no special considerations except for delay matching.
Target Processor Architecture
Our target asynchronous embedded processor is based on a well-known MIPS architecture [9] . Fig. 1 shows our asynchronous MIPS processor architecture. Instead of using a single global clock feeding timing references to the all pipeline registers, asynchronous handshaking circuits are used to feed local clock signals to storage elements at each pipeline stage.
To increase the speed of the asynchronous handshaking protocol, we devised a new C 2 -HC by extending simple handshake controllers proposed in [5] , [10] . C 2 -HC can be implemented with two adjacent lookup tables (LUTs). To embed the functionality of C-elements onto LUTs and guarantee their adjacency, we need to perform an LUT-level design and apply location constraints to those instants [5] , [11] , [12] .
• Data hazard resolution: Conventional MIPS processors use hazard resolution schemes to reduce the number of pipeline stalls and thus improve their performance. However, the asynchronous implementation of the hazard handling schemes is too complex owing to the absence of a global time reference among the stages. The implementation can significantly degrade performance when it is realized on a FPGA device. This is because of the many synchronizations that are incurred between decoding, execution, and memory access stages [13] , [14] .
In a low power embedded processor, forwarding schemes have not been employed in order to reduce chip-area and power-consumption [15] , [16] by eliminating a large number of MUX-logic and wire resources that consume a significant portion of the total power. Note that wire resources are becoming more expensive than logic resources from the viewpoint of power consumption and timing in the current FPGA technology. In our previous work [5] , the interconnect delay takes 83% of the worst cycle time delay on average. Consequently, we assume that a compiler produces NOP instructions whenever stalls occur to resolve the data hazard.
Nevertheless, it is interesting to do the evaluation of the power/energy efficiency of using scheduled NOP instructions without any forwarding datapath or partially employing forwarding datapath against utilizing a fully forwarding datapath. For the synchronous microprocessors targeting low cost or low power/energy consumption, [15] , [16] have explored the design space of utilizing forwarding datapath and they have optimized the transistor/wire cost or power consumption of the forwarding datapath by employing only cost-effective or power-efficient forwarding datapath.
In an asynchronous case, however, designing and implementing forwarding techniques themselves are possible research issues. Although we do not design an asynchronous MIPS with the forwarding mechanism due to its complexity and the huge amount of required design time, we believe that the energy efficiency of the forwarding techniques in an asynchronous MIPS will be much lower than that of the NOP based asynchronous MIPS without forwarding circuits.
Asynchronous Design on FPGA Devices
Many handshake protocols and high-speed handshake controllers have been investigated to reduce the asynchronous handshaking overhead [10] , [17] . However, because previous handshake controllers have been optimized in a full-custom or ASIC design, their application to an LUT based FPGA can lead to significant performance degradation owing to their circuit complexities. In our asynchronous MIPS design, we devise a new high-speed C 2 -HC that can be implemented on an FPGA with only two LUTs. Such a simplification for handshake controllers leads to speed improvement in our FPGA implementation. Fig. 2 shows typical implementation styles for a muller C-element which is the most fundamental and important circuit in asynchronous circuit designs [5] . As shown in Fig. 2 , there are two design styles: (a) latch based C-element design and (b) latch-free Celement design. The latch based C-element uses no feedback signal and has no delay constraint for wires or gates. On the other hand, the latch-free C-element has a feedback signal that must be routed locally so that the C-element stabilizes before new inputs arrive to guarantee it's functionality. Therefore, a timing constraint must be imposed on the feedback signal. From the viewpoint of delay insensitivity, the latch-based Celement is better than the latch-free design.
However, from the viewpoint of speed performance, the latch-free C-element is superior to the latch based design because the C-element functionality can be easily mapped to a single LUT in the latch-free C-element. On the other hand, the latch-based Celement requires more resources such as two LUTs (one for the AND-gate and another for the NOR-gate) and one latch. On the basis of the above mentioned qualitative comparison, we chose the latch-free design for high-speed performance.
A hazard issue in asynchronous circuits are important for the correct operations of the circuits. In particular, a C-element is a most essential circuit element in asynchronous circuits and the C-element is used frequently in our asynchronous pipeline circuits. The C-element designed with a conventional synchronous tool can have logic hazards that cause a system halt when the hazards happen in handshake control circuits. Therefore, it is critical for the safety of a system to de- sign a hazard-free C-element when we are using synchronous tools. Since we have used Xilinx tools that are assuming the circuits implemented by the tools are working as synchronous circuits with a global reference clock, we have to pay special concerns for the synthesized netlist and timing for our target asynchronous circuits. In our work, to solve the hazard problems in the design of asynchronous C-elements, we do not use a behavioral synthesis for generating asynchronous control circuits and we use a primitive LUT level circuit design. In addition, for proper placement of designed LUTs and wire routing between those LUTs, we add some timing constraints such as relative location constraints of LUT elements so that the C-elements work correctly without producing hazards. While we are designing C-elements, the C-elements produce hazards sometimes. However, by checking the layout including wiring by using a Xilinx FPGA Editor tool [12] , we correct and validate the timing-dependent functionality of the C-elements.
In Fig. 3 , a simple single C-element based handshake controller † is described with its circuit schematic and the corresponding Verilog-HDL style description of its LUT level instantiation. The schematic for the simple single C-element based handshake controller includes a reset port and logic inversion on an input port 'b'. Note that only one four-input LUT is required
C.L. to implement the simple single C-element based handshake controller as shown in Fig. 3 . In Fig. 3 (b), "LUT4" is a Xilinx library template of a four-input LUT primitive and ".INIT" sets the functionality of the LUT with four given inputs [11] . Fig. 4 (a) shows the proposed circuit structure of a C 2 -HC. Our C 2 -HC is similar to the simple handshake controller in [5] , [10] . However, our C 2 -HC uses two Celements for each stage of handshake controller instead of one in the simple handshake controller. This doubling of the C-element can make every pipeline stage hold data items (this feature is called "full capacity"), which is not true for the simple handshake controller, in which only either odd or even stages can exclusively hold data items (this feature is called "half capacity"). Fig. 4 (b) shows the timing diagram for a threestage asynchronous pipeline. The timing diagram mainly focuses to show the signaling events for the six C-element outputs which are denoted by out-i', out-i, out-j', out-j, out-k' and out-k. The alternating signal events at out-i, out-j and out-k are working as local clocks for the registers in their individual stages.
Right
The C-elements for out-i', out-j' and out-k' are used for decoupling the handshaking controls between stages so that any two adjacent stages can hold data. When the right output environment is blocked and new data items keep coming into the pipeline from the left input environment, eventually, all the pipeline stages will contain data items as shown in Fig. 4 (b) (See the stage k, stage j, and stage i that hold data items, , , , respectively.), and then the whole pipeline will be blocked. In addition to this full capacity property of our C 2 -HC, it has a performance benefit since it can work at high speed on an FPGA owing to its simple circuit structure (only two LUTs required).
There exist two more advanced handshake controllers than the simple handshake controller in [10] : a semi-decoupled latch controller and a fully-decoupled latch controller. These two controllers have the full capacity property as well. However, the former uses two asymmetric C-elements with three and two inputs each [10] , while the latter uses four asymmetric C-elements with two, three and four inputs without counting a reset signal. Note that our C 2 -HC uses two C-elements having three inputs including a reset signal. In addition to the number of gates, they use more internal wires than C 2 -HC, which can degrade the performance owing to the delay induced by the internal signal wires. Finally, extra delay elements can be added to the handshake controller to guarantee the timing constraints that must be satisfied for correct operation. Consequently, our C 2 -HC can be implemented in high speed circuit thanks to its simpler logic and interconnect structures than the previous handshake controller while exploiting a full capacity benefit.
It is noteworthy that "asymmetric delay" for rising and falling signal propagations is implemented in a delay element so as not to waste the time to transfer a falling request signal in the return-to-zero phase of our handshake protocol [18] . The circuit schematic of the asymmetric delay is shown in Fig 7. 
Timing Constraints
For a C 2 -HC based asynchronous pipeline to work correctly, the handshake controller should not allow a latter data item to catch up with a former data item. At first, any special constraints are not required when a flip-flop is used as data registers located in between stages thanks to the strict signal ordering by the C 2 -HC-driven handshake control and the short sampling time of the flip-flop. However, there should be timing constraints and the constraints have to be satisfied for the correct operation of the pipeline when a latch is used for the data registers in between stages. Fig. 5 shows a possible situation that a second data item catches up with the first data item: After a rising signal at out-j makes memory latches transparent (open) to pass a data item , the event of the rising signal propagates forward to transfer a request signal to the next stage with the bundled data item and also it propagates backward to transfer the acknowledge signal to the previous stage. The forwarded request signal goes to the C-element for out-k'. Then, the rising event at the C-element output for out-k' goes backward to the inverted input of the C-element for out-j. When the falling event happens at out-j, it make the latches opaque (close) and a safe capture for the data item  is completed finally. The forward cycle time is denoted by "t F wCycle " and drawn as a dotted line in Fig. 5(a) .
On the other hand, at the same time, the back propagated signal event from out-j reaches the inverted input of the C-element for out-i via the C-element for out-j'. Then, the C-element for out-i produces a rising signal event to read a new data item . After the read of the data item , the new data propagates through the combinational logics to the inputs of latches at stage j. This signal propagation path is shown in Fig.  5(a) . This backward cycle is denoted by "t BwCycle " and drawn as a dashed line in the figure.
In Fig. 5(b) , the corresponding signaling dependencies are embedded onto the timing diagram. Finally, from the diagram, we can expect the data item  catches up with the data item  under the condition, "t F wCycle ≥ t BwCycle " which means that "the latches at the stage j can not become opaque before the new data item arrives at the inputs of the latches". In consequence, the timing condition, "t F wCycle < t BwCycle ", has to be satisfied for the correct operation of our C 2 -HC based asynchronous pipeline if a latch is used for the data registers of pipeline stages.
The detailed timing path equations and the condition has to be satisfied are summarized as follows.
C.L. . 
In Eq. 1 and Eq. 2, t wire is delay for a wire segment between circuit components, t dCL is delay for a matched delay, d CL , and t C−element is delay for a C-element. t lc↓ and t lc↑ are delays for falling and rising local clock signals, respectively. t Latch−Open and t Latch−Close are delays for clock-to-Latch-Open (delay from the time of rising clock signal event at the clock port of a latch to the time for the latch to be open/transparent) and clock-to-Latch-Close (delay from the time of falling clock signal event at the clock port of a latch to the time for the latch to be closed/opaque), respectively. Since a bundled data model is employed in our paper, the sum of the first three delay components in t F wCycle , "(t wire + t dCL + t wire )", is a matched delay for a combination logic and it is denoted by t F w.dCL . In consequence, t F w.dCL is greater than or equal to t dCL .
We assume that all the same types of circuit elements have same delay. In addition, t F w.dCL ≈ t CL , t lc↓ ≈ t lc↑ , and t Latch−Close ≈ t Latch−Open are assumed. Finally, the following simplification of the condition is possible. 
Eq. 3 shows that t BwCycle is longer than t F wCycle under our assumptions. Although the condition is proved to be satisfiable with the timing symbols as shown in Eq. 3, for the actual correct operation of our pipeline, the timing margin (t wire can be thought as a margin for safe/correct operations) seems very small. In consequence, very careful concern for circuit layout and timing has to be paid in order to make the assumptions to be true.
It is noteworthy that flip-flops are used in our asynchronous MIPS datapath instead of the latches that are commonly used in conventional asynchronous circuits. This is because of the internal structure of commercial FPGAs. Because a flip-flop and a latch are instantiated from a same storage cell by changing its configuration in FPGA devices, there is no benefit of instantiating a latch with respect to performance and power.
Layout-Aware Design and Delay Matching / Optimization
Our C 2 -HC has been designed with a mix of lowlevel LUT instantiations and high-level behavioral descriptions together with relative location (RLOC) constraints [5] , [12] . Fig. 6 shows placement constrained C 2 -HC handshake circuit description with Verilog-HDL. The multiple instantiation of the C-element with a "SimpleC" circuit that is defined in Fig. 3 and a delay element (i.e., delayLUT2) are described in Fig. 6 with RLOC constraints. The constraints are used for locating those three components as closely as possible to the other components. The delay element is used to control the timing between two C-elements for correct handshaking. Fig. 7 shows the internal structure of a delay element where LUTs are cascaded in series. This is an asymmetric delay element having a fast return-to-zero behavior [18] . Finally, we want to note that the delay matching needs iterative design modifications and P&R in a FPGA based design flow. Since a small modification in the part of a design affects the whole circuit structures of the design and corresponding layouts in the FPGA design, the unmodified parts of the design also can be changed unexpectedly and the corresponding timings will be changed as well.
The one possible way of avoiding the iterations or reducing the number of the iterations is to use the "hardmacro" for a circuit module: a delay element or a handshake module particularly in asynchronous circuit designs. The hardmacro is a predeveloped (including P&R) circuit which is keeping the circuit layout structure mapped to an FPGA, and the timing properties of signals in the hardmacro are preserved. The problem with the hardmacro is the loss of layout optimality because no further optimization is not performed inside the hardmacro.
In addition to the delay elements that are used as matching delays for combinational logic datapaths, extra delay elements are required to guarantee the correct and timely operations that are involved with the feedback control and data signals. The delay elements shown in Fig. 1, d branch and d branchD are imposed on the feedback signals for branch address selection while d W B is imposed to timely delay the write-back signal that is used as a clock signal for a register file. Because pipeline stages are not synchronized and work elastically, we need to synchronize the data produced at two different stages by adjusting the arrival time of the feedback signals at the stage where those signals are consumed.
In order to optimally adjust delay elements in our asynchronous MIPS design, we take a time-consuming trial-and-error approach for optimize delay elements. It can be seen as a sort of "iterative optimization" used in conventional search algorithms. We randomly choose a delay element and reduce its delay by reducing the number of LUTs in the delay element and then we test its impact on the timing correctness of our asynchronous MIPS core by simulations. If the reduced delay in the selected delay element leads to timing error then the number of LUTs in the delay element is recovered to an original value. We perform this trial-and-error based delay tuning iteratively to produce optimal performance of the asynchronous MIPS core.
Benefits of Asynchronous Design on FPGA Devices
We need to address the beneficial aspects for applying asynchronous circuit design techniques onto an FPGA reconfigurable device. The three main advantages are as follows.
Time-varying functionality
An FPGA is a type of an adaptable device that can change the functionality according to time-varying functional demands. Only a small microcontroller can be programmed onto an FPGA device, and it can work at a certain time with minimal power consumption while it performs minimal computations e.g., interrupt listening, flag checking, etc. On the other hand, entire SoC circuits should be configured onto the device to provide full functionality with full power consumption when the full functionality is required. Fig. 8 shows one possible scenario for a timevarying dynamic reconfiguration that depends on the phase of two different processing demands. Depending on the size of those circuits, a different circuit implementation style can be utilized for modules on SoC for saving power/energy. When entire SoC circuits are required, a synchronous clocked circuit can be employed to support a full workload. However, when only light weight control processing is required, the corresponding small controller can be reconfigured onto an FPGA instead of the entire SoC. The reconfigured small controller circuits marked by μP † in Fig. 8 can be implemented in an asynchronous circuit to further save power/energy by not utilizing power-hungry thick clock lines of high-performance/high-capacity FPGA devices.
Furthermore, there are two more design issues that have to be considered together for the best use of this feature, time-varying functionality, and the issues are described as follows.
• On the use of a power gating technique: A power gating technique has been known as an effective technique for achieving low power consumption. In the field of FPGA, there have been many relevant works on the power gating for achieving power saving in leakage power consumption. However, there is still no support for the power gating in actual FPGA devices at least † The apostrophe is used to denote that the µP is an asynchronous version of the synchronous µP . in major FPGA companies such as Xilinx and Altera. However, further power saving will be obtained in the case of utilizing only small portion of FPGA circuit area (e.g., the small microprocessor, μP , in the light-weight processing phase) if the power gating becomes available on the FPGA devices.
Since our work is focusing mainly on the dynamic power reduction through an asynchronous circuit design technique, any technique for reducing static power such as the power gating can be used independently for lowering power consumption whatever the circuit design techniques (any of synchronous or asynchronous circuit design techniques) is employed for designs.
• On the the cost of a context switching for the dynamic reconfiguration: All the dynamic configurations require additional power and delay to switch contexts (i.e., that is a reprogramming that loads a bitstream onto a configuration memory). In the field of reconfigurable architecture and computing, always there have been special concerns for the overhead of context switchings. Definitely, meaningful context switchings happen only when those switchings that are incurring delay and power consumption lead to overall performance and power reductions.
In our case as well, context switching overheads can be amortized if the runtime of reconfigured small circuits is long enough so that overall power reduction is smaller than the power consumption in the case of the continuously employing large circuits consuming large power without reconfiguration. In consequence, such a decision for determining the time for context switching should be carefully considered for given applications by investigating their runtime features.
Efficient Multi-Clock Domain Design
To make an SoC with heterogeneous multi-clock domains on FPGAs, Xilinx DCM IP modules need to be instantiated for supporting multiple clock frequencies for each clock domain [19] . However, the instantiated DCM modules consume extra power/energy that is comparable to power/energy consumption of original clock-net. In asynchronous circuits, each circuit module can be working at its individually optimized speed without considering the working frequencies of the other IP modules. This leads to natural power/energy savings.
Event-driven nature of asynchronous circuits
The workload running on an SoC can vary over time and sometimes the arrival patterns of data coming into the system can be bursty. In particular, these types of data patterns can be easily found in ubiquitous eventdriven (sometimes, called as data-driven) embedded systems [18] . The problem of using synchronous circuits in these types of systems is a free-running clock that continuously consumes power/energy. Even if there are sleep or idle states where a clock gating can be used to stop the clock feeding to circuits, extra timing and power overheads are required at the transient point between the power saving stages and an active state [21] .
Furthermore, the clock gating requires a designer to know which parts of circuits have to be idle without clocking, when a clock has to stopped, and how long the clock has to be stopped for the specific circuit parts. As well, those information has to be store in a memory and extra control circuits have to produce proper control signals to stop clocking for some parts of circuits according to the information stored in the memory. Consequently, a fine or coarse granularity issue in the application of the clock gating is considered as an important design decision for reducing its overhead.
The general equations for dynamic power consumption for a processor core are described by the following [2] :
In Eq. 4, C total is the effective capacitance, V is the supply voltage level and f is frequency. C total can be further refined into three components C logic , C signal and C clock as in Eq. 5 In synchronous circuits, C clock continuously contributes to power consumption even at some idle states. However, C clock does not contribute to power and energy leakage during the core's idle time in asynchronous circuits, since they use locally generated clocks that are activated according to an event-driven handshake protocol. Fig. 9 shows the two scenarios of the clocking behaviors for the synchronous and asynchronous cases. In the figure, idling phase time is defined as the time duration where a core performs no useful computations, while working phase time is defined as the time duration where a core performs useful computations. The clock frequency component can be classified into two types of frequency: f work , the frequency during the working phase time, and f idle , the frequency during the idling phase time. If we use α as an activation factor, then
f work is the frequency part of involving useful computations while f idle is the frequency part of performing no operations. If we use P sync and P async to describe the power consumptions for synchronous and asynchronous circuits, respectively, then we have the following equations:
As described before, there is no f idle term in Eq. 8 because it is zero in P async . Now, we can calculate the working condition when the power consumption of asynchronous circuits is lower than that of synchronous circuits as follows:
If we summarize Eq. 9, then the following condition can be derived. (10) Finally, Eq. 10 can be rewritten as the following equation:
Consequently, an asynchronous circuit can be superior to its synchronous counterpart at the cost of average power consumption, even if the power consumption of the fully working asynchronous circuit during the working phase time is more than that of the synchronous counterpart. This becomes true if the ratio of f idle f work +f idle is higher than the value calculated from the right-hand side in Eq. 11.
We define this ratio as the idling-ratio and use a symbol β for it. With the assumption that ΔC logic , ΔC signal and ΔC clock are all relatively smaller than C sync clock , asynchronous circuits consume less power than synchronous circuits even with only a small idling-ratio.
Simulation and Measurement
Our asynchronous MIPS processor based on C 2 -HC has been developed using Verilog-HDL and synthesized with Xilinx ISE Design Suite 13.4. For simulation and measurement, we use Virtex-5 LX110t (speed grade-3) and Spartan-6 LX9 development boards.
• Simulation results: In order to investigate the voltage and thermal variation impact of the Virtex-5 device (based on 45 nm technologies), we observe the variation of the throughput performance with two corner conditions as follows: We do not perform simulations on the Spartan-6 device for those two conditions because the prorated delays (temperature and voltage) are not supported for the Spartan-6 device.
The throughput performances, 122.5 MHz and 106.1 MHz, are obtained on the Virtex-5 for the best case and worst case operation conditions, respectively. The worst case performance is approximately 1.15 times slower than the best case performance in the given variation range of voltage and temperature. The design on Spartan-6 shows 50.6 MHz throughput performance. On the other hand, with the synchronous version of MIPS, the throughput performances, 216.9MHz and 195.3 MHz, are observed on the Virtex-5 for the best case and worst case operation conditions, respectively. In this synchronous case, the worst case performance is approximately 1.11 times slower than the best case performance. Table 1 and Table 2 show the breakdowns of core dynamic power consumption by resource component types when synchronous MIPS and our asynchronous MIPS cores are configured onto high-end Virtex-5 and low-power Spartan-6 devices, respectively. The results are extracted from the Xilinx Xpower analyzer tool for the worst case operating condition. We have manually made a simple benchmark program calculating a Fibonacci sequence. Due to the lack of the compiler for our MIPS, we cannot do simulations with various benchmarks, but almost similar power consumption patterns have been observed with simple data computation codes performing filter operations. For the Spartan-6 device, the condition of "1.16V and 85
• C" is used for power evaluation. We do not show I/O and static power consumption because they are not of our interest.
For fair comparison, when the power consumptions are evaluated, the clock frequencies of the synchronous MIPS are set to the average frequencies of our asynchronous MIPS cores. As shown in Table 1 , 36.8% power savings is observed thanks to the significant reduction of power consumption for clocking resources in the high-performance Virtex-5. On the other hand, 25.6% more power consumption is observed in the lowpower Spartan-6 devices in Table 2 .
Resource utilizations of the synchronous MIPS and asynchronous MIPS are almost same. Only a small increase in LUT resources is observed for the asynchronous MIPS. The increase in LUT resources is caused by delay elements added to asynchronous handshake controllers. The increase in logic elements contributes to the increase in the consumption of logic power.
Note that BUFG (global clock buffer) and low skew clock signal resources are used only in the synchronous core [11] . The clock buffer and signal lines of an FPGA device have to be designed to provide the maximum clock frequency at which chip circuits on the FPGA can operate. Consequently, the FPGA chip will waste power because of these over-engineered components if the device or some parts of the device are not working at the maximum frequency. The significant increases in power consumption in the synchronous designs in Virtex-5 devices are mainly owing to those power-hungry components. This implies that employing asynchronous circuits can save power consumption in such high performance and high-capacity FPGAs.
However, in the low-power Spartan-6, the clock lines are optimized for low power consumption. Furthermore, our Spartan-6 device is an LX9 series model that has very small logic capacity and wire resources. In these chips characterized by low logic capacity and very limited resources, an asynchronous MIPS core can consume more power because short wires that consume low power are not enough for routing both data signals and asynchronous local clocks. Nevertheless, the asynchronous design on the low-power Spartan-6 can be a better solution in an event-driven working environment depending on the idling-ratio defined in Section II.
A simplified first order analysis based on Eq. 11 shows the Spartan-6 based asynchronous MIPS consumes less power than the synchronous MIPS if the idling-ratio is higher than 0.38 with the values given in Table 2 (ΔC logic = (2.79 mW -2.02 mW)/(
• On the different ratios of working and idle phase times: The bursty workload characteristics in an event-driven environment can be modeled somehow by running an application code for a specified time and then do nothing for another specified time. In an interrupt based system, the application code can be though as an interrupt handler so that the interrupt code processes bursty data on a core and then the core goes back to idling status after completing the interrupt processing with the bursty workload data while waiting a new interrupt signal for new data. The runtime for the interrupt processing can be thought as a working phase time and the idling time for waiting a new event can be considered as idling phase time. In consequence, the evaluation of the power benefit from the bursty workload in an event-driven environment can be performed by varying the ratio specified by working phase time and idling phase time.
In order to include the impact of "the ratio of working and idle phase times" on power consumption, we do conduct simulations with synthetic benchmarks. Note that the synthetic benchmarks are made for simulating the power consumption impact of the various ratios with working and idle phase times. The synthetic benchmarks are built by using a Fibonacci application code that is actively running for a specified time and the idling run is followed. In the idling run, the MIPS core is running without any active useful computation during a specified idle phase time. The working phase time and idling phase time are specified in each benchmark according to the target ratio.
To evaluate the impact of the ratio of working and idle phase times, we use a ratio factor β = f idle /(f work + f idle ), then we build synthetic benchmarks showing different "β ratios" as we described above and simulate power consumption of those benchmarks.
As shown in Table 3 , the power consumption is very linear to the ratio factor β. Fig. 10 shows the graph for the power consumptions according to the varying β. Then we can find that our asynchronous version of MIPS will be a better choice for low power consumption when is larger than 0.4 (marked by "Crossing Point" and the value of β at this point is located between 0.35 and 0.4 in Fig. 10 ). It is noteworthy that the value range includes the idling ratio, 0.38, that is derived from our simple analysis using Eq. 11 with the data given in Table 2 .
• Measurement results: The configuration bitstream of our designs is downloaded onto the FPGA. In order to verify the correctness of our asynchronous MIPS core implementation on the FPGA, we run benchmarks on the MIPS implementation and print out the final results of the computations to on-board LED outputs. Finally, the externally observed results are compared with the correct results of the computations.
The measured average operating frequencies of the implementations are 140.8 MHz and 75.7 MHz at room temperature on the Virtex-5 and Spartan-6 devices, respectively, as shown in Fig. 11 . The voltage oscillations on the 'lt M' signal shown in Fig. 1 are measured by an oscilloscope. These performance numbers are larger than those from the simulations and the differences are induced from the conservative estimation of the simu- lation tools.
Note that an on-board oscillator is necessary for feeding a reference clock signal for a synchronous core, and a typical on-board oscillator also consumes a few 'mW' power depending on its frequency. Asynchronous circuits and cores do not need oscillator components to operate.
Conclusions
In this paper, we design an asynchronous processor core on commercial FPGAs with a newly proposed handshaking controller called C 2 -HC to show the power/energy savings of asynchronous circuit design on an FPGA. We use commercially available 65 nm Virtex-5 and 45 nm Spartan-6 Xilinx FPGAs to show the impact of asynchronous circuit design on power consumption. When compared with a synchronous counterpart, our asynchronous processor core works correctly on the high-performance Virtex-5 FPGA and obtained 36.8% core dynamic power reduction. On a small-capacity low-power Spartan-6, 25.6% more power is consumed.
However, the power increase is amortized and soon power reduction is achieved owing to the bursty workload of event-driven working environment.
Currently, asynchronous circuits are considered as a circuit design technology of EMI-suppressed cores and modules. Thus, we are currently investigating EMI issues in asynchronous circuits/processors. We expect that the asynchronous processor chip emits lower EMI than its synchronous counterpart thanks to the absence of simultaneous clock and circuit power consumption.
