Abstract -Asynchronous integrated circuit technology provides low-power and low-noise operation for portable electronic securty applications. Rather than using a global clock, asynchronous circuits employ a system of distrbuted handshake signals that control on-chip dataflow; reducing power consumption to only those parts of a chip actively involved in computation. Sandia has developed an automated asynchronous design flow that enables the rapid development of these asynchronous ASICs. This paper describes the design of asynchronous DES encryption circuits using this flow, and evaluates their performance against standard synchronous implementations.
Abstract -Asynchronous integrated circuit technology provides low-power and low-noise operation for portable electronic securty applications. Rather than using a global clock, asynchronous circuits employ a system of distrbuted handshake signals that control on-chip dataflow; reducing power consumption to only those parts of a chip actively involved in computation. Sandia has developed an automated asynchronous design flow that enables the rapid development of these asynchronous ASICs. This paper describes the design of asynchronous DES encryption circuits using this flow, and evaluates their performance against standard synchronous implementations.
Index Terms -asynchronous logic, DES cryptography, low-power circuits.
I. INTRODUCTION*
Asynchronous circuit technology uses self-timed logic to control digital computations and eliminates the need for global clock signals. Compared to more conventional synchronous logic, self-timed asynchronous circuits consume less power and generate less electromagnetic interference; making them ideal for portable electronic security applications. While academic researchers have a long history of demonstrating the feasibility of asynchronous ICs, performance-driven industry designers have largely ignored the benefits of asynchronous logic. However, several companies, most notably Europe's Philips Electronics, have recently launched successful asynchronous products in lowpower consumer markets (e.g., automotive control, secure smartcards, and portable wireless devices) [1] . This paper focuses on designing asynchronous cryptographic circuits, a key component in many security applications. Asynchronous cryptographic circuits implement the same encryption and decryption operations that are found in traditional synchronous hardware designs. However, the temporal behavior of asynchronous cryptographic operations is different because they are not synchronized to a central clock. This gives asynchronous cryptographic logic more concurrency and does not restrict the duration of operations to a fixed clock period. In Section II and 1II, asynchronous logic and asynchronous hardware languages are introduced. Section IV describes asynchronous implementations of the DES algorithm, Section V outlines the Sandia asynchronous logic design flow, and Section VI compares the asynchronous DES designs with their synchronous equivalents.
ASYNCHRONOUS LOGIC
The bundled-data design style, illustrated in Fig. 1 The datapath logic and DFF registers in a bundled-data pipeline are similar to conventional synchronous logic. The delay line is used to match the delay of the asynchronous controller with the delay through the datapath logic. For correct operation, the minimum control path delay must be larger than the maximum datapath delay. This is a one-sided timing constraint, sometimes referred to as the bundling constraint, and is equivalent to the setup constraint in synchronous logic. There is no equivalent hold constraint because the read and wrte register operations in this style of asynchronous logic are sequential (i.e., mutually exclusive).
The Muller C-element, shown in Fig. 2(a) , is found in nearly all asynchronous logic circuits and is used to synchronize asynchronous control signals. The output of the C-element gate goes high when all of its inputs are high, and the output goes low when all of its inputs are low. When the inputs to the C-element have different values, the output remains unchanged. The C-element's operation, summarzed in the truth-table shown in Fig. 2(b) , requires a storage element to hold the state of its last output. In standard-cell implementations, this state-holding circuit is usually constructed from a combinational feedback path as illustrated in Fig. 2(c) .
Asymmetric C-elements have inputs that respond to transitions in only one direction. For example, the B input to the C-element shown in Fig. 3(a) is denoted by a '+" because it is ignored when the output transitions to a low value. This introduces an "X' in the truth-table (Fig. 3(b) ) and simplifies the resulting standard-cell ASIC circuit ( Fig. 3(c) ). Muller Celements can be made asymmetrc when the relative orderng of asynchronous input events is predetermined.
The S-element, shown in Fig. 4 (a), is a common handshake circuit that encloses one asynchronous handshake within another. As illustrated in Fig. 4(b) , the Selement operates as follows: (1) Areq initiates a handshake on channel A, (2) a full handshake is performed on channel B, and finally (3) the handshake on channel A is completed. Fig. 4 (c) shows the schematic representation of the Selement. An asymmetric C-elment is used in this implementation because the S-element behavior guarantees that Back goes low before Areq. Using these C-elements and S-elements as building blocks, asynchronous pipeline controllers of arbitrary complexity can be constructed. The most basic asynchronous pipeline controller, with one input channel and one output channel, is shown in Fig. 5 . The S-element is used to sequence the input and output handshakes, such that data is written to the register from the input channel before the register is read by the output channel. The Celement is used to generate the Rack output acknowledge signal by synchronizing the Rreq output request signal with the Aack output of the S-element. Ill.
HIGH-LEVEL ASYNCHRONOUS DESIGN
Designing asynchronous logic from the bottom-up, by manually composing C-elements and S-elements, would be tedious and prone to errors. Instead, asynchronous logic compilers have been developed that synthesize these control elements from high-level logic descriptions using a top-down design flow.
The Sandia asynchronous design flow uses the Balsa hardware description language (HDL), which has an opensource compiler and simulator [2] . Balsa is a high-level programming language that has special constructs to efficiently describe the behavior of asynchronous logic. To verify correctness before performing logic synthesis, functional tests can be run on asynchronous Balsa programs using the Balsa simulator. A summary of the syntax used in the Balsa language is in the Appendix of this paper.
The Balsa HDL has programming constructs (e.g., buses, conditionals, modules, etc.) that are similar to high-level features found in the synchronous Verilog and VHDL languages. However, Balsa programs describe dataflow operations and hide the low-level asynchronous circuit details from the designer. Asynchronous logic blocks are descrbed as procedures, data is stored in varables, and data communication between procedures is performed using unidirectional channels. The Balsa compiler is responsible for synthesizing these procedures into asynchronous circuit components such as registers, gates, and handshake wires. (Fig. 6(a) ) concurrently reads both of its input channels, while the second version (Fig. 6(b) ) reads its input channels sequentially and as a result is slightly slower. The synthesized datapath logic (Fig. 6(c) ) is identical for both adders. However, the synthesized control logic for the parallel implementation (Fig. 6(d) ) requires one more Selement than the control logic generated for the sequential implementation (Fig. 6(e) ). This illustrates an example of optimization tradeoffs between performance and circuit area that a designer can take advantage of when using asynchronous logic. Note that the control logic is automatically inferred from the Balsa program, and that the Balsa channels are automatically expanded into data, request, and acknowledge wires.
IV. DES ALGORITHMS
The Data Encryption Standard (DES), developed in the 1970s, was the first commercial-grade cryptographic algorithm with an open and publicly-available specification.
DES is a symmetrc-key block cipher that encrypts data using 16 iterations, or rounds, to permute 64-bit data blocks with a 56-bit key. Each DES encryption round takes the lower 32-bits (L) and upper 32-bits (R) from the previous round and computes the input for the next round as follows:
where P is a 32-bit permutation, S is a 48-to-32 bit substitution mapping, E is a 32-to-48 bit expansion permutation, and K is a 48-bit subkey generated from the main 56-bit key (see [3] for a full specification of the DES algorithm). In addition, the first round has a 64-bit initial permutation (permuteln) and the final round has a 64-bit final permutation (permuteOut). The same algorithm can be used for DES decryption by reversing the order that subkeys are applied to the rounds.
The DES algorithm is well suited for hardware implementations and has been studied extensively in the literature (see [4] for a recent survey). In-x; 7.
OutE(x.Decrypt, x.Key, x.Rval, permuteRoundi (x.Lval, x.Rval, x.Key, x.Dec))
8.
end --7oop 9. end --procedure (b) Fig. 7 . Pipelined asynchronous DES algorithm: (a) block diagram and (b) Balsa HDL pseudo code.
B. Iterative DES Algorithm
Unlike the pipelined algorithm, the iterative DES algorthm is optimized for circuit area instead of performance. Fig. 8(a The Balsa procedure describing the iterative asynchronous DES algorthm is listed in Fig. 8(b) . The key is stored in a 56-bit register (k) and the control in a 1-bit register (d). A 5-bit counter that keeps track of the current DES iteration is stored in the cnt register. The DES round computation is stored in the 64-bit x register, whose L and R slices are accessed by the LRreg record type. The cnt last and y variables are shadow registers that eliminate variable self-assignments (e.g., c:=f(c)), which are synthesized inefficiently by the Balsa compiler [2] . The iterative asynchronous DES circuit goes through 17 iterations, sixteen iterations to compute the DES rounds and an additional iteration to send the resultant 64-bit cipher text on the DesOut channel. In line 14, a "probe" statement is used instead of a channel receive to avoid requiring an additional 64-bit register to store the value of the Desln channel. While the iterative asynchronous DES algorithm is significantly smaller than the pipelined implementation, it is also slower due to its increased control complexity and 17-iteration encryption latency.
V. ASYNCHRONOUS LOGIC SYNTHESIS
Sandia's asynchronous logic design flow, shown in Fig. 9 , is composed of open-source, commercial, and in-house EDA tools. In addition to supporting a varety of commercial standard-cell ASIC libraries, this design flow is also compatible with Sandia's low-NRE, rad-hard structured ASIC devices [6] . While this flow is targeted towards asynchronous logic, it also supports system-on-chip applications that contain both asynchronous and synchronous logic. The remaining part of this section summarzes the design steps.
Asynchronous Synthesis: Balsa programs are synthesized to gate-level verilog netlists using the Balsa compiler. These netlists are then remapped onto Sandia's generic asynchronous standard-cell library, which is compatible with the Synopsys GTECHW format. Static Timing Analysis: Synopsys PrimeTimeTM is used to verfy the asynchronous delay-line margins in the routed netlist, and any setup/hold time constraints in the synchronous blocks.
Gate-level Simulations: The routed netlist is backannotated and simulated using Mentor ModelSimTM. These simulations, along with cursory spice simulations, provide performance measurements for the asynchronous circuits.
VI. DES DESIGN EVALUATION
The iterative (iDES) and pipelined (pDES) asynchronous DES designs described in Section IV were compiled and taken through Sandia's asynchronous design flow up until physical place and route. Equivalent synchronous DES designs, obtained from opencores.org [7] , were also synthesized for comparson. For meaningful area and power comparisons, the synchronous implementations were optimized to have approximately the same speed as their asynchronous counterparts. (Fig. 8(b) ). This overhead is expected to significantly reduce when the Sandia design flow moves to the newer Balsa version 3.5, which synthesizes parallel control circuits more efficiently [8] .
The current peaks and total accumulated energy for the clocked and asynchronous iterative DES designs are shown in Fig. 10 . While the total energy is a modest 10% less for the asynchronous design, this does not include the amortized energy dissipated in a large global clock tree that is required for system-on-chip applications with many design blocks. The clocked design will also consume clock tree energy when it is idle, whereas the asynchronous implementation will dissipate zero standby power. Even without including a large clock tree, the asynchronous DES design has 50% lower current peaks.
The power spectrum for the clocked and asynchronous iterative DES designs are shown in Fig. 11 . These figures illustrate how the power dissipation is distributed in the frequency domain, which is proportional to the electromagnetic emission spectrum. The asynchronous design has 2x lower average (DC) power and 3x lower peak harmonic power when compared to the clocked design. Since the timing of asynchronous circuits is dependent on handshaking logic gates rather than well-defined periodic clocks, their exact operating speed depends on temperature, supply voltage, and fabrication process. Fig. 12 shows the robustness of the asynchronous DES designs across cold (1.65V, -55C, fast process), typical (1 .5V, 25C, typical process), and hot (1.35V, 125C, slow process) conditions. The speeds of the equivalent synchronous designs are shown for reference. VIl. CONCLUSIONS This paper described the design of asynchronous DES cryptographic circuits using Sandia's automated asynchronous design flow. The power consumption, electromagnetic emissions, and environmental robustness of asynchronous logic were demonstrated to be superior to equivalent synchronous logic.
