This paper addresses the design of asynchronous circuits for low
I. INTRODUCTION

A. Asynchronous Circuit Design
Most digital circuits designed and fabricated today are "synchronous." In essence, they are based on two fundamental assumptions that greatly simplify their design: 1) all signals are binary and 2) all components share a common and discrete notion of time, as defined by a clock signal distributed throughout the circuit.
Asynchronous circuits are fundamentally different; they also assume binary signals, but there is no common or discrete time. Instead the circuits use handshaking among their components in order to perform the necessary synchronization, communication, and sequencing of operations. This difference gives asynchronous circuits inherent properties that can be and have been exploited to advantage in the following areas:
• lower power consumption [1] - [6] ;
• higher operating speed [7] - [9] ;
• robustness toward variations in supply voltage, temperature, and fabrication process parameters [10] - [12] ; • less emission of electromagnetic noise [1] , [13] ;
• better composability and modularity [14] - [18] ; • no clock distribution and clock skew problems. On the other hand, the asynchronous control logic that implements the handshaking normally represents an overhead in terms of silicon area, circuit speed, and power consumption. It is therefore a question whether the investment pays off, i.e., whether the use of asynchronous techniques results in a substantial improvement in one or more of the above areas.
Research in asynchronous design goes back to the mid-1950's [14] , [19] , but it was not until the 1990's that projects in academia and industry demonstrated that it is possible to design asynchronous circuits which exhibit significant benefits to nontrivial real-life examples. Low power consumption seems to be one of the more promising directions, and the design reported in this paper is one of these examples.
Many researchers, including the authors [20] , have experienced that "just going asynchronous" results in larger, slower, and more power consuming circuits. The crux is to use asynchronous techniques to exploit characteristics in the algorithm and architecture of the application in question. The work reported here represents a contribution toward a better understanding of this, as well as a contribution to the 0018-9219/99$10.00 © 1999 IEEE bulk of engineering experience needed to design efficient circuits.
B. Designing for Low Power
In CMOS circuits, the power consumption is mainly related to signal transitions and stems from the charging and discharging of the parasitic capacitances in transistors and wires and from short-circuit currents during switching [21] . Minimizing power consumption is therefore a question of avoiding unnecessary signal transitions which do not contribute to the computation in question [22] . In synchronous design this is addressed by stopping the clock signal in unused modules. This is called clock gating and is basically an ad hoc approach which is only manageable at a coarse grain level.
Asynchronous circuits have the inherent property of only activating modules and storage elements where and when needed. This may be viewed as a systematic way of introducing fine grain clock gating and variable length clocks. For this reason, asynchronous design holds significant promise in applications characterized by a significant data-dependent variation in their computational complexity: loops where the number of iterations is data dependent, and arithmetic operations on predominantly small numbers are a few examples of such characteristics.
C. The Interpolated Finite Impulse Response (IFIR) Filter Bank Application
So far, the work reported on low-power asynchronous design has focused on microprocessor design [3] , [4] , and on error-correcting codes [1] , [2] . The latter is an area that is characterized by algorithms where the number of steps is data dependent and where multiple clocks are involved.
The work presented here addresses a different area: digital signal processing with a moderate sampling rate (e.g., digital audio). At first glance, this application area does not seem to have any of the "right" characteristics: the algorithm involves a fixed number of steps and the environment is synchronous (due to the fixed sampling rate). However, as demonstrated in this paper, the asynchronous implementation is able to exploit low-level data dependencies in ways that are not available in synchronous design.
A seven-band IFIR filter bank served as a vehicle for this research. It is part of the fully digital hearing aid, DigiFocus, manufactured by Oticon, Inc., and it was chosen because it is a realistic industrial example where low power consumption is very important. The existing synchronous design and the asynchronous re-implementation have been fabricated in the same CMOS technology, and a fivefold reduction in power consumption has been measured [5] , [6] .
D. Contributions and Organization of the Paper
The paper makes several contributions: 1) the asynchronous audio filter bank chip is one of the very few existing asynchronous chips which exhibit significant advantage in a nontrivial industrial example and 2) the design falls within an application domain that has not previously been addressed by the asynchronous community and it exploits characteristics of this application in novel ways.
In addition to this, the paper can be read as an exampledriven introduction to asynchronous low-power design. The IFIR filter bank is sufficiently complex to bring out some of the key characteristics of asynchronous design, and yet the algorithm and the architecture in question are sufficiently simple to be understood by a broad audience.
The paper is organized as follows: Section II describes the IFIR filter bank algorithm and the architecture used to implement it. Section III discusses the characteristics of the sampled audio input data which are exploited to minimize power consumption. Section IV gives a brief introduction to asynchronous handshaking protocols and describes the circuit implementation style used in the filter bank design. Section V explains the circuit implementation of key components in the filter bank design, and Section VI describes the physical implementation of the two chips and presents the speed and power figures of the two designs. Finally, Section VII concludes the paper.
II. ALGORITHM AND ARCHITECTURE
This section introduces the hearing-aid filter bank algorithm and describes the architecture of the circuit used to implement it.
A. Algorithm
A block diagram of the hearing aid is shown in Fig. 1(a) . The filter bank splits the input signal into seven frequency bands. These are amplified individually and merged into two frequency bands. Finally, these two signals undergo additional signal processing.
The filter bank constitutes about half of the signal processing circuitry in the hearing aid. As illustrated in Fig. 1(b) and (c), it consists of a tree structure of complementary interpolated linear phase FIR filters [23] . Explaining the details of the algorithm is beyond the scope of this paper and it is not necessary for the following discussion. We only mention that much effort has been devoted to reducing the number of multiplications in order to save power: many of the coefficients are zero and the nonzero coefficients in the individual IFIR filters are symmetric around the midpoint. This allows a folded implementation that effectively halves the number of multiplications [see Fig. 1(c) ]. Furthermore, the multiplications have been simplified by approximating the filter coefficients by numbers whose binary representation contains at most three ones. This approximation enables a simple implementation of the multiplier using a shift module and two adders.
The following figures provide some indication of the complexity of the design. In some areas the figures are approximate, as we cannot disclose exact values for the hearing aid.
• The sampling rate is approximately 20 kHz and the input is linear up to a sound pressure level of 100 dB.
• The entire IFIR filter bank structure requires storage of several hundred data values. During the processing of one input sample, only one fourth of these are accessed.
• The data samples, the filter coefficients, and the internal buses are in the 15-25-bit range.
• The number of nonzero filter coefficients is around 30, and the values of several of these are identical. 29% of the filter coefficients are represented using three ones, and 48% of the coefficients are represented using only a single one (corresponding to multiplication by 0.5, 0.25, etc., which can be implemented as shift operations).
B. Architecture
The modest speed requirement (sampling rate) allows for a highly sequential implementation. The algorithm can be serialized in several dimensions: using bit-serial arithmetic units and/or serialization in the time domain by mapping the arithmetic units depicted in Fig. 1(c) onto a smaller set of hardware units.
To avoid excessive power consumption due to handshaking overhead, bit-serial implementations should be avoided [20] . Also, structures where data are copied unchanged down a chain of registers without being used should be avoided, as this consumes power without contributing to the computation. This means that a straightforward dataflow implementation following the structure in Fig. 1(c) would be a poor solution from a power consumption point of view.
These simple arguments hint that the optimal choice is a dedicated processor structure (see Fig. 2 ) with a single add-multiply-accumulate (AMA) data path, a RAM for the data samples, a ROM for the filter coefficients, and an address sequencing and control unit. Due to the folded structure of the IFIR filters, it is convenient to use a dualport RAM. Using this architecture, the processing of one input sample requires a sequence of approximately 30 AMA operations, corresponding to approximately 600 000 AMA operations per second. The main task of the address sequencing and control unit is to generate the sequence of read and write addresses for the dual-port RAM. For each IFIR filter, a portion of the dual-port RAM is administered as a cyclic buffer: when time progresses one step and a new data sample is input, it is stored in the location which holds the oldest data sample that is no longer needed. The input sample stays in this location throughout its lifetime (in order to avoid powerconsuming data shifts). For this reason, the addresses in a computation sequence for an IFIR filter must be offset by one from one input sample to the next. In combination with the many coefficients being zero, this results in a very irregular address sequence. As an example, filter H7 in Fig. 1(c) has 31 delay elements, and its address sequence is defined by the code fragment in Fig. 3 .
All IFIR filters have an odd number of delay elements, and in the actual implementation the writing of a new input sample is performed in the same step as the last read operation related to the processing of the previous input sample. Furthermore, in this last step the data path performs a multiply-accumulate-subtract (MAS) operation, thereby producing the two outputs from the IFIR filter. This means that the data path must be able to perform both AMA and MAS operations.
III. DATA DEPENDENCIES
This section reports on an analysis of typical real-life input samples and discusses the implications it has on the implementation of the filter bank. In essence, the analysis shows that the stream of input samples to the filter is characterized by a huge predominance of numerically small values and a significant correlation among the data samples. Fig. 4 (a) shows the average signal transition probabilities in a 5-s recording of several people speaking at the same time, using a 17.5-kHz sampling rate, 16-bit resolution, and two's complement number representation. The figure shows a clear pattern which is typical in sampled real-life audio and video signals [24] . The most significant (MS) bits, 0-3, are outside the dynamic range of the signal and correspond to the sign bit and a number of sign extension bits. These bits change whenever the sign of the data changes. From a computational point of view the sign extension bits carry no information. The LS bits, 8-15, exhibit a 50% switching probability, corresponding to random uncorrelated data. The middle bits, 4-7, correspond to a transition region where the switching probability falls from the 50% level to the level of the MS bits.
A. Switching Activity in Sampled Audio Signals
The analysis of switching activity shown in Fig. 4 (a) is based on several people speaking at the same time. A further analysis shows that during a normal conversation, the filter is idle-processing background noise-for 20-50% of the time due to pauses in the conversation. Over a full day this is even more predominant, and it is thought provoking to realize that the battery lifetime is dominated by the power consumed when processing background noise.
The consequence of this is that the switching activity profile, shown in Fig. 4 (a), is shifted toward the right when real-life audio is considered. Depending on the environment, the background noise can have different activity profiles, but a 40-dB sound pressure level is common to most environments. Representing this requires less than half of the bits in a 16-bit sample.
B. Switching Activity in the Data Path
The switching activity profile is entirely different inside the filter bank circuit. The data samples are accessed out of order and the correlation is lost. Furthermore, as demonstrated in [22] , the choice of number representation (sign magnitude or two's complement) has a significant impact on the switching activity. Fig. 4 (b) and (c) shows the activity on the output of the dual-port memory, and on the MS 16 bits of the multiplier output, when processing the audio sequence whose switching profile is shown in Fig. 4 
(a).
As can be seen, the two's complement representation has a much higher switching activity than the sign-magnitude representation. The area between the two graphs represents wasted power for a two's complement representation. At the multiplier output this overhead is more than 100%, and considering the reduced switching activity in real-life audio, the overhead can easily exceed 200%. In large circuits with heavily loaded buses, the overhead can have a significant impact on the power consumption of the circuit. This is a strong argument for sign-magnitude representation, but it only considers the switching activity at the input and output ports of the arithmetic modules. Unfortunately, sign-magnitude addition and subtraction are complex operations to implement. A closer look at possible implementations reveals an internal switching activity similar to that of the two's complement interfaces described above (unless positive and negative numbers are dealt with separately). Instead we use a different approach, as explained below.
C. Adapting the Number Range to the Actual Need
To avoid the excess power consumption inherent in two's complement representation, the data path and the memory blocks have been sliced, and the circuit automatically adapts the number range to the actual need by conditional activation of the more significant slices. The mechanism is based on tagging and overflow detection. The mechanism works as follows: a comparator at the input detects when the MS slice of the input sample carries redundant sign extension information, and a tag ("big" or "small") is appended to the LS slice of the data sample. As long as the operands from the RAM's and the intermediate results in the data path are all tagged "small" the MS slice is not activated. It is only activated if one or more "big" operands are present, or if an overflow in the LS slice occurs. Furthermore, it should be noted that the filter bank consists of filters with a finite impulse response. Therefore, it is not necessary to have a hardware mechanism that checks the result and resets the tag from "big" to "small." A sequence of "small" input samples will eventually flush all "big" operands from the circuit. If an output sample is "small," it is necessary to sign extend it to the full word length before outputting it. The actual implementation of the filter bank, described later in the paper, uses two slices as explained above. The same principle can be used in a scenario with more slices. As more slices call for more control logic this decision involves a tradeoff.
D. Scaling Down the Supply Voltage
Finally, it should be noted that the varying word length results in varying latencies of the operators in the data path. Because of the fixed sampling rate, the filter bank must be designed for the worst-case situation where the full word length is used. It is therefore possible to exploit the reduced latency in the typical case and obtain additional power savings by adaptively reducing the supply voltage. A mechanism for this is explained and analyzed in [12] . As the hearing aid operates from a very low battery voltage (close to the threshold voltage of the transistors), the advantage is limited. But for circuits operating from "normal" 3.3 or 5.0-V supplies the advantage will be significant.
IV. ASYNCHRONOUS IMPLEMENTATION STYLE
Asynchronous design is not a single well-defined method, but rather a wide spectrum of options in a number of areas: handshake protocol, circuit implementation style, and theoretical basis for the design of control circuits [25] . In the following, the choice of handshaking protocol and circuit implementation style is addressed in the context of low power consumption.
A. Asynchronous Handshake Protocols
The three commonly used asynchronous handshake protocols are illustrated in Fig. 6 . As will be clear from the following, they all have different properties which affect power consumption.
The four-phase bundled-data and the two-phase bundleddata protocols in Fig. 6 (a) are self explanatory: when data are ready the sender issues a request, and when data are received, the receiver issues an acknowledge. The four-phase protocol uses signal levels to signal request and acknowledge, and the two-phase protocol uses signal transitions. In both cases it is important to ensure that the timing relation "data before request" is not violated at the receiver's end due to different delays in the request and data wires. This is similar to the data setup-time and holdtime requirements in a synchronous circuit. It also leads to worst-case timing behavior, though only on a local scale where safety margins can be tighter.
The four-phase dual-rail protocol in Fig. 6(b) is insensitive to such delays. This is obtained by a combined encoding of data and request using two wires per data bit. This robustness comes at a very high price.
For the different protocols illustrated in Fig. 6 , Table 1 shows the number of wires and the number of signal transitions, including the request and acknowledge signal wires, when communicating an -bit data word from one module to another. The number of signal transitions is a measure of the associated energy consumption.
For the bundled-data protocols, the number of signal transitions depends on the transition probability of the individual data bits. The values quoted in Table 1 assume a worst-case switching probability of , corresponding to uncorrelated data. For the four-phase dual-rail protocol, of the data wires will make an up-going transition followed by a down-going transition. This is independent of the switching probability of the data bits. Consequently, the four-phase dual-rail protocol does not allow the designer to exploit the reduced switching activity found in many real-life data, as illustrated in Section III. Taking this into account, the four-phase dual-rail protocol can result in a power consumption that is an order of magnitude higher than that of a bundled-data protocol. Although the above arguments do not consider the switching activity inside the communicating circuit modules, the huge difference speaks for itself.
The choice between the four-phase and the two-phase bundled-data protocol is also a simple one. In our experience, register implementations for the two-phase bundleddata protocol are significantly larger or significantly slower than the traditional latches used in four-phase designs. The same is true for the control circuitry used to implement conditional sequencing. The reader may find more details and circuit level insight about these matters in [4] and [20] . Furthermore, if the decision is on precharged logic rather than static logic, then the four-phase protocol comes as a natural choice since the request signal can directly control the precharging and evaluation of the circuits.
At this point it is relevant to mention that deciding on a four-phase bundled-data protocol conforms with what seems to be a general trend when focus is on power and area (and possibly also speed): Philips Research Laboratories have re-targeted their Tangram silicon compiler from fourphase dual-rail to four-phase bundled-data circuitry [2] , [26] , and the Amulet Group at Manchester University uses four-phase bundled-data circuitry in the second version of their asynchronous ARM microprocessor, where the first version used two-phase bundled-data circuitry.
For the sake of completeness, it should be noted that protocols other than the three described here do exist [27] - [29] , but in general they are impractical.
B. Circuit Implementation Style
The asynchronous IFIR filter bank uses the four-phase bundled-data protocol, and Fig. 7 gives a flavor of the associated circuit implementation style used in the design. Clock signals for latches and registers are derived locally from the handshake signals by small speed-independent asynchronous sequential control circuits (CTL) [30] - [32] . The request signals either undergo matched delays or they are derived by detecting completion of the corresponding operations.
Completion detection is possible by local use of the four-phase dual-rail protocol. Section V provides some examples of this mixing of protocols. A reader familiar with CMOS transistor-level design will notice that the dual-rail protocol is basically what is found in differential precharged structures such as RAM's, ROM's, and DCVSL gates which are used in synchronous CMOS circuits [21] , [33] . Furthermore, it should be noted that the dual-rail encoding shown in Fig. 6(a) is a one-hot encoding. It is often convenient to use four-phase -rail one-hot generalizations of this protocol. Again, Section V provides some examples of this.
Using the design style described above, the difference between synchronous and asynchronous circuits has more or less diminished. The low-level component implementations are the same; the difference is that where a synchronous design uses clock distribution, clock buffering, and clock gating, the asynchronous circuit uses distributed asynchronous control to derive local clock signals for the individual latches and registers. This style of design gives asynchronous circuits inherent properties that resemble clock gating and variable-length clocks taken to the extreme: registers, latches, and combinational circuits are only activated where and when they are needed. However, there is one important difference: asynchronous design offers a systematic approach to achieve this.
For this reason, asynchronous techniques are advantageous for the implementation of algorithms which exhibit irregular and/or low-level data dependencies, i.e., in situations where a synchronous implementation using clock gating is not viable.
V. CIRCUIT IMPLEMENTATION
This section provides a closer look at the overall organization of the filter bank circuit and gives a detailed description of the low-level implementation of some of the most important and interesting components.
A. Overall Organization, Data Flow, and Control
The architecture of the asynchronous IFIR filter bank is shown in Fig. 8 . Comparing it with Fig. 2 , it can be seen that the only difference is that the dual-port RAM and the address sequencing and control logic have been partitioned into nine IFIR modules, corresponding to the nine IFIR filters in Fig. 1(b) . A top-level sequencer in turn requests each of the IFIR modules to "drive" the data path with the necessary sequence of operands (Fig. 3) and control signals, corresponding to the processing of one input sample. The coefficient module in Fig. 8 communicates directly with the multiplier in the data path, and the sequence of filter coefficients delivered by the coefficient module corresponds to a concatenation of the sequence dictated by the IFIR modules (but there is no direct communication between the two parts).
The partitioning of the dual-port RAM and the distribution of the control logic has several advantages that significantly reduce power consumption. In the RAM's it minimizes the capacitance of the bit lines and in the address sequencing logic it allows global buses and decoding logic to be avoided. Fig. 8 provides a closer look at one of the IFIR modules, and the important details are explained here: in a standard CMOS RAM, the set of internal row-select signals follows a one-hot -rail four-phase protocol, but the external address bus is typically encoded using a binary representation. Since the RAM's in the nine IFIR modules are on-chip and of a moderate size, it is possible to generate the address signal directly as a one-hot encoded signal. The circuitry developed for the address sequencer consists of two cyclic one-hot counters-a step counter and an offset counter (cf. Fig. 3 )-from which the one-hot address is directly decoded. The address sequence is quite complex, yet the structure of one-hot counters and decoding logic provides a very power-efficient solution as explained in Section VI.
The same power efficient one-hot counter was used in the top-level sequencer. Each of the bits in the one-hot code is connected directly to one of the IFIR modules, thus providing a request signal for that module. This is seen as in the detailed view of Fig. 8 . In conclusion, the reader should notice the data-driven nature of the implementation and, in particular, the distributed, autonomous, and hierarchical structure of simple one-hot control units. The activation scheme in the control logic is very complex, and a corresponding synchronous implementation is not viable.
B. The AMA Data Path
It is desirable to have a fast processing unit with low latency but at the same time avoid pipeline registers, since such registers consume substantial power. To achieve these goals, the data path in Fig. 9 was developed. A block diagram of the data path is seen to the left, and the request flow and data flow are illustrated to the right. The figure is only an example illustrating the principle; it is not an exact diagram. For instance, in the real implementation, the adders of the multiplier are activated according to the number of ones in the coefficients. Sometimes the coefficients correspond to a simple shift, and in that case none of the adders are activated. The total number of additions for an entire AMA operation therefore varies from two to four. The logic required for this behavior is not included in the figure.
The interface of the data path follows the four-phase bundled-data protocol as illustrated in Fig. 9 , but the data flow and the request flow are quite different inside the data path. This is illustrated by the shading and the arrows, respectively. The idea of the approach is simple: the validity of the sum bits in an adder can be guaranteed to be correct in a sequential order from the LS full adder to the MS full adder. The first full adder in the multiplier can therefore start computation immediately after the first bit is computed in the adder above it. The implementation of a dual-rail ripple-carry full adder that enables this is explained in Section V-C.
In the example in Fig. 9 , the shaded full adders indicate full adders that have finished computation. It is noticed that the computational wavefront progresses diagonally rather than straight down. Forcing this request flow has two significant advantages: 1) the completion of the computation is directly indicated by the request output of the accumulator, and therefore no completion logic is required, and 2) all hazards can be eliminated if each full adder only computes sum and carry outputs once, i.e., when requested. The power consumption in such two-dimensional array structures can otherwise be quite severe [34] .
The biggest disadvantage of the suggested approach is the fact that correct operation relies on delay matching. It must be guaranteed that the computation taking place in one row of full adders never "overtakes" the computation taking place in the row of full adders above it. The circuit implementation of the full adder is therefore critical and was carefully developed. Fig. 10 shows a transistor diagram of the full adder used throughout the data path, as well as a detailed view of the top row of full adders in Fig. 9 . As can be seen, the operand and sum bits are standard single-rail signals whereas the carry is a dual-rail signal. The dual-rail carry enables a strictly sequential evaluation order. To minimize the power consumption associated with the highly active carry signals, dynamic domino logic was used for the circuit implementation. This type of logic is well suited for dualrail encoded signals [31] , [33] , and it reduces the load on signals to a minimum since all evaluation can be carried out in NMOS logic. The precharge is carried out by a single PMOS transistor, and to minimize the time required for the precharge operation all full adders are precharged in parallel.
C. The Full Adder
The chosen circuit implementation of the full adder is derived from [35] and is shown in Fig. 10(b) . To minimize the variation in propagation delay two dummy transistors are inserted to make the number of transistors from the output node to ground identical for any path in the evaluation logic. Note also that the placement of the carry transistors was chosen so that only one transistor loads the carry signals in each of the two circuits (sum and carry).
D. Slicing the Data Path
Section III-C explained briefly how the slicing and tagging used in the data path worked. This section explains the details and shows that the associated circuit overhead is insignificant.
Consider the adder and the multiplier of the data path in Fig. 2 . Adding two "small" numbers can only extend the result one bit beyond the slicing point of the adder, and all coefficients in the algorithm have a magnitude in the range [0; 0.5] which corresponds to shifting the operand at least one position toward the LS bits. For that reason, the combined add-multiply operation never causes an overflow, and the tag appended to the multiplier output is "big," only if one of the inputs to the adder is "big." This requires only an OR-gate, as shown in Fig. 11 .
Overflow may occur in the accumulator, and the tagging logic is therefore slightly more complex. Also, additional multiplexers are needed at the operand input of the MS part in order to extend the sign of the LS part into the MS part when overflow occurs. Fig. 12 shows the adder of the accumulator as well as the Boolean equations for the tagging logic.
The LS part of the adder is controlled directly by the request signal associated with the two operands and the request input to is generated by the Tag  Control circuit . To support this, generates a dual-rail encoded overflow signal . Whenever an overflow occurs in the LS part, or one of the operand inputs has a tag with value "big," the MS part is activated as indicated by . This signal is also the tag of the result.
E. The Dual-Port RAM
The RAM in the filter bank is derived from a standard eight-transistor dual-port RAM cell [21] . Such a RAM typically uses differential signaling at the bit lines. If the size of the RAM is moderate, it is possible to avoid sense amplifiers, and read from the RAM using only a single bit line. In order to activate only one bit line during read operations, it was chosen to have separate access to each select transistor in the RAM cells. The RAM was further optimized by dedicating one of the ports for reading only, thereby eliminating one bit line and the associated select transistor in the RAM cells. The resulting seventransistor RAM cell is shown to the left in Fig. 13 . Word1 corresponds to the port with both read and write capabilities and word2 corresponds to the dedicated read port.
These optimizations lead to substantial power reductions, since most operations taking place in FIR filters are read operations. When random data are stored in the RAM, the number of transitions on the bit-lines is halved.
In order to detect when the read and write operations have completed, one of the RAM cells in a word is a traditional eight-transistor dual-port RAM cell, as shown to the right in Fig. 13 . The two (differential) bit-lines from that RAM cell behave according to the four-phase dual-rail protocol, and completion is easily detected from the output of this cell alone: it is simply assumed that all the other bits (from the seven-transistor RAM cells) arrive at practically the same time. In total less than 2% extra transistors are added to the RAM in order to make it self timed and accommodate the slicing.
F. Discussion
The data path consists entirely of dynamic logic. This is truly remarkable in a two-dimensional structure of full adders and it is only possible due to the completion indication found in the carry signal. No evaluation sequence can be guaranteed in an implementation with single-rail signals. A standard synchronous implementation using domino logic would therefore only be possible if separate clock signals were provided for each full adder in the array, and if that clock signal guarantees that all inputs to the full adder are valid and stable. It is therefore seen that a class of circuits can be implemented with asynchronous logic which simply cannot be implemented using traditional synchronous methods. Thus, the asynchronous data path has no synchronous equivalent, but the data path as a whole could of cause be used as a module in an otherwise synchronous design.
The data path illustrates an important characteristic of asynchronous design: guaranteed average case latency. In the filter bank, the processing of an input sample is carried out as a sequence of AMA operations. The latency of the individual AMA operations depends on the filter coefficients. Since they are constants (built into the circuit) this latency is known in advance. From this follows that the total latency of a sequence of AMA operations is also constant. From a performance point of view this means that the total latency of a sequence of AMA operations is determined by the average latency of the individual AMA operations. In a synchronous design it is not possible to exploit this. The clock period would have to be set according to the worstcase AMA operation, or a higher rate clock in combination with a (carry-save) add-shift-accumulate unit could be used. The first solution has worst-case performance, and the latter solution results in significant excess power consumption from clocking the accumulator at a higher rate. Asynchronous design offers more freedom to the designer. In the filter bank design, the benefit of this property is marginal due to the overlapping evaluation in the adders in the data path, but in other applications the benefit may well be significant.
Finally, a word on handshake protocols. At the module level, the design uses the four-phase bundled-data protocol, but locally, inside the RAM modules and inside the data path, dual-rail signaling is used. The primary reason for this is a power efficient circuit implementation, but it also has the advantage that completion detection becomes possible. In the filter bank this module level completion detection comes at basically no cost. The RAM modules are inherently dual-rail and the asynchronous overhead is only 2%. In the data path, the carry out of the most significant position of the accumulator directly indicates completion. Such a hybrid use of handshake protocols is typical in the design of efficient asynchronous circuits.
VI. RESULTS
To compare synchronous and asynchronous design techniques, both the asynchronous IFIR filter bank and its synchronous counterpart have been fabricated and tested. This section reports on the physical implementation of the two designs and the measured power consumption. It also includes a breakdown of the sources of power consumption in order to provide more insight into where and how power is saved.
A. Physical Implementation of the Two Chips
The two designs were fabricated in pairs on the same wafer in a standard 0.7-m CMOS technology with transistor threshold voltages V and V. Die micrographs of the two chips are shown in Figs. 14 and 15 .
The layout of the synchronous design was provided by Oticon, Inc. The layout was generated automatically using standard cells and a single-port RAM generator. From the chip micrograph in Fig. 14 , it is seen that all logic is placed in one block of standard cells at the bottom of the chip and that the RAM at the top of the chip has been divided into four blocks. Consequently, several IFIR filters are mapped onto each of the four RAM's. The chip contains 48 000 transistors and the size of the core is 3.6 2.7 mm (excluding pad cells). The transistors in the standard cells are scaled individually with small transistors for logic and larger transistors in the output drivers. As in the asynchronous design, the RAM blocks are laid out using custom-made generators and the dimensions of the transistors are close to those used in the asynchronous design. In total, the layout of the synchronous design comprises a good low-power design. The layout of the asynchronous design involved more manual work. We developed a number of dedicated asynchronous standard cells. Examples of these are: 1) a set of Muller C-elements; 2) the precharged dual-rail-carry full adder described in the previous section; 3) various cells for building one-hot counters; and 4) a couple of latch controllers. Furthermore, an asynchronous dual-port RAM layout generator was developed (cf. Section V-E). This generator sizes transistors according to the number of words and the capacitive loading of the bit lines in order to minimize power consumption without compromising the rise and fall time of signals. The result of this effort is a highly customized and optimized set of building blocks.
In order to be able to control the delay matching (request versus data) at the many bundled-data handshake interfaces, cells and modules were assembled, placed, and routed manually. Tools for this task do exist; they were just not available in this (university) project. For this reason, the layout is not very dense. The chip micrograph is shown in Fig. 15 . It consists of a data path and eight IFIR modules (one is shared by two IFIR filters). The IFIR modules are easily identified from their dense and regular RAM blocks. As explained previously, they consist of a RAM block and the associated address sequencing logic. The core of the design contains approximately 70 000 transistors, and the size of the design is 4. 
B. Power Measurements
Two different test vector sets were developed to measure the power consumption: one set to activate only the least significant slice of the asynchronous chip (typical data), and the other to activate the two slices. Both chips were functionally correct and operational at the required sampling rate, down to a supply voltage of 1.55 V. Table 2   Table 2 Power Consumption of The Two Chips Table 3 Breakdown of the Power Consumption in the Asynchronous Design for Worst-Case Data (>50 dB)
shows the measured power consumption of the cores. As expected, the power consumption in the asynchronous design depends on the input data. Furthermore, the table shows that the asynchronous design has a remarkably low power consumption: 4.0-5.5 times lower than the synchronous design.
In order to provide more insight into these matters, Table  3 shows the breakdown of the power consumption in the asynchronous design. This breakdown is based on HSPICE simulations of extracted layout, calibrated to match the measured power consumption in the data path and in the IFIR modules.
The power consumption of the IFIR dual-port RAM's and the AMA array is directly proportional to the degree of slicing of the data path, whereas the power consumption in the other modules is independent of this. The slicing reduces power consumption by 30% in the typical case, and it is the single most important factor contributing to the reduced power consumption.
For worst-case data, the difference between the two designs is a factor of four. A closer look at the two designs shows that this difference is found across all modules in the two designs: approximately 1 : 4 in the RAM's and the data path, and 1 : 6 in the control logic. For the RAM's the optimizations explained in Section V-E explain the 1 : 4 power reduction, and the same techniques could be exploited in a synchronous design.
The power savings that can be attributed to the asynchronous techniques stem from the hierarchical one-hot address sequencing logic, the AMA, and the slicing of the data path. Altogether, this accounts for a 1 : 4.3 difference in power consumption for typical-case input data.
VII. CONCLUSION
The paper explores the use of asynchronous circuit techniques in the design of digital signal-processing circuits with low power consumption. The vehicle for this study is a seven-band IFIR filter bank. This circuit constitutes a major part of the fully digital hearing aid, DigiFocus, manufactured by Oticon, Inc.-an industrial application where low power consumption is of paramount importance.
The asynchronous re-implementation and the existing synchronous counterpart were fabricated in the same 0.7-m CMOS technology. The synchronous design contains 48 000 transistors and its power consumption is approximately 470 W. The asynchronous design contains 70 000 transistors and its power consumption is 85 W when processing input data corresponding to a sound pressure level less than 50 dB.
This fivefold power reduction is a strong argument in favor of asynchronous design, but a note of warning is appropriate: it is difficult to make a fair comparison of different design methods based on quantitative power figures for the resulting circuits. First, such a comparison is only valid for the particular benchmark circuit considered. Second, if the benchmark circuit is too small it may well be biased and favor one method, and if it is too complex, many factors other than the design methods themselves may offset the result. The IFIR filter bank considered in this paper represents a nontrivial yet moderately sized circuit. The synchronous design and its asynchronous counterpart use the same basic architecture: some RAM blocks and a single data path. At the layout level, the transistor sizing and standard-cell style are also comparable. Therefore, the two designs do allow the conclusion that asynchronous circuits are advantageous for implementation of low-power signal processing circuits.
Having said this, the most important contribution of the paper is that it demonstrates how asynchronous design techniques offer flexibility and freedom to exploit low-level data dependencies in the algorithm and obtain significant power savings. This comprehension is particularly interesting because, in the first place, the application seemed to be an obvious candidate for a synchronous implementation due to the fixed sampling rate and the fixed number of steps in the algorithm.
The asynchronous design exploits the fact that typical real-life audio signals are dominated by numerically small samples, and it adapts the number range to the actual need. This is implemented by slicing the data path and the RAM's and by using a tagging and overflow detection scheme that only activates the most significant slice when it is necessary. This alone accounts for 30% of the power reduction, making it the single most important measure.
The asynchronous design uses two slices to demonstrate the principle, and the 50-dB slicing point basically distinguishes between background noise and "real" audio signals. The idea could easily be extended to more than two slices leading to additional power savings. Because of the asynchronous implementation, the slices can have arbitrary sizes.
The slicing reduces power consumption by minimizing the switching activity in the circuit, but it also reduces the data path latency in the typical case. At the same time, the fixed sampling rate results in a worst-case design where response time must be guaranteed. This combination gives room for additional and significant power savings by adaptive scaling of the supply voltage. This option has not been pursued in the present design because of the very low nominal supply voltage in the hearing aid.
The transistor count of the asynchronous design is 45% higher than that of the synchronous design. This is due to the distributed organization of the address sequencing logic and the higher number of RAM modules, each requiring its own control and precharge logic. Such a tradeoff between area (transistor count) and power consumption is typical for low-power design in general.
Finally, it must be said that synchronous design and asynchronous design are not opposites. They are alternatives to the designer. Both have advantages and disadvantages, and most circuits involve both synchronous and asynchronous parts. The balance seems to be shifting toward more asynchronous circuitry, and this paper represents a contribution toward a better understanding of where and how asynchronous techniques can be exploited to advantage. In 1997, he joined the hearing-aid company Oticon, Inc., Hellerup, Denmark, and his main research interests are DSP applications with a special focus on low-power design and asynchronous circuit design.
Jens Sparsø (Member, IEEE) was born in Silkeborg, Denmark, in 1955. He received M.Sc. degree in electrical engineering from the Technical University of Denmark, Lyngby, in 1981.
Since 1982, he has been with the Department of Computer Science at Technical University of Denmark, where he became an Associate Professor in 1986. He is teaching courses on very large scale integration (VLSI), digital systems design, and computer architecture, and his research interests are architecture and design of VLSI systems, i.e., design methods, circuit techniques, and the interplay between technology and system architecture. This has included the design and implementation of several error-correcting decodes for telecommunication applications. For the past eight years, his main focus has been on design of asynchronous circuits and circuits with low power consumption. He spent the 1995-1996 academic year as a visiting Associate Professor with the Computer Science Department, University of Utah, Salt Lake City.
Prof. Sparsø is on the steering committee of several conferences in the area, and he has given a number of tutorials on asynchronous circuit design at European conferences.
