Abstract: Energy-efficient design for self-timed circuits is investigated. Null convention logic is employed to construct speed-independent self-timed circuits. For error-free computation, the supply voltage automatically tracks the input data rate so that the supply voltage can be kept as small as possible while maintaining the speed requirement. For error-tolerable computation, such as soft digital signal processing, further energy saving is achieved at the cost of signal-to-noise ratio when an ultralow supply voltage is applied. Cadence simulation shows that 40 to 70% power can be saved by introducing À15 to À11 dB error in typical speech signal processing.
Introduction
In recent years, very large-scale integrated (VLSI) system design for low power has been of great interest given the proliferation of portable devices and the need to extend the lifetime of batteries and to reduce cooling costs. Many low power techniques have been developed at different design levels, such as the algorithmic level [1] , architecture level [2] , logic level [3] , circuit level [4] , and device level [5] . Scaling of supply voltage is one of the most effective approaches in low power design because the power dissipation of a CMOS digital circuit is typically proportional to the square of the supply voltage V DD [6] . Although reducing the supply voltage leads to an increase in circuit delays, the increased delays are allowable as long as the circuit still meets the speed requirement.
A variable supply voltage scheme [7] has been developed for synchronous systems, where special attention is paid to increased delays due to the complicated clock distribution. This scheme includes a large logic control circuit, which consumes significant power and has large chip area, so that it is not suitable for smaller circuits. A technique that combines self-timed circuitry with adaptive supply voltage scaling was proposed in [8] . However, this technique needs FIFO buffers, and the minimal size of the buffers depends on the data rate. Also, the power dissipated by the buffers reduces the benefit of adaptive scaling.
It is possible that the available energy is not sufficient to support a real-time computing mission even if all of possible low power techniques are applied. It is highly desirable to trade-off some quality with operating time. This trade-off was implemented by scaling the supply voltage beyond the requirement imposed by the critical path delay in synchronous digital signal processing (DSP) [9] , referred to as 'soft DSP'. This technique is limited in synchronous circuits as follows: (i) the applied circuit must have a characteristic of delay data-dependency; (ii) the probability of erroneous outputs depends not only on the supply voltage, but also on the distribution of input patterns. A high probability of erroneous outputs may degrade the accuracy of the final outputs to an unacceptable degree.
Asynchronous circuits can achieve low power performance. They generally have transitions only where and when involved in the current computation [10] . Some asynchronous implementations inherently eliminate glitches, therefore decreasing energy consumption. Furthermore, a speed-independent circuit is an asynchronous circuit that functions correctly regardless of gate delay, with the delays of wires or interconnections being assumed negligible or zero [11] . This gate delay insensitivity gives the circuit further opportunities to scale supply voltage for low power operation [12, 13] . This paper investigates the implementation of low power operation for speed-independent circuits by scaling supply voltage. Null convention logic (NCL) [14, 15] is employed to construct these speed-independent circuits. The main contributions of this paper are: (i) a novel supply voltage scheme for error-free computation; and (ii) the implementation and thorough analysis of the energy-quality trade-off for error-tolerable computations in speed-independent circuits.
Overview of null convention logic
NCL uses symbolic completeness of expression to achieve speed-independent behaviour [14] . A symbolically complete expression is defined as an expression that depends only on the relationships of the symbols presented in the expression without reference to the time of evaluation. In general, a multi-rail signal can be used to incorporate data and control information into one mixed signal path to eliminate the time reference, and therefore to form a symbolically complete expression. Typically, a dual-rail signal D consists of two wires, D 0 and D
1
, which represent a value from the set {Data0, Data1, Null}, shown in Fig. 1a . The signal values are referred to as 'Data' and 'Null' for high voltage level and low voltage level respectively. This gate has two important properties, threshold behaviour and hysteresis behaviour. The threshold behaviour requires that the output becomes Data if at least M of the N inputs have become Data. The hysteresis behaviour requires that the output changes only after a sufficiently complete set of input values has been established. In the case of a transition to Data, the output remains at Null until at least M of the N inputs become Data. In the case of a transition to Null, the output remains at Data until all N inputs become Null. An N-of-N gate is an N-input Muller C-element while a 1-of-N gate corresponds to an N-input OR gate. As an example, the schematic of a static 2-of-3 gate is shown in Fig. 1b . It includes four blocks (Go to Null, Go to Data, Hold Null, Hold Data), one inverter and two feedback transistors (pMOS for Hold Null, nMOS for Hold Data). Transistorlevel design of the M-of-N threshold gates can be found in [16] .
A typical NCL circuit structure forms a pipeline in which each stage consists of a register, completion detection circuit and a combinational circuit [14] . Threshold gates can be used to build these blocks [17] . An NCL full adder is shown in Fig Figure 3 illustrates the gate-level structures of the register and completion detection circuit. Th22nx0 (Th22dx0) is a 2-of-2 threshold gate that is initialised to Null (Data). Th12bx0 is a 1-of-2 threshold gate with an inverter following. Th22x0 is a 2-of-2 threshold gate without an initialisation control. The completion detection in Fig. 3 is essentially a serial process and therefore will affect high performance if a reasonable data width is used. An alternative completion detection [18] , which is a parallel process, can speed up the completion detection operation significantly.
A pipeline, which receives data from a synchronous system such as an A/D converter, is shown in Fig. 4a . If the request signal r 2 from the next stage is high to request Data, then Data are allowed to pass through the current register, and after a complete set of Data pass through the current register, the request signal r 1 will become low to request Null from the previous stage, which means the computation in the current stage is finished and the circuit needs to be reset. A similar operating mode exists when the request signal r 2 from the next stage is low to request Null. The rate of input data from the synchronous system is limited by the speed of the pipeline. Assuming that the pipeline has been optimised so that each stage has the same delay, the timing constraint for error-free computations is illustrated in Fig. 4b , where T data is the input Data cycle, D data is the propagation delay of Data from register 1 to register 2, which includes the delays of two registers, the data path and completion detection. D null is the propagation delay of Null from register 1 to register 2. The sum of D data and D null must be no more than T data , i.e. a complete set of Data (or Null) must arrive after the corresponding request signal. Otherwise, the speed-independent circuit will miss some input data. Therefore, the allowed maximum input data rate is given by
Usually a speed margin is needed to guarantee that the speed-independent circuit works correctly.
Adaptive supply voltage scheme for error-free computation
When the input data rate is less than the allowed maximal input data rate given by (1), a feedback circuit can be designed to provide the data path with a lower supply voltage, as long as the delay of the data path meets the timing constraint in Fig. 4b . The feedback circuit is implemented as shown in Fig. 5 . It consists of a completion detector, a D flip-flop and a DC-DC buck converter [19] . The completion detector could be constructed using C-elements. The output of the detector is high when a complete set of Nulls arrive. The output is low when a complete set of Data arrive. Otherwise, the output does not change. The high level output of the D flip-flop implies that the data path is waiting for the data input and that the data path works faster than required. Therefore, the supply voltage is allowed to decrease, and vice versa. A delay element is used to provide a margin of safety for the supply voltage. However, the advantage of this scheme will be offset if the delay is too large. In the following simulation, the delay is set to 1 ns. Based on Cadence simulation, typical waveforms in close-loop voltage control are shown in Fig. 6 . Initially, the maximum voltage V DD is applied to the self-timed data path by setting M 1 on and M 2 off. Data and Null are input to register 1 alternatively at a constant rate below the allowed maximum data rate. After one Data cycle, the output of the D flip-flop becomes high because the request signal arrives at register 1 earlier than the corresponding Data does, which means that the data path operates too fast, and that the supply voltage may decrease to save power. V x becomes low when V out is reduced to a specific value, which means that the data path operates too slowly, and that the supply voltage needs to increase to ensure adequate speed. When the circuit is stable, the V x waveform should be a pulse signal with an average duty cycle P associated with an input data rate. The output of the buck converter is a rough DC voltage with a small ripple, and the DC component is given by
The frequency of V x depends mainly on the input data rate and the delay characterisation of the data path. To achieve small ripple on the output voltage, the values of L and C are chosen so that the LC frequency constant f 0 ¼ 1=2p ffiffiffiffiffiffi LC p is much smaller than the frequency of V x . On the other hand, increasing L and/or C will result in a longer time for the circuit to track the input data rate. Thus, choosing the values of L and C requires making a trade-off between ripple and transient performance.
A 4Â 4 -bit NCL Baugh-Wooley multiplier [20] is designed as the data path to demonstrate the energy saving effectiveness of the proposed scheme. The simulation is based on 0.18 mm CMOS technology. The dependences of supply voltage V out and energy consumed by the multiplier on data rate are plotted in Fig. 7a and Fig. 7b , respectively. When the supply voltage V DD ¼ 3.3 V is applied to the data path, the allowed data rate is approximately 180 MHz. The supply voltage will adaptively decrease as the input data rate goes down. Fig. 7b shows that significant energy can be saved when the input sample rate is relatively small. 
Ultralow supply voltage operation for soft DSP
The gate delay insensitivity provides a speed-independent circuit with further low power potential when an ultralow supply voltage (ULSV) is applied to the circuit. A supply voltage is termed ULSV if the supply voltage is so small that the circuit cannot work fast enough to process the input data stream. When an ULSV is applied to a speedindependent circuit, further energy saving can be achieved while some errors are introduced because the constraint defined by (1) is violated. The following analysis shows that under the condition of ULSV, the speed-independent circuit would miss some input samples (Data or Null), and the outputs corresponding to the inputs not missed are always correct. In other words, an output is either lost or delivered correctly. In DSP systems, missing leads to an equivalent signal to noise loss, referred to as soft DSP. This ultralow supply voltage scaling is particularly useful in systems with highly sequential algorithms that perform a large number of computation steps per data sample [8] .
For simplicity the following assumptions are made:
(i) The input data rate is fixed, and the duration of Data is equal to that of Null, i.e. T data ¼ T null ¼ 0.5T data+null .
(ii) The Data delay is the same as the Null delay, i.e.
This assumption requires that the rise time of the circuit is equal to its fall time.
(iii) DoT data+null so that no two consecutive samples are missed. This assumption ensures a miss rate no more than 50%. Figure 8 shows the implementation of soft DSP using NCL circuits. Register 1 is a modified register. This modification makes sure that no single Data or single Null is missed, which means only (Data, Null) pairs are missed, if any. Since no consecutive Data samples are missed under assumption (iii), one flag bit is attached to the data bus for miss detection. This flag bit is assigned Data0 and Data1 for two consecutive Data samples respectively, and passes from the input register to the output register without processing. If two consecutive outputs have the same flag Data0 (or Data1), there must be an output missed between these two consecutive outputs. The missed output is estimated by the interpolation based on the outputs delivered by the speedindependent circuit. A linear interpolation method is adopted in this paper. The average of two consecutive outputs with the same flag is the estimation of the missed output.
From Fig. 9 it can be observed that the time difference Dt accumulates until a (Data, Null) pair is missed, where Dt is defined by
In fact, a (Data, Null) pair is missed whenever Dt accumulates to 0.5T data+null . Thus, let n be defined by
where Ixm is the floor function of x. Note that nZ1 due to assumption (3). If there is a (Data, Null) pair missed after average k pairs of (Data, Null) are delivered at output, then the miss rate of (Data, Null) pair is defined by
where k is given by
Note that k is not necessarily an integer. Obviously, the miss rate is a two-dimensional function of input data rate and circuit delay (or supply voltage). By defining the input data rate f as the reciprocal of the Data-Null cycle T data+null , replacing D in (3) by D(V dd ), and combining (3), (4), (5), and (6), the miss rate in (5) can be rewritten as
where
C L is the total node capacitance, b is the gate transconductance, V t is the device threshold voltage, and a is the velocity saturation index.
The circuit delay is estimated by (8) with quite good accuracy [21] . Since a speed-independent circuit has a characteristic of average-case delay, instead of worst-case delay, D(V dd ) in (7) is an average delay in real operating environments.
As an example, the miss rate R m (V dd , f) for a chain of eight full adders is plotted in Fig. 10 , where the plane (V dd , f) is partitioned into different regions by a family of curves, and each region between any two neighbouring curves is defined by the miss rate of a (Data, Null) pair. Given an input data rate, the supply voltage can be reduced significantly by allowing a tolerable miss rate. Similarly, given a supply voltage, the maximal input data rate can be increased by allowing a tolerable miss rate. The curve 'critical V dd ' shows the minimum supply voltage for errorfree operation.
A 4Â 4 -bit NCL Baugh-Wooley multiplier is investigated when it is employed to process speech signals. A typical speech signal y(n) and its spectrum without missing data are plotted in Fig. 11 . The outputŷ yðnÞ from the interpolation includes the ideal output signal y(n) and error signal e(n), expressed bŷ y yðnÞ ¼ yðnÞ þ eðnÞ ð 9Þ
The magnitude of interpolation error, normalised to ideal output signal, is defined by
where s 2 error is the variance of error e(n) and s 2 y is the variance of the ideal output signal y(n). Figure 12 shows the estimation error versus the reciprocal of miss rate for this speech signal. The error depends on the signal bandwidth, interpolation method, and miss rate.
The delay of the 4 Â 4 -bit NCL multiplier is modelled by (8) where a ¼ 1.1967, and C L /b ¼ 0.899 Â 10 À9 FV 2 /A, V t ¼ 0.75 V, based on 0.18 mm CMOS technology. If the above speech signal passes through the multiplier with a ULSV, a miss rate exists for a specific supply voltage and input data rate. Given an input data frequency, the reciprocal of miss rate is plotted in Fig. 13 as a function of supply voltage. The reduction in power dissipation is characterised by power savings (PS) defined as
where P critical is the power dissipation at V dd ¼ V critical , and P ULSV is the power dissipation when V dd is less than V critical . Neglecting the power dissipated by the error compensation circuit, the curves of power savings due to ULSV are plotted in Fig. 14 for input data rates of 200, 400, and 600 MHz, respectively. More than 40 to 70% power can be saved by introducing À15 to À10 dB error, which is tolerable in many DSP applications.
Summary
Low power operation can be achieved for a data path using self-timed design and an adaptive supply voltage scheme. The scheme proposed here consists of a simple logic circuit and buck converter. The handshake signals in a self-timed pipeline are employed to track the input data rate automatically, and thus to keep the supply voltage of the data path as small as possible. However, further work is needed to investigate the dynamic tracking performance.
We have also presented a low voltage/low power design for DSP applications. This approach exploits the robustness of a self-timed circuit to ultralow supply voltage to achieve significant power saving. The effectiveness of this approach is demonstrated by miss rate analysis and a DSP case study. However, the bandwidth and signal-to-noise ratio of the input signal limit the accuracy of error correction. On the other hand, the accuracy may be improved by a smaller miss rate and/or an advanced interpolation method such as linear prediction based on multisamples and data autocorrelation. 
