Awtracte introduce new architectural optimizations for lowpower asynchronous systems, such as Tangram-based systems of van Berkel et al. Our goal is to reduce power consumption by iimproving system concurrency. We introduce two new sequencer designs, with greater concurrency than existing ones, that provide the opportunity for substantial power savings through voltage scding. To safely accommodate this added concurrency, new latch designs are presented, for both dual-rail and single-rail implementations.
level Tangram programs. The programs are compiled, using syntax-directed translation, into handshake ctrcuits, an intermediate-level representation of a circuit as a network of communicating processes. Every process is mapped to a circuit element in a self-timed library of modules. Such systems are macromodular, since they are constructed by combining modules into it working system. Macromodular circuits are robust and usiially have few timing assumptions.
The goal of this paper is to present architectural-level o p timizations for low-power asynchronous macromodular systems, such as those of van Berkel [13, 141. In these systems, sequencing control and its interaction with the datapath are critical. Our goal is to increase the level of concurrency in the sequencing of data processing actions. This increased concurrency must be achieved without increasing the switching activity required for the computation (otherwise power consumption c:ould increase).
In particular, we present the following new contributions. First, we introduce two new designs for asynchronous sequencers. Each design increases the concurrency of the datapath operations in the entire system. Second, we show that existing asynchronous datapaths will not operate correctly at this level of concurrency. We therefore modify the datapath to insure correct operation. Specifically, we introduce new designs for asynchronous latches and multiplexers that handle concurrent operation safely in (a) "dual-rail" datapaths, and (b) "single-rail" datapaths (described below).
For dual-rd datapaths, our new components allow roughly twice the throughput of existing sequential designs.
In this case, after voltage scaling, energy is reduced to less than one-half. For single-rail datapaths, two different schemes are resferenced. Our new components result in twice the throughput of the first scheme, and roughly the same performance i s the second one. However, our simpler approach has advantages over the latter in (i) ease of design and (22) glitch avoidance in the datapath. Organization of the paper. The paper is organized as follows. Section 2 reviews background on power consumption and asynchronous circuits. In section 3, existing sequencers are examined, and two new concurrent sequencer designs are introduced. Section 4 introduces new latch and multiplexer designs, for dual-raildatapaths, to handle the increased concurrency. Similar modifications for single-rail datapaths are introduced in section 5. Section 6 presents results of analysis and SPICE ruins, and Section 7 presents conclusions.
Overview

2.1
There are three major sources of power consumption in CMOS circuits. Switching energy is associated with transitions on gate outputs. Short-circuit energy consumption is caused by simultaneous conduction of pull-up and pull-down stacks, allowing current flow directly from the power supply to ground. Finally, leakage energy occurs in standby mode, and is determined by technology factors. In most CMOS circuits, switching power dominates the other two. amount of switching activity that takes place, i.e., the number of transitions; and (22) the energy consumed per transition, which is a function of the capacitance that is being (dis)charged and the supply voltage. Power consumption can be reduced by reducing the capacitance, the number of transitions, or the supply voltage. Since power depends quadratically on the supply voltage, supply voltage scaling is an especially attractive scheme for power reduction [3] . Unfortunately, voltage scaling has the undesirable effect of reducing the speed of the circuit. Our goal is therefore to increase the concurrency, and hence the throughput, of a system. Such throughput improvement compensates for the performance penalty which results from supply voltage reduction. If the increase in performance is achieved without increasing the switching activity required for the computation, a substantial reduction in power is possible after voltage scaling, with no net loss in performance.
Asynchronous Circuit Operation
In this paper, we focus our attention to asynchronous macromodular systems. This type of asynchronous circuits are designed as a network of predefined data and control modules [2] . Instead of a global clock signal, communication channels between modules are used to synchronize their operation and data interchange. These channels can be implemented using different protocols and different codes can be used to represent and transmit data. Two protocols are most common: dual-rail and single-rail. e Dual-rail Data Processing. In dual-rail datapaths, data is encoded using a dual-rail code [14, 61, a delayinsensitive code that requires two wires for every data bit. A controller, frequently a sequencer, uses handshake signals C, and C , to communicate with this datapath section, using a 4-phase handshake protocol (see below). Function F is implemented using hazard-free combinational logic that operates on dual-rail input data and generates dual-rail outputs. Hazard-free operation is required because any glitch in the data wires can be interpreted as a valid data signal and produce erroneous operation. Optimization techniques for such sequencers typically focus only on reducing the latency of phases 4 2 and 4 4 . Our goal is to provide significant reductions in dead time by introducing concurrent operation. The sequencer can start process P i as soon as Pi-I has finished processing. In this way, every processing phase P i is ouerlappedwith the returnto-zero phase Ri-1 of the previous process.
Previous Sequencers
We now describe existing sequencers that implement sequential and concurrent protocols and indicate their limitations. . 'The sequencer is activated on its passive port, or channel, S (a passive port is indicated by a small white circle). The sequencer then communicates on active ports P1 and P 2 to activate the first and second processes, respectively (an active port is indicated by a small black circle). Channels are implemented using request and acknowledge wires (S,. and Sa for channel S, and ri and ai for channel Pi.) A complete 4phase handshaking occurs on port PI, followed by a complete 4-phase handshaking on port P2: * ( s r T ; r l T ; Q I T ; P I~; Q I 1 ;~2 T~Q 2 T ;~~T ;~r l ;~2 I ; a z l ;~~l )
Sequential Approaches
An implementation of the SEQ operator is shown shown in Figure 3(b) . This circuit is speed-independent, i.e., it o p erates correctly assuming arbitrary, finite, gate delays. An n-way sequencer consists of SEQ operators connected in a tree structure, as shown in Figure 3 (c). There are two problems with the Tangram sequencer: (2) it has a long initial latency (the time it takes the start signal to reach the first process), and fii) it has a long 4 2 latency, equivalent to several gate delays. The counter centralizes the state of the sequencer, and the decoder distributes the signals to the processes. The circuit is speed-independent and it is currently used in several designs. The implementation has improved initial and 4 2 latencies compared to the Tangram tree sequencer. Minor problems are that the circuit is not modular and is designed to work with an even number of processes. 0 Bailey Chain Sequencer. Bailey [l] also introduced a distributed sequencer built as a linear chain of n modules, each controlling a process. The modules assume fundamental-mode operation. In fundamental mode, no new inputs can arrive until the component has stabilized from a previous input change. The long latencies present in the Tangram circuit do not occur in this design, resulting in a more efficient sequencer.
Concurrent Approaches
A concurrent sequencer was introduced by Unger [12] . However, it pays a. large penalty in latency, area and energy. 0 Unger Tree Sequencer. Unger [12] presents a %step module that implements a concurrent 2-way sequencer. The 2-step assumes fundamental-mode operation and relies on reasonable timing assumptions. An n-way sequencer can be built as a balanced tree of 2-step modules [12] . There are several problems with this implementation: (k) the sequencer has a long iniitial latency, (ii) the inter-process latency ( 4 4 ) is different for every pair of processes and can be several gate delays, depending on how far up and down the tree the signals have to propagate, and (iiz) the area and power consumption of this structure are significantly worse than the previous designs (see Section 6 ) .
New Concurrent Sequencers
We now introduce 2 new concurrent sequencers. Both designs have good latency, area and power characteristics.
Burst-Mode Concurrent Sequencer
Our first sequ.encer tightly controls the overlap between a processing phitse P i and the previous return-to-zero phase The sequencer operates as follows. A request on S, activates module M1 which starts a 4phase handshake with process PI by T I 1. PI then responds with a1 T; modules M1 and M2 both receive this signal. BM1 will respond with rl 1 while, concurrently, M2 will start a 4phase handshaking with with P 2 by 72 t. As a result, the reset phase of the first process (Rl) overlaps the next computation (P2). The sequencer then waits for the completion of both phases to proceed: once a1 1 and a2 T have both arrived, M2 continues the handshaking with P2 concurrently with starting a 4-phase handshake with P3. As a result, R2 overlaps P 3 .
The same behavior continues until the end of the sequence.
Note that in module MI, shown in Figure 5( The other modules, Figure 6 : Optimized Sequencer.
Our new sequencer design is shown in Figure 6 (a). Although similiar to the burst-mode sequencer, three improvements are clear: ( 2 ) a wire replaces the AND gate that generates Sa; (ii) each module has one fewer input ( u i -z ) , resulting in a reduced fan-out of the processes' acknowledge signals; and (zii) the module implementation, shown in Figure 6(b) , is more efficient in terms of area and power. '
We now examine the interaction of concurrent sequencers with the actual datapath, and point out problems that can arise. We then present modified latch and multiplexer designs that allow safe overlapped operation in the datapath. A classical example where this hazard arises is in a dualrail shift-register (SR). In the framework of Figure l If the data being written to the latch is equal to the data already stored in it, the write operation is not stalled and is acknowledged immediately, regardless of the state of the read port. T h i s is a safe optimization: No changes are caused in the latch and no glitches are generated or propagated. Two versions of our modified latches are shown in Figure 8 . Figures 8(a) and 8(b) highlight the gate-level and transistorlevel changes, respectively. The latter solution requires onlytightly-coupled burst-mode sequencer, which allows WAR interactions between two consecutive computations only.
Dual-rail Datapaths
O v e r l a p p e d Multiplexers
The operation of the datapath often requires multtple modules to write to the same latch. Since latches only have one write port, the different write requests must be multiplexed to this port. An existing handshake multiplexer [14], shown in Figure 9(a) , requires mutually-exclusive requests on its two channels. A new multiplexer design, shown in Figure 9(b) , allows overlapped requests. In this design, an overlapped request is stalled at the AND gate until the first operation is completed. 
Single-rail Datapaths
Dual-rail datapaths are very robust but pay a large penalty in terms of area and power dissipation. We now examine single-rail datapaths as an alternative implementation.
Figure lO Conservative Scheme. The conservative scheme uses a sequential controller, such as Bailey chain sequencer, with the single-rail latch (Figure lO(a) ). Performance is poor, since the sequencer does not allow overlapped operation. In this scheme, the result of the computation is valid at the end of the processing phase ( $ 1 ) . Once processing is complete, the latches becomes transparent (see Figure l(b) ). The key point, in this scheme, is that the result remains stable throughout the return-to-zero phase (&), allowing the destination latch to remain open with valid and stable data.
A positive aspect of this scheme is that the latch is transparent only when data is valid and stable, so no undesired glitches are propagated to the rest of the circuit. The drawback of this scheme is that, even though the result of the computation is ready at the end of the processing phase, the stage still must go through the return-to-zero phase before the next Computation can begin.
Fast Scheme. A fast scheme that achieves a desirable high density of computation by a novel distribution of the computation throughout the phases of the handshake protocol. While the fast scheme can be twice as fast, it has potential probliems in terms of datapath power dissipation and ease of realization.
In this scheme, delays are designed to match only half the value of the worst-case delay in the functional blocks.
As in the prevnous scheme, C, propagates through DF and becomes the data-valid signal for the output data from F. The difference is that, at the time this signal is asserted, only half of the cornputation time has elapsed, and data is not ready! The signal arrives as a write request to 2, making it transparent. The latch acknowledge signal goes to the controller as an indication of a completed processing phase even though computation is still going on. C, 1 starts the return-to-zero phase and propagates through the matched delay. At this point the result of the computation is available and stable in the data wires that feed the latch. When the control signal iceaches 2, the latch is closed.
The advantage of this scheme is that it reduces to a half the length of the processing and return-to-zero phases of the handshake, obtaining roughly twice the density of computation of the conservative scheme. However, the scheme has two key drawbacks: (i) the matching of delays to half the value of the d'elay in the functional blocks is not straightforward, and, inore significantly, (iz) the destination latch is made transparent when data unstable. In fact, the outputs of the combinational circuit F can glitch many times during this period and these glitches will be propagated to every processing stage connected to the latch. This results in unpredictable pwwer consumption that can be large, especially if the latch is connected to deep combinational circuits.
O v e r l a p p e d Single-rail Operation
Our solution is to use one of our concurrent sequencers with the conservative datapath protocol, where the matched delay matches the full computation block. This results in essentially the same performance as the fast scheme but without the drawbacks: the latch is transparent when data is stable, eliminating glitch propagation, and the delays are matched to the worst-case value of the associated functional block. This approach is a valid solution, except for one problem: in the interaction with the latches. As before, overlapped operation introduces the possibility of hazards if operations interact with the same latch. An analysis of the operation of the single-red datapath (equivalent to the analysis of the dual-rail datapath in the previous section) reveals that three type of interaction are safe (RAR, RAW, and WAW) and only WAR interactions are unsafe and require modifications.
Modifications for Safe Operation
The WAR hazard arises because the destination latch Z remains transparent throughout the return-to-zero phase while the overlapped1 processing phase can write to a source latch (X or Y ) . La.tch Z already stored the information and is only waiting for Cr 1 to propagate through the matched delays as a close signal. Figure 10(b) shows the existing latch enable circuit. Two different solutions can be used: E a r l y close scheme. We can fast-forward Cr 1 to the destination latch so it closes early in the return-to-zero phase instead of at the end. Figure 1O (c) shows this simple modification to the latch enable circuit. The latch will not open early so no glitch propagation will occur. This scheme uses some reasonable timing assumptions for correct operation.
Interlock scheme. A more robust approach is to stall the writing of the source latch until Z is opaque again. In this case, the destination latch acknowledge signal is used as an We have simulated results using SPICE, targeted to dualrail implementations. In particular, we simulated several versions of an an 8-stage dual-rail ripple shift-register. Table 1 lists relevant analytical results for the different sequencers. The information is given as a function of N , the number of processing stages being sequenced. The total number of transistors and gate-output transitions aze used as first order approximations to area and power consumption. The results show that the new designs are very competitive in both dimensions. Table 2 shows expected performance of each sequencer controlling N identical processes. G is roughly the delay associated with a CMOS complex gate or an inverter, P represents the length of a processing phase, and R is the length of the return-to-zero phase. Again, the table shows that the new designs are very competitive. The substantial improvement in the computation time is due to the concurrent operation of the new sequencers, which eliminates the ( N -l ) R term. 
Conclusions
This paper has focused on concurrency optimizations, targeted to low-power asynchronous systems. We introduced two new sequencer designs, with greater concurrency than existing designs. New latch and multipliexer designs, that safely accommodate the added concurrency, were presented for both dual-rail and single-rail implementations. In the dual-rail case, results showed improved throughput, providing the opportunity for substantial power savings through voltage scaling. We also indicated attractive features of OUT single-rail approach over existing approaches.
