A b s t r a c t : This paper presents law-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards.
Introduction
Asynchronous circuits sometimes can consume very low average energy for a given average performance partly because of their ability to adapt to variations in chip temperature and voltage supply level. To achieve this goal, however, the asynchronous circuit should be optimized to process common data faster than rare ones and the overhead associated with the completion sensing logic should be minimized [l] . In this paper we propose novel asynchronous shifters that achieve these two objectives.
Shifting data is required in a variety of applications, including arithmetic operations, bit indexing, and variable length coding. Barrel shifters m e a common design choice because they can perform multi-bit shifts in a single op- This paper proposes an orthogonal approach in which the structure of the barrel shifter is optimized so that the logic executed for common (smaller) shifts is simpler and has less capacitance than that of uncommon (larger) shifts, resulting in a barrel shifter with data-dependent delay. To best take advantage of this data-dependent delay, we build completion sensing circuitry that facilitates its incorporation into an asynchronous datapath. Pre-and post-layout simulations of both dynamic and static implementations suggest that significant reductions in both average delay and average power can he achieved.
Asynchronous barrel shifters
In our proposed architecture, each barrel shifter output is a network of muxes arranged such that more common shifts have shorter path delays than rare shifts, Consider the simple two-level mux network depicted in Figure 2 . Here, for more common shifts, So . . . SM, the data is routed to the output through barrel shifter BS-1. In contrast, for less common shifts, SM+I . . . SX, the data must pass through both barrel shifters BS-2 and BS-1.
Static design
The configuration for one ontpnt hit using static mux circuitry along with the associated completion sensing cir- cuitry is depicted in Figure 3 .
Note that conrenlional completion sensing circuits either require t,hr data inputs to be dual-rail or use the bundled data approach in which a single delay line models the worskase component delay, both of which are not suitable for our purpvses. A more recently developed idea is the speculative completion scheme (91, in which multiple delay lines ale implemented and an additional mux controlled h j abort. circuitl-y is used to select one of them. Although promising, this approach seems to require the creation of the ahort circuitry and an additional mux, either of which may i n c u significant delay overhead. In our implementation. however. we implemented the speculative completion in an efficieut novel circuit that conibines the necessary delay lines, mux, and ahortion logic.
Dynamic design
Dynamic-logic-baseci shifters are faster than stat,ic-logicbased shifkri: I~KRLISC dynainic-logic-haseil shifters need to only pass a logic zero through the NMOS muxes. On the other hand, because outputs must be pre-charged to zero before each evaluation, if the outputs more often evaluate to one than to zero, dynamic logic implementations will consume more power than their static counterparts. Thus, the decision as to which to use depends on the speed requirement and signal statistics of the data inputs to the barrel shifter.
The implementation of our two-level barrel shifter architecture in dynamic logic is illustrated in Figure 4 . As is typically done in large dynamic gates, we add an extra PMOS pull-up transistor at the output of BS-2 to avoid charge-sharing problems that might otherwise cause accidental discharge of BS-1. As in the static design, this completion sensing circuitry assumes a four-phase handshaking protocol where the delay from a shift control rising (which can occur only after Go rises) to rising Done matches the delay of the shifter and the delay from falling Go to falling Done is quick and data independent. Specifically, this latter delay is the gate's pre-charge delay.
3
To determine which of the 16 different possible 2-level decompositions is the best, we simulated all of them at the transistor-level using the HP 0.35pm HSPICE model. As Figure 5 indicates, the best overall design for our application is the "6-10 design", i.e., where the 6 most common shifts are handled solely by BS-1 and the other 10 shifts must pass through both BS-2 and BS-I.
As reported in Table 1 , we also compare our best designs with their synchronous static and dynamic cowterparts obtained by taking our one-level designs and removing the lesszommon and completion sensing circuitry. The delay improvements are approximately the same as in our earlier comparison with the one-level asynchronous design. This is because our logic models the delay af the asynchronous designs remarkably accurately.
HSPICE Simulation Results and Conclusions

3.0V
2 5 0~ im0c
. . . The improvements in average energy compared to the synchronous counterparts, however, are significantly lower than when compared to the one-level asynchronous designs due to the the energy overhead of the done logic and the generation of the l e s s r o m o n signal.
One may claim, however, that the done logic provides more than just adaptivity to the data. In particular, another advantage of the done logic is that the asynchronous design can adapt to enviroinental changes in the power supply voltage level and chip temperature. Thus, it may be reasonable to compare the asynchronous design simulated at 3.3V and 25'C with the synchronous design simulated at 3.0v (worst-case voltage) and iOO°C (worst-case temperatwe). As t,hr table indicates, for the comparison of static (dynamic) logic designs, the improvement becomes 41% (42.2%) in average delay and -4.1% (-4.7%) in average energy (the energy loss arises because the synchronous circuit has a lower supply voltage). If we use voltage scaling, we can translate this excess speed of the asynchronous designs into savings in average energy consumption. In this case, we can run the asynchronous static (dynamic) component a t 2.5V (2.1V) and achieve approximately 41.7% (60.8%) reduction in average energy consumption, We layed out the synchronous design, the single-level asynchronous design, and the 6-10 two-level design, all using dynamic logic. The two asynchronous design layouts are shown in Figure 6 . The 6-10 layout is 47% larger than the single-level asynchronous design and approximately 50% larger than the synchronous design. As reported in Table 1 , the post-layout simulation results indicate that the energy improvements are not significantly altered but that the delay improvements are somewhat reduced. Considering the adaptivity to temperature and assuming architectural voltage scaling, however, the asynchronous design still yields a 52% savings in average energy consumption.
The inclusion of these designs into low power asynchronous variable length codecs (e.g., the Huffman decoder presented in [lo] ) is a topic of future work.
