A power-optimized 8-bit priority encoder cell that simplifies the conventional circuit from 102 to 62 transistors is presented. A parallel priority look-ahead architecture that reduces the delay time of priority propagation is introduced. The 8-bit PE cell and parallel priority look-ahead architecture are applied to the design of a 64-bit PE in a latch-based two-stage pipelined structure. Simulation results shows that the 64-bit PE is 27% faster and 53% more power efficient than the conventional design using the same process technology.
INTRODUCTION
Priority Encoders (PE) are widely used in computer systems. A number of basic computing components are implemented based on PE algorithm, such as fixed and floating point units [1] , comparators [2] , and incrementer and decrementers [3] , and. As the computer systems become faster and the data width gets longer, the speed of the PE becomes a key parameter in the performance of the system. At the same time, the overwhelming demand for portable electronics encourages the development of a power optimized PE structure.
The Priority Look-ahead (PL) scheme is first introduced by Delgado-Frias et al. [4] to reduce PE delays associated with priority propagation. Soon after, Wang et al. [5] improved on the PL technique by utilizing a 'Multilevel' Look-ahead architecture to improve the speed performance, and NP Domino logic technique to reduce the power consumption. Recently Huang et al. [3] reported a PE using Multilevel Look-ahead and 'Multilevel Folding' techniques to improve the speed of an N-bit PE to the order of O(logN). However, this Multilevel Folding architecture requires complex look-ahead signal routing, making it difficult to design and test.
In this paper, an 8-bit PE cell that is power-optimized by simplifying the circuit is presented. Then a new Parallel Priority Look-ahead (PPL) architecture is introduced.
Finally, a 64-bit PE utilizing the above techniques and a latch-based two-stage pipelined structure is described and analyzed in comparison to conventional designs.
THE POWER-OPTIMIZED 8-BIT PE CELL
In a multibit PE, The output of the ith bit is
where D i is the corresponding input data and P i stands for the priority token passed into this bit. When the input of the lower significance bit is 0, the priority token is passed into the next bit:
. The general expression of EP i can be written as:
.
The conventional PE circuit [3, 5] for implementing this function is shown in Fig. 1 . As an 8-bit PE with a threelevel look-ahead, the circuit implements the following functions. 
D7
The circuit in Fig. 1 is implemented in NP domino logic. Mp20 -Mp27 are used to generate LA_out; the transistor chains in the right side rows evaluate the data inputs and generate corresponding EP outputs. Mrf0 -Mrf7 are used to restore the outputs, if they are erroneously changed in the beginning of the evaluation phase.
II -753 0-7803-8251-X/04/$17.00 ©2004 IEEE ISCAS 2004 « ¬ Fig. 1 . The conventional 8-bit PE cell [3] .
By studying the functions in equation (2), methods for reducing the complexity of this circuit can be identified. Notice that many functions in (2) are similar. For example, EP2 has a 1 0
term, which repeats again in EP3. This suggests that the circuit may be simplified if portions could be shared. Recognizing that in the PE circuit only one chain of the evaluate transistors will discharge during evaluation, it becomes possible to simplify the functions in (2). By sharing common terms, the functions of the 8-bit PE cell can be rewritten as:
Based on the functions in (3), the new 8-bit PE cell circuit, as shown in Fig. 2 , has been designed. Notice that the logic definition of LA_in has been reversed: a logic-1 LA_in enables corresponding cell. LA_inter and LA_out from equation (2) have been eliminated. There are a number of advantages of the new 8-bit PE cell over the conventional cell. First, the circuit is greatly simplified. The transistor count is reduced from 102 to 62 and necessary precharge nodes are thus also reduced. Simulation shows that this simplification results in more than 50% of the power savings. Second, the number of gate delays in the data path is reduced. In the conventional design, EP7 has two gate delays (one OR gate delay to generate LA_inter and one AND delay to generate EP7) while the new circuit has only one AND gate delay. Although the new AND gate is slower due to the longer discharge path, post-layout simulation with all parasitics included shows that the overall delay is smaller than the two-gate conventional design. Third, the new design avoids the possibility of the outputs making more than one transition during evaluation, which otherwise could cause severe malfunction in following dynamic circuits. Finally, the new PE circuit is much more regular, resulting in easier layout and smaller size.
THE PARALLEL PRIORITY LOOK-AHEAD ARCHITECTURE
Recently Hang et al. [3] proposed a Priority Look-ahead technique named Multilevel Folding architecture. Compared to the Multilevel Look-ahead architecture [5] , The Multilevel Folding method is complex and the connections between PE cells are not easily understood.
As a result the look-ahead signal routing is highly irregular, complicating layout and testing, especially when the number of bits becomes large. Further investigation shows the Priority Look-ahead architecture can be simplified in structure to improve performance. Consider a 64-bit PE look-ahead scheme that uses 8-input OR gates to exam if there is a logic-1 in each set of eight inputs. That is, for the first 8 bits,
If all the inputs are logic-0, the output of OR gate is logic-0 and the priority token is passed to the next 8-bit cell. The look-ahead signals LA0-LA7 can thus be expresses as: Notice the similarity between (5) and (3), which suggests the look-ahead logic can be generated using an additional PE, with OR0-OR7 as inputs and LA0-LA7 as outputs. Fig.  4 shows a circuit that utilizes this concept to implement a new Parallel Priority Look-ahead (PPL) architecture. The OR gates are designed using typical Domino CMOS logic. All the PEs are implemented using the new poweroptimized 8-bit PE cells described above.
The new Parallel Priority Look-ahead architecture enjoys several advantages. First, the look-ahead signals LA0-LA7 are generated in parallel. Thus, the lower significance PE cells do not need to wait for the look-ahead signals from the higher significance PE cells, as what they have to do in the conventional design. For a 64-bit PE, the total gate delays are only three: one OR gate, one AND gate in the 'look-ahead' PE, and one AND gate in the 'data' PE. It can be shown that for a 128-bit PE and a 256-bit PE, the total gate delay is 4 and 5 respectively. That is, the total gate delay is approximately (log 2 N-3) for the new PPL architecture compared to log 2 N in conventional design. Second, it is evident that the look-ahead signal routing is much more regular than the Multilevel Folding architecture, which makes it possible to predict the signal propagation delay along the wire due to the parasitic capacitance, and then make corresponding optimization during layout. Third, this new architecture divides the data processing into two stages, the OR gate stage and the PE stage. This makes a pipelined structure possible.
A 64-bit PE with a latch-based two-stage pipeline structure is illustrated in Fig. 5 . The outputs of the OR gates are latched by N-C2MOS latches. The two stages (5) II -755 ¬ ¬ Fig. 5 . A 64-bit PE with the pipelined structure.
are clocked by two complementary clocks phases; when the OR stage is in the evaluation phase, the PE stage is in the precharge phase. In this phase, OR0-OR7 are generated and latched on the next clock edge when the PE stage enters the evaluation phase. The 'look-ahead' PE reads OR0-OR7 and changes one of the outputs in LA0-LA7, determining which 'data' PE will be active and generate the final EP outputs (see equations (3) - (5)).
PERFORMANCE EVALUATION AND EXPERIMENTAL RESULTS
A 64-bit PE utilizing the new Power-Optimized 8-bit PE cell and the Parallel Priority Look-ahead architecture in a latch-based two-stage pipelined structure (new design) and a 64-bit PE with conventional three-level look-ahead 8-bit cell and three-level folding technique (conventional design) were designed in MOSIS 3V 0.6µm CMOS technology, both designs using same sized transistors for direct comparable. Post-layout simulation with all parasitics included shows the propagation delay through the new PE is 1.73ns with a 100MHz clock, as shown in Fig. 6 , and the conventional PE has a 2.34ns delay. Power consumptions were also measured and results indicate the new design uses only half power of the conventional design. Fig. 7 illustrates the performance comparison of the two designs.
CONCLUSION
A new power-optimized 8-bit PE cell and parallel priority look-ahead approach have been presented. A 64-bit PE utilizing the new 8-bit PE cell and look-ahead architecture in a latch-based two-stage pipelined structure has been developed in a 3V 0.6µm CMOS technology. Simulation results shows that the new circuit is 27% faster and 53% more power efficient than the conventional design. Fig. 6 . Simulation waveforms of the new 64-bit PE. When the input D0~D63 is 0x00 00 00 00 00 00 00 01, which is the worst case, the delay of the OR stage is 0.71 ns and the delay of the PE stage is 1.73 ns. 
