Design of 128 Bit Round Robin Priority Encoder by Joshi, Kiran Raj
DESIGN OF 128 BIT ROUND ROBIN 
PRIORITY ENCODER 
By 
KIRAN RAJ JOSHI 
Bachelor of Engineering 
Institute of Engineering 
Kathmandu, Nepal 
2000 
Submitted to the Faculty of the 
Graduate College of the 
Oklahoma State University 
in partial fulfillment of 
the requirements for 
the Degree of 
MASTER OF SCIENCE 
July, 2004 





I would like to express my sincere appreciation to my father and mother. I would also 
like to thank my brother and my sisters for the moral support they have provided me over 
the years. 
I would like to thank Dr. Daniel Grischkowsky for financially supporting me as well as 
providing inspiring guidelines throughout my study here in Oklahoma State University. 
I would like to express my sincere appreciation to Dr. Louis Johnson for guiding me 
through this thesis and for his wonderful class lectures. I also like to extend my 
appreciation to my committee member Dr. Gary Yen for providing the encouragement to 
caiTy on the thesis. 
I would also like to thank my friends for making my stay m Stillwater memorable. 
111 
TABLE OF CONTENTS 
Chapter Page 
CHAPTER 1 ....................................................................................................................... 1 
INTRODUCTION AND THESIS ORGANIZATION ....................................................... I 
1.0 THESIS INTRODUCTION .............................................................................................. 1 
1. 1 SCOPE AND PREVIOUS STUDIES ................................................................................... 3 
1.2 DESIGN APPROACHES ................................................................................................. 5 
1.3 T~IESIS ORGANIZATION .............................................................................................. 6 
CHAPTER 2 ....................................................................................................................... 7 
DESIGN APPROACHES ................................................................................................... 7 
2.0 GENERAL DESCRIPTION ............................................................................................. 7 
2.1 OPERATION OF THE ENCODER ..................................................................................... 8 
2.1. l Reqi,est to tl1e E11coder ...................................................................................... 9 
2.1.2 !11dicatio11 of the highest priority ..................................................................... 10 
2.1.3 Passing of the priority token ............................................................................ JO 
2.1.4 /11pz1t p _i,-1 ......................................................................................................... 10 
2.1.5 /11p11t a_i11 ......................................................................................................... 11 
2.1. 6 /11pz1t b _i11 .................................•....................................................................... 11 
2. 1. 7 Oi,tpllt p _Olli .................................................................................................... 11 
2.1. 8 01,1tp11t e11c _ 0111 .. ............................................................................................... 11 
2.1. 9 Ci,·cz,it Fz111ctio11 ............................................................................................... 11 
2.2 BARREL SHIFT BASED PRIORITY ENCODER (BSPE) ................................................. 13 
2.2.1 Structure of left Shift Barrel Shifter ................................................................ 15 
2.2.2 Kill Signal Generator (KSG) ........................................................................... 17 
2.2.3 Right Shift Barrel Shifter ................................................................................. 18 
2. 2. 4 Encoded Signal Generator (ESG) .................................................................... 19 
2.2.5 Pe1forma11ce evaluation ................................................................................... 20 
2.3 LOOK AHEAD TREE PRIORITY ENCODER (LATPE) .................................................. 21 
2.3.1 T),pe 1 Node ..................................................................................................... 26 
2.3.2 T)pe 2 Nocle ..................................................................................................... 27 
2.3.3 Type 3 Node ..................................................................................................... 28 
2._'1.4 Circz1it Operatio11 ............................................................................................. 29 
2.4 MODULAR PRIORITY ENCODER (MPE) .................................................................... 35 
CHAPTER 3 ..................................................................................................................... 37 
MODULAR APPROACH OF DESIGN .......................................................................... 37 
IV 
Chapter Page 
3.0 [N'J'RODUCTION ......................................................................................................... 37 
3.1 IMPLEMENTATION DETAIL ....................................................................................... 39 
_1. J. I Stage I .............................................................................................................. 40 
3. 1.2 St(1ge 2 .............................................................................................................. 42 
_?,I·-~ St(tge 3 .............................................................................................................. 44 
3.2 IMPROVED MODULAR PRIORITY ENCODER (IMPE) ................................................. 46 
-~.2.1 011ert1tio11 of /}.;/PE .......................................................................................... 47 
CHAPTER 4 ..................................................................................................................... 50 
IMPLEMENTATION OF THE DESIGNS ...................................................................... 50 
4.0 INTRODUCTION ....... , ................................................................................................. 50 
4.1 IMPLEMENTATION OF BSPE ..................................................................................... 5 I 
4.2 IMPLEMENTATION OF LATPE .................................................................................. 5 I 
4.2. 1 Verilog lmp/eme11tatio11 ................................................................................... 51 
4.2.2 Lc1,·011t of LATPE .............................................................................................. 52 
4.3 IMPLEMENTATION OF MPE ...................................................................................... 56 
4.4 IMPLEMENTATION OF IMPE ..................................................................................... 56 
4.4. l Verilog Implementation ................................................................................... 56 
4.4.2 La1,·011t of IMPE ................................................................................................ 57 
CHAPTER 5 ..................................................................................................................... 60 
CHIP LAYOUT OF 128-BIT IMPE ................................................................................ 60 
5.0 INTRODUCTION ......................................................................................................... 60 
5.1 INPUT/OUTPUT PADS ................................................................................................ 61 
5.2 SCAN-BASED TEST TECHNIQUE ............................................................................... 63 
5. 2. 1 Sca,1 Registers .................................................................................................. 65 
5. 2. 2 Clock Generation and Distribution Issues ....................................................... 66 
5.3 FLOORPLAN OF IMPE IN PAD FRAME ....................................................................... 68 
CHAPTER 6 ..................................................................................................................... 73 
TESTING AND VERIFICATION RESULTS ................................................................. 73 
6.0 INTRODUCTION ......................................................................................................... 73 
6.1 TESTING PROCEDURES ............................................................................................. 7 4 
6.2 TESTING RES UL TS .............................................................................................. 77 
CHAPTER 7 ..................................................................................................................... 79 
CONCLUSION AND FUTURE WORK ......................................................................... 79 
7 .0 CONCLUSION ............................................................................................................ 79 
7.1 FUTURE STUDIES ...................................................................................................... 81 
REFERENCES ................................................................................................................. 83 
APPENDIXES .................................................................................................................. 86 
APPENDIX A ................................................................................................................... 86 
V 
Chapter Page 
DESCRIPTION OF 0.35 MICRON CMOS STANDARD CELLS USED ...................... 86 
APPENDIX 8 ................................................................................................................... 91 
LAYOUT IN AMI 0.6 MICRON PROCESS ................................................................... 91 
Vt 
LIST OF TABLES 
Table Page 
Table 3 .1: The maximum number of inputs without the use of buffers ........................... 39 
Table 6.1: Delay through the various versions of 128-bit encoder circuits ...................... 78 
Table 6.2: Chip area taken up by different designs of 128-bit priority encoders ............. 78 
Vll 
LIST OF FIGURES 
Figure Page 
Figure I. I: N bit Round Robin Priority Encoder ................................................................ 1 
Figure I .2: Assignment of the priority in N bit RRPE with input Ii set to highest priority 2 
Figure 2.1: General Structure of priority encoder ............................................................. 10 
Figure 2.2: Basic Structure of a BSPE .............................................................................. 14 
Figure 2.3: Structure of the Left Shift Barrel Shifter. ....................................................... 16 
Figure 2.4: Node structure for the left shift barrel shifter. ................................................ 16 
Figure 2.5: Stmcture for a four input KSG ....................................................................... 18 
Figure 2.6: Stmcture for a right shift barrel Shifter .......................................................... 19 
Figure 2. 7 (b ): AND plane ................................................................................................ 22 
Figure 2.8: Four-input LATPE ......................................................................................... 24 
Figure 2.9: N Stage Buffer circuit.. ................................................................................... 25 
Figure 2.10: Type I node structure ................................................................................... 27 
Figure 2. 1 I: Type 2 node structure ................................................................................... 28 
Figure 2.12: Type 3 node structure ................................................................................... 29 
Figure 2.13: Operation for the 4-input LA TPE when b _in3 = 0 ...................................... 33 
Figure 2. 14: Representational steps for a 4 bit LA TPE .................................................... 34 
Figure 3.1: Structure for the modular approach of Encoder design .................................. 38 
Figure 3.2: Structure of unit cell in stage 1 with four inputs ............................................ 41 
Figure 3.3: Structure for a four input unit in stage 2 ........................................................ 43 
Figure 3.4: Unit cell of Stage 3 with four inputs .............................................................. 45 
Figure 3.5: Block diagram for the improved modular priority encoder (IMPE) .............. 4 7 
Figure 3 .6: Basic Unit for stage 1 with four inputs used in IMPE .................................... 48 
Figure 4.1: Modified Type I Node for LATPE ................................................................ 52 
Figure 4.2: Modified type 2 node for LA TPE .................................................................. 53 
Figure 4.3: Modified type 3 node for LATPE .................................................................. 53 
Figure 4.4: Static CMOS implementation of the inverter. ................................................ 54 
Figure 4.5: Static CMOS implementation of the 2-input NAND gate ............................. 54 
Figure 4.6: Static CMOS implementation of the 2-input NOR gate ................................. 55 
Figure 4.7: Flom-planning for the 128-bit IMPE ............................................................... 58 
Figure 5 .1 : 40-pin pad frame ........................................................................................... 61 
Figure 5.2: Circuit diagram for the bidirectional input/output pad ................................... 62 
Figure 5.3: General diagram for scan based testing technique ......................................... 64 
Figure 5.4: Schematic for a true two-phased clock positive edge clock latch .................. 65 
Figure 5.5: Two-phase clock generator ............................................................................ 66 
Figure 5.6: Non Overlapping - clock ................................................................................ 67 
Figure 5. 7: H-tree for the distribution of clock ................................................................. 6 7 
Figure 5 .8: Floorplan with scan registers for a 128-bit IMPE .......................................... 69 
V111 
Figure Page 
Figure 5.9: 2 - I multiplexer used in the input side of the encoder circuit ...................... 71 
Figure 5.10: Latch circuit for the output registers ............................................................ 72 

















Iterative Serial Line IP 
Content Addressable Memory 
Round Robin Priority Encoder 
Simple Linear Priority Encoder 
Programmable Priority Encoder 
Barrel Shi ft Based Priority Encoder 
Kill Signal Generator 
Encoded Signal Generator 
Programmable Linear Priority Encoder 
Hardware Description Language 
Modular Priority Encoder 
Improved modular priority encoder 
Pipelined MPE 
Circuit Under Test 
X 
Chapter 1 
Introduction and Thesis organization 
1.0 Thesis Introduction 
This report presents a design of a 128 bit round robin priority encoder (RRPE), 
with a delay of order O(logN) and also presents an implementation method that greatly 
minimizes the effect of interconnect delays. 
The round robin encoder has the basic structure shown in figure I. I. 
enc out N enc out N-1 enc_out 2 enc_out 1 ~-L- ---------------
N bit Priority Encoder 
--------------
b in N b in N-1 - - b_in 2 IJ_in 1 
a_in N a_in N-1 a_in 2 a_in 1 
Figure 1.1: N bit Round Robin Priority Encoder 
1 
a_inj ~ j=N, ... , I , are the N inputs to the encoder, the inputs can be either one or 
zero at anytime depending on the request at that particular input. enc_outi, i= N, ... , 1 
represents the N outputs from the circuits. 
b_inj, j=N, ... , L are the N signals that denote which input signal has the highest 
priority. The input with the highest priority is denoted by resetting the corresponding 
begin signal to 'O'. For example, if the a_ini input is to be the highest priority, then the 
b_ini signal is zero. At any time, only one of the b_inj inputs can be zero, the rest of them 
should be one, which implies that only one of the inputs can be the highest priority input 
at anytime. The priority then decreases gradually from the highest priority input to the 
lowest in a round robin fashion from right to the left. Thus, if a_ini has the highest 
priority, then input a_inmodN(i+I > will have the next highest priority and so on. Thus, the 
input a_inmodN(i·I) will have the lowest priority. Figure 1.2 shows the pictorial 









a_in N a in· 
- I 
a_in 1 
Figure 1.2: Assignment of priority in N bit RRPE with input a_in i set to highest priority 
2 
1.1 Scope and previous studies 
Several studies of Round Robin priority encoders with large numbers of inputs 
have been done in the past [ 1] [2]. In [I] [2] designs have been presented that break the 
circuits into smaller modules, which helps to simplify the design complexity. A similar 
method has been used in this study to design the Round Robin encoders. This method 
greatly simplifies the design and reduces the design time. In [I] all the proposed circuits 
have been implemented using Altera's ACEXIK series CPLD. The delay for a 64 input 
was 25.5 ns. They have also simulated the circuit proposed by [2] for which the delay of 
39.9 ns has been shown for the encoder of size 64 bits. 
Most of the circuits in the current study makes use of the tree of AND and OR 
gates. Use of tree structures to minimize the delay through the circuit is well known. 
They have been used in the design of look ahead carry generators [5] [6] [7]. The 
concepts of look ahead signals have also been used in the design of fast linear priority 
encoders [8] [9] [10] [11]. The concept of the look ahead signal has also been utilized in 
the design of the round robin priority encoder., in order to obtain the delay O(logN) .. 
where N denotes the number of bits in the encoder. 
All of the circuits that have been designed in this study have been for 128 bit 
round robin encoders and have been implemented using either the Verilog hardware 
description language or using the Cadence Virtuoso layout editor. Therefore a direct 
3 
comparison of area and speed with the previously done work is difficult, since different 
technologies are used 
The proposed round robin encoder finds application in several different fields. 
They have been used in the design of high speed switches and scheduling [I]. For fast 
switching and fair control of congested flow in broadband networks [ 12]. 
There have been many designs of round robin priority encoders, which have been 
used in the realization of iterative serial line IP (iSLIP) and in the area of high speed 
packet switches with input buffers [2] [ 12]. They have also been used in the Content 
Addressable Memory (CAM) circuits [14] [15]. Most of these designs have been 
implemented for a relatively small number of inputs, of the order of 10. Although these 
designs have delays of order O(logN), if the same designs are expanded to a large number 
of inputs of the order of I 00, the delay will dramatically increase. The main reason for 
the increase in the delay is because of the increase in the delay of the interconnect 
network. The use of a tree type structure indicates that the delay increases exponentially 
with the increase in the number of inputs. 
The round robin encoder can be incorporated inside the modem superscalar 
processors to improve its performance. One of the motivations for the study of the round 
robin priority encoder was to access if the layout for the encoders with large inputs is 
reasonable enough to be incorporated inside the superscalar process to improve the 
4 
perfonnance. At the same time the circuit should be fast enough to perform its operation 
within a clock period of the processor. 
1.2 Design Approaches 
Two novel features of the design have been implemented. The first 
approach is based on the principle of a look-ahead signal, similar to the one used in look 
ahead caJTy adder circuits. The look-ahead approach is very easy to implement when the 
number of inputs to the encoder is small. As the number of inputs to the encoder 
increases .. the delay in the look ahead carry as well as the circuit area are dominated by 
the cross over network and the wrap around circuits, which makes them impractical for a 
large number of inputs. 
A modular approach in which the circuit is broken down into smaller sub blocks 
has been designed. The delay for the modular approach is very close to the O(log N). The 
delay for this design is given by 2logL + logM, where M and L are related by N as 
N=M x L. There is a large reduction in the area of the chip as compared to the look-ahead 
approach. Also many of the sub blocks used in the modular approach are similar in 
structure. Thus, these repeated blocks need to be designed only once, which makes the 
implementation of the modular approach less time consuming as compared to the look-
ahead approach. 
5 
1.3 Thesis Organization 
This document is divided into seven sections. Chapter 2 discusses the various 
design approaches available for the design of the round robin priority encoder. At the end 
of this chapter, a novel approach is also presented. Chapter 3 discusses the improved 
modular approach of priority encoder design. In Chapter 4., implementation details for the 
various designs are presented. Chapter 5 presents the implementation detai 1 of the layout 
of the improved modular priority encoder circuit in 40 pin pad frame. Chapter 6 presents 
the testing and verification methods and the simulation results. Chapter 7 discusses the 





In this chapter, two different methods of the implementation is described. The 
main objective of the thesis is to find a design approach which has O(logN) delay. One 
way to achieve the targeted delay is to use a binary tree structure. Two different ways to 
implement a tree are given below. 
2.0 General Description 
The following is the description of the binary tree approach. For the 
explanation~ purpose of a four bit Round Robin Priority Encoder (RRPE) has been 
described. 
There are four inputs to the encoder. Each of these inputs has an associated 
'begin' signal. The begin signal is used to indicate which of the input signals is of the 
highest priority. From the input with the highest priority, the priority gradually decreases 
towards the left. If the end of the circuit is reached, the priority will wrap around and will 
still continue to decrease in the right to left order until the input with the highest priority 
7 
is reached. Thus., in this scheme, the input which is to the immediate right of the input 
with the highest priority will be the input with the lowest priority. 
As opposed to a round robin priority encoder (RRPE) a linear priority encoder 
does not wrap around from the lowest priority input. Usually the priority order is fixed in 
the linear priority encoders. Either the leftmost input or the rightmost input will have the 
highest priority. For the purpose of this thesis we will assume that all the linear priority 
encoders have its rightmost input at the highest priority level and the leftmost input at the 
lowest priority level. And these encoders are referred to as the Simple Linear Priority 
Encoder (SLPE). 
2.1 Operation of the encoder 
The operation of the encoder can be completely specified by logic equations. For 
a fixed priority linear encoder with a_in as the input and enc_out as the output, the 
following equation completely defines the priority encoder. The order of the priority is 
such that the input with the smallest number has the highest priority. The i-th encoded 
output can be written as 
enc_out ;=a_in ;• _in ;_1+a_in ;_2+ .... +a_in 2+a in 1) 
enc out a in .• kill . 
- I I 
where, the 'kill i' is the kill signal and is defined as 
kill ;=a in ;_,+a in ;_ 2+ .... +a_in 2+a_in 1 
8 
For the round robin encoder there needs to be extra inputs which determines 
which of the inputs is of the highest priority. Also the priority wraps around from the 
circuit. This extra input can be termed as b_in. Only one of the b_in signals can be a t 
logic low level at any time which indicates the highest priority input, all the remaining 
b _in inputs must be at logic high. If a kill signal similar to the one in the linear encoder is 
generated then the logic equation for the kill signal is given by 
kil( =a_in,_1 •b_in; +a_in;-'!. •b_ini •b_in;_1 +···+a_i111 •b in; •b_in;_1 • 
···•b_in3 •b_in2 +a_inN •b_in; •b_in,_1 •···•b_in1 +a_in"._1 •b_in, • 
···•h_i111 •b_inN +···+a_in.,·+i •b_i111 •···•b_it11 •b_in.,. •···•b_in,+1 + 
a _in; •b _in1 •···•b _in1 •b _inx•···•b _in;+i ···············Eqn1 
The Encoded signal is generated from this kill signal as in the linear priority 
encoder as 
enc out .= a in .• kill . 
- I - I I 
Where, N indicated the total number of inputs to the encoder, and I can be varied 
from N to l. The simplest implementation of the round robin priority encoder is given in 
figure 2.1. The description for the encoder is presented below. 
2.1.1 Request to the Encoder 
Each individual input can request by setting the input, a_ini, to high. Thus at any 
instance of time there can be several of the inputs to the priority encoder with a value of 
logic I. 
9 
2.1.2 Indication of the highest priority 
The input with the highest priority will have its priority input, b_ini, as O all the 
remaining inputs will have its priority input, b_in as 1. Thus at any instance of time only 
one of the priority input can be 0. This ensures that there is only one, a_in input with the 
highest priority at any instant of time. 
2.1.3 Passing of the priority token 
At the beginning of the circuit operation, the priority token is with the input with 
the highest priority. This priority is then passed to the input in the immediate lower 
priority hierarchy. Thus, in its simples fonn the RRPE looks like the one shown in figure 
2.1 below. 
enc out enc_out N-1 
A N & 
. ~ rp~ou1 __ PJ•~+;outl pJ-- ·················· 
! a_in N I a_irli N-1 I 






.- p_out p_i~-~p_out p_i 
a_in 2 a_i'l 1 
b_in 2 b_in 1 
Figure 2.1: General Structure of priority encoder 
If there are N inputs to the encoder, there will be N such blocks. Each Block has 
the following inputs and outputs. 
2.1.4 Input p_in 
This indicates the priority input to the block. The signal is fed from the block with 
the next higher priority. If the particular block is the one with the highest priority the p _ in 
signal is ignored. This signa1 is 1ike acts 1ike a ki11 signal. 
10 
2.1.5 Input a_in 
The input is provided by the external circuit, which is using the priority encoder 
circuit and serves as the request to the priority encoder. 
2.1.6 Input b in 
This input is also provided by the external circuit. This is used to indicate which 
input has the highest priority. The assignment of the priority is according to the scheme 
described above. Thus only one of these b_in signals will be low, which makes that 
particular input signal the highest priority one. All the other b_in must be set to high so as 
lo prevent more than one input from becoming the highest priority one at any instance of 
time. 
2.1. 7 Output p_out 
Each of the blocks can send out a priority signal to the next block which is in the 
immediate lower priority level than itself. This signal acts as the kill signal to the next 
block. 
2.1.8 Output enc_out 
This is the output encoded signal. The signal is computed from the inputs a_in, 
b_in and the priority in signal, p_in. 
2.1.9 Circuit Function 
This type of priority encoder is called programmable p1;ority encoder (PPE) [ l ]. 
The PPE differs from the simple priority encoder in the sense that the priority for each of 
11 
the input signals can be set by an external circuitry [l]. All the encoders designed in this 
thesis fall in the general category of PPB as the b_in inputs can be used to indicate the 
priority for the input signals. 
In order to understand the circuit operation, we start from the input with the 
highest priority and then move down toward the lower priority one. 
enc_out i = a_in i · p_int i 
p_int i = b_in i • p_in i 
P out · = - a in · · p int · _ I _ I _ I 
Where,, the operators are defined as follows 
· Logical AND 
+ Logical OR 
- Logical NOT 
And i = N,, ... , l, N being the total number of the inputs to the encoder. 
p_in i = p_out modN (i- J) 
Thus the priority signal acts like a token. If the input with the higher priority uses 
the token, the input with the lower priority cannot be served even if it requests. If on the 
other hand, the one with the higher priority doesn't use the token, then it passes the token 
to its immediate lower priority. Then the process is repeated in a similar manner for the 
other inputs. Thus, if there are N inputs and the request is made by only the input with the 
lowest priority, then, in order for the input to be granted by the scheme described above, 
the priority token will have to pass through each of the inputs sequentially .. until it 
12 
reaches the last input. Thus the worst case delay of the circuit is O(N). This delay will not 
be significant as long as N is small, but if N is large, then the delay will render this type 
of circuit unsuitable for most modem high speed applications. While the delay for the last 
input will take N gate delays to get served, the input with the highest priority will take 
only one gate delay to get served. Thus, there is a huge imbalance in the delay from the 
first input to the last input in the priority hierarchy. 
One of the approaches to reduce this imbalance in delay is to spread the delay 
among the inputs so that the delay for all the inputs is approximately equal. One of the 
most common methods to implement this approach is to use a tree circuit which has look-
ahead signals to improve the performance. Several different approaches have been 
proposed. 
Some of the Round Robin priority encoder design is presented in the next section. 
The different aspects like delay, complexity and the chip area are compared for each of 
the design implementations. 
2.2 Barrel Shift Based Priority Encoder (BSPE) 
The block diagram for the BSPE is given in figure 2.2. There have been other 
designs of the barrel shifter based priority encoder in the past [ 13]. The design consists of 
four major blocks. The first block is a left shifter, the second block is a kill signal 
generator (KSG), the third block is the right shifter, and the fourth block is the encoded 
signal generator (ESG). The circuit functions as follows, the requests are fed to the inputs 
13 
a_in and the grants are assigned to the output enc_out. The pointer b_in indicates which 
input is currently the highest priority. The shifters are such that the amount of shift can be 
varied. The pointer. b_in, is connected to the shift amount input of both shifters. Thus, the 
first batTel shifter shifts the inputs such that the highest priority input to the circuit is 
rearranged to the highest priority input of the KSG. The KSG operates on the request and 
produces the kill output. Since the input to the KSG is presented by the first barrel shifter 
such that the highest priority request always lies in its highest priority input, the 
programmable nature of the circuit is transparent to the KSG. The output of the KSG is in 
the wrong order with respect to the input requests. Thus, the process by which the order 
of the inputs was changed has to be reversed for the output of the KSG. Since the input 
was shifted left by the amount specified by the pointer b_in, the output of the KSG also 
has to be shifted right by the same amount. Thus, the third block, right shifter, shifts the 
output of the KSG to the correct order. The output of the third stage is the rearranged kill 
signal, from which the actual encoded signal is generated. This is done by the fourth 
block. 
There have been several designs of the KSG, with O(log N) delay. The most 
commonly used shifter with O(log N) delay is the barrel shifter. Each of the blocks has 
been described in detail in the following section. 
a_in 
Kill Signal Generator Right Shift Shifter 
I 
I 
l I - - b_in 
Figure 2.2: Basic Structure of a BSPE 
14 
2.2.1 Structure of Left Shift Barrel Shifter 
The barrel shifter can be designed by using an overlapped tree structure as shown 
111 figure 2.3. The extra shift_in input to the shifter is used to supply the value to the 
shi ficr which would otherwise be undefined after the shift operation. Since the shifter is 
being used for the round robin encoder, the shift signals should be wrapped around and 
connected to the left inputs as shown in figure 2.3. For an N bit input encoder, an N bit 
shi ftcr is required. For an N=2" bit shifter, there are n stages of nodes. Each of these 
nodes is implemented with a 2 to 1 multiplexer as shown in the figure 2.4. From figure 
2.3~ it is seen that the number of wraparounds increase exponentially with the number of 
the stages. There is only one wrap around connection in stage 1, while there are two wrap 
around connections for the second stage. Thus for the p-th stage, there are 2p-t wrap 
around connections. Also the number of wrap around also increases exponentially for 
larger inputs, which make them very difficult to layout. 
Also the size of the cross over network in each stage increases exponentially. The 
number of bits crossed over at i-th stage is 2<i-l)_ As the number of stage increases the 
cross over network becomes too large, and thus, becomes the major cause of delay 
contribution to the circuit. In order to reduce these delays, additional buffers are needed. 
These buffers increase the area requirement of the circuit. 
The a_in input to this encoder circuit is applied to the inputs of the left shift barrel 
shifter. The b_in input to the encoder are decoded and are applied to the control inputs of 
the shifter. The Db_in signals in figure 2.3 represent the decoded signal of b_in inputs. 
15 
The inputs a_in are shifted by the amount specified by the Db_in inputs and appear at the 
output of the shifter in rearranged form. These rearranged inputs are tern1ed as ra_in as 




ra_in N-1 ra_in N-2 ra_in 3 ra_in 2 ra_in 1 
' ' ' ! ! 
!~ - .. -~----- -··-----------··---·-·--·· -&__~-'Y--
----=+====- ····················-· ______ r---1 I --=: 
---·-·-·-------------· -------- -• --------------·------- -----=.-------===±-----------. 
I l I l 
l ! 








------------. -- ------ i I 




' j \ (\ ___ ·· · · · i---i-\ --------------\ -- Db_in, 
t \_ -t \_ -----------------------4 '\_i \ \Shift_in 
a_ini N a_in N-1 a_in N-2 a_in 3 a_in 2 a_ini 1 
Figure 2.3: Structure of the Left Shift Barrel Shifter 
<==>s s 
Figure 2.4: Node structure for the left shift barrel shifter 
16 
2.2.2 Kill Signal Generator (KSG) 
Figure 2.5 shows the circuit for the KSG. The circuit is similar in construction to 
the barrel shi fier described above. The KSG is similar to a fixed priority encoder, with 
the input on the right most comer having highest priority. The priority then decreases 
subsequently towards the left. Thus the leftmost input in the KSG is the one with the 
lo\\'cst priority. The KSG can be implemented by using a simple tree of OR gates. As 
shown in figure 2.5 some of the inputs to the OR gates in the right side doesn't have any 
connection. These inputs must be connected to ground for the proper functioning of the 
circuit. For a two input OR gate, if one of the input is connected to the ground, then the 
output will be same as the other input. Thus those OR gates in the KSG circuits can be 
removed. But leaving them there serves two purposes. First of all it makes the circuit 
symmetrical, the second is that it serves as a buffer to drive the crossover networks. As in 
the case of the barrel shifter, as the number of stages increases the crossover network also 
increases exponentially, and eventually dictate the overall delay of the circuit. The cross 
over network also increases the complexity in the layout of the circuit as the number of 
inputs increases. 
The KSG receives it input from the left shift register, which is the rearranged for 
of the input a_in to the encoder. From figure 2 it is seen that if the right most input .. ra_in 
1 is at logic 1 then alt the outputs, k1, k2, k3 and kt will be at logic 1, irrespective of the 
state of the other inputs, ra_in 4, ra_in 3 or ra_in 2. If ra_in 1 is at low and ra_in 2 is at high 
then kl will be at tow but k2, k3 and kt will be at logic high irrespective of the logic levels 
of the other two inputs ra_in 4 or ra_in 3. The output signals are the kilt signals. Kill 
17 
signals start at the position where the highest priority input signal is at logic 1 and then 

















l i rt ~ ! ' 
' l \__I \_J \_J 
I ~-
ra_in 4 ra_in 3 ra_in 2 ra_in 1 
Figure 2.5: Structure for a four input KSG 
2.2.3 Right Shift Barrel Shifter 
Figure 2.6 shows the diagram for the right shift barrel shifter. The barrel shifter is 
similar to the left shift barrel shifter shown in figure 2.4. The only difference is in the 
connection of each of the nodes, so that the circuit shown in figure 2.6 will perform a 
right shift of the inputs by the amount specified by the input S. All the problems 
associated with the left shifter will also be present in the design and implementation of 
the right shift ba1Tel shifter. 
The input to the stage provided by the KSG, which are the k signals. These are the 
kill signals generated for the rearranged input signals ra_in. The right shifter again shift it 
by the amount specified by the amount by which the a_in was shifted but in opposite 
18 
direction to get the correct pattern of the kill signals. Thus the output from the right 
shi fler is the kill signal which corresponds to the input signal a_in to the encoder. The 
control signal Db_in to the right shifter is the decoded version of the b_in signals. 
Db_inn 
Shift_!!" 




~---- . r··_ 




l ! . . 
: : 
: : 
: : . . . . . . . . . . 
t I . . . . i I 
kill 3 kill 2 kill 1 
.::::-.::::::·.:::·.::::·.- ~==--i:~~~ --
---------------------- ___- I . --i---___.-- .-.-- ..--
I i . . . . . . 




I/ I,/ / --------------------- // / 
I.' 




I / t-/ 
.... __ ,,, 
I .. 
i 
k N k N-1 k N-2 k 3 k 1 
Figure 2.6: Structure for a right shift barrel Shifter 
2.2.4 Encoded Signal Generator (ESG) 
This block simply takes the kill signals that has been rearranged by the right shift 
buffer and then combines with a_in and b_in signal to generate encoded output, en_out. 
The following logic operations are perfonned for each of the kill signals. 
k_int i = kill i .b_in i 
19 
enc out i = -k int i • a in i - - -
where, i represents the i-th input position, and i runs from N to 1 for an N bit 
encoder. 
2.2.5 Performance evaluation 
The delay for the above circuit can be estimated by adding the delay through the 
three blocks. As mentioned above, all three of the blocks have approximately log N gate 
delays, where N signifies the number of the inputs. Thus the total delay of the circuit is 
roughly 3 log N gate delays thus limiting the speed at which the BSPE can operate. As 
the number of inputs to the circuit increases, the cross over network in each of the blocks 
increases exponentially. For the higher levels in each block, the cross over network will 
present a large capacitive load to the driving node, which significantly increases the 
delay. In order to reduce the delay, buffer circuits have to be inserted at carefully chosen 
places in the cross over networks. This need for the insertion of the buffer greatly 
increases the complexity of the circuit. Also, the area greatly increases as there are three 
identical blocks where these buffers have to be added at several points. Also, since each 
of the level in each block is different than any other, if additional inputs are required, the 
circuit has to have a new layout. This makes this approach less suitable for applications 
with a large number of inputs. 
20 
2.3 Look Ahead Tree Priority Encoder (LA TPE) 
The general equation to generate the kill signal for a round robin priority encoder 
is gi\·cn in section 2.1. It is seen from the equation that to generate the kill signal two 
logi c operations are involved. One is the AND operation and the next is the OR 
operat ion. These t,\·o operations are interlaced as given by equation 1 in section 2. 1. One 
method to compute this type of interlaced AND-OR operation has been used in the 
construction of the look ahead carry generator [5] [6] [7] . Also the concept of lookahead 
s ignals have been implemented in the design of large linear priority encoders [8] [9] [IO] 
[ I 11. A tree implementation of the circular priority encoder, which is same as round robin 
priority encoder in functionality is also presented in reference [12]. 
In order for the c ircuit to perfom1 at logN de lay a novel circuit configuration 
based on the binary tree structure has been designed. The basic structure of the circuit is 
shown in fi gure 2.8. The circuit can be divided into two different planes: the AND plane 
and the OR plane. The OR plane and the AND plane are very similar in construction. 
Both of these networks have a tree structure. The OR plane is exactly similar to the KSG 
show n 111 fi g ure 2 .5. W hil e the AND plane is similar to the Right Shift barrel shifter 
shown 111 fi gure 2.6, with each node being replaced by an AND gate instead o f the 
multiplexer. Figure 2 . 7(a) shows the structure for the OR plane and figure 2 . 7(b) for the 
AND plane. The LA TPE ci rcuit can be considered as the interlaced combination of the 
AND and the OR plane. These two planes when interlaced will generate the kill signal. 
21 
( 
Figure 2. 7 ( a): OR Plane 
Figure 2. 7 (b ): AND plane 
22 
For an N input LA TPE, where N=2", there are n+ 1 levels. Each of the nodes in 
the encoder is divided into several categories. Type 1 nodes are the nodes in level 1. The 
nodes in level 2 through level n are called type 2. The nodes in level n+ 1 are named type 
3. The inputs to the OR plane are called 'a_in' and the inputs to the AND plane are called 
~b _ _in·. Any external circuit which uses the encoder will choose the pattern for the a_in. 
For example if there are N different inputs vying for a common resource, then the request 
to the resource is made by setting the corresponding a_in inputs to 1. The b_in inputs 
<le fine the priority of the inputs at any instance of time. For an N input encoder there are 
N b _in signals. each associated with the corresponding a_in input. The highest priority 
input is indicated with its corresponding b_in input as logic 0. Thus at any instance of 
time. one and only one of the b_in inputs can be 0. All the rest of the b_in inputs should 
be set at logic I. 
The grant to the resource for a particular input is indicated by turning its 
corresponding 'enc_out' signal high. Since at a time only one of the inputs can have 
access to the resource, only one of the N enc_ out signals can be at logic 1 at any instant 
of time. All the remaining outputs should be at logic 0. 
The inputs to type 2 nodes in the i-th level and p-th column are made from three 
other nodes. 
• nodes of (i-1 )-th level and p-th column 
• nodes of (i-1)-th level in mod(p - i-1)wth column 





































i l -i-- - ----
._. _J 
I ~ 
b_in 4 a_in 3 
TYPE 1 I 
I 
I ~-, L 
b_in3 a_in 2 












I I . I I . I TYPE 2 
TYPE 1 I TYPE 1 
LI 
a_in 1 b_in 1 
Figure 2.8: Four-input LATPE 
Thus at any k-th level of the LAPTE there are 2k-1 cross over from the left end 
which wrap around to the right side and there are 2k-1 crossover from the right which 
wraparound to the left of the circuit. As the number of inputs increases, the number of 
levels also increases., and also the number of crossovers and wrap arounds. For each 
higher level., the length of the crossover also doubles as compared to the crossover in the 
preceding level. Thus., part of the loading for each node increases exponentially for each 
additional level of nodes. This will have two implications on the overall circuit 
perfonnance. For an encoder with a large number of inputs., the circuit area for the higher 
24 
level is mostly dominated by the cross over networks alone. Appendix B shows the 
layout of the LATPE for the 7th stage where the interconnect networks are 15 times larger 
than the circuit clements. Also the delay of the circuit increases dramatically because of 
the large capacitive load these cross over network presents to the nodes driving it. In 
order to obtain optimum delay through the circuits, extra buffer circuits have to be added 
lo the driving nodes. The buffer has the general structure as shown in figure 2.9. 
""'· ~-.. --------
IN 1 2 
_I __ . 
GOUT 
Figure 2.9: N Stage Buffer circuit 
The general fom1 of the equations to determine the number of buffer stages 
needed and the ratio of the buffer size between consecutive stages can be fonned. Solving 
these equations provides the number of stages required for optimum delay. The ratio of 
buffer size between the consecutive stages is usually not same between stages, but the 
solution for such a case will be very complicated and often requires a computer program 
for the solution. A simpler case which is suitable for the hand calculation can be obtained 
for the case when the size ratio between the consecutive stages is fixed. Such a result is 
presented below. 
{IN== COUT = y 
( c
0 JwL A I 




'N" is the number of buffer stages 
·a· is the fixed size ratio between the consecutive stages 
( \' ) is the capacitance per unit area for the poly gate 
· \V,' is the width of the transistor in the first buffer stage 
· L' is the channel length of the transistor used in the buffer 
·coUT' represents the load being driven by the buffer stages 
Usually ·a' is a fixed number with a value between 3 and 5 for a close to optimum 
delay through the buffer stages. 
The structure for the Type I, type 2 and the type 3 nodes are given m the 
following sections. 
2.3. 1 Type 1 Node 
Type I nodes are the simplest of all the nodes. They have only one AND gate in 
them. None of the type I nodes are directly connected to each other. The input to the type 
I nodes comes from the external circuitry, and the output of the node always goes to the 
input of the type 2 node in level 2. Figure 2.1 O shows the structure for the type I nodes 














a_in 4 b_in 4 a_in 3 b_in 3 a_in 2 b_in 2 a_in 1 
Figure 2.10: Type 1 node structure 
2.3.2 Type 2 Node 
TYPE 1 
b_in 1 
Type 2 nodes are the most common nodes in the encoder circuit. The structure of 
the type 2 nodes is shown in figure 2.11. The input to the type 2 nodes comes from either 
the output from the type 1 nodes or from the outputs of the other type 2 nodes. The output 
of the type 2 nodes is fed to the input of the type2 nodes in the higher level or to the input 
of the type 3 nodes. Each type 2 node has three gates, one AND gate and two OR gates. 
The largest delay experienced by any signal passing through any type 2 nodes is 
equivalent to two gate delays. 
27 
TYPE2 




,-..---.1----___._ _ _l_j 




Figure 2.11: Type 2 node structure 
TYPE2 
Type 3 nodes are used in the last stage of the encoder. Thus the input to the type 3 
nodes always comes from the output of the type 2 nodes. The structure of the type 3 
nodes is shown in figure 2.12. It consists of four gates, one OR gate and two AND gates 
and one inverter. 
28 
enc_out., enc out3 
l 
enc out2 enc out, 
TYPE 3 TYPE3 TYPE3 TYPE3 
~ i I I I 
Figure 2.12: Type 3 node structure 
2.3.4 Circuit Operation 
First it is verified that the circuit given in the figure 2.13 indeed provides the kill 
signal and the enc_ out signal as given by equation 1 presented in section 2.2. For this 
purpose the kill 3 signal is evaluated from figure 2.13. 
C =A+ B, where, A= b_in3. a_in2 and B = b_in2. a_in1 
F = G + H, where G = b_in4. a_in3 and H = b_in1 . a_in4 




kill-' = K. b_in3 
= (D + C). b_in3 = ( E. F +A+ B). b_in3 
= (b_in2. b_in1(G + H) + b_in3. a_in2 + b_in2. a_in1 ) • b_in3 
(b_in.2. b_in1 ( b_i114. a_in3 + b_in1 . a_i14) + b_in3. a_in2 + b_in2. a_in1) 
. b_in; 
= (a_in;. b_in..i. b_in2 • b_in1 + a_i114. b_in2. b_in1 + a_in2. b_in3 + a_in1 
. b_in2). b_in3 
= a_in2. b_in3 + a_in1 . b_in3. b_in2 + a_il4. b_in3. b_in2 . b_in1 + a_in3. 
b_in_,. b_in2. b_in1. b_i114 
This expression is same as given by the general equation ( 1) of section 2.2. The 
enc_ out signal is generated from the kill signal as seen from figure 2.13 as 
enc_out3 = a_in3. -kilb 
If all but one of the b _in signals is set at logic high then the LA TPE will work 
exactly as the KSG. All the b_in signals are fed to the AND plane, which consists of two 
input AND gates interconnected together. Therefore whenever one of the input is at logic 
one, the output is same as the other input. The whole AND plane will function as a 
simple wire network if all the b_in signals are set at logic high. Thus in such a case the 
circuit will functionally look as shown in figure 2. 7(a). This circuit is similar to the KSG 
shown in figure 2.5, except for the wrap around network. Thus if all the b_in signals are 
set to logic one, a loop circuit is formed. If any one of the inputs to the AND plane is set 
at logic zero, with all the rest of them at logic one, O's are formed on all of the inputs to 
30 
the vertical column in the OR plane. This effectively breaks the loop created in the circuit 
shown in figure 2.7 (a). Thus a zero at a particular b_in input will effectively act as a 
shield for the signals being fed to the OR plane from the right side of that particular input. 
This effectively opens the closed loop fonu of the circuit and the equivalent circuit acts 
exactly as the SLPE. 
As an example a typical situation is shown in figure 2.13. Here the input b _in3 is 
set at logic low and all the rest of the b_in inputs are set at logic high. The AND gates 
that are tumed pem1anently off by this particular input combination are shown in the 
figure as grayed. The crossover network flowing towards the left, which carry low signals 
arc shown as grayed in the figure. Thus the input b_in3 effectively has separated the 
circuit into two parts. The right part of the circuit with the inputs a_ini and a_in3, and the 
lell part with the inputs a_in2 and a_in1• Thus any input combination of the inputs a_in2 
and a_in 1 will not show in the outputs enc_ou4 and enc_out3. While if any of the inputs 
a_in4 or a_inJ are at logic high level, then this shows up in the output enc_out2 and 
enc_out1 through the wrap-around networks. If the right and the left part are taken 
individually without the wrap-around network, they can be considered as two 
independent SLPE's. The inputs a_in3 and a_i14 forms the first SLPE which can be 
named as SLPE 1. Since for the SLPE the rightmost input will have the highest priority, 
the input a_in3 will be the highest priority input in the SLPEl. The inputs a_in2 and a_in 1 
fonn the second SLPE named SLPE2. Here since a_in 1 is the rightmost input, this input 
wi II have the highest priority. The wrap around network serves as the connection between 
SLPE2 and SLPE I. Since the a_in inputs are propagated from right to left in the circuit 
31 
because of the orientation of the crossover network in the OR plane, SLPEl will have 
higher priority than SLPE2. Thus a combined SLPE can be assumed with SLPEl on the 
right side of SLPE2. This representational diagram is shown in figure 2.14. Here the wrap 
around circuit can be assumed to supply a signal, kill_in, to SLPE2. The kill_in signal is 
generated by SLPE 1.. if none of the inputs in the SLPEl are at logic high state, then the 
kill_in signal will have no effect on the operation of SLPE2. While if any one of the 
inputs in SLPE 1 is at logic high then the kill_in signal will be such that the output of the 
SLPE I is always at logic low independent of the inputs a_in1 and a_in2. Thus the entire 
LATPE can be represented as a simple SLPE with the input order rearranged as shown in 
step 4 of figure 2.14. Thus the new priority order from highest to lowest priority is a_in3, 
a_in.i, a_in 1 and a_in2• In this way for each of the new combination for the b_in input, the 
circuit can be represented by a simple SLPE but with a different order of input signals 
a_in. The dark band in the step 1 of figure 2.14 represents the permanently turned off 
AND gates shown in figure 2.13. Thus the closed loop can be broken at that point as 
shown in step 2. Step 3 shows how the two separated parts can be rearranged so that they 





enc_out ~ kill 3 enc_out 3 enc_out 2 enc_out, 
I 
: J1 Ii---' --H.-J,+-~-i- ( ._).------1-{ r-+-"--+"LJ--"-----t-Ll-+-'-+-l___,1....---=..............;_He--1 









~ ~ . _J i 
a_in. b_in 4 
(1) 
I I y ~., 





a_in 2 b_in2 a_in, b_in 1 
( 1) ( 1) 
Figure 2.13: Operation for the 4-input LATPE when b _in3 = 0 
33 




b_in3 (0) b_in2 (1) b_in, (1) 
a_in 1 





Step 2 b_in4 (1) 
b_in3 (0) b_in2 (1) b_in1 (1) 
a_in4 a_in3 a_in2 a_in1 




~ ~- -- -~.:::;;;:;;;:;;;.;:;:~ _____ a _ i_n_2 ____ a _ i_n_, -------· _ a __ in-4 _____ a _ i_n_3 ___ _ 
b_in2 (1) b_in, (1) b_in4 (1) b_in3 (0) 
enc_out2 enc_out, enc_outi enc_oub 
SLPE 




Figure 2.14: Representational steps for a 4 bit LA TPE 
34 
2.4 Modular Priority Encoder (MPE) 
From the analysis of LATPE it is clear that although the circuit offers an O(log N) 
delay. the practical implementation of the circuit becomes very difficult with the increase 
in the number of input signals. The overall design of the 128 bit LATPE shown in 
Appendix B shows that a large area is taken by the interconnect networks alone. This is 
because of the increase in the size of the cross over networks in the higher levels of the 
circuit. which dominates the delay. Although a buffer circuit as shown in figure 2.9 can 
be used to minimize the delay, the use of such a buffer greatly increases the area 
requirement of the circuit. The layout of the circuit also becomes very challenging, as the 
buffer circuit has to be redesigned for each new stage. Thus the amount of circuitry that 
can be reused in the layout process is very minimal. Another major delay contribution is 
caused by the wrap around network. These wrap around network along with the cross 
over network take up most of the area in the higher stages of the LATPE (see the layout 
of the 7'11 stage in Appendix B). While a SLPE is similar to the LATPE, the fact that there 
is no wrap around circuit makes the SLPE circuit much faster than the LA TPE. Moreover 
it is shown in section 2.3.2 that a LATPE can be represented as a cascade combination of 
two SLPE~s, with one of the SLPE providing kill_in signal to another. Thus if the same 
principle can be physically implemented, then a round robin encoder can be implemented 
by a simple combination of smaller SLPE's. Similar hierarchical methods have been used 
to design linear encoders in the past [14] [15]. A 32x32 switch has been implemented in 
reference [ 16 ]. Thus a method to break the total encoder circuit into smaller units can be 
found such that the units are independent of each other. Each of these smaller units can 
35 
be a SL PE of smaller order such that the crossover networks are not too large to require a 
buffer circuit. Appendix B shows the layout for the 3rd stage of the 128 bit LATPE where 
the interconnect networks are in similar proportion to the circuit elements. This 
significantly reduces the circuit area and delay. Each of these units is controlled by 
another circuit which should be in a round robin configuration. The purpose of this circuit 
is to provide appropriate kill signals to the individual units. Since all the SLPE units are 
identical in construction~ only a single unit is needed to be implemented which acts as a 
template. The other units are just instantiated from the constructed template. This greatly 
reduces the layout complexity of the circuit. By using such a method the LA TPE circuit 




Modular Approach of Design 
3.0 Introduction 
As mentioned in the section 2.4 a modular approach of design should have several 
smaller independent units which are controlled by means of a common circuit which 
supplies them with kill_in signals. A block diagram representation of the circuit is shown 
in figure 3.1. 
As shown in figure 3.1, the overall circuit can be divided into three stages. Stage I 
is the initial stage of the MPE. The inputs to this stage are the a_in and b_in inputs to the 
encoder circuit. This stage provides signals to the second stage. The structure for the 
second stage is similar to a LATPE circuit. This second stage has a much smaller number 
of inputs compared to the encoder circuit itself. This reduction in the number of inputs 
helps to keep the second stage much simpler to implement. The output of the second 
stage provides kill in signals to the third stage. The third stage consists of several 
independent units. These units are similar to the simple linear priority encoder. The 
number of inputs to these units is much smaller than the number of inputs to the encoder 
37 
circuit. Thus, the crossover network in these units is not large enough to require a buffer 
stage. This greatly simplifies the design of the circuit and also saves considerable chip 
area. Also only one of the units needs to be designed, since all of the units in stage 3 are 






















Round Robin Encoder 





(a_inr , .b_inr 
I (a_inr M-1 ,b_inr 1.1.,) (a_inr 2 ,b_inr 2) 
















Figure 3.1: Structure for the modular approach of Encoder design 
38 
3.1 Implementation Detail 
For the design of an N input priority encoder, the inputs can be subdivided into 
smal lcr groups. In figure 3.1, the N input signals have been grouped into M groups. For 
case of analysis as well as to make maximum reuse of the circuit, these groups can have 
an equal number of inputs. If the N inputs are divided into M equal groups then each of 
the groups consists of L inputs, where N = L x M. If L happens to be a fractional number 
while dividing N by M, then either the number of groups or the number of inputs can be 
varied such that al I the groups have an equal number of inputs except one of them which 
wi II have the remaining number of inputs. The choice of the number of groups, M, and 
the number of inputs within each group L must be done judiciously. The number of 
groups, M, detennines the size of the second stage, while the number of inputs within 
each stage gives the size of the units used in stage 1 and stage 3. From the layout of the 
circuit, the number of inputs for which the circuit does not require extra buffer stages for 
the crossover networks was found. The result is given in table 3.1. 
Round Robin Linear 
(Stage 2) (Stage 1 & 3) 
Maximum no of inputs 8 16 
Table 3.1: The maximum number of inputs without the use of buffers 
Thus if the above scheme is used, then the encoder with 128 inputs can be 
constructed without using any buffer stages for the interconnect network. Also only two 
39 
basic blocks need to be designed. One of them is a modified 8-input LATPE circuit used 
in stage 2, and the other is the modified SLPE circuits used in stage 1 and stage 3. Thus 
the modified SLPE unit is used 8 times in stage 1 and 8 times in stage 3. Since the 
modi lied LATPE has only 8 and the modified SLPE has only 16 inputs, the circuit area is 
not dominated by the wrap around or the cross over network. This results in substantial 
saving of area as compared to implementing the whole 128 bit encoder circuit using the 
LT APE approach. 
3. 1. I Stage I 
The input signals are bundled into M groups. If all the groups are assumed to have 
L inputs~ then each unit in stage 1 will have L of the a_in inputs and L of the b_in inputs. 
These signals are bundled into combined signal (A_in k, B_in k) in figure 3.1, where k 
denotes the group number for the signals. Each unit in stage 1 provides only two single 
bit outputs to stage 2. These two outputs are a_inr k and b_inr k, where k denotes the 
group number. If any of the L b_in inputs to a particular k-th stage 1 unit is at logic zero~ 
then the output b_inr k must be zero, otherwise it is at logic one. This means that the b_inr 
signal is used to indicate to stage 2 in which of the M groups of input signals does the 
highest priority input fall. The a_inr k signal indicates whether any one of the priority 
inputs in any k-th group is at logic high. Thus if all of the L a_in inputs for any particular 
k-th group is at logic zero, a_inr k will be at logic zero. For the case when one or more of 
the L a_in inputs are at logic one, the output a_inr k depends upon the value of the L b_in 
inputs. Two different cases for the L b_in inputs can be identified, the first is the case 
when all the L b _in inputs are at logic one. That means none of the L inputs in that group 
40 
is the highest priority input. In this case if any one of the L a_in signals is at logic 1, then 
the output a_inr 1,; will be at logic I. If on the other hand any of the L b_in signals within 
that group is at logic zero, indicating that it is the highest priority input, then the value of 
the output a_ inr k wi II depend upon the relative position of the a _in signals that are at 
logic one with respect to the highest priority signal b_in. If any of the a_in inputs that are 
at logic one arc to the left of the highest priority input then in that case the output a_inr k 
,vi 11 be at logic one. If on the other hand none of the a_ in inputs to the left of the highest 
priority input are at logic one, then the output a_inr k is at logic zero. 
a_in 4 b_in 4 
( 1) 
a_in 3 b_in 3 
(0) 
a_in 2 b_in 2 a_in, b_in, 
(1) (1) 
Figure 3.2: Structure of unit cell in stage 1 with four inputs 
41 
Thus based one the above explanation of the operation of the basic units of stage 
L it is obvious that the structure for the basic unit of the stage 1 must look like the SLPE. 
The on I y di tTcrence is, for an L input SLPE, there are L outputs, where as for the L input 
basic unit of stage 1, there are only two outputs. The structure for the 4-input basic unit 
block for stage I is shown in figure 3.2. 
The circuit clements that are presented in figure 3.2 in a grey color represent the 
clements that are not needed in stage 1. They are needed for stage 3 which will be 
discussed later. 
3.1.2 Stage 2 
Stage 2 takes the inputs from the stage 1 and then provides the output signal to 
stage 3. Stage 2 does not have any unit elements. The structure for the stage 2 is very 
similar to the structure of the LATPE. The exact structure of stage 2 is shown in figure 
3.3. The dark connection in figure 3.1 going from the left to the right of stage 2 represents 
the wrap around network. This is the only stage where the wrap around network is 
needed. The number of the wrap around signals depends on the number of inputs to stage 
2. Since the number of the inputs to the second stage is the number of unit elements in 
stage I . The stage 2 doesn't require a large number of inputs even for the case when the 
total numbers of inputs to the encoder circuit are high. For example, for a 128 bit input 
encoder if each unit of stage 1 has 16 inputs, then the second stage requires only 8 inputs. 
This reduction keeps the wrap around network as well as the cross over network small 
42 
enough such that no extra buffer stages are required. Since the wrap around networks is 





. I ',: 















I') i 1 \ 
-.-- 11"1" !~--__J ~_______: 
a_in 3 b_in 3 a_in 2 b_in 2 a_in, b_in, 
Figure 3.3: Structure for a four input unit in stage 2 
Stage 2 gets inputs from stage 1 which indicates which of the M groups of stage 1 
has the highest priority input and also which of the inputs in the M groups of L inputs are 
at logic high. The operation of the second stage is exactly the same as that of the LA TPE 
except for the fact that it generates the kill_in signal rather than the enc_ out signals of the 
LA TPE circuit. Generating the kill_in signal requires one less stage than to generate the 
43 
enc_ out output. In figure 3.3 the grayed elements represent the components that need to 
be removed from the LA TPE circuit. The kill_in signal always starts at logic high from 
the first highest input at logic high all the way to the lowest level input signal in the round 
robin fashion. Thus at any instance of time, there can be more than one kill_in signal at 
logic high. Stage 2 provides the kill_in signals to the stage 3. 
3.1.3 Stage 3 
Stage 3 has the same number of unit as stage 1. Each units of stage 3 has a kill_in 
input from the stage 2 as well as L a_in and the b_in inputs. The output of the units of 
stage 3 will produce the output enc_out. The structure for this unit is shown in figure 3.4. 
Figure 3.4 shows that the circuit is very similar to the unit elements of stage I shown in 
figure 3.2. The grayed circuit elements in figure 3.2 represent the additional circuit 
elements that are present in the unit elements of stage 3. The leftmost input in figure 3.2 
is pem1anently connected to logic low, whereas in figure 3.4 the same input is connected 
to the kill_in signal provided by stage 2. This kill_in signal if at logic low indicates that 
the priority token has not yet been used in any other higher priority group. Thus the unit 
behaves like a SLPE. On the other hand if the kill_in signal is at logic high then it 
indicates that the priority token has already been used in the higher priority blocks. At 
this time all the stage 3 units which do not have the highest priority inputs can be lumped 
into one category., and the single unit block which has the highest priority input can be 
placed into a second category. If the kill_in signals to the first category units are high, 
then all of the enc_out signals of those blocks must be at logic low., irrespective of the 
input bit combinations. On the other hand if the kill in signal is high to the second 
44 
category unit with L inputs where the highest priority input is at the k-th position from 
the right. then none of the rightmost enc_out outputs up to the k-1-th position will be at 
1. Of the remaining L-k+ I leftmost enc_out signals, one and only one can be at logic 
high, if in fact at least one of the leftmost L-k+ 1 input are at logic high. Since each stage 
3 units can be programmed to have any one of its input as the highest priority input and 
the operation of the circuit is linear without the wraparound networks, this circuit can be 

















a_in 4 b_in 4 
L 
a_in 3 b_in 3 a_in 2 b_in 2 
L 
a_in 1 b_in 1 




The grayed circuit elements in figure 3.4 show the extra elements as well as the 
network that are necessary to form the unit elements of stage 1. As explained earlier in 
the circuit operation of the LATPE circuit, the begin signal breaks the PLPE into two 
independent SLPEs. But in this case as the wrap around network is not present, the two 
separated encoders will act independently. Thus if the kill_in signal is connected to logic 
low. then it is possible to have more than one enc_out signal high at the same time, 
depending on the input bit combinations, if at least two high inputs fall in the two 
separated S LP Es. Thus the kill_in signal prevents this situation from happening. 
3.2 Improved Modular Priority Encoder (IMPE) 
From the operation of the MPE it is clear that the three stages; stage 1, stage 2 and 
the stage 3~ work in serial fashion. That means first the signals are fed to the stage 1 
which provides the output signals to stage 2 which then generates the kill_in signals to 
the stage three which will finally generate the encoder output signals. Thus while the 
signal is being processed in stage 1, stage 2 and stage 3 are sitting idle. And while the 
signal is being processed in stage 3, stage 1 is sitting idle. Also if the structure of the 
basic units in stage 1 and stage 3 are compared they show very similar structure. Figure 
3 .4 shows the structure for the basic unit of stage 3 which only requires one extra AND 
gate to tap the signal b _inr from it. The a _inr signal can be tapped directly from the 
leftmost column of the basic unit of stage 3, as shown in figure 3.4. Since stage 1 and 
stage 3 are never used at the same time and since the structure for each of them are very 
similar we can combine stage 1 and stage 3 together into a single stage. This combination 
46 
affccti\'cly cuts the circuit area by half without any additional disadvantage as compared 
to the M PE. The area reduction is mainly due to the fact that the main contributors to the 
area arc stage l and stage 3, each of which have M units, whereas stage 2 has only one 
unit with comparable size to the unit element of stage 1 or stage 3. Figure 3.5 shows the 
block diagram for the design. 
Stage 2 












Figure 3.5: Block diagram for the improved modular priority encoder (IMPE) 
3.2.1 Operation of IMPE 
It is seen from figure 3.5 that the IMPE has only two physical stages. The total 
delay is still the same three-stage delay as that of the MPE shown in figure 3.1 The 
reason for this is that the IMPE approach reuses stage 1 to perfonn the function of stage 3 
ofthc MPE. 
47 
The structure for the basic unit for stage 1, shown in figure 3 .6, is similar to the 
basic unit of the stage 3 for the MPE with the additional and gate and the tapping 
required for stage l. Thus the circuit operation is also very similar to that of the MPE. 
a inr b_inr 
enc_oul \~---=-1, enc_out J 
-\ I I\ i 




r __J I I -· 
I 
I 
~ I I 
,..._, 
I 














b_in 4 a_in J b_in J a_in 2 b_in 2 
a_in, b_in, 
Figure 3.6: Basic Unit for stage 1 with four inputs used in IMPE 
The input signals a_in and b_in are bundled into M groups, each of which can be 
of equal or unequal sizes. The choice is governed by the complexity that is bearable in 
the unit elements. Stage I decides which of the M group of elements have the highest 
priority input as well as which of the input groups have made the request to the encoder. 
This information is fed to stage 2 through the inputs a_inr and b_inr. At this time the 
48 
output enc_ out of the first stage are not the valid outputs. The structure for stage 2 is 
exactly the same as that for stage 2 of the MPE as shown in figure 3.3. Thus the stage 2 
for the lMPE operates the same as stage 2 of the MPE and provides the kill_in signals. 
Now these ki II_ in signals are fed back to the stage I. The effects of the kill_in signals in 
the IMPE arc exactly the same as the effect of the kill_in signals in the third stage of the 
M PE. When stage 1 provides the a_inr and the b_inr signals to stage 2, more than one of 
the enc_ out signals in the M input groups can be at active 1, because at this time each of 
the units of the stage I acts as a PLPE with kill_in as zero. When the kill in signals are 
provided by stage 2, only the enc_out with the highest priority will remain at logic high. 
All of the other enc_out will be forced to logic low. Since the signals travel through stage 
1 twice and once through the stage 2, the delay through the network is exactly the same 
as the delay through the MPE. The great advantage of this approach is the reduction in 
the chip area to half that of the MPE. 
49 
Chapter 4 
Implementation of the Designs 
4.0 Introduction 
This chapter gives the exact design implementation for each of the different 
encoder circuits that have been used in this thesis. The encoder circuits that have been 
implemented are the BSPE, LATPE, MPE and the IMPE. All of the designs were 
implemented using the Verilog hardware description language (HDL). Since the main 
objective of the design is to achieve the minimum delay that is possible, strict control of 
the structure of the circuits is necessary. Thus all of the circuits have been implemented 
using structural Verilog code. All of the encoders that have been implemented have 128 
a_in inputs and 128 b_in inputs. Because of the large number of inputs, the amount of 
HDL that has to be written is very time consuming if the whole circuit is implemented 
using structural Verilog code. Since all of the circuits that have been implemented have a 
very regular fonn, software programs that are written in high level language can be used 
to generate the structural Verilog code. Thus for each of the implementations, software 
programs in Visual Basic (VB) were written to generate the structural Verilog code. 
Apart from the Verilog implementation, the LATPE and the IMPE were also 
50 
implemented using the Cadence Virtuoso layout editor. Finally the IMPE was 
inco1vorated into a chip with a 40 pin dip pad driver for the fabrication of the circuit. 
4.1 Implementation of BSPE 
Implementation of the BSPE was done using the structural Verilog code only. The 
size of the encoder was 128 bits. In order to facilitate fast coding of the circuit, a program 
was written in Visual Basic to generate the Verilog code. Each of the nodes in the barrel 
shifter circuit was implemented using the built in multiplexer MX2 l of the AMI3HS 
0.35 µ CMOS standard cell library and the OR gates for the KSG were implemented using 
the standard cell OR2 l from the same cell library. The cell description and the delay for 
the components used from the cell library are presented in Appendix A. 
4.2 Implementation of LATPE 
Two different implementation of the 128-bit LATPE circuit were done. The first 
one was using the Verilog HDL and the second one using the Cadence Virtuoso Layout 
editor. The detail description for each of the implementation is presented below. 
4.2.1 Verilog Implementation 
The entire circuit as described in section 2.4 of this thesis was implemented in 
structural Verilog code. A software program written in Visual Basic was used to generate 
the structural Verilog code. The AMI3HS 0.35µ CMOS standard cell library was used to 
51 
implement the individual components. The AND plane was implemented using the AA2 l 
cell from the cell library and the OR plane was implemented with the OR21 cell of the 
cell library. The cell description and the delay for each of these are presented in 
Appendix A. 
4.2.2 Layout of LATPE 
The circuit layout was created using the Cadence Virtuoso Layout editor. No cell 
library was used for this purpose. The process used was the AMI 0.6 µ C5N from the 
American Microsystems Inc. This process has three metal layers and two poly layers. The 
second poly layer was never used in the circuit, since this poly layer is used for the 
fabrication of poly to poly capacitors. The basic implementation of the circuit is exactly 
as shown in figure 2.8. All the circuits are implemented as static CMOS circuits. The 
circuits are easy to implement if they are in NAND-NOR form rather than in AND-OR 
fonn. Thus the type I, type2 and the type3 networks as shown in the figure 2.10, 2.11 and 
2.12 respectively are changed to the NAND-NOR form and are shown in figure 4.1,4.2 
and 4.3 respectively. 
J 
I 




Figure 4.2: Modified type 2 node for LA TPE 
I 
"---, 






/ \ ! 
~ ! --- I,~~ 
I I 




The total components that are used in these modified three types of nodes are the 
2-input NAND, 2-input NOR and the inverter circuit. Each of these elements is 
implemented using static complementary MOS design as shown in figure 4.4 through 4.6. 
Both P and N type of MOSFET are used. All of the transistors that have been used are of 
minimum sized geometry. 
53 
IN .. OUT 















Figure 4.6: Static CMOS implementation of the 2-input NOR gate 
The entire wrap around network is implemented in Metal 3, so that the delay 
offered by them is minimum. The cross over network is implemented in metal 1 and 
metal 2. All the transistors used in the circuit are of minimum geometry, with length 0.6µ 
and width of 12µ, no sizing up of the transistors was done. The layout of each element 
was done in such a way that the number of contacts on the output node was minimized, 
which helps to make the circuit faster. Also doglegged gates were used wherever possible 
to save the circuit area. The N-type of transistor was laid out directly on the P-type 
substrate, whereas the P-type transistor was laid out on the N well. Substrate contacts 
were placed at a number of places to prevent the latch-up problem. 
55 
4.3 Implementation of MPE 
It is clear from the block diagram for the MPE that the hardware used in stage I is 
very similar to the hardware used in the stage 3. Thus there was no need to implement the 
circuit in the layout fom1, thus only Verilog version of the circuit was coded. This version 
allowed one to compare the relative speed with its IMPE counterpart. The entire circuit 
was coded in structural Yerilog using software developed in Visual Basic. The 128-bit 
encoder was implemented in purely structural Verilog as shown in the block diagram of 
figure 3.1. Each of the elements of stage 1 was implemented as shown in figure 3.2. The 
implementation of stage 2 and the unit element for stage 3 was done as shown in figure 
3.3 and figure 3.4 respectively. 
4.4 Implementation of IMPE 
The IMPE circuit was implemented using both Verilog HDL and the Cadence 
layout editor. The size of the encoder implemented was 128-bits. The following section 
provides the description for each of the implementations. 
4.4.1 Verilog Implementation 
The circuit was implemented as shown in figure 3.5. The unit element of stage 1 
was implemented as shown in figure 3.6. Since stage 2 is exactly the same as that for the 
MPE, figure 3.3 was used for its implementation. The entire circuit was defined in purely 
56 
structural Yerilog code. Software was developed m Visual Basic to generate the 
structural V cri log code. 
4.4.2 Layout of Il\1PE 
The layout of the circuit also follows the block diagram as shown in figure 3.5. 
For the first stage., the input signals are grouped into 8 groups with each group having 16 
inputs to it. Thus the second stage should have 8 inputs to it. This decision for taking 16 
bit units in the first stage and the 8 bit second stage is based on the observation of the 
complexity in the crossover or the wrap around network of the LATPE circuit. From the 
layout of the LATPE it was found that with the wrap around network, the circuit didn't 
require extra buffer stages to drive the interconnect networks if the input was 8 bits. For 
the basic unit of stage 1 in figure 3.5, the circuit is very similar to stage 2 but without the 
wrap around network. Thus for this stage, the maximum number of inputs that can be 
placed in a group without requiring the circuits to have the extra buffer stages for driving 
the interconnect load was around 16. Since the wrap around network also provides 
significant load, the unit element of stage I can have more inputs than for stage 2. As is 
the case for the layout of the LA TPE circuit, each of the gates are implemented in the 
NANO-NOR form rather than in the AND-OR form. All the gates are implemented in 
fully complementary static CMOS form and are shown in figure 4.4, 4.5 and 4.6 for the 
NOT, NANO and the NOR gate respectively. The floor plan for the layout is given in 
figure 4. 7, which is very similar to the block diagram shown in figure 3.5. 
57 
Enc_c;>l!~.~___.____. _____ ~ 
LIN RI 
·-1f1n I . 
- I 
E_nc _ outx.-CC: I ----. 
A_in 1 
I - ] 
·-a _in 2 ...,_.,.__---:-A_-:-in-2---:::~::c--' 
Enc_e>_1:1!~---'--+-.--------. 
I -·, ! -S_in 4 
. ii, 
• I I 


















I 'i I I 
'-+-------------! li 1 .....__,1--------------· ! 
Figure 4.7: Floorplanning for the 128-bit IMPE 
The block labeled ROUND contains the circuitry for stage 2 as shown in figure 
3.3. All the blocks that are labeled LINEAR contain the circuitry for stage 1 of the IMPE 
as shown in figure 3.6.The layout for only two blocks needs to be done~ one for the 
ROUND block and the other for the LINEAR block. Then the eight LINEAR blocks and 
the single ROUND block can be interconnected as shown in the figure above. The layout 
is made such that the encoder circuit as a whole is more or less square in shape. This 
58 
helps to keep the circuit compact. Also since the circuit consists of several smaller blocks 
that arc connected by a very few lines, the shape of the encoder can be adjusted to the 
a\'ailablc chip area if the encoder is used in conjunction with other circuits. Appendix B 
shows the layout for the 128 bit lMPE. The above circuit has been planned to be 
fabricated f'rom MOSlS~ thus additional circuitry are needed for connecting the inputs 
and outputs to the external world. The next chapter defines the floor plan and the 
additional layout issues that are associated with the complete chip. 
59 
Chapter 5 
Chip Layout of 128-bit IMPE 
5.0 Introduction 
The IM PE circuit with complete input and output pads has been laid out to be 
fabricated using the MOSIS fabrication process. In addition to the basic units for the 
IMP l: circuit that have been described above, other circuits are also needed to make the 
chip easily testable. The chip has to be tested both for functionality and speed. The pad 
circuits need to be used to interface with the external world. Since the encoder is 128 bits 
in size. there are 128 a_in inputs, 128 b_in inputs and 128 enc_out outputs. Thus the total 
numbers of input and output pads that are needed are around 384. To fabricate a chip 
with this many inputs requires a very expensive package. One common method that has 
been used to get around this problem is to use scan registers. This methodology has been 
implemented to reduce the number of inputs and outputs so that the entire circuitry can fit 
into a 40-pin package. The pad frame has been supplied by MOSIS group. With the use 
of scan registers, the additional issue of supplying the clock signals to the registers arises. 
AJso there has to be some mechanism to determine the speed of the encoder circuit alone 
without the additional delay of the scan registers and the input output pad drivers. 
60 
5.1 Input/Output Pads 
The signal generated by the circuit has to be transferred to the external world. The 
pad dri\·crs arc used for this purpose. The pads are then connected to the pin of the chip 
package using bond ,,·ires. For the purpose of testing the IMPE in this thesis, a pad frame 
laid out using 0.(1~t technology. The pad frame is shown in figure 5.1. The pad frame was 
pro\·idcd by MOS IS and was not designed as part of this thesis. 
~~---------1500µ-------- ......... 
[[I 1/0 1/0 
(11110 .......... ----900µ----~ 1/0 
111111111111111 ..... --..... ... ____ iiiiiiiiiiiiiii 
~1ro oo 
E[I 1/0 1/0 
1311 1/0 1/0 
Iii VDD GND 
6] 110 1/0 
[ED 1/0 1/0 
,~~, 1/0 1/0 
,~SI 1/0 1/0 
:::::: :::::: 
0 0 
Iii) Iii Ii] • 111 lil •••• 
Figure 5 .1 : 40-pin pad frame 
61 
There are 40 pads., with 38 of them being the bidirectional input/output pads and 
the remaining two being used for the ground and the VDD connections. The structure for 











j. __ J 





Figure 5.2: Circuit diagram for the bidirectional input/output pad 
All the incoming and the outgoing signals are defined with respect to the chip. 
Thus an output signifies that the signal is fed from the circuitry to the external world. The 
input signal signifies the signal flowing from the external world to the chip. Since each of 
the input/output pads is bidirectional, they can be used either as an input pad or an output 
pad. The OEN input is used to define whether the pad works as an input or the output. If 
OEN is set at logic high .. then the pad works as an output pad and if it is set at logic low it 
62 
,vorks as an input pad. The OEN input should be connected to the logic low level for all 
the pads that arc not used or used as input only pads. The transistors that are used for the 
output buffer arc ,·cry large with W/L of the order of 100. Also the channel length is 
made larger than the minimum length allowable to provide a high breakdown voltage. To 
prevent the gate resistance to grow to a large value for these wide transistors, folded 
layouts arc used. To prevent the latch up problem for these wide transistors guard rings 
arc used around the transistors. 
5.2 Scan-Based Test Technique 
The IMPE requires 128 a_in inputs and 128 b_in inputs as well as 128 outputs 
from the encoder circuit. Thus a large number of input/output ports are required. In order 
to reduce the number of input/output ports to a more practical limit, several approaches 
avai I able for testing can be used. In the current thesis, serial and parallel scan based 
techniques have been incorporated. A general block diagram for the serial scan technique 
is shown in figure 5.3. It shows that the combinational logic that has to be tested is 
sandwiched between two shift registers. The first shift register takes the input from the 
external world through the single input terminal termed as Scan_in in serial fashion. After 
all the input signals are scanned into the input register, the entire input signal is applied to 
the circuit under test in parallel fashion. The circuit under test in this case is the 
combinational logic shown in figure 5.3. The circuit under test operates on the input and 
generates the output signals, which is latched into the output scan register in parallel 
fashion. The content of the output scan register is then clocked out serially through the 
63 
Scan out tem1inal to the external world. Thus usmg this approach the number of 
input/output pads can be significantly reduced. 
Scan out .. - Scan Register 
Combinational Logic 
Scan Register 
Figure 5.3: General diagram for scan based testing technique 
If a single scan register is used for the entire inputs and outputs, the scan chain 
will be quite long which can dominate the testing time. The scan chain can be broken into 
several small chains. Thus in this approach, there can be more than one scan register at 
the inputs and the outputs. More about the scan based techniques can be found in 
reference 3. 
Each of the scan registers is implemented with of edge sensitive registers. Several 
different designs of the edge sensitive registers exist. The following subsections provide 
the design that has been used in this thesis and other issues related with the design. 
64 
5.2.1 Scan Registers 
The scan registers are implemented by cascading several edge sensitive latches in 
series. Di ff crcnt versions of the edge sensitive register design exist. For this thesis, static 
true two phase positive edge sensitive registers have been used. Figure 5.4 shows the 
basic diagram for the positive edge sensitive register used. 
OUT 
IN +---'---.... 
Figure 5.4: Schematic for a true two-phased clock positive edge clock latch 
The two-phased approach is used to make the register immune to clock skew. 
Clock skew and clock distribution problem are discussed in the next subsection. Phi 1 and 
Phi2 are the non-overlapping clock signals. The system can be made to work by 
increasing the non-overlapping period between Phi1 and Phh. Since non-full swing logic 
has been used, this approach consumes more power. But since it requires only two clock 
lines, clock distribution is easier. In this thesis, power is not an issue because there are 
only a few registers. Reducing area is more important since the entire IMPE circuit along 
with the scan registers has to fit into the pad frame with a central area of 900µ x 900µ. 
The true two phase approach which requires only two clock lines as opposed to the lower 
power consuming pseudo two phase clocked approach, which requires four clock lines., is 
more advantageous. 
65 
5.2.2 Clock Generation and Distribution Issues 
If a single-phase clock is used, the different edge sensitive latches of the scan 
register may receive the clock signal at different times. This is known as clock skew. 
Clock skew can be a significant problem in the circuit implemented as shown in figure 
5 .4. One of the common methods is to use non-overlapping two-phase clock signals. The 
non-overlapping clock signal can be generated from the single clock line as shown in the 
figure 5.5. 
Clk 
Figure 5.5: Two-phase clock generator 
Adding or removing the inverters can vary the non-overlap period. Figure 5.6 
shows the signal Phi I and Phii. 
Another issue that has to be addressed is that of clock distribution. After the clock 
generator generates the two-phase clock signals, both of these signals have to be routed to 
each and every edge triggered latch in the scan registers. In order for the clock signal to 
reach to all the registers at almost the same time, a balanced clock tree or the H clock tree 




Figure 5.6: Non Overlapping - clock 
The H tree is a very regular structure with recursive construction of H's. For each 
level of H structure, four smaller H structures can be added at the four end points of the H 
bars [ 4]. The size of the wires can be varied to account for the variations in the load 
capacitance to equalize the skew throughout the H tree [ 4].The H tree defines the floor 
plan of the circuit which may not be suitable for all cases. A balanced tree is based on the 
concept of the H tree, where the tree is irregular in shape but the branches are arranged so 
as to balance the branches to minimize the skew [4]. 
I\ 
Clk 
Figure 5.7: H-tree for the distribution of clock 
67 
5.3 Floorplan of IMPE in Pad Frame 
The lMPE circuit has been laid out inside the 40-pin pad frame with the internal 
area of 900p x 900~1. The tloorplan is shown in figure 5.8 and the layout in Appendix B. 
In order to prevent very large scan chains, the scan registers have been broken down into 
several small units. There are 12 scan registers. Each of these scan registers are 32 bits in 
size. The entire stage I of the IMPE has been divided into four groups. Since there are 
eight unit clements in stage I for the 128-bit encoder, each of the four groups consists of 
two unit blocks. Thus each group has 32 inputs. Tn figure 5.8, signals are named after the 
group to which they serve. Thus each 128-bit a_in input to the encoder has to be broken 
down into four groups of 32 bit words and fed serially through the input terminal ain 1, 
ain2. ain-' and ain4 • Similarly the 128 bit b_in inputs are fed in serially through the input 
pins named bin 1• bin2, bin3 and bi114. These 8 input registers are implemented using the 
edge triggered latched as shown in figure 5.4. A single true two-phase clock network 
controls all of these eight scan registers. The input terminal named, Clock_in, is used to 
give the clock signal to these eight registers. Since a single clock line is used to control 
all eight registers, the serial inputs to all the eight registers need to be supplied at the 
same time. The clock signal supplied by the external circuit is converted to the two-phase 
clock signal using an internal circuit as shown in figure 5.5. The distributions of these 
two phases are done utilizing a balanced tree clock distribution network as shown in 
figure 5.8. Since these two phase clock distribution networks run aJI over the chip 
surface, they provide considerable capacitive load to the two-phase clock generator 









. r-, r-,, ~-~--=-=-
O..tput Scan R•;,ster - "'-
-T-
0 LINEf I 
1· SNin-4'1R~ ,_ -
L.......J L.......J ~-~-~ - -
I 
' .--:-. r.:-:1. ~ --=-:::::a- -=-=- ---. - -~~lll~gtllel -
. ____ w,Ef [. J 
,,. ! Scan-en R-t•r -- -
L.......J-L=J~-~ - - - -
enouO bout, aout, 
LI I 
-11::"::::II.. --- I Oulput Scan Registor ~, 
( 1 U~B . 
Scan-in Register 
JI 
~ ---~ """C....!::r' 
\ 
~ -=-=- L..c-a::s:: ~ 
Oulplll Scan Register 
II 
Ll N~7 I Scan-lnR..--








. 11 11 r::"'l. ~cS:2:1 -=-'=-
Output Scan Rcgisi.r - -·--·T 
LINEf r w 
11 
-11:: -::::i.... ~ 1-c:-- --- ........ 








l : • r Scal\o<n R~l•r - I; - Sea~ Reglstor 11 




I II II r:-:t_'2::t.J5: ::a,_ -=-=- ~~ .J ~__..,____._ 
, Output Sc:.tn RctjllSlel - - - Output Scan Reg<$tl!r I_, 
;· -- --- LINE Rr I r1 [) NfAAS i1ilj--
•-' 4-, • • ·· Scol\o<n R sler 1 
-~ --' L.......J t.:....:.:....J t=:J-~~ ~ -.:....!:: • ~ ~ 
i. I I 
KillinE 
,: 
I ii ~ \I\ 1, -- )111• !- [ I I I r 1 I I I '".'"'.Co""--~ I I 11111 mr \ B_inr - , . RQIINO 
!]'1 i I: I 
'"i++inr 




I ·~ I [ ButtarCircud Buller Circuit 
0 0 
Latch_in Clock_in Latch_out 







capability. The circuit diagram for the buffer is as shown in figure 2.9. 
Each output from the scan register needs to be fed to the input of the encoder. 
Latches at the output of each of the scan-in registers are used to gate the input signals to 
the encoder circuit. While the scan-in registers are being fed serially, this latch is turned 
off thus disconnecting the scan-in registers from the encoder circuit. During this time~ 
69 
some input signals must be provided to the encoder circuit. Thus a second latch is 
connected to the inputs of the encoder circuit. The input to these latches is fixed by 
hardwiring them either to the ground or the supply terminal during layout. This latch 
should turn on when the latch connected to the output of the scan-in register is turned off. 
And it should turn off when the latch connected to the output of the scan-in registers are 
turned on. Thus these two latches connected to each of the input terminals of the encoder 
circuit should have complimentary control inputs. These two latches together act like a 2-
1 multiplexer. The schematic for the latch is shown in figure 5.9. Since all of these 
latches need to operate at precisely the same time so that they provide signals to all the 
inputs of the encoder at the same time, the control signal to all of the latches must arrive 
at the same time. Thus a balanced tree is used to supply the control signals to these 
latches. This control signal is connected to the input pad named, Latch_in. In order to 
enhance the driving capability this balanced tree is also driven by a buffer circuit as 
shown in figure 2.9. 
Finally there are four output registers, each of them are 32-bits wide. These 
registers operate slightly differently than the scan-in registers. While the scan in registers 
operate in serial input, parallel output and serial output mode, the output registers operate 
in parallel in and serial out mode only. The parallel input to the output registers are 
provided by the output from the encoder circuit. These outputs from the encoder circuit 
are latched into these registers by means of control signal named, Latch_ out. With a two-
phase clock generator circuit, this Latch_out signal is transformed in to true two non-
overlapping clock signals, which are distributed to all the latches in the output registers 
70 
by a balanced clock distribution network. After the output from the encoder is latched to 
the output registers~ the content of these registers are serially scanned out for the purpose 
of \'cri fication. Thus the four output pads enout1, enout2, enout3 and enout°.4 are used to 
To the clock tree 
Hard wired to 
either Vdd or Gnd 
To the input 
~--~, .-- terminal of the 
j encoder circuit 
for Latch_in 
L~, 
To the output of 
the Input scan 
register 
Figure 5. 9: 2 - 1 multiplexer used in the input side of the encoder circuit 
read the content of these output registers serially. Since the output registers are used in 
both parallel and serial mode, extra latches and control signals to separate these two 
conditions are needed. Thus an input pad named P _S is used for this purpose. This input 
signal is used to select the operation of the output registers in parallel or the serial mode. 
Figure 5.10 shows the latches that are needed for the output registers to function in either 
parallel or the serial mode as selected by the P _S input. 
The outputs from all the eight scan-in registers are also read serially for the 
purpose of verification. These signals are connected to the output pads named aout 1 ~ 
71 
aout~. aout ~ and aout.i for the a_in inputs and the output pads bout1, bout2, bout3 and bou4 
for the h in inputs. 
The V DD and the Ground lines are supplied to all the circuit blocks shown in 
figure 5.8 by using metal layers. Also all of the clock distribution lines are implemented 






p _s Encoded signal 
from the encoder 
circuit 
Two Phase Latch 
Figure 5.10: Latch circuit for the output registers 
72 
Chapter 6 
Testing and Verification Results 
6.0 Introduction 
This chapter presents the testing methodologies used to verify the functionality of 
the encoder circuit and the results from the simulation. A11 of the different approaches for 
the encoder designs were fully tested and verified for their functionality and correctness. 
All the circuits that were simulated were tested using a set of test vectors. Since 
all the circuits that were tested had 256 inputs, exhaustive testing was not possible. For a 
128-bit encoder there are 128 a_in inputs and 128 b_in inputs. Since each of the 128 a_in 
inputs can be either at logic high or low, irrespective of what logic levels the other bits 
are at, there can be 2' 28 = 3.402 x 10 38 total input combinations. Since only one of the 
b _in inputs can be at logic low at any time and all the remaining should be at logic high, 
the total number of input combinations for the b_in inputs is 128=27. Combining the 128 
a_in inputs and 128 b _in inputs the total input bits combination is 2'28 +7 = 2' 35 = 4.355 x 
1040 . With a typical testing device that can provide a test vector and sample the output at 
a clock frequency of 500 MHz, then it will take 2.76 x 10 24 years. Thus, the exhaustive 
73 
testing of the circuit is not possible. A practical approach is to test the circuit using a set 
of random I y generated test vectors from the full set of possible test vectors. A computer 
program was written in Visual basic to generate a set of randomly generated test vectors. 
All or the simulated circuits were tested using approximately 1000 test vectors. The 
folJowing sections describe the procedures used for testing the circuits. 
6.1 Testing Procedures 
All of the simulated circuits except the chip implementation of the IMPE follow 
the same general procedure. The circuits that are simulated using the Cadence Virtuoso 
Layout Editor are first extracted to determine the parasitic capacitances. Then a netlist of 
the circuit is generated. Then the test file is generated using the 'test vector generator 
so ft ware~. The test file contains the test vectors for verification as well as other 
infom1ation like the rate at which the inputs are to be supplied to the circuit under test 
( CUT)~ as well as the expected output from the CUT. For the circuit simulated using the 
Verilog code the extraction procedure is not necessary. As mentioned earlier, there are no 
inclusion of the capacitive load due to the interconnect networks. Apart from the chip 
implementation of the IMPE circuit, all the simulated circuits have parallel input, and 
parallel output. Thus the testing procedure merely involved applying the test vectors one 
at a time, waiting for a preset interval of time and then capturing the output of the CUT. 
This output is then compared to the expected output. If the expected output for all of the 
test vectors matches that of the simulated output, the circuit is working correctly in the 
preset inteival. The procedure is repeated until a least time interval between the applying 
74 
of the test ,·ector and capturing the output of the CUT is found for which the circuit 
output matches the expected output. This time interval is the minimum time interval at 
which the circuit works as expected. The inverse of this time will give the maximum 
frequency at which the circuit can operate. 
The testing for the chip implementation of the IMPE circuit is different from the 
other. The testing procedure follows the flowchart as shown in figure 6.1. First the output 
register is con figured into parallel mode by setting the P _S signal high. Then the inputs 
ain1, ain~, ain_,, ain.h bin,. bin2, bin3 and bi114 are set to the desired value of 1 or O and the 
C'lock_in signal is applied. Since each of the scan in registers are 32 bits wide, the input 
pattern is changed 32 times and then the Clock_in signal is applied. To allow the signals 
to be stable a certain delay is needed after this operation. After this operation, the input 
signals need to be fed into the encoder circuit. Thus, the Latch_in signal is set high. This 
operation connects the input of the circuits to the output of the scan-in registers, which 
have the valid input signals. After applying the input signals a certain delay is allowed 
until the output of the encoder circuit becomes valid. This set amount of time is 
detcm1ined iteratively, initially it is set to some value. If the circuit performs successfully 
in that time the duration is reduced, until the output of the circuit doesn't match the 
expected output. After waiting this preset interval of time, the output of the circuit is 
latched into the output registers. This is done by setting the Latch_out signals high. After 
the output signal is latched to the output registers, the output registers are configured to 
work in serial mode, by setting the P _S signal to 0. Then the content of the output 
registers as well as the input registers are read out serially by clocking 32 times into the 
75 
CUT_DELAY = Minimum 
expected delay through 
the encoder circuit 




' P_S= 1 
I 
' [~atch_out =O J 
- -- ' Set these inputs to the desired values 
a_in1 a_in2 a_in3 a_in.a 
b_in1 b_in2 b_in3 b_irt.a 
And apply the dock_in signal for 32 
times 
lw~ .. i.t for 1000 dock L cycles 
I Set Latch_in= 1 
Wail for CUT_DELAY 
seconds 
Set Latch_out= 1 
wait for 1 ns seconds 
Set Latch_out= O 
Wait for 1 ns seconds 
P_S=D 
' Apply the latch_out and Clock_in 32 
times while recording these outputs 
a_out, a_out2 a_out, a_out. 
b_out, b_out2 b_out3 b_out. 
enout, enout2 enout3 enout, 
Figure 6.1: Flowchart showing testing procedure for the chip implementation of IMPE 
76 
Clock in and the Latch_out signals. These seria1Iy read signals are stored and then 
combined to fonn a complete 128-bit string which are then compared with their expected 
values. The time interval denoted as CUT_DELA Yin figure 6.1 denotes the time interval 
bet\\'een the event when the Latch_in signal is turned high and the event when the 
Latch_ out is tumed high. 
For a particular value of CUT_DELAY, if the output of the circuit is as expected, 
then the circuit performs as expected within that time interval. The value of the 
CUT _DELAY can be progressively decreased until one or more of the output just doesn't 
match the expected output, which gives the minimum time interval at which the circuit 
performs properly. In the flowchart shown in figure 6.1, the CUT_DELAY also includes 
the delay caused by the latch circuits as well as the setup and hold time due to the output 
registers. To detenuine these extra delays in the hardware circuit is complicated. Thus, 
one can use a circuit simulator to determine the extra delays and then subtract them from 
the CUT DELA y to find the actual delay through the encoder circuit alone. 
6.2 TESTING RESULTS 
Table 6.1 presents the delay through each of the circuits that were designed and 
simulated for this thesis. It should be noted that the Verilog implementation uses the 
0.35 µ cell library, while the layout in the Cadence Virtuoso editor was done for 0.6µ 
process. 
77 

























Table 6.1: Delay through the various versions of 128-bit encoder circuits 
A1so the chip area comparison is necessary in order to assess the true value of a 
design apart from the delay determination. Only two of the designs need a direct 
comparison of the area. The relative area for the rest of the designs can be approximated 
from their structure itself. The two designs that needs direct comparisons are the LATPE 
and the IMPE layout circuits. Table 6.2 presents the sizes for these two designs for the 
input size of 128-bits. 
Design Type Area ( µ.!) 
LATPE 4575x567 =2594025 
IMPE 1225x637 = 780325 
Table 6.2: Chip area taken up by different designs of 128-bit priority encoders 
78 
Chapter 7 
Conclusion and Future Work 
7.0 Conclusion 
The main objective of this study was to design a round robin priority encoder with 
a large number of inputs with a delay of O(logN) where N denotes the number of inputs 
to the encoder circuits. Several different designs were conceived and then simulated as 
we11 as verified using either Veri1og HDL or the Cadence Virtuoso Layout tools. Each of 
the designs presents their own pros and cons which have been discussed in detail in the 
preceding chapters. The final results of the simulation are presented in table 6.1. The 
BSPE was simple in design but the fact that it has inherent delay of 0(3* log N) as well 
as the fact that the interconnect network gets too complicated in the higher levels of the 
barrel shifters make them less favorable. Also the size of the encoder more than linearly 
increases with the increase in the number of the inputs. Thus this is not an area efficient 
design. The LA TPE circuit has a smaller delay than the BSPE, which is also evident from 
Table 6.1. The LATPE has a delay of the order of O(logN). But this design still has larger 
interconnect loads for the higher levels, thus complicating the design. Also the relative 
chip area taken up by this circuit also increases linearly with thenumber of inputs and is 
79 
thus not an area efficient design. The MPE design is very simple, since it reuses the 
hardware designs. Only a few basic blocks need to be designed. These basic blocks can 
be instantiated many times to realize the final circuit. Thus, it is very easy to increase the 
number of inputs to the circuit if needed. The area taken up by the IMPE circuit is also 
very smaJJ compared to the area taken up by the BSPE and the LATPE circuits. The 
delay through this circuit is slightly larger than the O(logN) delay for LATPE. The exact 
delay through the MPE design depends upon the number of groupings that have been 
made in stage 1 and stage 2 of the design. If the N inputs have been grouped into M 
groups of L signals in the stage 1 then the total delay is approximately given by Jog(L) + 
log(M) + log(L) = 2 log(L) + log(M). Where the stage land stage 3 presents delay of the 
order O(logL) delays and the stage 2 presents delay of the order of O(logM). For the 128 
bit encoder in which the input signals are bundled as 8 group of 16 inputs the total delay 
is given by log216+log28+logil6 = 11 as opposed to the log2128 =7. Thus the increase in 
the delay is minimal, whereas the area reduction is very high, as well as the reduction in 
complexity of the design. The IMPE has a similar delay as that for the MPE circuit, but 
the added advantage is the reduction in area to almost half as compared to the MPE. This 
is due to the fact that the IMPE utilizes the similarities in structure of stage 1 and stage3 
of the MPE circuit and combines them into a single stage. Since stage 1 and stage 3 of the 
MPE are the major contributors to the area of the MPE circuit, the combination of stage 1 
and stage 3 in the IMPE reduces the area to almost half. Although the area for the IMPE 
is almost half of the area of the MPE, the delay is almost the same. The reason for this is 
that the signal travels twice through stage 1 of IMPE. Thus the delay remains unchanged 
as compared to the MPE circuit. For the IMPE circuit only two basic building blocks 
80 
need to be des igned. one for the stage 2 and another for the basic unit of stage! . The final 
circu it uses several instances of these basic buildino blocks thus makino the design 
b ' :::, ~ 
procedure very simple. Also since each of the building blocks have very few inputs to 
them. the interconnect networks in these blocks do not offer large capacitive loads to the 
s tage driYing th1,;m. thus obviating the need for the buffer circuits. This is another reason 
why these circuits are easy to design. The comparison of the area of the LATPE circuit 
against that of the IMPE is presented in table 6.2, which shows a large reduction in area. 
The IM PE circuit was also included into a 40 pin pad frame for tape out of the circuit. 
The c irc uit has been designed to be fabricated in a 0.5µ AMI process offered by MOSIS. 
The final circuit has additional circuits apart from the IMPE to faci litate testing and 
vcri fication. These additional circuits add extra delays, which is apparent from table 6.1. 
The delay caused by these extra circuits needs to be subtracted from the total delay to 
dete rmine the true delay of the encoder circuit a lone. 
7.1 Future Studies 
The lMPE circuit has a similar delay as that of the MPE circuit but the area for 
the IMP E is ha! f the area for the MPE. If the speed is the concern and not the area, then 
the MPE circuit can be slightly modified and used to f01m a very fast circ uit. From the 
structure of the MPE, it is quite obvious that the circuit consists of three distinct stages. 
The interconnections betvveen these stages are small in number. Moreover, the c ircuit 
uses the three s tages one at a time, in a seri al fashion. When stage 1 is being used, stage 2 
and s tage 3 are not being used and similarly when stage 3 is being used stage 1 and stage 
81 
2 arc not being used. The lower stages are just preserving their outputs until the higher 
stages reach a stable state. Thus pipeline registers can be included between these stages 
that preserve the state of the lower level stage. If this is done the lower level stages are 
free to start processing a new set of inputs. The circuit can be called the pipelined MPE 
(PMPE). Thus using the pipelined registers helps to greatly increase the throughput of the 
circuit. The pipelined registers need to be clocked at a certain frequency which is 
dctem1incd by the stage with the largest delay in it. For a 128 bit encoder, if the input 
signals are bundled into 8 groups of 16 signals, then the delay through the first and the 
last stage are the largeSL These delays detem1ine the pipeline clock frequency. The delay 
is given as log216 = 4, which is smaller than the log2128 =7. Apart from the delay of 
log216 there will be a small additional delay due to the setup and the hold time of the 
. · isters The disadvantage of th· d · . 
p1pelme reg · 16 es1gn 1s that, although the circuit works at a 
c. t speed there is some latency invol d Th 
very 1as ' ve · e latency of the circuit is 3 pipeline 
}es Thus for the applications wh · h 
clock eye · ic cannot afford latency, this approach will 
fi l If the circuit operation is not r~ 
not be use u · a iected by the latency then this approach 
. reat promise for a very fast operati . . . 
prov1des a g on. If the apphcatton circuit permits more 
the MPE circuit can be modifi d 
latency, then e to have a larger number of simpler stages~ 
. creasing the circuit throughput 
thus fu11her tn . 
82 
References 
[l] Gupta, Pankaj; Nick, McKe . D . . own, es1g11111g and J111ple111e11ti11g a Fast Crossbar 
Scheduler, Proc. The 45
111 IEEE Inte · . mationa \ Midwest Symposium on Circuits and 
Systems 2002, Page(s): 636-64 I 
[2] Sheng, S. Q.; Yang, Mei; Bl . 
. . 
anton, John; Galla, Prasad; Verchere, Dornm1que; A 
Simple and Fast para/le/ Round-Robin Arbiter for High-speed Switch colllrol and 
Scheduling, Micro, IEEE, Volume· J 9 1 · , ssue: I , Jan.-Feb. 1999, Page(s): 20 - 28 
[3] Wcstc, Neil H. E.; Eshraghian, K amran; Principles of CMOS VLSI design. A 
systems perspectii•e, Second Edition 
[ 4] Wolf, Wayne; lv/odem VLSI Design, Systems on Silicon. Second E
dition 
[5] Goldovsky, A.; Srinivas, H.R. ; Kolagotla, R.; Hengst, R
.; A folded 3.J-hir prefix 
tree adder in 0.16-µm static CMOS Circuits and Systems, 2000.
 Proceedings of the 43rd 
IEEE Midwest Symposium, Volume: 1, 8- 11 Aug. 2000 Page(s):
 368 -373 vol.1 
[6] P. Brent, Richard;Kung,H.T.; A regular Layout for
 parallel adders. IEEE 
Transact ions on Computers, 3 1: 260 -264, Mar, 1982. 
83 
[7] Beaumont-Smith, A.; Lim, C.-C.; Parallel prefix adder design, Computer 
Arithmetic. 2001. Proceedings. 15th IEEE Symposium on, 11-13 June 2001, Page(s): 218 
-225 
[81 .Jinn-Shyan Wang; Chung-Hsun Huang; High-speed and low-power CMOS 
priority encoders. Solid-State Circuits, IEEE Journal of, Volume: 35 Issue: 10, Oct. 2000, 
Pagc(s): 1511 -1514 
[9] Delgado-Frias, J.G.; Nyathi, J.; A VLSI high-performance encoder with priority 
lookahead. VLSI, 1998. Proceedings of the 8th Great Lakes Symposium on , 19-21 Feb. 
1998. Pagc(s): 59 -64 
[ 1 O] Delgado-Frias, J.G.; Nyathi, J.; A high-pe1formance encoder with priority 
lookahead. Circuits and Systems I: Fundamental Theory and Applications, IEEE 
Transactions on, Volume: 47 Issue: 9, Sept. 2000, Page(s): 1390 -1393 
(11] Jinn-Shyan Wang; Chung-Hsun Huang; High-speed and low-power CMOS 
priority encoders, Solid-State Circuits, IEEE Journal of, Volume: 35 Issue: 10 , Oct. 
2000, Page(s): 1511 -1514 
[ 12] Katevenis, M.; Fast Switching and Fair Control of Congested Flow in 
Broadband Networks, Selected Areas in Communications, IEEE Journal on, Volume: 5 
Issue: 8, Oct I 987, Page(s): I 315 -1326 
84 
f I~ I Sun Huajin: Gao Deyuan; Zhang Shengbing; Wang Danghui ; Design fast round 
rohin scheduler in FPGA. Communications, Circuits and Systems and West Sino 
Expositions. IEEE 2002. International Conference on, Volume: 2, 29 June- I July 2002, 
Pagc(s): 1257- 126 1 Yol.2 
/ 1-i_J Yamagata. T.: Mihara, M.; Hamamoto, T.; Kobayashi, T. ; Yamada, M.; A 288-
/,;hir Jit/~1· f1Ura!lcl co11tc11t addressable 111e11101J using stacked capacitor cell structure. 
Custom Integrated C ircuits Conference, 1991., Proceedings of the IEEE 1991 , 12-15, 
i\tl ay I l)l) I. Pagc(s): I 0.3/1 -I 0.3/4 
fl 5J Y,1magata, T.; Mihara, M.; Hamamoto, T.; Murai, Y.; Kobayashi , T. ; Yamada, 
M.: Ozaki, I l.; A ]88-kh ji1/ly parallel co111ent addressable memo,y using a stacked-
cu1)({ciror cell structure. Solid-State Circuits, IEEE Journal of, Volume: 27 Issue: I2 , 
Dec. 1992. Pagc(s): 1927 -1933 
[ 161 S hin. E.S.; Mooney, VJ., III ; Riley, G.F.; Ro1111d-robi11 Arbiter Design and 
Generution. System Synthesis, 2002. 15th International Symposium on, 2-4 Oct. 2002, 




Description of 0.35 Micron CMOS Standard Cells Used 
AA21 
Description 
AA21 is a two input gate which performs the logical AND function. 
Verilog Syntax: AA21 inst_name(Q,A,B); 
Where, Q is the output and, A and B are the two inputs to the AND gate. And 
each of the pins A and B offers 1 equivalent load to the circuit driving it. Each equivalent 
load is equal to 27.7fF. 
Propagation Delays (ns) 
No of Equivalent Loads 1 4 8 11 15 
From Any input to Q tpu-1 0.15 0.23 0.34 0.42 0.53 
trHL 0.14 0.23 0.34 0.41 0.5 
86 
lp111 : Input lo output propagation delay for a rising edge on the output 
lp111 . Input to output propagation delay for a falling edge on the output 
J\l so each or the equivalent load in this table needs to be multiplied by 1.47. 
becaus e an interconnect capacitance of 13fF is already added while calculating the 
dcla\·s. 
O R 2 1 
Description 
OR2 l is a two input gate which performs the logical OR function. 
Vcrilog Syntax: OR21 inst_name(Q,A,B); 
Where, Q is the output and, A and B are the two inputs to the OR gate. And each 
of the pins A and B offers I equivalent load to the circuit d1iving it. Each equivalent load 
is e qual to 27.7 fF. 
Pr·opagation Delays (ns) 
No of Equivalent Loads 1 4 8 11 
From Any input to Q tpu1 0.13 0.22 0.33 0.41 
lp1 IL 0.17 0.27 0.37 0.45 
tp1 11 : Input to output propagation delay for a rising edge on the output 





Also each of the equivalent load in this table needs to be multiplied by 1.4 7, 




AA2 l is a two to-one multiplexer. 
V cri log ................ .. ... ......... .... MX21 inst_name(Q,IO,Il ,S); 
Where, Q is 1hc output and, JO and J1 are the two inputs,and S is the control 
· · I J d t the circuit drivina it and pin S offers 
signal. And pins IO and I J offers I equ1va ent oa O b 
equivalent load of 1.5.Each equivalent load is equal to 27.ltF. 
Tn1th table 
s JO I I Q 
L L X L 
L H X H 
H X L L 
H X H H 
88 
Prnpagation Delays (ns) 
No of Equivalent Loads 1 4 8 11 
From Any IX input tp u-1 0.19 0.29 0.40 0.48 
to Q tp111. 0.23 0.33 0.43 0.50 
From S to Q tpu 1 0.27 0.36 0.47 0.55 
lp11L 0.32 0.42 0.53 0.61 
tp 1 11 : Input to output propagation delay for a rising edge on the output 






Also each of the equivalent load in this table needs to be multiplied b y 1.4 7 , 




INV 1 is an inverter that performs logical N OT function. 
Verilog ....... ..... ........ ....... .. ..... INVI inst_name(Q,A); 
Where, Q is the output and, A is the input to the inverter. Pin A offers 1 
equivalent load to the circuit driving it. Each equivalent load is equal to 27.7fF. 
89 
Propagation Delays (ns) 
No of Equivalent Loads 1 4 8 11 
From A to Q tpu-1 0.07 0.16 0.26 0.34 
tp1-1t 0.08 0.17 0.27 0.35 
lp1 11 : Input to output propagation delay for a rising edge on the output 




Also each of the equivalent load in this table needs to be multiplied by 1.47, 




Layout in AMI 0.6 Micron Process 
128 bit LA TPE 
7th Stage of bit LATPE 
91 
3rd Stage of 128 bit LATPE 
92 




128 bit IM PE in 40 pin pad frame 




KIRAN RAJ JOSHI 
Candidate for the Degree of 
Master of Science 
Thes is : DESIGN OF 128 BIT ROUND ROBIN 
PRIORITY ENCODER 
Major Field: Electrical Engineering 
Biographical: 
Personal Dara: Born in Kathmandu, Nepal; 
son of Krishna Gopal Joshi and Bl!agaba ri Joshi 
Education: Graduated from Viswa Nikctan Science Campus, 
Kathmamlu, Nepal in 1996; received a Bachelor De~ree 
of Engineering in Electronics Engineering (ro\\\ lnstitute 
of Engineering . Kathmandu, Nepal in October 2000. 
Completed Requirements for the Master of Science 
degree with a major in Electrical Engineering at 
Oklahoma State University in July 2004. 
Experience: Research Assistant, Electrical and Computer 
Engineering. Oklahoma State University, 2002-2004. 
Professional Associations: Phi Kappa Phi Honor Society 
