A new robust handshake for asymmetric asynchronous micro-pipelines by Cheng, Kuo-hsing
8 
A New Robust Handshake for Asymmetric Asynchronous 
Micro-Pipelines 
Kuo-Hsing Cheng, Yang-Han Lee and Wei-Chun Chang 
Department of Electrical Engineering 
Tam-Kang University, Tam-Shei, Taiwan, R.O.C. 
E-mail: chenn@,ee.tku.edu.tw 
, 
Abstract 
In this paper, a new handshake methodology.to enhance the 
performance of the asynchronous micro-pipeline systems is 
proposed. The proposed handshake methodology has more 
flexibilities to design an asymmetric asynchronous micro- 
pipeline system. The proposed handshake methodology also has 
some advantages, like latch free, robust, high throughput, very 
short pre-charge time, less transistors, and mnre flexibility in 
asymmetry data path. A technique that combines a single-rail 
dynamic circuit with a dual-rail dynamic circuit was proposed 
and used to design in the data path. In the critical delay data 
paths, the dual-rail dynamic circuits were used to improve the 
operating speed. Others, the single-rail dynamic circuits were 
used. It brings some advantages that reduce power consumption 
and die area while maintaining the calculation speed. An 
asynchronous micro-pipeline array multiplier was designed and 
implemented by the new robust handshake methodology. Based 
on the TSMC 0.35um CMOS technology, the simulation results 
show that the proposed new handshake methodology has 
shortest latency and more robust property as compare with other 
handshake methodologies. 
1. Introduction 
Micro-pipeline structure is useful to integrate into very high 
speed digital circuits. But when increasing clock speed over a 
large Si substrate in the future, there are two important problems 
that designers must consider. First, the clock networks bring the 
high power consumption. Second, the clock skew problem in 
synchronous micro-pipeline systems becomes serious [I]. 
In view ofthe two previous problems in the synchronous micro- 
pipeline systems, the asynchronous micro-pipeline systems have 
beenproposed [2]-[3]. The handshake control signals replace the 
global, clock, and then the synchronous micro-pipeline systems 
change into the asynchronous micro-pipeline systems. 
Asynchronous micro-pipeline systems relax the two previous 
problems. Moreover, the asynchronous design style brings some 
advantages, like low power consumption, high modularity, race 
free,ihigh performance [4], and no large voltage spikes occur in 
the power supply. In order to implement a reliable asynchronous 
micro-pipeline system, many handshake methodologies have 
been proposed as following. Williams [2]-[3] introduced 
asynchronous micro-pipeline systems without latches, where the 
handshake methodology incurs no latency delay overhead, 
making the total micro-pipeline latency equal to the computation 
circuit delays alone. Matsubara and Ide handshake methodology 
[SI, a scheme targeted at increases throughput rate by making 
preycharge time does not affect calculation time. Singh and 
Nowick [6] analyzed and: proposed five new handshake 
methodologies to increase throughput rate. in asynchronous 
micro-pipeline systems. These include three types of handshake 
methodologies: ( I )  early evaluation, (2) early done, and (3) a 
combination of both. Chan [ I ]  proposed a new fast and robust 
handshake methodology to improve timing constrains of 
Matsubara, Ide, and Singh, Nowick's ' LP3ll's handshake 
methodology. 
In this paper, a new handshake methodology is proposed. 
It has some advantages as compared with previous w o r k ,  
like robust, high throughput, shorter pre-charge time, less 
transistors, and more flexibility in non-ring and non- 
'symmehy data path. 
2. Basic and Previous Asynchronous Circuit 
Design 
The Differential Cascode Voltage Switch Logic (DCVSL) is 
used as the basic logic gate of the asynchronous micro-pipeline 
systems. A completion detector is used to determine that the 
logic value of the dual-rail data paths is valid or invalid. Some 
handshake methodologies use the completion signal to control 
the handshake circuits. Combination of Single-Rail Static 
Circuits and Dual-Rail Dynamic Circuits is used to reduce power 
consumption and die area while maintaining the calculation 
speed. 
A. DCVSL With Completion Detector 
DCVSL circuit with completion detector, dynamic sense 
amplifier, and its timing diagram [7]-[9] are shown in Fig. 1. 
DCVSL circuit is a logic gate with dual-rail output design It has 
more advantages than single-ended output circuit, like high logic 
flexibility, low wiring complexity, high packing density, and 
high speed performance. In the data path circuit of this paper, it 
is implemented by complementary NMOS logic tree. And it can 
also he implemented by complementary PMOS logic tree. 
However, the DCVSL circuit has been improved speed and 
avoids the charge-sharing problem by adding a dynamic sense 
amplifier. As show in Fig. I ,  when the control signal PIE (Pre- 
chargeEvahation) is low. The logic circuit is operated in the 
pre-charge phase. Nodes QP and QN are pre-charged to high, 
and outputs FP and F" are discharged to low. The dynamic sense 
amplifier is disabled. When the control signal P E  is high. The 
logic circuit is operated in the evaluation phase, The dual-rail 
input signals in the complementary NMOS logic tree are 
evaluated. Mean while the dynamic sense amplifier is enabled. 
Then a path exists from the node QP or QN to ground through 
V-209 0-7803'776 I- 310315 17.00 02003 IEEE 
Authorized licensed use limited to: Tamkang University. Downloaded on March 23,2010 at 21:44:05 EDT from IEEE Xplore.  Restrictions apply. 
one side of the complementary NMOS logic tree. This leads to a 
little voltage difference between the nodes QP and QN, which 
causes the sense amplifier to trip. Finally the nodes QP or QN 
with a lower voltage is discharged rapidly to ground while the 
other node voltaae remains at VDD. 
FP 
Dual-Ra 
Data Inp 
PIE 
Figure 1. 
Dynamic Sense Amplifier, and Timing Diagram. 
DCVSL Circuit with Completion Detector, 
1 Forbidden 
QP D C  
QN 
QN P D C  
Figure 2. Completion Detector Circuit and Truth Table. 
Figure. 2 shows the completion detector been 
implemented by  NAND or XOR gate, and its truth table. 
When the control signal PIE is low, the completion signal 
C is low, that means the DCVSL circuit finishing the pre- 
charge phase. When the control signal P/E is high, the 
completion signal C is high, that means the DCVSL circuit 
finishing the evaluation phase. DCVSL circuits have two 
interesting behaviors. First, during the pre-charge phase, 
input data has no effect on the output value. Therefore, no 
matter what value of data appears at the input, the outputs 
(FP and FN) will be pre-charged to low. Second, during 
the evaluation phase, the DCVSL gate begins evaluation 
as soon as the input data are valid (01 or  IO). When the 
input data are not valid (00), the DCVSL gate remains at 
the pre-charge state. These two behaviors lead to two 
advantages of a dynamic asynchronous micro-pipeline 
over its static counterpart, namely latch-free and simpler 
handshake methodology [I]. 
B. Previous Handshake Methodologies 
The handshake signal transition graph of Williams’s handshake 
methodology [2] is showed in Fig. 3(a). Stage N is pre-charged 
when stage N+I has received the stage N s  valid data. Stage N is 
evaluated when stage N+I finishes pre-charging. Thus the 
handshake control circuits are connected to a ring type without 
being limited by any extemal signals. Others are connected to a 
ring type, too. This handshake has an advantage of very low 
fonvard latency, however it is throughput-limited. 
El-  
C I P -  
(d) 
Figure 3. Handshake Signal Transition Graphs. (a) 
Williams. (b) Matsubara, Ide and Singh, Nowick‘s LP3/1. 
(c) Chan. (d) The Proposed Handshake. 
The handshake signal transition graph of Matsubara, Ide [ 5 ]  and 
Singh, Nowick’s LP3/1 [6] handshake methodology is showed in 
Fig. 3(b). Stage N is pre-charged when stage N+I has received 
the stage N’s valid data. Stage N is evaluated when stage N+2 
has received the stage N+l’s valid data. This handshake 
methodology has an advantage that pre-charge time does not 
affect calculation time. It enhances the throughput rate. However, 
it has a timing constraint that is roughly equai to TEN+Z 2 TPN. 
It means that stage N must have enough time to be pre-charged. 
TEN+Z is the evaluation time of stage N+2. TPN is the pre- 
charge time of stage N. The handshake signal transition graph of 
Chan [ I ]  handshake methodology is showed in Fig. 3(c). Stage N 
is pre-charged when the stage N has a valid data out, and stage 
N+I receives the stage N’s valid data. Stage N is evaluated when 
stage N+2 has received the stage N+l’s valid data, and stage N 
has finished pre-charge. The handshake methodology maintains 
the advantage that pre-charge time does not affect calculation 
time. And it reduces the timing constraint of the previous 
handshake methodology. 
C. Combination of Single-Rail Static Circuits and 
Dual-Rail Dynamic Circuits 
A micro-pipeline system has some critical data paths in each 
stage. In Matsubara and lde’s idea, they implement the critical 
data paths in the DCVSL circuits, and non-critical data paths in 
v-210 
Authorized licensed use limited to: Tamkang University. Downloaded on March 23,2010 at 21:44:05 EDT from IEEE Xplore.  Restrictions apply. 
the static single-rail logic, This idea has an advantage that 
reduces power consumptian and die area while maintaining the 
calculation speed. 
3. Design New Handshake Methodology 
A. New Handshake Methodology 
Figure. 3(d) shows the proposed handshake signal transition 
graph. Chan's handshake circuit [I] is showed in Fig. 4(a), and 
the proposed handshake circuit is showed in Fig. 4(b). They are 
very similar, hut the new handshake methodology is more robust 
in the ring type connection. There are two signals to respectively 
control a stage's pre-charge and evaluation. Then an example 
explains how to design the new handshake. C2P is a complete 
signal from stage2, and C2N is the complement of C2P. P E 2  is a 
signal to control stage2 pre-charge or evaluation. The other stage 
signals are the same the previous definition. 
(a) (b) 
Figure 4. Handshake Circuits. (a) Chau. (b) The 
Proposed Handshake. 
Figure. 4(b) shows that C3P connects to MN2, CIN connects to 
MNI, C4N connects to MP2, and C2P connects to MPI. In the 
proposed handshake, C3P and CIN control the data paths to pre- 
charge. C4N and C2P control the data paths to evaluate. First, the 
proposed handshake has an advantage that pre-charge time does 
not affect calculation time, It uses the Matsubara, Ide [ 5 ]  and 
Singh, Nowick's LP3Il [6]  handshake methodology. Thus, C3P 
connects to MN2, and C4N connects to MP2. Second, the 
proposed handshake makes that every stage has enough pre- 
charge time. It uses the Chan [ I ]  handshake methodology. Thus, 
C2P connects to MPI. Final, the Chan's handshake methodology 
that still has a timing constraint. The timing constraint is roughly 
equal to TEN+I + Max(TPN, TEN+2) t TPN-I. Thus, the 
proposed handshake circuit replaces C2P with CIN that connects 
to MNI. This design can make the handshake protocol more 
robust, because it reduces the Chan's timing constraint. 
B. Combination of Single-Rail and Dual-Rail 
Dynamic Circuits 
In order to reduce the power consumption and chip area? the 
combination of single-rail and dual-rail dynamic circuits in a 
single chip is proposed. The idea originated from Matsubara 
and Ide's paper [ 5 ] .  It focused on implementation of the data 
path units reduce power consumption and die area while 
maintaining the calculation speed. In the proposed technique, the 
DCVSL still used to implement the critical paths. But the 
Dynamic Single-Rail Logic (DSRL) circuits implement the non- 
critical path. 
Figure 5. Dynamic Single-Rail Logic (DSRL) Circuit 
Figure. 5 shows the basic DSRL circuit. When the control 
signal P E  is low. The logic circuit is operated in the pre-charge 
phase. Output nodes DataOutP and DataOutN are pre-charged to 
high by PMOS transistors, When the control signal P E  is high. 
The logii circuit is operated in the evaluation phase. The single- 
rail input signals in the NMOS logic tree 'are evaluated. The 
DSRL circuit also has a feedback signal from DataOutP to 
DataOutN with a weak PMOS transistor. Thus the DSRL circuit 
avoids the charge-sharing problem. 
4. Asynchronous Micro-Pipeline Array 
Multiplier 
In this paper implements an asynchronous micro-pipeline array 
multiplier with the proposed handshake methodology in this 
section. Multiplier is useful in the digital signal processing (DSP) 
or other applications. An array multiplier can be easily 
implemented in micro-pipeline system. Fig. 6 simply shows the 
architecture of the asynchronous micro-pipeline array multiplier. 
The multiplier implemented of the latch, full adder, half adder in 
the DCVSL and DSRL circuits. Fig. 7 shows the structure of the 
asynchronous micro-pipeline array multiplier using the new 
handshake circuit. The technique that has a combination of 
single-rail and dual-rail dynamic circuit is implemented in 
asynchronous micro-pipeline array multiplier. It brings some 
advantages that reduce power consumption and die area while 
maintaining the calculation speed. The design was simulated 
with Hspice. 
When input data is A3-A0=0010 and B3-B0=0001 in a 4x4- 
bits multiplier. Output data PI is high, others are low. Fig. 8 
shows the simulation results of the PI data for all handshake 
methodologies. Williams and the new proposed handshake 
methodologies are evaluated correctly in an asymmetric 
asynchronous micro-pipeline system. Matsubara, Ide and Singh, 
Nowick's LP3Il can't get any function due to the asymmetric 
loading. And the output of Chan's handshake methodology is not 
correct. Therefore, these two methodologies are not reliable in an 
asymmetric loading application. The simulation results are also 
given in Table I .  The measurements of interest are transistor, 
cycle time, and pre-charge timeicycle time. The proposed 
handshake methodology has shortest pre-charge time with a 
robust output function. Therefore, it has the properties, like 
robust, high throughput, shorter pre-charge time, less transistors, 
and more flexibility in non-ring and non-symmetry data path. 
v-211 
Authorized licensed use limited to: Tamkang University. Downloaded on March 23,2010 at 21:44:05 EDT from IEEE Xplore.  Restrictions apply. 
Pi P6 
Transistors 
QcIeTimc 
Precharge 
TimeiCyels 
Time 
Figure 6. Architecture of Micro-Pipeline Mult~,  er. 
4 4 4 4 
10.6 ns 5.98 ns 
46% No Value 30% 14% 
* * 
Figure 7. Structure of the Micro-Pipeline with the New 
Handshake Methodology. 
Figure 8. Simulation results of P I .  (a) Williams. (b) 
Matsubara, Ide and Singh, Nowick‘s LP3/1. (c) Chan. (d) 
The Proposed Handshake. 
Parameter 
121 Matsubara 161 1’1 The Proposed I I & Nowick‘s LP3/1 1 Chan I Handshake 1 
5. Conclusions 
This paper presents a design of an asynchronous micro-pipeline 
array multiplier with a new handshake methodology The new 
handshake methodology uses a simple robust circuit that only 
has four transistors. It has all advantages from previous 
handshake methodologies, like pre-charge time does not affect 
calculation time and every stage has enough pre-charge time. 
And it has shorter pre-charge time. The measurement of prc- 
charge timeicycle time is only 14%. A technique combines a 
single-rail dynamic circuit with a dual-rail dynamic circuit in the 
data path. It brings some advantages that reduce power 
consumption and die area while maintaining the calculation 
speed. 
References 
[ I ]  J:L. Yang, C . 3 .  Choy, and C:F. Chan, “A self-timed 
divider using a new fast and robust pipeline scheme,’’ lEEE 
Journal of Solid State Circuits, Volume: 36, Issue: 6, 
Page(s): 917 -923, June 2001 
121 T. E. Williams, andM. A. Horowitz, “A zero-overhead self- 
timed 160-ns 54-b CMOS divider,” IEEE Joumal of Solid 
State Circuits, Volume: 26, Issue: 11, Page(s): 1651 -1661, 
Nov. 1991 
[3] T. E. Williams, “Analyzing and improving the latency and 
throughput performance of self-timed pipelines and rings,” 
IEEE International Symposium on Circuits and Systems 
1992, Volume: 2, Page(s): 665 -668 
C.-S. Choy, 1. Butas, J. Povazanic, and C.-F. Chan, “A new 
control circuit for asynchronous micropipelines,” IEEE 
Transactions on Computers, Volume: 50, Issue: 9, Page(s): 
992 -997, Sept. 2001 
G. Matsubara, and N. Ide, “A low power zero-overhead 
self-timed division and square root unit combining a single- 
rail static circuit with a dual-rail dynamic circuit,” h o c .  of 
3rd International Symposium on Advanced Research in 
Asynchronous Circuits and Systems 1997, Page(s): 198 - 
209 
[6] M. Singh, and S .  M. Nowick, “High-throughput 
asynchronous pipelines for fine-grain dynamic datapaths,” 
Proc. 6th Intemational Symposium on Advanced Research 
in Asynchronous Circuits and Systems 2000, Page(s): 198 - 
209 
[7] T. A. Grotjohn, and B. Hoefflinger, “Sample-set differential 
logic (SSDL) for complex high-speed VLSI,” IEEE Joumal 
of Solid State Circuits, Volume: 21, Issue: 2, Page(s): 361- 
369, Apr. 1986 
[8] C.-Y. Wu, and K.-H. Cheng, “Latched CMOS differential 
logic (LCDL) for complex high-speed VLSI,” IEEE Joumal 
of Solid State Circuits, Volume: 26, Issue: 9, Page(s): 
[4] 
[SI 
1324-1328, Sept. 1991 
191 H.-Y. Huane. K.-H. Chene. J.-S. Wana, Y.-H. Chu, T . 4 .  . .  -. -. - 
Wu, and C.-Y. Wu, “Low-voltage low-power CMOS true- 
single-phase clocking scheme with locally asynchronous 
logic circuits,” IEEE lntemational Symposium on Circuits 
and Systems 1995, Volume: 3, Page(s): I572 -1575 
v-212 
Authorized licensed use limited to: Tamkang University. Downloaded on March 23,2010 at 21:44:05 EDT from IEEE Xplore.  Restrictions apply. 
