A High Performance Hybrid Wave-Pipelined Multiplier by Suryanarayana B. Tatapudi & José G. Delgado-frias
A High Performance Hybrid Wave-Pipelined Multiplier 
Suryanarayana B. Tatapudi and José G. Delgado-Frias  
School of EECS, Washington State University  
Email: {statapud, jdelgado}@eecs.wsu.edu
Abstract 
   The clock period in conventional pipeline scheme is 
proportional to the maximum delay while in hybrid 
wave-pipelining it is proportional to the maximum 
delay difference. An 8×8-bit hybrid wave-pipeline 
multiplier using carry-save adder technique is 
described. The multiplier has been designed using 
TSMC 180nm. The basic cells in multiplier are 
designed to have small propagation delay and delay 
variation. The hybrid wave-pipelined multiplier is able 
to achieve 2.86 billion multiplications per second.
1. Introduction 
    Pipelining has emerged as the design technique of 
choice that helps to achieve high throughput digital 
systems. This technique breaks down a single complex 
computational block into discrete blocks separated by 
clock storage elements CSE -like flip-flops, latches. In 
recent years the desire to scale clock frequencies and 
achieve higher performance has led to implementation 
of super-pipelined systems resulting in additional 
overhead from CSEs. In a super-pipelined system, the 
latency from the logic blocks may be comparable to the 
latency of the CSE which envelope them, adversely 
effecting the performance of the systems.  In this paper 
we propose a novel multiplier architecture using a 
hybrid wave-pipelining (hwpp), which achieves 
significant performance gains compared to the regular 
pipeline and wave-pipeline schemes.  
    The proposed hybrid wave-pipeline scheme modifies 
the wave pipeline scheme [1, 2] to achieve improved 
power and performance gains. In this scheme the clock 
is also wave-pipelined as shown in Figure 1. The clock 
frequency is determined by the stage with the 
maximum delay difference. Contrary to this scheme’s 
similarities to regular pipeline scheme, it allows 
multiple data waves to exist in any stage similar to 
wave-pipelining. Higher clock frequencies are possible 
and influence of clock uncertainties is mitigated. As 
can be seen, this scheme eliminates the need for 
complex clock distribution. Clock gating can be easily 
implemented to save power without affecting the 
pipeline’s performance. 
∆1
Logic
∆2 ∆N
R
e
g
i
s
t
e
r
R
e
g
i
s
t
e
r
R
e
g
i
s
t
e
r
R
e
g
i
s
t
e
r
R
e
g
i
s
t
e
r
∆R ∆R ∆R ∆R ∆R
Stage 1 Stage 2 Stage N
I
n
p
u
t
 
d
a
t
a
O
u
t
p
u
t
 
d
a
t
a
Logic Logic
C
l
o
c
k
Figure 1. Hybrid wave-pipelining scheme 
    The clock period Tclk of this system is determined by 
stage with the largest delay difference and safe time 
required before a new data wave is admitted into this 
stage [3]. The fundamental circuit limitations 
determine the safe time to separate any two adjacent 
data waves. The equation for Tclk can be derived as 
clk H S j j clk T T d d T
h ∆ + + + − ≥ 2 ) min( ) max( (1)
    Where  j denotes the stage with the largest delay 
difference between the minimum (dmax) and maximum 
(dmin) propagation delays, TS and TH are the register 
setup and hold times. ǻclk is the unconstructive clock 
skew or clock uncertainties.  
2. 8 × 8 pipelined multiplier 
        An 8-bit multiplier has been chosen to be 
implemented in Hybrid wave-pipeline architecture as a 
proof of concept. The well-know Carry-Save Adder 
(CSA) technique has been used to implement the 
multiplier. Figure 2 shows the schematic of the 8×8-bit 
multiplier implemented in hwpp architecture. The full 
adder used in the multiplier was implemented based on 
transmission gates so that the Sum and Carry signals 
are generated simultaneously. Also differential 
implementation (complimentary inputs are used and 
complimentary outputs are generated) was chosen to 
speed up the full adder and avoid glitches.
   An improved version of Sense Amplifier based Flip-
Flop (SAFF) with complementary push-pull [4] was 
the flip-flop implemented in the multiplier. It uses 
single-phase clock and is a small load on clock 
distribution network. The first stage of the flip-flop is 
essentially a sense amplifier which assures accurate 
Proceedings of the IEEE Computer Society Annual Symposium on VLSI 
New Frontiers in VLSI Design 
0-7695-2365-X/05 $20.00 © 2005 IEEEtiming necessary in high speed applications [4]. The 
8×8 multiplier was implemented in TSMC 180nm 
(drawn length 200nm), 1.8V supply voltage. 
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
∆1 ∆2 ∆3 ∆4
+
+
+
+
+
+
+
+
+
+
+
+
+
+
F
F
F
F
F
F
F
F
F
F
F
F F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
+
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Y
<
7
:
0
>
X
<
7
:
0
>
M<15>
M<14>
M<13>
M<12>
M<11>
M<10>
M<9>
M<8>
M<7>
M<6>
M<5>
M<4>
M<3>
M<2>
M<1>
M<0>
C
l
o
c
k
Figure 2. Hybrid wave-pipelined 8×8 multiplier 
    Extensive  simulations  were  performed  on  the  Full 
adder to precisely characterize performance of this cell. 
Iterative process was used to optimize the transistor 
sizes to achieve minimum propagation delay and delay 
variation. The propagation delay for the full adders 
varied from 210ps (dmin) to 280ps (dmax), resulting in a 
delay variation of 70ps. The internal node constraints 
dictate the rate at which new inputs can be applied to 
the full adder and from simulations it was observed 
that the fastest the inputs could be applied is at 
intervals of 175ps.  
        The transistor sizes in SAFF [4] were determined 
through an iterative process. Simulations performed on 
the flip-flop revealed that the clock high time must be 
at least 160ps, the hold time is 130ps and Clk-Q delay 
was approximately 295ps.  
        These results reveal that the bottle neck in the 
pipeline is the flip-flop. The clock frequency is 
dependent on the timing limitations imposed by the 
flip-flop. The clock period has to be at least 320ps. 
Compensating for possible clock uncertainties (2ǻclk) a 
clock period of 350ps (≈2.86GHz) (Tclk) was chosen. 
According to Equation 1, dmax(j) – dmin(j) can take a 
maximum value of 190ps. The placement of flip-flops 
as shown Figure 2 was based on this calculated limit 
on delay difference. The logic enclosed between any 
two adjacent flip-flop stages is wave pipelined and has 
a delay difference less than 190ps. Each data wave 
passes through the logic blocks shown and as the wave 
propagates, each data path adds different delay. As a 
result the delay variation of the data waves increases. 
Figure 3 illustrates the delay variation of each data 
wave after the first stage. Since the delay variation at 
this point is close to the calculated limit, a flip-flop is 
used to synchronize the data waves. The synchronized 
data waves as stored by the second flip-flop stage are 
shown in Figure 4.  
Figure 3. Outputs of 
the first stage
Figure 4. Outputs of 
second flip-flop stage 
    Simulations performed on the entire system revealed 
that the system successfully performed 8×8-bit 
multiplication every clock period i.e. 350ps. Using the 
same technology, a 3-inverter chain ring oscillator has 
been simulated; this circuit yields an oscillation period 
of approximately 260ps. Comparing the multiplier and 
ring oscillator clock periods, it is remarkable the 
multiplier’s clock period is just 35% longer than the 
shortest possible period. 
3. Conclusion 
    An 8×8-bit hybrid wave-pipelined multiplier using 
carry-save adder technique has been designed and 
simulated. The multiplier was implemented in TSMC 
180nm (drawn length 200nm). Since in hybrid wave-
pipelining clock period is proportional to delay 
difference, short clock periods can be obtained by 
minimizing the delay difference. The basic cells in 
multiplier have been designed to have small 
propagation delay and delay variation. 
        The pipelined multiplier is able to achieve 2.86 
billion multiplications per second. The number of flip-
flops needed in this implementation is significantly 
less compared to a conventional pipeline. The delay 
balancing necessary to reduce the delay variation is 
simpler in hybrid wave-pipeline architecture than in 
wave-pipeline architecture. 
4. References 
[1] C. T. Gray, W. Liu, R. K. Cavin, “Timing Constraints 
for Wave-pipelined Systems,” IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and 
Systems, vol. 13, no. 8, Aug. 1994, pp. 987 – 1004. 
[2] W. P. Burleson, M. Ciesielski, F. Klass, and W. Liu, 
“Wave-Pipelining: A Tutorial and Research Survey,”
IEEE Transactions on Very Large Scale Integration 
(VLSI) Systems, vol. 6, no. 3, Sep. 1998, pp. 464 – 474. 
[3] J. Nyathi and J. G. Delgado-Frias, “A Hybrid Wave-
Pipelined Network Router,” IEEE Transactions on 
Circuits and Systems - I, vol. 49, no. 12, Dec. 2002, pp. 
1764 – 1772. 
[4] V. Stojanovic, V. G. Oklobdzija, FLIP-FLOP, US 
Patent No. 6,232,810, May 15, 2001.
c
l
o
c
k
Proceedings of the IEEE Computer Society Annual Symposium on VLSI 
New Frontiers in VLSI Design 
0-7695-2365-X/05 $20.00 © 2005 IEEE