Superscalar pipelined inner product computation unit for signed unsigned number  by Rajput, Ravindra P. & Swamy, M.N. Shanmukha
PS
c
n
R
J
R
A
I
T
m
u
r
h
2
lerspectives in Science (2016) 8, 606—610
Available  online  at  www.sciencedirect.com
ScienceDirect
j our na l homepage: www.elsev ier .com/pisc
uperscalar  pipelined  inner  product
omputation  unit  for  signed  unsigned
umber
avindra  P.  Rajput ∗,  M.N.  Shanmukha  Swamy
SS  Research  Foundation,  SJCE  Campus,  Mysore,  Karnataka,  India
eceived  19  February  2016;  received  in  revised  form  12  June  2016;  accepted  13  June  2016
vailable online  9  July  2016
KEYWORDS
Pipeline;
Superscalar;
MMBE;
VCA;
CLCSA
Summary  In  this  paper,  we  proposed  superscalar  pipelined  inner  product  computation  unit
for signed-unsigned  number  operating  at  16  GHz.  This  is  designed  using  ﬁve  stage  pipelined
operation  with  four  8  ×  8  multipliers  operating  in  parallel.  Superscalar  pipelined  is  designed  to
compute four  8  ×  8  products  in  parallel  in  three  clock  cycles.  In  the  fourth  clock  cycle  of  the
pipeline operation,  two  inner  products  are  computed  using  two  adders  in  parallel.  Fifth  stage
of the  pipeline  is  designed  to  compute  the  ﬁnal  product  by  adding  two  inner  partial  products.
Upon the  pipeline  is  ﬁlled  up,  every  clock  cycle  the  new  product  of  16  ×  16-bit  signed  unsigned
number is  obtained.  The  worst  delay  measured  among  the  pipeline  stage  is  0.062  ns,  and  this
delay is  considered  as  the  clock  cycle  period.  With  the  delay  of  0.062  ns  clock  cycle  period,  the
pipeline stage  can  be  operated  with  16  GHz  synchronous  clock  signal.  Each  superscalar  pipeline
stage is  implemented  using  45  nm  CMOS  process  technology,  and  the  comparison  of  results  shows
that the  delay  is  decreased  by  38%,  area  is  reduced  by  45%  and  power  dissipation  is  saved  by
32%.
© 2016  Published  by  Elsevier  GmbH.  This  is  an  open  access  article  under  the  CC  BY-NC-ND  license
rg/l
o
p(http://creativecommons.o
ntroductionhe  dedicated  vector  processors,  supercomputers,  and  the
odern  digital  signal  processors  have  the  multiple  pipeline
nits.  In  these  multiple  pipeline  units,  the  multiplier
 This article belongs to the special issue on Engineering and Mate-
ial Sciences.
∗ Corresponding author. Tel.: +91 9886083949.
E-mail address: rprajput2006@gmail.com (R.P. Rajput).
c
a
n
e
p
m
u
t
n
ttp://dx.doi.org/10.1016/j.pisc.2016.06.034
213-0209/© 2016 Published by Elsevier GmbH. This is an open access art
icenses/by-nc-nd/4.0/).icenses/by-nc-nd/4.0/).
peration  is  the  most  time  critical  operation.  Hence,  we
roposed  ﬁve  stage  superscalar  pipelined  inner  product
omputation  unit  for  high  performance,  small  chip  area,
nd  lower  power  consumption.  In  superscalar  pipeline  tech-
ique,  multiple  instructions  and  arithmetic  operations  are
xecuted  in  parallel  and  overlapping.  The  multiple,  the
arallel  and  the  pipeline  concept  can  enhance  the  perfor-
ance  of  the  system.  Our  proposed  superscalar  pipeline
ses  four  8-bit  multipliers  in  parallel  and  pipeline  for
he  computation  of  product  of  16  ×  16-bit  signed  unsigned
umber.
icle under the CC BY-NC-ND license (http://creativecommons.org/
607
  A8LSB × B8MSB
   A8MSB × B8LSB
    A8LSB × B8LSB
   A8MSB × B8MSB
Adder
P = A × B 
Figure  1  Architecture  of  inner  product  computation.
 p15   p14 p13 p12 p11   p10 p9  p8   p7  p6  p5  p4  p3  p2  p1  p0
a7 a6 a5 a4 a3 a2 a1a0
 b7b6 b5 b4 b3 b2 b1 b0
1  p18 p17 p16 p15 p14 p13  p12 p11 p10        n0
p08 p08  p08 p07 p06 p05 p04 p03 p02 p01 p00
1    p28 p27   p26 p25 p24 p23  p22 p21 p20         n1
  1    p38 p37 p36 p35 p34 p33 p32 p31 p30         n2
  C47 p46 p45 p44  p43 p42 p41 p40         n3
Figure  2  8  ×  8  multiplier  for  signed  and  unsigned  numbers.
t
o
p
s
m
e
s
a
s
a
s u  =  0,  an =  an+1 =  0  and  bn =  bn+1 =  0Superscalar  pipelined  inner  product  
To  address  the  need  of  very  high  speed  modern  process-
ors  (Hong  et  al.,  2006;  Alshawi  et  al.,  2015;  Olivieri,  2001;
Lin,  2001;  Hoyer  et  al.,  2001;  Jou  et  al.,  1997)  illustrated
the  design  of  multipliers.  In  Hong  et  al.  (2006)  and  Alshawi
et  al.  (2015)  illustrated  the  design  of  very  high  speed  matrix
multiplier  matrix,  but  without  using  the  pipeline  technique.
In  Olivieri  (2001)  illustrated  the  pipeline  of  the  Carry  Save
Adder  (CSA)  for  the  partial  product  reduction  tree  (PPRT)
and  not  all  the  stages,  therefore  the  selection  of  high  speed
clock  signal  is  difﬁcult.  In  Lin  (2001)  illustrated  an  array  of
8  ×  8  or  4  ×  4  multiplier  for  the  inner  product  computation,
with  high  speed,  small  area,  and  lower  power  dissipation.
But  this  paper  fails  to  address  the  number  of  pipeline  stages.
In  Hoyer  et  al.  (2001)  illustrated  the  micro-pipelines  using
asynchronous  local  clock  signal,  but  in  this  pipelined  oper-
ation  is  not  regular.  In  Hong  et  al.  (2006),  Alshawi  et  al.
(2015),  Olivieri  (2001),  Lin  (2001),  Hoyer  et  al.  (2001)  and
Jou  et  al.  (1997)  have  used  Modiﬁed  Booth  Encoder  (MBE)
technique  as  partial  product  generator  (PPG),  and  Shin  et  al.
(2010),  Yeh  and  Jen  (2000),  Kuang  et  al.  (2009)  and  Wang
et  al.  (2008)  presented  the  design  of  MBE  with  different
number  of  transistors.  The  second  phase  in  the  multiplier
is  to  convert  n  rows  of  partial  product  into  two  rows  called
as  PPRT  illustrated  in  Wang  et  al.  (2008),  Goto  et  al.  (1997),
Chang  et  al.  (2004),  Radhakrishnan  and  Preethy  (2000)  and
Prasad  and  Parhi  (2001).  Finally,  the  concept  of  high  speed
carry  propagate  adder  (CPA)  such  as  Carry  Look-Ahead  (CLA)
adder  is  illustrated  in  Oklobdzija  et  al.  (1996),  Kim  and
Ambler  (2000),  Zlatanovici  et  al.  (2009),  Nagendra  et  al.
(1996),  Wang  et  al.  (2002),  Lee  et  al.  (2001),  Kim  et  al.
(2002)  and  Nève  et  al.  (2004).
Design of inner product superscalar pipeline
multiplier
Fig.  1  shows  the  architecture  of  inner  product  computation
unit,  which  consists  of  4  multipliers  operating  in  parallel
for  the  16  ×  16-bit  multiplication.  The  16-bit  word  length
of  operand  (A)  and  operand  (B)  is  decomposed  into  w  =  8-bit
MSBs  and  8-bit  LSBs,  and  the  four  8-bit  multipliers  computes
the  product  in  parallel  as  given  by  Eq.  (1).  The  ﬁve  stages
of  the  superscalar  pipelined  multiplier  consist  of  generat-
ing  partial  products,  compression  of  an  array  of  ﬁve  rows
into  an  array  of  two  rows.  Remaining  three  high  speed  adder
stages  is  used  to  obtain  the  product.  The  ﬁve  stage  of  the
superscalar  pipeline  multiplier  is  illustrated  in  the  following
section.
A  ×  B  =  A8LSB ×  B8LSB +  (A8MSB ×  B8LSB)2w +  (A8LSB ×  B8MSB)2w
+  (A8MSB ×  B8MSB)22w (1)
Superscalar  pipeline  stage  1:  four  partial  product
generator in  parallel
The  superscalar  pipeline  stage  1  comprises  of  four  PPG
as  illustrated  in  Fig.  2.  The  PPG  accepts  operands  a7—a0,
a15—a8,  b7—b0,  and  b15—b8 simultaneously  during  the  ﬁrst
stage  of  the  superscalar  pipeline  operation.  The  PPG  using
multiplexer  based  MBE  (MMBE)  is  implemented  using  16 CFigure  3  Circuit  diagram  of  MMBE.
ransistors  as  shown  in  Fig.  3.  The  MMBE  taking  three  bits
f  multiplier  operand  simultaneously  produces  ﬁve  partial
roducts.  Since  the  superscalar  multiplier  is  to  function  for
igned  (s  u  =  1)  and  unsigned  (s  u =  0)  number,  the  require-
ent  of  the  sign  extend  logic  is  given  by  the  following
xpression.
 u  =  1,  an−1 =  1,  bn−1 =  0,  an =  an+1 =  1
nd  bn =  bn+1 =  0
 u  =  1,  an−1 =  0,  bn−1 =  1,  an =  an+1 =  0
nd  bn =  bn+1 =  1ij =  s u¯an−1an−2
608  
Superscalar  pipeline  stage  2:  PPRT  as  5:2
compressors
During  the  second  stage  of  the  superscalar  pipeline,  ﬁve
partial  products  of  the  stage  1  is  accepted  by  four  PPRT
and  reduced  to  an  array  of  two  rows  and  latch  into  the
registers  using  synchronous  clock  signal.  The  Vertical  Col-
umn  Adder  (VCA)  functions  as  PPRT  and  converts  ﬁve  rows
into  two  rows.  In  Goto  et  al.  (1997),  Chang  et  al.  (2004),
Radhakrishnan  and  Preethy  (2000)  and  Prasad  and  Parhi
(2001)  the  PPRT  consists  of  full  adders  only,  but  VCA  consists
of  full  adders  and  the  Sum  Carry  Generate  and  Propagate
(SCGP)  logic.  The  SCGP  logic  circuit  produces  the  Sum,  Carry
Generate  term  and  the  Carry  Propagate  term,  which  are
essential  for  the  CLA  operation.  The  design  of  high  per-
formance  full  adder  is  implemented  using  Eqs.  (2)  through
(3).
si =  xi+1 ⊕  xi+2 ⊕  ci (2)
ci+1 =  (xi+1 ⊕  xi+2)ci +  (xi+1 ⊕  xi+2)xi+1 (3)
The  circuit  diagram  of  full  adder  is  shown  in  Fig.  4.  This
is  implemented  in  CMOS  logic  using  only  ten  transistors.  The
required  logic  for  SCGP  is  derived  from  Eq.  (3)  as  given  by
Eqs.  (4)  and  (5).  Where  cpi is  called  carry  propagate  term,
and  cgi is  called  carry  generate  term.  Fig.  5  shows  the  cir-
cuit  diagram  of  SCGP  logic,  this  is  the  ﬁnal  cell  of  each  VCA.
This  is  designed  to  perform  operations  such  as  sum,  carry
generate  and  carry  propagate  terms  to  save  the  extra  hard-
ware  for  carry  generate  and  carry  propagate  terms  and  is
implemented  in  CMOS  logic  using  only  ten  transistors.
cpi =  xi+1 ⊕  xi+2 (4)
cgi =  (xi+1 ⊕  xi+2)xi+1 (5)
Figure  4  Circuit  diagram  of  full  adder.
Figure  5  Circuit  diagram  of  SCGP.
S
i
D
r
p
c
S
t
8
i
p
m
C
h
P
c
T
a
s
b
i
s
a
p
a
n
T
o
t
P
D
u
t
c
t
i
i
m
u
c
i
1
t
o
r
1
m
E
F
m
s
P
uR.P.  Rajput,  M.N.S.  Swamy
uperscalar  pipeline  stage  3:  computation  of  8  ×  8
nner product  in  parallel
uring  the  third  stage  of  superscalar  pipeline,  the  ﬁnal  two
ows  from  the  PPRT  is  added  using  four  CPA  to  obtain  the
roducts  of  four  8  ×  8  multipliers  in  parallel.  The  CPA  which
ombines  the  effect  of  Carry  Look-ahead  Adder  and  Carry
elect  Adder  (CLCSA)  is  as  shown  in  Fig.  6. In  CLCSA,  all
he  8-bit  CLA  adders  produce  carry  in  parallel  with  two  such
-bit  CLA’s  in  each  stage  with  0  and  1  as  the  initial  carry
nput.  Depending  on  the  carry  output  generated  from  the
revious  stage  of  8  bit  CLA,  the  output  selected  by  the  2:1
ultiplexer  is  the  carry  signal  to  the  next  stage.  Thus  the
LCSA  generates  carry  in  parallel  and  the  ﬁnal  product  with
igh  performance.
ipeline stage 4: inner partial product
omputation
he  four  8 ×  8  products  obtained  from  the  pipeline  stage  3
re  k15—k0,  l15—l0, m15—m0,  and  n15—n0. During  the  pipeline
tage  4,  these  four  partial  products  are  added  using  two  24-
it  adders  in  parallel  as  shown  in  Fig.  2. The  CLCSA  of  Fig.  6
s  extended  for  the  addition  of  24-bit  operands  by  including  a
et  8-bit  CLA  adder  and  a  2:1  eight  multiplexers.  Two  CLCSA
dders  operating  in  parallel  can  produce  two  inner  partial
roducts  x24—x0 and  y24—y0. And  these  two  inner  products
re  latched  into  the  register  using  the  synchronous  clock  sig-
al.  The  delay  measured  for  the  pipeline  stage  4  is  0.051  ns.
hus,  the  superscalar  pipeline  stage  4,  requires  four  16-bit
perands  every  pipeline  clock  cycle  of  16  GHz  to  produce
wo  25-bit  inner  product  in  parallel.
ipeline stage 5: ﬁnal product computation
uring  the  pipeline  stage  5,  the  ﬁnal  two  inner  partial  prod-
cts  obtained  from  the  stage  4  are  added  in  parallel  to  obtain
he  product  of  16  ×  16-bit  multiplier.  The  CLCSA  of  Fig.  6
an  be  used  to  add  32-bit  inner  products  x  and  y  to  obtain
he  product  of  16  ×  16-bit  multiplier.  The  CLCSA  of  Fig.  6
s  extended  for  the  addition  of  32-bit  operands  by  includ-
ng  two  sets  of  8-bit  CLA  adder  and  two  sets  of  2:1  eight
ultiplexers.  The  32-bit  CLCSA  adder  produces  32-bit  prod-
ct,  and  is  latched  into  the  register  using  the  synchronous
lock  signal.  The  delay  measured  for  the  pipeline  stage  5
s  0.062  ns,  and  is  operated  with  the  clock  frequency  of
6  GHz.  The  pipeline  ﬁlls  up  in  the  5th  clock  cycle,  once
he  pipelined  is  ﬁlled  up,  every  clock  cycle  the  new  product
f  16  ×  16-bit  multiplier  is  generated.  Superscalar  pipeline
equires  two  16-bit  operands  every  pipeline  clock  cycle  of
6  GHz  to  produce  the  product  of  16  ×  16-bit  signed  unsigned
ultiplier.
xperimental results
or  comparison,  we  have  implemented  several  pipelined
ultipliers.  Each  pipeline  multiplier  is  divided  into  ﬁve
tages.  The  ﬁve  stages  of  the  pipeline  multiplier  are  the  PPG,
PRT,  and  three  CPA.  Each  pipeline  stage  is  implemented
sing  the  digital  schematic,  and  the  Verilog  HDL  code  is
Superscalar  pipelined  inner  product  609
c8
p15 - p0
Clock 
cin0
1
y7- y0x7- x0y15- y8x15- x8
16 bit reg ister
8-bit  CL A add er
8-bit CLA adder
2:1 Eight multiplexers 
8-bit  CLA adder  
Figure  6  Architecture  of  CLCSA  for  8  ×  8-bit  multiplier.
Table  1  Comparison  of  multipliers.
Multiplier  size  References  Number  of
transistors
Cycle  time
(ns)
Area  (m2)  Power  (mW)  Clock  frequency
(GHz)
16  ×  16
Hong  et  al.  (2006)  34,620  0.085  3093.28  605.8  11.0
Alshawi  et  al.  (2015)  34,846  0.075  2646.00  680.8  13.0
Hoyer  et  al.  (2001)  32,472  0.097  2580.48  518.4  10.0
Proposed 18,468  0.062  1679.60  477.0  16.0
A
T
T
t
t
R
A
C
G
H
H
Jobtained  by  compiling  the  digital  schematic.  Then  the  Ver-
ilog  HDL  code  is  compiled  to  obtain  the  layout  using  the
45  nm  CMOS  technology  Microwind  Tool.  Finally,  the  layout  is
synthesized  and  measured  the  critical  path  delay,  the  area,
and  the  power  consumption.  For  the  pipelined  multiplier,
the  maximum  critical  path  delay  of  the  pipeline  stage  is
considered  as  the  clock  cycle.  That  is  the  maximum  delay  of
PPG,  PPRT  and  CPA.  The  maximum  delay  (0.023  ns,  0.046  ns,
0.056  ns,  0.062  ns)  is  0062  ns.  Therefore,  the  delay  0.062  ns
is  considered  as  the  pipeline  clock  cycle,  and  the  pipeline
stage  is  operated  with  the  frequency  (f)  =  1/0.062  =  16  GHz.
Also  the  area  and  power  measured  is  listed  in  Table  1.
Comparison  of  results  shows  that  our  proposed  three  stage
pipeline  multiplier  has  been  improved  in  delay  by  38%,  area
reduced  by  45%  and  power  dissipation  saved  by  32%.
Conclusion
The  superscalar  pipeline  multiplier  is  operated  with  two  16-
bit  operands  in  parallel,  and  the  operands  are  decomposed
into  four  8-bit  operands.  With  four  operands  in  parallel,  at
the  end  of  the  third  stage  inner  product  is  also  computed  in
parallel.  And  using  the  CPA  at  three  levels  the  ﬁnal  16  ×  16-
bit  product  of  signed  unsigned  number  is  obtained.  The
superscalar  pipeline  ﬁlls  up  in  ﬁve  clock  cycles,  and  there
after  every  clock  cycle  16  ×  16-bit  product  is  generated.
Finally,  the  comparison  of  results  shows  that  our  proposed
ﬁve  stage  superscalar  pipeline  multiplier,  improved  in  delay
by  38%,  area  reduced  by  45%  and  power  dissipation  saved  by
32%  and  operates  16  GHz  clock  signal  frequency.
Kcknowledgements
he  authors  would  like  to  acknowledge  the  Chief  Executive
.N.  Nagbhushan  and  members  of  the  JSS  Research  Founda-
ion,  SJCE  Campus,  Mysore,  for  all  the  facilities  provided  for
his  research  work.
eferences
lshawi, T., Bentrcia, A., Alshebeili, S., 2015. Design and low-
complexity implementation of matrix—vector multiplier for
iterative methods in communication systems. IEEE Trans. VLSI
Syst. 23 (December (12)), 3099—3103.
hang, C.-H., Gu, J., Zhang, M., 2004. Ultra low-voltage low-power
CMOS 4-2 and 5-2 compressors for fast arithmetic circuits. IEEE
Trans. Circuits Syst. 51 (October (10)), 1985—1997.
oto, G., Inoue, A., Ohe, R., Kashiwakura, S., Mitarai, S., Tsuru,
T., Izawa, T., 1997. A 4.1 ns compact 54 × 54-b multiplier uti-
lising sign-select booth encoders. IEEE J. Solid State Circuit 32
(November (11)).
ong, S., Park, K.-S., Mun, J.-H., 2006. Design and implementation
of a high-speed matrix multiplier based on word-width decom-
position. IEEE Trans. VLSI Syst. 14 (April (4)), 380—391.
oyer, G.N., Yee, G., Sechen, C., 2001. Locally clocked pipelines
and dynamic logic. IEEE Trans. VLSI Syst. 10 (February (1)),
58—62.
ou, S.-J., Chen, C.-Y., Yang, E.-C., Su, C.C., 1997. A pipelined mul-
tiplier accumulator using a high speed, low power, static and
dynamic full adder design. IEEE J. Solid State Circuit 32 (January
(1)), 114—118.
im, D., Ambler, T., 2000. Low power carry lookahead adder by using
dependency between generation and propagation. IEEE.
6K
K
L
L
N
N
O
O
P
R
S
W
W
Y
692—701.
Zlatanovici, R., Kao, S., Nikolic, B., 2009. A 240 ps 64 b carry-10  
im, J., Joshi, R., Chuang, C.-T., Roy, K., 2002. SOI-optimized 64-
bit high-speed CMOS adder design. In: Symp. VLSI Circuits, pp.
122—125.
uang, S.-R., Wang, J.-P., Guo, C.-Y., 2009. Modiﬁed booth multi-
pliers with a regular partial product array. IEEE Trans. Circuits
Syst. II 56 (May (5)).
ee, S.J., Woo, R., Yoo, H.J., 2001. 480 ps 64-bit race logic adder.
In: Symp. VLSI Circuits Dig. Tech. Papers, pp. 27—28.
in, R., 2001. Reconﬁgurable parallel inner product processor archi-
tectures. IEEE Trans. VLSI Syst. 9 (April (2)), 261—272.
agendra, C., Irwin, M.J., Owens, R.M., 1996. Area-time-power
tradeoffs in parallel adders. IEEE Trans. Circuits Syst. II: Analog
DSP 43 (October (10)), 689—702.
ève, A., Schettler, H., Ludwig, T., Flandre, D., 2004. Power-delay
product minimization in high-performance 64-bit carry select
adders. IEEE Trans. Very Large Scale Integr. Syst. 12 (March (3)),
235—244.
klobdzija, V., Vileger, D., Simon, S.L., 1996. A method for speed
optimized partial product reduction and generation of fast par-
allel multipliers using an algorithmic approach. IEEE Trans.
Comput. 45 (March (3)), 294—306.
livieri, M., 2001. Design of synchronous and asynchronous variable-
latency pipelined multipliers. IEEE Trans. VLSI Syst. 9 (April),
365—376.R.P.  Rajput,  M.N.S.  Swamy
rasad, K., Parhi, K.K., 2001. Low-power 4-2 and 5-2 compressors.
In: Proc. of the 35th Asilomar Conf. on Signals, Systems and
Computers, vol. 1, pp. 129—133.
adhakrishnan, D., Preethy, A.P., 2000. Low power CMOS pass logic
4:2 compressor for high speed multiplication. In: Proc. 43rd
IEEE Mdwest Symp. on Circuits and Systems, August 8—11, pp.
1296—1298.
hin, M.-C., Kang, S.-H., Park, I.-C., 2010. An Area-efﬁcient Iter-
ative Modiﬁed-booth Multiplier Based on Self-timed Clocking.
Department of Electrical Engineering and Computer Science,
KAIST, Daejeon, Korea.
ang, Y., Pai, C., Song, X., 2002. The design of hybrid carry looka-
head/carry select adders. IEEE Trans. Circuits Syst. II 49 (January
(1)), 16—24.
ang, L.R., Jou, S.-J., Lee, C.-L., 2008. A Well-structured Modiﬁed
Booth Multiplier Design. IEEE, 978-1-4244-1617-2/08/$25.00.
eh, W.-C., Jen, C.-W., 2000. High speed booth encoded par-
allel multiplier design. IEEE Trans. Comput. 49 (July (7)),lookaheadadder in 90 nm CMOS. IEEE J. Solid State Circuit 44
(February (2)), 569—583.
