A digit-serial divider for fine grain heterogeneous parallel-pipelined processing by Fiallos-Aguilar, Mario & Duprat, Jean
HAL Id: hal-02102017
https://hal-lara.archives-ouvertes.fr/hal-02102017
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A digit-serial divider for fine grain heterogeneous
parallel-pipelined processing
Mario Fiallos-Aguilar, Jean Duprat
To cite this version:
Mario Fiallos-Aguilar, Jean Duprat. A digit-serial divider for fine grain heterogeneous parallel-
pipelined processing. [Research Report] LIP RR-1993-26, Laboratoire de l’informatique du paral-
lélisme. 1993, 2+11p. ￿hal-02102017￿
LIP
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
A digit serial divider for ne grain
heterogeneous
parallel pipelined processing
Mario Fiallos Aguilar
Jean Duprat
septembre  
Research Report No 
Ecole Normale Supérieure de Lyon
46, Allée d’Italie, 69364 Lyon Cedex 07, France,
Téléphone : + 33 72 72 80 00; Télécopieur : + 33 72 72 80 80;
Adresses électroniques : 
lip@frensl61.bitnet;                    lip@lip.ens−lyon.fr (uucp).
A digit serial divider for ne grain heterogeneous
parallel pipelined processing
Mario Fiallos Aguilar
Jean Duprat
septembre  
Abstract
We design a new radix   digit on line ie serial most signicant digit rst
oatingpoint divider which performs its arithmetic operation in digit on line
mode both for the exponent and the mantissa We have performed parallel
discreteevent simulations of the circuit on a memorydistributed massively pa
rallel computer
Keywords  ne grain parallelisme heterogeneous processing digit online computation
Resume
Ce document decrit un diviseur 	enligne
 en vrgule ottante fonctionant en
base   Lexposant comme la mantisse sont transmis chire a chire Des si
mulations paralleles devenements discrets du circuit ont ete eectuees sur une
machine parallele a memoire distribuee
Motscles  parallelisme a ganularite ne calcul heterogene calcul enligne
A digit serial divider for ne grain heterogeneous
parallel pipelined processing  
Mario Fiallos Aguilar y and Jean Duprat
Laboratoire de lInformatique du Parallelisme LIP	
Ecole Normale Superieure de Lyon

 Allee dItalie  Lyon cedex  France
malloslip
enslyon
fr
R esum e
We design a new radix   digit on line ie serial most signicant digit rst oating
point divider which performs its arithmetic operation in digit on line mode both for the
exponent and the mantissa We have performed parallel discreteevent simulations of
the circuit on a memorydistributed massively parallel computer
  Introduction
Online arithmetic is a radical departure from conventional techniques for performing scien
tic computations    In such arithmetic the digits circulate serially most signi
cant digit rst Since in classical ie non redundant number systems carries are propa
gated from the least signicant digit to the most signicant one digit on line computations
are not possible in these systems Then we need to use a redundant number system which
enables carryfree computations Here we use the BS 	borrow save
 notation  which is
a special bitlevel implementation of the binary signeddigit representation 
The digit on line arithmetic operators are characterized by their delay that is the number
  such that p digits of the result are deduced from p   digits of the input operands When
successive digit on line operations are performed in digit pipelined mode the resulting de
lay will be the sum of the individual delays of operations and communications and the
computation of large numerical jobs can be executed in an ecient manner Here we will
assume that any communication has a delay of 
As we can see from gure  the computations in digit on line mode can be described as a
dataow graph DFG These graphs consist of nodes which indicate operations executed on
arithmetic units and edges from one node to another node which indicate the ow of data
between them A nodal operation can be executed only when the required information a
digit from all the input edges is received Typically a nodal operation requires one or two
operands and produces one result Once the node has been activated and the computations
   This work is part of a project called CARESSE which is partially supported by the  PRC Architectures
Nouvelles de Machine of the French Ministere de la Recherche et de la Technologie and the Centre National
de la Recherche Scientique
y  Supported by CNPq and Universidade Federal do Ceara Brazil

related to the input digits inside the arithmetic unit performed ie the node has red
the output digit is passed to the destination nodes This process is repeated until all nodes
have been activated and the nal result obtained Of course more than one node can be
red simultaneously
In this paper we deal with the digit on line oatingpoint implementation of the division
a
c
d
b
ith digit of a  b   c  d
 
 
 
i   th digits of the products
i   th digits of the inputs
    
    
Fig    Digit level pipelining in digit on line arithmetic
We shall assume that both the exponents and the mantissas of numbers circulate in digit on 
line mode and are represented in the BS system We have already introduced digit on line
oatingpoint adders and multipliers   Recently Tu  has studied oatingpoint
implementations of digit on line operators but in a slightly dierent manner he assumes
that the exponents enter the operators in parallel
 The BS notation and the number format
  The BS notation
An interesting implementation of a radix  carryfree redundant system is Borrow Save
notation BS for short In BS the ith digit xi of a number x is represented by two bits x
 
i
and x i with xi  x
 
i   x
 
i  Then  has two representations   and   The digit 
is represented by   and the digit   or  by   Using the BS number system the
addition can be computed without carry propagation  Figure   shows some elementary
xedpoint BS circuits
   Floatingpoint number format
A BS oatingpoint number X with n digits of mantissa and p digits of exponent is repre
sented by X  mx ex where mx 
Pn
imxi 
 i and ex 
Pp 
i exi 
i In our system the
exponents and the mantissas circulate in digit on line mode exponent rst See gure 
  Pseudonormalization
In classical binary oatingpoint representation a number is said normalized if its mantissa
belongs to    or    Normalization of numbers leads to more accurate represen
tations and consequently results In BS representation to check if a number is normalized
needs sometimes the examination of all its digits For this reason we adopt the concept of
pseudonormalized numbers A number is said pseudonormalized if its mantissa belongs to


   


   

  
 
 
z  x    if inc   for all input
z  x    if inc   for all input
digits but the last two
digits but the last one
z  x if inc   for all input digits
z  x  k where k is a
binary positive number
latch
z
 
i
z
i
k
x
i
x
 
i
 
 


 
z
 
i
z

i
y
 
i
y
i
x
 
i
x

i
a
b
c t
u
a b c
t u
represents
 
 
 

latch
z
 
i
z

i
inc
x
i
x
 
i
circuit to compute the subtraction  x  y
The PPM  cell
 sub  
augmementer aug  
circuit to compute z  suk  
Fig     Some elementary xed point BS circuits
circulation of digits
exp    ex mx   mxn
Fig    The BS oating point format
  or   It is easier and faster to ensure that a number is pseudonormalized it
suces to forbid a mantissa beginning by    or  This pseudonormalization is
performed in two steps
 A four state automaton examines two consecutives digits and transforms the couples
  and   into   and   respectively and leaves the other couples unchanged
We call this operation an atomic pseudonormalization This automaton is shown in
gure 
  The second step consists in counting the zeroes generated by the previous computation
and adding the same quantity to the exponent
The divider could have a smaller delay if the divisor is guaranteed to be pseudonormalized
In this case the output of all arithmetic operators adders multipliers dividers etc must
be pseudonormalized
But as our principal goal is to perform computations in digitlevel pipelined mode it is
preferable to pseudonormalizer the inputs of the divider internally
Note that the rst solution makes the subtraction a variable delay operation The second
ones make the divider more complex but allows the adders to have a x digit on line delay
This last solution is preferable because the division is less frequent than the addition is
x x
x

  

Initial




state
 
Fig    The automaton of the pseudo normalizer
scientic computation
 The digit online algorithm
The digit on line oatingpoint division algorithm performs three operations exponents
calculation mantissas centring and calculation A synchronization is performed between
the exponent and the mantissa The algorithm part of mantissa computation is based on
the algorithm presented in  Let us present the algorithm
 The algorithm
We want to compute Q  XY with X  mx ex Y  my ey Q  mq eq and
  my  
jmxj  my
We will see how to deal with the cases of mx  my and negative divisor mantissa in the
next sections The algorithm can be stated as follows
Algorithm  Digit online division algorithm
Step   Exponent computation
  Compute the subtraction of the exponents but its last two digits eqp      eq
Step  Mantissas shifting and exponent computation
  MY

 
P
imyi 
 i
 A

 
P
imxi 
 i
 if MY

    then MY   MY

  else MY  MY

 
 if jA

 j   MY     then A

  A

  else A

  A

 
 if A

  A

  then increment eq and compute eq
 if jA

j   MY     then A  A

  else A  A


 if A  A

  or MY   MY

 then increment eq and compute eq
Step  Mantissa computation
  for j   j  n  
f
 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
  if MY

    then
f
		 MYj   MYj myj  
 j 
		 Aj    Aj mxj  
   mqj MYj   Qjmyj  
 
g
else
f
		 MYj   MYj myj  
 j 
		 Aj    Aj mxj  
   mqj MYj   Qjmyj  
 
g
 Qj   Qj mqj   j 
g
  Proof of correctness
It is obvious that the computation of the exponent of the result is correct On the other
hand for the mantissas alignment and computation the situation is more complex Let us
explain this
 Mantissas shifting
We show why it may be necessary to shift A

 and A

 one time each
According to the algorithm it must be guaranteed that jmxj  my Then as the shift must
be performed with only  digits of each mantissa we may have the following situations
 If MY


   jA

 j
MY
  
 
and mx
my
may be equal to  
 
 A shift is necessary
But as jA

j
MY
  
 
another shift is necessary and then jAj
MY
  
 
 With this it
is guaranteed that jmxj  my
 If MY

    then MY

 is shifted of one position The worst case is
jA

 j
MY
  
 

Then it is enough to shift A one position to guaranteed that jmxj  my With this
MY    Where MY is the mantissa of the divider
Then the exponent must be augmented in   or  
 Mantissa computation
To perform the division correctly the values of mqj  chosen in step   of the algorithm
must be compatibles with the Robertsons conditions  They are
 if MXj   MY  then mqj   
  if  MY  MXj   then mqj    or mqj   
 if MXj   then mqj    or mqj    or mqj   
 if   MXj MY  then mqj    or mqj   
 if MXj  MY  then mqj   
The two following equations may be easily proved by induction
If MY

   
Aj   
j
 
j X
i
mxi 
 i  
 jX
i
mqi 
 i
j X
i
myi 
 i
 
else if MY

   
Aj   
j
 
j X
i
mxi 
 i  
 jX
i
mqi 
 i
j X
i
myi 
 i 
  
Aj can be expressed also as
Aj   
j
 
j X
i
mxi 
 i  
 jX
i
mqi 
 i

MYj
 
MYj is the shifted mantissa of the divisor at step j
We dene a sequence as 
MX  mx
MXj    MXj  mqj MY

We nd that
MXj   
j
 
nX
i
mxi 
 i  
 jX
i
mqi 
 i

MY
 
MXj  Aj   
j
 
nX
ij 
mxi 
 i  
 jX
i
mqi 
 i

MY  MYj
 
As
MYj 
 Pj 
i myi 
 i if MY

   Pj 
i myi 
 i  if MY

   


We have
jMXj   Aj   
j
 
nX
ij 
  i 
 jX
i
  i

jMY  MYj j
 
As
jMY  MYj j 

  j  ifMY

   
  j ifMY

   


Then
jMXj  Aj j         
According to step  of the algorithm
 if mqj    then Aj   From equation  we nd that if Aj   then
MXj    Robertsons conditions  and  are satised
 Similarly if mqj    then Aj   Then MXj    Robertsons conditions 
and   are satised
 if mqj    then    Aj    From equation  we nd that    MXj 
  and as jMY j    then the Robersons conditions    and  are satised
Hence the algorithm computes the division correctly
However this algorithm can be improved The sequence of tests
Test  Test of Aj
		 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
needs the examination of all the digits of Aj ie j This examination involves a needless
loss of time the arithmetic operations on step  of the algorithm may be performed in
parallel without carry propagation using the BS number system Therefore this sequence
of test is the most timeconsuming part of the algorithm In order to avoid this drawback
we examine all the digits of Aj between the most signicant one and the digit which power
is    Namely Aj 
P
i  
 iaji
 Then the test will be performed on Aj instead of Aj
as following
Test  Test of Aj
		 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
The proof of the improved algorithm is similar to the previous one
We obtain the obvious relation
jAj  A

j j    
Then according to the modied Step  of the algorithm
 if mqj    then A

j   From equation  we nd that if A

j   then
Aj    and from  we nd that MXj  
  by now let us assume that Aj can be represented as a  digits expression
 Similarly if mqj    then A

j   Then Aj    and MXj  
 if mqj    then   A

j   As A

j is a multiple of   we have   
Aj    From equation  we nd    Aj    and from equation  we
nd that   MXj   
 Pseudonormalization
If the inputs of the oatingpoint divider are pseudonormalized then its output is also
pseudonormalized Let us prove that
 If MY

    then the worst case is
jXj
Y
  
 
 

and the quotient is pseudo
normalized
 If MY

    then the worst case is
jXj
Y
  
 
 

and the quotient is pseudo
normalized
 The architecture
The oatingpoint divider consists of several blocks gure 
 A serial circuit to compute the dierence between the exponents
 A serial augmenter to increase the exponent by   or  
 A serial automaton that computes the absolute value of Y 
 A serial overow detector
 A pseudonormalizer which ensures that   Y  
 A serial shifter synchronizer for the mantissas
 A serial divider for the mantissas
absolute
value
mantissa
ready
pseudo
normalizer
ey
ex
mx
mx
A
shift
my
my
mq
Q
X
Y
stop delay
overow
eq
detector
serial divider delayand synchronizer inverter
delay
subtraction
sign
augmentation
augmenter
shifting
Fig    The on line oating point divider
The rst two computations are performed with the circuits of gure  
The automaton that computes the absolute value of Y is shown in gure  The sign inverter
changes the sign of the mantissa of the result if the state of the maximum value automaton
is 
The detection of the overow is done at the output of the incrementer A small automaton

 






Initial
 state
Fig    The absolute value automaton
tries to nd a representation of the exponent so that to have the carry digit equal to  in
order to keep the p exponent of the format Figure  shows this automaton
xx
	
x
x




 
xx
xx



 


x
 
  are overflow states
xx
Initial
state
Fig    The overow detector automaton
The shifter synchronizer guarantees that if shifts have been performed then the exponent
is augmented and otherwise the exponent remains unchanged We will explain with more
detail the pseudonormalizer the shifter synchronizer and the serial divider
 Pseudonormalizer
The pseudonormalizer is shown in gure  The automaton is shown in gure  A binary
counter stores the number that the exponent must be decreased A zero tester is used to
avoid the delay of the serial circuit when the subtraction of the exponents is not performed
The overow detector is similar to the ones shown in gure  The delay of the pseudo
normalizer  pno is variable and depends on the degree of pseudonormalization of the
operands If le is the number of digits of the exponent and lbs the number of digits to
represent the oatingpoint number then
le    pno  lbs   
Then the delay of the normalizer may be in the worst case as great as the length of the
number representation plus  On the other hand if the input operand is already pseudo
normalized  pno has its minimum value Figure  shows an example
If the zero tester is not used a simplied design is obtained but the minimum value of the
delay will be augmented by  The serial subtraction can be replaced also by its parallel
version
my
ey
shift register demux serial overow
detector
y pseudonormalized
zero tester
counter
automaton
input
shift registeroutput
count
mantissa ready
subtraction
Fig    The pseudo normalizer
Input
operands
detector
Overflow
Output
cycles
e

p
   e


m m m	 m m
   
mantissa pseudonormalized
e

pe

p
   e


ep    e
ep    e m	 m    mn  
Subtraction
Fig    Example of the internal synchronization on the pseudo normalizer my 
   
  Shifting the mantissas
The circuit performs the comparisons of the mantissas The comparison on MY


is perfor
med before the comparison with mx A second comparison delays mx of  or   cycles if
necessary None digit of mx is lost but delayed It is assumed that these operations can be
performed in one cycle
 The serial divider
The serial divider is shown in gure  The upper part of it computes the termmqj MYj 
Similarly the lower ones computes Qjmj  The BS fourinput parallel adder computes the
shift MY

A
mxmx
my
mx
MY
comparator comparator
latch
buffer
delay mx
Fig    The circuit for shifting the mantissas
term Aj  It is made up with   input BS parallel adders A  input parallel adder is
proposed in  The format control is very simple and requires only the test of the digit
with power   If the value of this digit is dierent form zero then the digit with power  
is inverted remember jAjj   This technique was originally proposed by Kla 
 Let Z  zn   zzz z k  NzzK such that jZj  
if z    Z  zK else Z  zK









	
n

my
mq

shift
      
e
e
multiplexer to shift left
mqj MYj 
Qjmy 
A
 clock
 clock
selection
Aj
A
j
multiplexer to shift left
p
a
r
a
l
l
l
a
d
d
e
r




	







n
r
e
g
i
t
s
rmx
my
control
mqj 
shift from aligner
A from aligner
shift
       
my	
my my my
 mymymy

mq mq	 mq mq
 mq mq mq
of mqj 

Aj 
format
control

  latches
Fig    The serial divider
 Internal synchronization of the oatingpoint divider
pseudo
and delay
output of
Subtraction
overflow
detector
normalizer
cycles
Mantissas
Output
divider
mantissas alignment
ey

p
ey

p
ey

p	
ey

p
ey

p

ey

p
ey

p
   ey



ey


ey

	
ey


ey


ey


my


my


my

	
my


my



my


my


  
ex

p
ex

p
ex

p	
ex

p
ex

p

ex

p
ex

p
   ex



ex


ex

	
ex


ex


ex


mx


mx


mx

	
mx


mx



mx


mx


  
e

p e

p
   e


e


e



e


e

	
e


e


e


e

p 
e

p e

p
   e


e


eqp    eq eq
mq mq
eqp    eq eq mq   
 augmentation
Augmenter
Fig     The internal synchronization on the on line oating point divider
As we can see from gure   the decision on augmenting or not the exponent can be taken
when their last two digits go through the incrementer As the last two digits of the exponent
are outputting the rst ve digits of the mantissas are available and then it is possible to
subtract   or   from the exponent of the result Using gures  and   we obtain the
interval values of the digit on line delay of the oatingpoint divider  div
le    div  lbs  
Note that if the inputs are guaranteed to be pseudonormalized the delay of the divider
would be 
 Conclusion
We have described a new radix   digit on line divider This arithmetic unit has a variable
digit on line delay which depends on the pseudonormalization degree of the divisor
This architecture is fully simulated using parallel discreteevent simulations It works on
MaPar MP a memorydistributed massively parallel computer where several operators
work in parallel
With this operator and the adders and multipliers already introduced it is possible to per
form in a digitlevel pipelined mode complex computations such as the Gauss elimination
algorithm to solve linear equations
We are working in a project to simulate and to build a digit on line machine called CA
RESSE the french abbreviation of Serial Redundant Scientic Computer that will made
up of heterogeneous digit on line arithmetic units
References
 A Avizienis Signeddigit number representations for fast parallel arithmetic IRE
Transactions on Electronic Computers pp  
  J Duprat and M Fiallos Aguilar Datafow dot product on networks of heterogeneous
digitserial arithmetic units In submitted to IEEE th Symposium on Parallel and
Distributed Processing SPDP 
 J Duprat and M Fiallos On the simulation of pipelining of fully digit online oating
point adder networks on massively parallel computers In Second Joint Conference on
Vector and Parallel Processing Lecture Notes in Computer Science pages  
SpringerVerlag September  
 J Duprat M Fiallos J M Muller and H J Yeh Delays of online oatingpoint
operators in borrow save notation In Algorithms and Parallel VLSI Architectures II
pages    Noth Holland 
 MD Ercegovac Online arithmetic an overview In SPIE editor SPIE Real Time
Signal Processing VII pages pp  
 MD Ercegovac and KS Trivedi Online algorithms for division and multiplication
IEEE Trans	 Comp	 C pp  
 A Guyot Y Herreros and J M Muller Janus an online multiplier divider for
manipulating large numbers In IEEE th Symposium on Computer Arithmetic pages
 IEEE Computer Society Press 
 Sylvanus Kla Calcul Parall
ele et En Ligne des Fonctions Arithmetiques PhD thesis
Ecole Normale Superieure de Lyon France February 
 JM Muller Arithmetique des Ordinateurs Masson 
 P K Tu On line Arithmetic Algorithms for Ecient Implementation PhD thesis
Computer Science Department UCLA 
 PK Tu and MD Ercegovac Design of online division unit In th Symposium on
Computer arithmetic pages   IEEE 
  R J Zaccone and J L Barlow Eliminating the normalization problem in digit online
arithmetic IEEE Transaction on Computers C January 
