Dataflow dot product on networks of heterogeneous digit-serial arithmetic units by Duprat, Jean & Fiallos-Aguilar, Mario
HAL Id: hal-02102063
https://hal-lara.archives-ouvertes.fr/hal-02102063
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Dataflow dot product on networks of heterogeneous
digit-serial arithmetic units
Jean Duprat, Mario Fiallos-Aguilar
To cite this version:
Jean Duprat, Mario Fiallos-Aguilar. Dataflow dot product on networks of heterogeneous digit-serial
arithmetic units. [Research Report] LIP RR-1993-10, Laboratoire de l’informatique du parallélisme.
1993, 2+17p. ￿hal-02102063￿
LIP
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Institut IMAG
Unité de recherche associée au CNRS n°1398 
Data ow dot product on
networks of heterogeneous digitserial
arithmetic units
Jean Duprat
Mario Fiallos Aguilar
March  
Research Report No  
Ecole Normale Supérieure de Lyon
46, Allée d’Italie, 69364 Lyon Cedex 07, France,
Téléphone : + 33 72 72 80 00; Télécopieur : + 33 72 72 80 80;
Adresses électroniques : 
lip@frensl61.bitnet;                    lip@lip.ens−lyon.fr (uucp).
Data ow dot product on
networks of heterogeneous digitserial
arithmetic units
Jean Duprat
Mario Fiallos Aguilar
March  
Abstract
In this paper we deal with a new high precision computation of the dot product  The
key idea is to use hundreds of digitserial arithmetic units that allow a massive digit
level pipelining  Parallel discreteevent simulations performed on a memorydistributed
massively parallel computer show that with a limited number of arithmetic units the
computation of dot product when performed using a  classical algorithmic technique
i e  serial cumulative multiplications is almost as fast as the case where an  optimal
divideandconquer algorithmic technique is used  Interconnection networks for both
algorithmic techniques are considered 
Keywords  Dataow dot product massive pipelining digit online computation 
Resume
Ce document decrit un produit scalaire a haute precision  Lidee principale est dutiliser
plusieurs centaines dunites arithmetiques permettant le 	pipeline
 au niveau du chire 
Des simulations paralleles devenements discrets faites sur des machines paralleles a me
moire distribuee montrent que lorsque le produit scalaire est calcule avec un nombre xe
dunites un ordononnancement de multiplications cumulatifs est presque aussi rapide
quun ordononnancement 	divideandconquer
  Les reseaux dinterconnection pour les
deux techniques sont aussi presentes 
Motscles  ot de donnees produit scalaire 	pipeline
 massive calcul enligne 
Data ow dot product on networks of
heterogeneous digitserial arithmetic units 
Jean Duprat and Mario Fiallos Aguilary
Laboratoire de lInformatique du Parallelisme LIP
Ecole Normale Superieure de Lyon
	
 Allee dItalie 

	 Lyon Cedex  France
malloslipenslyonfr dupratlipenslyonfr
  Introduction
Matrix and vector operations based on dot product computation occur frequently in engineering and
scientic applications  A lot of work has been performed in order to obtain better algorithms and
ecient implementations on parallel computers       
Unfortunately in computations of arithmetic algorithms that deal with the approximation of real
numbers by oatingpoint representations inaccurate calculations and representations lead to com
pletely wrong results 
These errors are produced by cancellation and truncation of the oatingpoint numbers    A
computer that allows the size of operands and results to be large enough to compute according to
the needs of accuracy potentially resolves these problems 
However as high accuracy is achieved using verylong precision arithmetic the representation of
numbers needs a lot of bits typically several thousands  It is more practical to carry all these bits
serially than in parallel 
In digit online mode of computation   the operands and the results ow through the arith
metic operators or units aus serially digit by digit starting with the most signicant allowing a
digitlevel pipelining 
This paper deals with the digit online mode computation of dot product  Floatingpoint digit on
line adders and multipliers are used to compute with a maximum accuracy of  digits 
Two dierent algorithmic techniques for computing the dot product are studied  The rst one is
the 	divideandconquer
 technique a technique frequently used in parallel machines  The second
consists basically in computing the dot product using cumulative multiplications a technique fre
quently used in SISD computers 
A comparison of the two algorithmic techniques is performed using analytical methods and parallel
discreteevent simulation on MasPar MP 
 Online and dataow modes of computation
As stated above in digit online computation the operands and the results ow between arithmetic
units serially most signicant digit rst MSD  Similarly LSD means least signicant digit 
A consequence of this ow is the need of a redundant number system  In such systems addition
 This work is part of a project called CARESSE which is partially supported by the  PRC Architectures Nouvelles
de Machines  of the French Ministere de la Recherche et de la Technologie and the Centre National de la Recherche
Scientique
ySupported by CNPq and Universidade Federal do Ceara Brazil

is carry free and can be performed in parallel or in any serial mode MSD or LSD  The most
usual arithmetic operations can be calculated in MSD mode too  Digit online arithmetic is the
combination of MSD and redundant number system 
An interesting implementation of a radix carryfree redundant system is the Borrow Save notation
BS for short  In BS the iith digit xi of a number x is represented by two bits x
 
i and x
 
i with xi 
x i   x
 
i   Then  has two representations   and    The digit  is represented by   and
the digit  is represented by   
A BS oatingpoint number x with n digits of mantissa and p digits of exponent is represented by
x  mxex  where mx 
Pn
imxi
 i and ex 
Pp 
i exi
i  In our system the exponents and the
mantissas circulate in digit online mode exponent rst 
The digit online systems are characterized by their delay that is the number   such that p digits
of the result are deduced from p    digits of the input operands  When successive digit online
operations are performed in digit pipelined mode the resulting delay will be the sum of the individual
delays of operations and communications and the computation of large numerical jobs can be
executed in an ecient manner  We will assume that any communication has a delay of   See
gure  
As we can see from gure  the computations in digit online mode can be described as a data
a
c
d
b
ith digit of a  b   c d


 
i   th digits of the products
i   th digits of the inputs
Figure  Digitlevel pipelining in digit online arithmetic
dependence graph or dataow graph DFG  These graphs consist of nodes which indicate operations
executed on arithmetic units and edges from one node to another node which indicate the ow
of data between them  A nodal operation can be executed only when the required information a
digit from all the input edges is received  Typically a nodal operation requires one or two operands
and produces one result  Once that the node has been activated and the computations related to
the input digits inside the arithmetic unit performed i  e  the node has red the output digit is
passed to the destination nodes  This process is repeated until all nodes have been activated and
the nal result obtained  Of course more than one node can be red simultaneously 
  Pseudonormalization
In classical binary oatingpoint representation a number is said normalized if its mantissa belongs
to   or    Normalization of numbers leads to more accurate representations and con
sequently results  In BS representation to check if a number is normalized may be necessary to
examine all its digits  For this reason we replace the concept of normalized numbers by that of
pseudonormalized numbers  A number is said pseudonormalized if its mantissa belongs to  
or    It is very easy to ensure that a number is pseudonormalized it suces to forbid a
mantissa beginning by    or   This pseudonormalization is performed serially 
In the next sections we will describe briey the two digit online oatingpoint arithmetic units aus
used in the computation of dot product 
 Fully digit online oatingpoint adder
As in classical oatingpoint adders the fully digit online adder addo for short  performs
three basic operations exponent calculation mantissa alignment and mantissa calculation  Figure
 shows the dierent blocks of the adder
 A serial maximizer computes the maximum of the two exponents 
 A serial aligner performs the mantissa alignment with a shift register using the dierence
between the exponents 
 A serial adder calculates the sum of the aligned mantissas 
 A synchronizer is used to guarantee that only an unavoidable carry appearing in the sum of the
mantissas will provoke the truncation of the mantissa and the incrementation of the exponent
of the result 
maximizer  delay incremeter
overflow
detector
serial
adder
  aligner
normalizer +
carry detector
subtracter +
counter
delay
output
x y
e    
e
e
e
m
m
y
x
x
y
x
y
Figure  The online oatingpoint adder
The synchronizer normalizer  carry detector in gure  must test the carry digit and the more
signicant digit of the mantissas sum  When these digits are equal to       or   respectively
the incrementation of the exponent and the truncation of the mantissa the last digit is lost are
performed  When these digits are equal to       or   the exponent is not modied and by
substituting   by   or   by   when necessary the carry digit of the sum is  
The operation performed by the synchronizer is dierent from the systematic incrementation of
the exponent during all addition proposed by Tu   In this last case the truncation of the
mantissa of the result leads to a needless loss of information  In the solution we have proposed 
the incrementation is performed only when necessary  After the tests of the rst two bits of the
mantissas sum the exponent is incremented i the modied carry digit is not   Since the decision
of incrementing or not can be made when the last digit of the exponent result is outputting the
incrementer see gure  the online delay of addo  add becomes  
The digit pair   is used to transmit a  and the pair   for a non signicant transmission so that
the synchronization is insured automatically 
 Fully digit online oatingpoint multiplier
The oating point multiplier consists of three dierent parts 
 A serial adder for the exponents 
 A serial multiplier for the mantissas 
 A synchronizer which ensures that if two input numbers are pseudonormalized the output
will be pseudonormalized too 
e          e          e          e          . . .          e          m          m          m          m          . . .          m          0          0          0
e          e          e          e          . . .          e          m          m          m          m          . . .          m          0          0          0
TIME (cycles)
                     
input
operands 
maximum
delay
incrementer
overflow
detector
serial
adder
pseudo−
normalization
x      x      x     x               x      x      x      x        x              x
max     max     max     max      . . .          max 
 max     max      . . .          max     max        max  
  inc       inc        . . .          inc       inc          inc        inc
 ove       . . .          ove      ove         ove       ove
 p−1      p−2                 2        1        0
          adm      adm       adm      . . .    adm       adm      adm      
          0      1         2              n−2       n−1      n
            *              *         *        *       *
               adm    . . .      adm       adm      adm      adm
   p−1                3        2          1       0
+ delay
max  : output digit of the maximum calculation operator      delay: delay of the computed maximum
  i
adm  : output digit of the mantissas adder 
  i
inc    : output digit of the incrementer                                   ove   : output digit of the overflow operator
output z
 
p−1
            1              n−3      n−2       n−1      n
z
1
. . . . . .
  p−1    p−2      p−3     p−4                 0
   p       p−1                  3       2        1        0
e
  z
    e
     0
m
  
p−1      p−2        p−3    p−4                  0        1       2        3           4                   n
  i                                                                  i
y      y     y      y               y      y      y      y        y              y
 p−1     p−2      p−3     p−4                    0       1        2        3           4                  n   
Figure  Synchronization in the online oatingpoint adder
The serial multiplier is described in      The serial adder and the pseudonormalizer
are similar to the oatingpoint adder ones and will not present here  
Since the two input mantissas belong to   in absolute value the product belongs to  
thus if we want maintain the result pseudonormalized it is necessary to shift up the mantissa up
to  positions to the right  This kind of normalization requires the dynamic subtraction of   or
 to the exponent of the result  To generate the nal exponent the last two digits of the exponent
are controlled by the three digits of the mantissa product  The normalizer contains a decrementer
followed by an overow detector which is similar to the one of the adder  The digit online delay of
the multiplier  mul is   The internal synchronization is similar to the ado case 
 Computing dot product in digit online mode
With the adder and the multiplier presented in the last two sections we envisage computing the dot
product of two vectors in a massive digitpipelined mode 
The dot product C of two vectors A  fa     ang and B  fb     bng is given by
C 
nX
i
aibi 
The rst  fast mode of computation arises immediately compute the products in parallel rst and
after use adders to perform operations of reduction following a divideandconquer technique  See
gure   The resulting dataow graph is a complete binary tree CBT for short with dlogne  
levels 
We can also try to compute dot product by cumulative multiplications see gure  we compute
the rst product  This result is added then to the next using an adder and so on until the nal
result is reached  The DFG resulting is a linear array of operators LA  
 The graph can be dened also as a strictly binary tree with n levels
digits
addition
product
Complete binary tree of operators Linear array of operators
Figure  Two resulting graphs for dot product computation
Three cases arise when the dot product is performed in digit online mode
 The number of arithmetic units is greater than the number of operations to perform and a
minimum delay is obtained 
 The number of arithmetic units is less than the number of operations to perform but reusing
the idle operators it is possible to compute with minimum delay 
 The number of arithmetic units is less than the number of operations to perform and though
the idle operators are reused as soon as possible it is not possible to compute with minimum
delay 
Note that in the last two cases a scheduling policy must be used 
 Computing dot products with a number of aus larger or equal than
n
With  mul    add   and assuming that any communication has a delay of  it is easy to nd
that the minimum delay for computing a dot product using the divideandconquer technique see
g   is
 CBT   mul   add   log n     log n 
Similarly the minimum online delay for the computation of the dot product using cumulative
multiplications is
 CM   mul   add  n    n 
As we stated the two cases above are not realistics 
  Reusing the aus to compute with minimum delay
The problem of reusing operators is unavoidable in a real machine where the number of them cannot
be grown indenitely  However it is possible to reuse the aus to obtain a minimum delay 
On a not digit online computer i  e  it receives all digits of the operands in parallel reusing is
simple  For the complete binary tree this can be achieved with n  adders and n multipliers 
In a similar way it is possible to reduce the number of operators to compute in digit online mode 
But here the situation is more complex because as the numbers are transmitted serially digit by
digit the predecessor operators may be computing the last or some middle digits of the result number
whose other digits are being consumed by the successor nodes and then the predecessors cannot be
reused until them have produced the last digit of the result  A similar situation occurs in the LA of
operators 
We will present rst the case for the complete binary tree graph and after the case for the linear
array of operators 
 Reusing aus in the tree
It is not possible to reach the minimum  CBT without using n multipliers  The problem is to know
how many adders are necessary to compute with minimum delay having n multipliers  We note ui
as the level on the CBT of operators the multipliers are at level  and L the number of digits
used to code the numbers  It is possible to reuse only the rst level of adders  The beginning and
ending time of adders tbega tenda and multipliers tbegm tendm are
tbegm   
tendm   mul  L      L 
tbega  ui    add   mul  ui     ui    
tenda  tbega   add  L     ui    L 
The adders that have not begun their rst digit computation when the ones at the second level of
the CBT have nished are those with
uj  d  L  e 
For example if L   and n   the number of adders saved by reusing is only   For CBT
searching for the minimum digit online delay is not realistic 
 Reusing aus in the linear array
If we enumerate the adders of the LA from the left to right as a     an and the multipliers as m
 

and m    mn see gure  we can compute the beginning and ending time of the operators as
following
tbegai   mul     add  i    i 
tendai  tbegai   add  L       i L 
tbegmi  tbegai    mul     i 
tendmi  tbegai  L       i L 

For the rst adder tenda   L  It can be reused at time tbegre  L  Then we look for
the adders whose beginning time are greater or equal than tbegre
tbegrea    L 
  i    L  i  d  L e
We note that i is the subindex that identies each adder  For the other aus the situation is similar 
These results indicate that computation of dot product with minimum delay can be performed by
using a number of operators which depends only on the length of the format of the number L and
not on the number of products n 
Table  summarizes the number of operators required to compute with minimumdelay as function of
the number of digits used to represent the numbers for adders dLe and for multipliers
d  L  e   
Number of digits Adders needed Multiplier needed
  
  
  
  
  
  
  
  
  
Table  Number of operators to compute with CM technique with minimum delay
 A rst comparison between the techniques and networks
The CM technique permits the computation of dot product with a number of aus that is independent
of the number of products to be performed  This contrasts with the divideandconquer technique
where the computation depends on both the length of operands L and the dimension of the dot
product n 
It is true that the digit online delay of the LA is greater than the delay using the divideandconquer
technique but it is interesting to investigate what will happen in a realworld situation where the
number of operators will be limited  In the next sections we will show that the advantages of
computing with the divideandconquer method may vanish when the number of operators is limited
and the number of digits to represent the numbers increases 
Moreover it is easy to see that as the operators must be reused the computation of dot products
by using cumulative multiplications can be performed on a ring of operators instead of on a LA 
That is the output of the last adder of the computation is feedback to one of the inputs of the rst
adder  See gure   From now we suppose that the computation of dot products using cumulative
multiplications will be performed on a ring of aus  Note also that the 	divideandconquer
 technique
suggests immediately to use a set of CBT interconnected aus 
product
addition
Ring of operators
Figure  Computing dotproduct using a ring of operators
 Computing dot product with a greater delay than the minimum
We suppose now that we have less aus than necessary to compute with minimum delay 
To compute the dotproduct we adopt a two phases scheduling algorithm  The rst phase is the
assignment of a priority number for each task  Priorities are in decreasing order  The second phase
schedules the tasks according to their priority number and the number of available aus  As there
are two types of aus adders and multipliers the scheduling can be performed independently and in
parallel for each type  We will present this scheduling algorithm with more details in section   
Let us present rst the host computer where the scheduling algorithm and the dot product were
performed 
 Parallel simulation and the host machine
We use discreteevent parallel simulation            in
our work  In the discrete approach to system simulation state changes are represented by a series
of discrete changes or events at specic instants of time  In our case the events are the input and
output of digits of the aus 
MasPar MP the host computer of the simulation is a SIMD massively parallel computer   
In MP all processors change state in a simple predictable fashion  The parallelism in MasPar is
achieved from the execution of single operation simultaneously across a large set of data  In MP
it is easy to determine the program state because all processes are either active or inactive and the
full synchronization guarantees that the value each processor retrieves is correct 
The processors PEs are interconnected by an xnet toroidal neighborhood mesh and a global mul
tistage crossbar router network  The programming language used is MPL a superset of C that
includes commands for the dataparallel programming mode 
The key idea to simulate several aus on MP is to map to several PEs several aus processes  It is
possible to map several aus of the same or dierent types to each PE but all the processors would
simultaneously simulate the same type of operator since MP is a SIMD computer 
 Description of the performed simulation
We have performed both the scheduling algorithm and the computations of dotproduct using the
parallel facilities of MP 
The simulation can be viewed as a nite succession of two dierent steps computation and communi
cation  In fact due to the dataparallel programming model of MasPar problems of synchronization
between the dierent arithmetic units are easily solved  The computations are performed in one
type of operator at a time  Static or dynamic scheduling may be applied to our problem  We use
dynamic scheduling  As the digit online delay of the adders and multipliers is xed static scheduling
seems more natural at rst  But in digit online mode of computation there are some arithmetic
operations as the division that cannot be computed with a constant digit online delay and conse
quently the static scheduling cannot be used  However in both cases of scheduling the results will
be identical 
Some other features of the simulation are
 The event list is partitioned or distributed on the PEs  In fact each PE has a variable called
priority that contains its priority relatively to the other tasks of the same type 
 It exist a global counter for counting the number of cycles used to perform the computations
and local counters to describe the state of the operator  The local counters are used to control
the computational progress on the node they belong to  The global and local counters always
progress forward 
 The time is advanced according to the production of the next event  That is after one step of
computation and communication the time is incremented in one unit 
 Using the dataparallel paradigm it is guaranteed that the simulated computation time of each
node that produces output digit is less than the virtual or simulated receive time of the node
that consumes the output digit 
 Simulation of the fullydigit online oatingpoint operators
Each node of the simulated DFG performs its discreteevent simulation by repeatedly processing
the inputs performing some computation and outputting its results  In our simulation a BS digit is
represented by two bits  The oatingpoint BS format chosen may have from  to  digits for
the mantissa and from  to  digits for the exponent  Control of each arithmetic unit process is
assumed by a status variable  The process works like a global automaton which controls local ones 
maximum overow detector and pseudonormalizer etc and circuits serial adder and incrementer
etc 
  Mappings
It is necessary to map the tree and the linear array of operators on the mesh of MP  The
mappings of the tree and of the linear array of processors DFGs for a multiplications dot product
are shown in gures   respectively 
Figure  Mapping of a products tree on a mesh
Figure  Mapping of a products linear array on a mesh
	 The simulation of the interconnection network
From gures  and  we see that the communication distances are short  In order to take advantages
of this the networks are simulated using the static mappings 
With this mode of simulation we will activate a number of operators less than or equal to the number
of available operators  The result is that the communication distances will be kept short and the
computation will be performed fast 
Note also that at this level of abstraction the simulation of the linear array of operators and the ring
are equivalent 
 The scheduling algorithm and its simulation
From now the terms task and node will be used as synonymous  A ag will be used to indicate when
an operator has been scheduled  Counter C will store the number of iterations of the algorithm  An
information table will contain the beginning and ending time of computation for each operator in
Q
the computation tbeg tend  If the two predecessors of a node have produced valid digits
 then we
will say that the node is ready 
We present the scheduling algorithm applied to the CBT  The case for the LA of operators is similar 
The algorithm can be stated as follows
  Assigns priorities to the nodes that represent the additions from one side to the other of the
CBT beginning at the rst level of adders and ending at the level of the root  The node with
 priority has the highest priority and the node n   the lowest one  In a similar way assign
priorities to the the nodes that represent the multiplications 
  Set counter C to zero 
  As long as there are nodes to be scheduled do the following
a For each type of task determine the number of ready nodes  Scheduled the maximum
number of ready tasks according to the number of available operators of the type 
b Set in the beginning time tbeg of the arithmetic units selected in the last item to the
value of C  Compute for each node its tend too 
c Wait computations of the cycle to be performed 
d Return to the group of available operators whose which interval of computation have
expired 
e Increment C 
  End 
As one of the inputs of an operator may be delayed in relation to the other a synchronization must
be provided  Latches are used to delay the input that is ready rst 
The scheduling algorithmwas performed using MPL and the parallel facilities of MP  The schedul
ing were performed in a type of operator at a time but using dataparallel statements 
	 Performance of the techniques and networks
In order to compare the networks we adopt the following measures of performance
  Number of cycles means the number of necessary cycles to perform the computations  In fact
the number of cycles is equal to the digit online delay length of the operands 
  The speedup of computing with n operators of each type is dened as the ratio of the number
of necessary cycles to compute with  operator of each type and the number of necessary cycles
to compute with n operators of each type 
  Eciency is the ratio of the speedup and the number of operators used 
Finally traces show how the utilization of the dierent operators along the time are 
rst outputs dierents from 
Q
 Number of cycles to perform computations
The gures  and  show that when the number of multiplications of the dot product is xed the
cumulative multiplications technique when performed on the ring has performances comparable to
those of the binary tree  When the number of available operators begins to increase the performance
of the binary tree as can be expected begins to be better 











   
C
y
c
l
e
s
Number of arithmetic units
tree
ring
Figure  Number of cycles needed to perform a elements dot product with L  








   
C
y
c
l
e
s
Number of arithmetic units
tree
ring
Figure  Number of cycles needed to perform a elements dot product with L  
Q
  Speedup
The gures  and  show the speedup obtained for the tree and the ring respectively  In the tree
the speedup is better when the number of aus is a power of   For the ring the speedup reaches a
maximum value relatively fast 






      
S
p
e
e
d

u
p
Number of arithmetic units
L   






 
    
L  
L   



 
 
 
  
    

Figure  Speedup on the tree for a element dot product







      
S
p
e
e
d

u
p
Number of arithmetic units
L   




        
L  
L   












        
Figure  Speedup on the ring for a element dot product
Q
 E	ciency
In the tree g   the eciency is better when the number of aus is a power of   The eciency
for the ring is self explanatory see g  







      
E
f
f
c
i
e
n
c
y
Number of arithmetic units
L   




















L  
L   



  
















Figure  Eciency on the tree for a element dot product











      
E
f
f
c
i
e
n
c
y
Number of arithmetic units
L   










      
L  
L   



   






 
Figure  Eciency on the ring for a element dot product
Q

 Traces of adders and multipliers
From gure  we see that the peak value of the adders used for the tree is reached a number
of times equal to the ratio of n the dimension of the dot product per number of available aus 
For the ring gs   and  the maximum number of multipliers and adders is reached fast and
maintained practically constant until the end of computation 








       
O
p
e
r
a
t
o
r
s
u
s
e
d
Cycles
 aus
 aus
 aus 


Figure  Traces of utilization of multipliers on the tree for a elements dot product with L 









       
O
p
e
r
a
t
o
r
s
u
s
e
d
Cycles
 aus
 aus
 aus 

















Figure  Trace of utilization of adders on the tree for a elements dot product with L 
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200
O
pe
ra
to
rs
 u
se
d
Cycles
32 aus
64 aus
128 aus
Figure  Traces ofutilization of multipliers on the ring for a elements dot product with L 

0
2
4
6
8
10
12
14
16
18
0 200 400 600 800 1000 1200
O
pe
ra
to
rs
 u
se
d
Cycles
32 aus
64 aus
128 aus
Figure  Traces of utilization of adders on the ring for a elements dot product with L  

 Concluding remarks and future work
Here we have described a heterogeneous computer made up of digit online adders and multipliers
working on the dot product problem  We have described the simulation of the machine on a massively
parallel computer the MasPar MP 
The main conclusion is that due to the natural pipeline at digit level in digit online mode linear
arrays have performances very near of binary trees when the dimension of the problem is large
compared to the number of arithmetic operators  This phenomena is augmented by the fact that
working at the digit level the dimension of the problem is the product of the number of inputs
and the length of the number  Another interesting fact is the ability of a ring of digit online
arithmetic units that with a reasonable number of them may perform high precision calculus with
large numbers 
Other numerical computations are under study  This includes polynomial evaluation and the Gauss
elimination algorithm to solve linear equations 
We are working in a project to simulate and to build a digit online machine called CARESSE the
french abbreviation of Serial Redundant Scientic Computer that will made of heterogeneous digit
online arithmetic units  A VLSI prototype of the multiplier has been projected and tested 
References
 A  Avizienis  Signeddigit number representations for fast parallel arithmetic  IRE Transactions
on Electronic Computers pp   
 J C  Bajard  Evaluation de fonctions dans des Systemes Redondantes decriture des Nombres 
PhD thesis Ecole Normale Superieure de Lyon February  
 J C  Bajard and J M  Muller  Online power series  In International Conference on Signal
Processing Applications and Technology Boston USA 	

  
 J  Bezivin and H Imbert  Adapting a simulation language to a distributed environment  In rd
International conference on distributed computing system pages   IEEE  
 P  Bjorstad F  Manne T  Sorevik and M  Vajtersic  Ecient matrix multiplication on simd
computers  SIAM Journal of Matrix Anal Appl  January  
 E  Debenedictis S  Ghosh and M  YU  A novel algorithm for discreteevent simulation  IEEE
Computer pages   
 J  Duprat and M  Fiallos  On the simulation of pipelining of fully digit online oatingpoint
adder networks on massively parallel computers  In Second Joint Conference on Vector and Par
allel Processing Lecture Notes in Computer Science pages   SpringerVerlag September
 
 J  Duprat M  Fiallos J  M  Muller and H  J  Yeh  Delays of online oatingpoint operators in
borrow save notation  In Algorithms and Parallel VLSI Architectures II pages   Noth
Holland  
 J  Duprat J  M  Muller S  Kla and J C  Bajard  Some operators for radix  online computa
tions  Journal of Parallel and Distributed Computing  To Appear 
 M D  Ercegovac  Online arithmetic an overview  In SPIE editor SPIE Real Time Signal
Processing VII pages pp   
 M D  Ercegovac and K S Trivedi  Online algorithms for division and multiplication  IEEE
Trans Comp Cpp   
 C T  Evens M  Gargeya and T  Leonard  Structure of a distributed simulation system  In rd
International conference on distributed computing systems pages   IEEE  
 R  Fujimoto  Parallel discrete event simulation  Communications of the ACM 
October  
 A  Guyot Y  Herreros and J  M  Muller  Janus an online multiplierdivider for manipulating
large numbers  In 
th Symposium on Computer Arithmetic pages   IEEE Computer
Society Press  
 E  Horowitz and A  Zorat  The binary tree as an interconnection network Applications to
multiprocessor systems and vlsi  IEEE Transactions on Computers c April
 
 K  Hwang and Y  Cheng  Partitioned matrix algorithms for vlsi arithmetic systems  IEEE
Transactions on computers c December  
 H  V  Jagadish and T  Kailath  A family of new ecient arrays for matrix multiplication  IEEE
Transactions on computers  January  
 D R  Jeerson  Virtual time  ACM Transactions on Programming Languages and Systems
 July  
 J  P  Katoen  Simulation of doom a loosely coupled multiprocessor system  Masters thesis
Computer Science Department of the University of Twente November  
 J  kent Peacecock j  Wong and E  Manning  Distributed simulation using a network of pro
cessors  Computer Networks   
 C  King W  Chou and L  Ni  Pipelined dataparallel algorithms Part   design  IEEE
Transactions on Parallel and Distributed Systems  October  
 C  King W  Chou and L  Ni  Pipelined dataparallel algorithms Part  concept and modeling 
IEEE Transactions on Parallel and Distributed Systems  October  
 Andrewas Knoe  Fast hardware units for the computation of accurate dot products  In
P  Kornerup and D  Matula editors 	th Symposium on Computer Arithmetic  IEEE IEEE
Computer Society Press June  
 L  Lamport  Time clocks and the ordering of events in a distributed system  Communications
of the ACM  July  
 MasPar Computer Corporation  MasPar Parallel Application LanguageMPL  User Guide
 
 J  Misra  Distributed discreteevent simulation  Computer Surveys  March  
 JeanMichel Muller and Philippe Francois  Fautil faire conance aux ordinateurs Rapport de
recherche Laboratorire de LInformatique du Parallelisme ENSL  
 J M  Muller  Arithmetique des Ordinateurs  Masson  
 J  Nickolls  The design of the maspar mp A cost eective massively parallel computer  In
IEEE editor IEEE Compcon Spring 	

 pages pp   
 S  K  Reinhardt M  D  Hill J  R  Larus A  Lebecks J  C  Lewis and D  A  Wood  The win
sconsin wind tunnel Virtual prototyping of parallel computers  In 	

 ACM SIGMETRICS
Conference May  
 T  R  Stiemerling  Design and Simulation of an MIMD Shared memory multiprocessor with
interleaved instruction streams  PhD thesis Department of Computer Science University of
Edinburgh November  
 P  K  Tu  Online Arithmetic Algorithms for Ecient Implementation  PhD thesis Computer
Science Department UCLA  
