Some issues on CARESSE, a new heterogeneous fine grain parallel-pipelined architecture by Fiallos-Aguilar, Mario & Duprat, Jean
HAL Id: hal-02101962
https://hal-lara.archives-ouvertes.fr/hal-02101962
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Some issues on CARESSE, a new heterogeneous fine
grain parallel-pipelined architecture
Mario Fiallos-Aguilar, Jean Duprat
To cite this version:
Mario Fiallos-Aguilar, Jean Duprat. Some issues on CARESSE, a new heterogeneous fine grain
parallel-pipelined architecture. [Research Report] LIP RR-1994-03, Laboratoire de l’informatique du
parallélisme. 1994, 2+26p. ￿hal-02101962￿
LIP
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Some issues on CARESSE  a new
heterogeneous ne grain
parallelpipelined architecture
Mario Fiallos Aguilar
Jean Duprat
janvier  
Research Report No 
Ecole Normale Supérieure de Lyon
46, Allée d’Italie, 69364 Lyon Cedex 07, France,
Téléphone : + 33 72 72 80 00; Télécopieur : + 33 72 72 80 80;
Adresses électroniques : 
lip@frensl61.bitnet;                    lip@lip.ens−lyon.fr (uucp).
Some issues on CARESSE  a new heterogeneous ne grain
parallelpipelined architecture
Mario Fiallos Aguilar
Jean Duprat
janvier  
Abstract
Here  we deal with a new ne grain parallelpipelined architecture made up of hetero
geneous digit on line arithmetic units AUs We present some main issues of such an
architecture  including the model of computation  its digitserial AUs  new scheduling
heuristics and examples of linear algebra computations Using parallel discreteevent si
mulations and computation visualization on a massively parallel computer  we present
some measures of its performance
Keywords  ne grain parallelism  heterogeneous processing  digit online computation
Resume
Dans ce rapport  nous nous interessons aux architectures paralleles pipeline a grain n for
mees dunites arithmetiques heterogenes Nous presentons quelques resultats importants
pour de telles architectures dont le modele de calcul  les unites arithmetiques calculant
en serie au niveau du chi	re  de nouvelles heuristiques dordonnancement et des exemples
de calcul tires de lalgebre lineaire En utilisant la simulations par evenements discrets
paralleles et la visualisation de la trace du calcul e	ectue sur une machine massivement
parallele  nous presentons quelques mesures de performance de ces architectures
Motscles  parallelisme a ganularite ne  calcul heterogene  calcul enligne
Some issues on CARESSE  a new heterogeneous
ne grain parallelpipelined architecture  
Mario Fiallos Aguilar y and Jean Duprat
Laboratoire de lInformatique du Parallelisme LIP	
Ecole Normale Superieure de Lyon

 Allee dItalie 

 Lyon Cedex  France
malloslipenslyonfr
R esum e
Here  we deal with a new ne grain parallelpipelined architecture made up of heterogeneous
digit on line arithmetic units AUs We present some main issues of such an architecture  in
cluding the model of computation  its digitserial AUs  new scheduling heuristics and examples
of linear algebra computations Using parallel discreteevent simulations and computation vi
sualization on a massively parallel computer  we present some measures of its performance
  Introduction
It is well known that in computations of arithmetic algorithms that deal with the approximation of
real numbers by 
oatingpoint representations  inaccurate calculations and representations lead to
completely wrong results
These errors are produced by cancellation and truncation of the 
oatingpoint numbers A computer
that allows the size of operands and results to be large enough to compute according to the needs
of accuracy potentially resolves these problems
However  as high accuracy is achieved using verylong precision arithmetic  the representation of
numbers needs a lot of bits  typically thousands It is more practical to carry all these bits serially
than in parallel
In digit on line mode of computation     the operands and the results 
ow through the arith
metic operators or units AUs serially  digit by digit  starting with the most signicant  allowing a
digitlevel pipelining
This paper deals with some issues of an architecture made up of heterogeneous digit on line arith
metic units  called CARESSE  the french abbreviation of Serial Redundant Scientic Computer
We present brie
y the AUs used in the architecture and with some detail the divider These AUs
are suitables for VLSI implementation Two di	erent scheduling heuristics allow the computation of
numerical intensive applications with a limited number of AUs In this paper we apply these heuris
tics to Gaussian elimination To study the performance of CARESSE  we use parallel discreteevent
simulations and computation visualization technics on MasPar MP
   This work is part of a project called CARESSE which is partially supported by the  PRC ArchitecturesNouvelles
de Machines of the French Ministere de la Recherche et de la Technologie and the Centre National de la Recherche
Scientique
y  Supported by CNPq and Universidade Federal do Ceara Brazil

 Background
As stated above in digit on line computation  the operands and the results 
ow between arithmetic
units serially  most signicant digit rst MSD
This 
ow needs the use of a redundant number system In such systems  addition is carry free
and can be performed in parallel  or in any serial mode The most usual arithmetic operations can
be calculated in MSD mode too Digit on line arithmetic is the combination of MSD mode and
redundant number system
An interesting implementation of a radix carryfree redundant system is the Borrow Save notation 
BS for short In BS  the ith digit  xi  of a number x is represented by two bits x
 
i and x
 
i with
xi  x
 
i   x
 
i  Then  has two representations    and   The digit  is represented by  
and the digit  is represented by   Using the BS number system  the addition can be computed
without carry propagation  Figure  shows some elementary xedpoint BS circuits
The digit on line AUs are characterized by their delay  that is the number   such that p digits of the
 
 
 
  
 
 
 


   
z  x   if inc   for all input
z  x   if inc   for all input
digits but the last two
digits but the last one
z  x if inc   for all input digits
z  x   k where k is a
binary positive number
latch
z

i
z
 
i
k
x
 
i
x

i


 
 

z

i 
z
 
i 
y

i
y
 
i
x

i
x 
i
a
b
c t
u
a b c
t u
represents

 

 
latch
z

i
z
 
i
inc
x
 
i
x

i
circuit to compute the subtraction  x   y
The PPM  cell
 sub  
incrementer inc  
circuit to compute z  subk  
Fig    Some elementary xed point BS circuits
result are deduced from p   digits of the input operands When successive digit on line operations
are performed in digit pipelined mode  the resulting delay will be the sum of the individual delays
of operations and communications  and the computation of large numerical jobs can be executed in
an ecient manner We will assume that any communication has a delay of  See gure 
As we can see from gure   the computations in digit on line mode can be described as a data
a
c
d
b
ith digit of a  b   c  d
 
 
 
i   th digits of the products
i   th digits of the inputs
   
   
Fig    Digit level pipelining in digit on line arithmetic
dependence graph or dataow graph  DFG These graphs consist of nodes  which indicate operations
executed on arithmetic units  and edges from one node to another node  which indicate the 
ow of
data between them A nodal operation can be executed only when the required information  a digit
from all the input edges is received Typically a nodal operation requires one or two operands and
produces one result Once that the node has been activated and the computations related to the
input digits inside the arithmetic unit performed ie the node has red  the output digit is sent
to the destination nodes This process is repeated until all nodes have been activated and the nal
result obtained Of course  more than one node can be red simultaneously
 Floatingpoint number format and pseudonormalization
A BS 
oatingpoint number X with n digits of mantissa and p digits of exponent is represented
by X  mxex  where mx 
Pn
imxi
 i and ex 
Pp 
i exi
i In our system the exponents
and the mantissas circulate in digit on line mode  exponent rst See gure  In classical binary
circulation of digits
exp       ex mx      mxn
Fig    The BS oating point format

oatingpoint representation  a number is said normalized if its mantissa belongs to   or  
Normalization of numbers leads to more accurate representations and consequently better results
In BS representation  to check if a number is normalized or not sometimes needs the examination
of all its digits For this reason  we adopt the concept of pseudonormalized numbers A number is
said pseudonormalized if its mantissa belongs to   or   It is easier and faster to ensure
that a number is pseudonormalized it suces to forbid a mantissa beginning by      or 
This pseudonormalization is performed in two steps
 A four state automaton examines two consecutives digits and transforms the couples   and
  into   and   respectively and leaves the other couples unchanged We call this
operation an atomic pseudonormalization This automaton is shown in gure 
 The second step consists in counting the zeroes generated by the previous computation and
adding the same quantity to the exponent
x x
x

  

Initial




state
 
Fig    The automaton of the pseudo normalizer
The divider could have a smaller delay if the divisor is guaranteed to be pseudonormalized In
this case the output of all arithmetic operators adders  multipliers  dividers  etc  must be pseudo
normalized
But  as our main goal is to perform computations in digitlevel pipelined mode  it is preferable to
pseudonormalize the inputs of the divider internally
Note that the rst solution makes the subtraction a variable delay operation The second ones make
the divider more complex  but allows the adders to have a x digit on line delay We adopt this last
solution
 The arithmetic units
The adder delay  and the multiplier delay  has been extensively discussed in  We will present
here only a new digit on line 
oatingpoint divider
One important characteristic of these AUs is that the digit pair   is used to transmit a  and the
pair   for a non signicant transmission  so that the synchronization is insured automatically
In fact  an arithmetic unit must wait that its two rst inputs be di	erents from   to begin its
computation This synchronization of the operands is performed by an interconnection network
  The digit online division algorithm
We want to compute Q  XY with X  mxex  Y  myey   Q  mqeq and
  my  
jmxj  my
We will see how to deal with the cases of mx  my and negative divisor mantissa in the next
sections The algorithm can be stated as follows
Algorithm  Digit online division algorithm
Step   Exponent computation
 Compute the subtraction of the exponents except the last two digits
eqp      eq
Step  Mantissa shifting and exponent computation
 MY

 
P
imyi
 i
 A

 
P
imxi
 i
 if MY

   then MY  MY

 else MY MY

 
 if jA

 j  MY    then A

  A

 else A

  A

 
 if A

  A

 then increment eq and compute eq
 if jA

j  MY    then A  A

 else A  A


 if A  A

 or MY  MY

 then increment eq and compute eq
Step  Mantissa computation
 For j   to n   do
 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
 if MY

   then  MYj   MYj myj 
 j 
 Aj   Aj mxj 
   mqj MYj   Qjmyj 
 
else  MYj  MYj myj 
 j 
 Aj   Aj mxj 
   mqj MYj   Qjmyj 
 
 Qj   Qj mqj 
 j 
  Proof of correctness
It is obvious that the computation of the result exponent is correct On the other hand  for the
mantissas shifting and computation the situation is more complex Let us explain this
 Mantissa shifting
We show why it may be necessary to shift A

 and A

 one time each
According to the algorithm it must be guaranteed that jmxj  my Then  as the shift must be
performed with only  digits of each mantissa  we may have either of the following situations
 IfMY

   
jA

 j
MY
  
 
and  mx
my
may be equal to  
 
 A shift is necessary But as
jA

j
MY
  
 
another shift is necessary and then  jAj
MY
  
 
 With this  it is guaranteed
that jmxj  my 
 If MY

   then  MY

 is shifted of one position The worst case is
jA

 j
MY
  
 
 Then 
it is enough to shift A one position to guarantee that jmxj  my With this MY   
where  MY is the mantissa of the divider
Then  the exponent must be incremented by    or 
 Mantissa computation
To perform the division correctly  the values of mqj  chosen in step  of the algorithm must be
compatible with the Robertsons conditions  They are
 if MXj   MY then mqj   
 if  MY MXj   then mqj    or mqj   
 if MXj   then mqj    or mqj    or mqj   
 if   MXj MY then mqj    or mqj   
 if MXj  MY then mqj   
The two following equations may be easily proved by induction
If MY

  
Aj  
j
 
j X
i
mxi
 i  
 jX
i
mqi
 i
j X
i
myi
 i
 
else if MY

  
Aj  
j
 
j X
i
mxi
 i  
 jX
i
mqi
 i
j X
i
myi
 i 
 
Aj can be expressed also as
Aj  
j
 
j X
i
mxi
 i  
 jX
i
mqi
 i

MYj
 
MYj is the shifted mantissa of the divisor at step j
We dene a sequence as 
MX  mx
MXj   MXj  mqj MY

We nd that
MXj  
j
 
nX
i
mxi
 i  
 jX
i
mqi
 i

MY
 
MXj   Aj  
j
 
nX
ij 
mxi
 i  
 jX
i
mqi
 i

MY  MYj
 
As
MYj 
 Pj 
i myi
 i ifMY

  Pj 
i myi
 i  ifMY

  


We have
jMXj   Ajj  
j
 
nX
ij 
 i 
 jX
i
 i

jMY  MYjj
 
As
jMY  MYjj 

 j ifMY

  
 j ifMY

  


Then
jMXj  Aj j       
According to step  of the algorithm
 if mqj    then  Aj   From equation  we nd that if Aj   then MXj  
Robertsons conditions  and  are satised
 Similarly  if mqj    then Aj   Then  MXj   Robertsons conditions  and 
are satised
 if mqj     then    Aj   From equation   we nd that   MXj  
and as jMY j    Robersons conditions    and  are satised
Hence  the algorithm computes the division correctly
However  the algorithm can be improved The sequence of tests
Test  Test of Aj
 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
needs the examination of all the digits of Aj ie  j   This examination involves a needless loss
of time the arithmetic operations on step  of the algorithm may be performed in parallel  without
carry propagation  using the BS number system Therefore  this sequence of tests is the most time
consuming part of the algorithm In order to avoid this drawback  we examine all the digits of Aj
between the most signicant one and the digit which has power   Namely  Aj 
P
i 
 iaji

Then  the test will be performed on Aj instead of Aj as following
Test  Test of Aj
 if Aj   then mqj   
else if Aj    then mqj    
else mqj   
The proof of the improved algorithm is similar to the previous one
We obtain the obvious relation
jAj  A

j j   
Then  according to the modied Step  of the algorithm
 if mqj    then Aj   From equation  we nd that if A

j   then Aj   and
from  we nd that MXj  
 Similarly  if mqj    then Aj   Then Aj   and MXj  
 if mqj     then   A

j   As A

j is a multiple of   we have   A

j 
 From equation  we nd   Aj   and  from equation   we nd that
 MXj  
	 Pseudonormalization
If the inputs to the 
oatingpoint divider are pseudonormalized then its output is also pseudo
normalized Let us prove that
 If MY

   then  the worst case is
jXj
Y
  
  


and the quotient is pseudo
normalized
 If MY

   then the worst case is
jXj
Y
  
 
 

and the quotient is pseudo
normalized
 Architecture of the divider
The 
oatingpoint divider consists of several blocks gure 
 A serial circuit to compute the di	erence between the exponents
 A serial incrementer to increase the exponent by    or 
 A serial automaton that computes the absolute value of Y 
 A serial over
ow detector
 A pseudonormalizer  which ensures that   Y  
 A serial shiftersynchronizer for the mantissas
 A serial divider for the mantissas
absolute
value
mantissa
ready
pseudo
normalizer
ey
ex
mx
mx
A
shift
my
my
mq
Q
X
Y
stop delay
overow
eq
detector
serial divider delayand synchronizer inverter
delay
subtraction
signshifting
incrementation
incrementer
Fig    The on line oating point divider

 






Initial

state
Fig    The absolute value automaton
The rst two computations are performed with the circuits of gure 
The automaton that computes the absolute value of Y is shown in gure  The sign inverter changes
the sign of the mantissa of the result if the state of the maximum value automaton is 
The detection of the over
ow is done at the output of the incrementer A small automaton tries
to nd a representation of the exponent so that the carry digit is equal to  in order to keep the
pdigit exponent of the format Figure  shows this automaton
xx
	
x
x




 
xx
xx



 


x

  are overflow states
xx

Fig    The overow detector automaton
The shiftersynchronizer guarantees that if shifts have been performed  then the exponent is aug
mented and otherwise the exponent remains unchanged We will explain with more detail the pseudo
normalizer  the shiftersynchronizer and the serial divider
  by now let us assume that Aj can be represented as a  digits expression
 Pseudonormalizer
The pseudonormalizer is shown in gure  The automaton is shown in gure  A binary counter
stores the number by which the exponent must be decreased A zero tester is used to avoid the delay
of the serial circuit when the subtraction of the exponents is not performed The over
ow detector
is similar to the one shown in gure  The delay of the pseudonormalizer  pno is variable and
depends on the degree of pseudonormalization of the operands If le is the number of digits of the
exponent and lbs the number of digits to represent the 
oatingpoint number  then
le     pno  lbs  
The delay of the normalizer may be  in the worst case  as great as the length of the number repre
sentation plus  On the other hand  if the input operand is already pseudonormalized   pno has its
minimum value Figure  shows an example
If the zero tester is not used a simplied design is obtained  but the minimum value of the delay will
be incremented by  The serial subtraction can also be replaced by its parallel version
my
ey
shift register demux serial overow
detector
y pseudonormalized
zero tester
counter
automaton
input
shift registeroutput
count
mantissa ready
subtraction
Fig    The pseudo normalizer
Input
operands
detector
Overflow
Output
cycles
e

p 
  e


m m m	 m m
   
mantissa pseudonormalized
e

pe

p 
  e


ep    e
ep    e m	 m    mn  
Subtraction
Fig    Example of the internal synchronization on the pseudo normalizer my     
 Shifting the mantissas
The circuit performs the comparisons of the mantissas The comparison on MY

 with  is perfor
med before the comparison with A

  A second comparison delays mx for  or  cycles if necessary
The digits ofmx are not losts  but are delayed It is assumed that these operations can be performed
in one cycle
shift MY

A
mxmx
my
mx
MY
comparator comparator
latch
buffer
delay mx
Fig    The circuit for shifting the mantissas
 The serial divider
The serial divider is shown in gure  The upper part computes the term mqj MYj  while 
the lower part computes Qjmj  The BS fourinput parallel adder computes the term Aj The
fourinput parallel adder is made from three input BS parallel adders A input parallel adder
is proposed in  The format control is very simple and requires only the test of the digit with
power  If the value of this digit is di	erent from zero  then the digit with power  is inverted
remember  jAjj   This technique was originally proposed by Kla 
 Let Z  zn   zzz z k  NzzK such that jZj  
if z    Z  zK else Z  zK
  Internal synchronization of the oatingpoint divider
As we can see from gure   the decision whether to increment the exponent can be taken when
the last two digits go through the incrementer As the last two digits of the exponent are emerging 
the rst ve digits of the mantissas are available  and it is then possible to subtract     or  from
the exponent of the result Using gures  and  we obtain the interval values of the digit on line
delay of the 
oatingpoint divider  div
le     div  lbs   
Note that if the inputs were guaranteed to be pseudonormalized  the delay of the divider will be 
 A multipipeline network of heterogeneous AUs
Using the AUs described above adder  multiplier  divider and opposers we can perform parallel
pipelined numerical intensive computations
We suppose that there are two di	erent types of AUs  namely the constant delay ones multipliers 
adders  opposers  etc  and the variable delay ones dividers  square rooters  etc  Of course 
these operators may have di	erent delays and their number is limited These operators are intercon
nected between them to allow the transmission of only one BS digit and not all the digits
All the operators are synchronized with the one with the larger period of computation In fact  this
period will be used as the unit time We will suppose also that the communication cost is unitary
An AU may be reused when its last computation has ended That means that the interconnection
network must be recongurable during the computation
The parallelism is allied to the multipipelining when several operators begin to compute simulta
neously As the AUS have di	erent digit on line delays  it is necessary to synchronize the digits of
their input operands  in a such way that the digits inputting the operators have the same power In
 n
 
 
 
 	

 
 

 
 
 

      
e
e
multiplexer to shift left
mqjMYj
Qjmy
A
 clock
 clock
selection
Aj
Aj
p
a
r
a
l
l
l
a
d
d
e
r


 
 
 	
 
 

 
 
 
 
 n
r
e
g
i
t
s
rmx
my
control
mqj
shift
       
my	my my my
 mymy

mq mq	 mq mq
 mq mq mq
of mqj
	
Aj
format
control

 latches
mantissa shift
my
mq
multiplexer to shift left
shift
my
A from shifting
Fig    The serial divider
our network this is achieved using variable length registers as stated in section 
Then  some important characteristic of such an architecture are
 It is possible to multipipeline the AUs and at the same time to compute in parallel
 The AUs work in digitserial mode and are heterogeneous in the type of computation that they
perform  in their delay and time of computation
 The operators are synchronized to the slowest one
Such machine is shown in gure  For this type of architecture we answer the following questions
 How to perform the scheduling in this type of machine to compute with the minimum delay
of evaluation
 How can AUs be reused
 What are the delay  speedup and eciency in such machine
 How to generate traces to learn more about such architecture
 What are the di	erences in behavior between some scheduling strategies
pseudo 
and delay
output of
Subtraction
overflow
detector
normalizer
cycles
Mantissas
Output
divider
ey

p 
ey

p 
ey

p 	
ey

p 
ey

p 

ey

p 
ey

p 
   ey



ey


ey

	
ey


ey


ey


my


my


my

	
my


my



my


my


  
ex

p 
ex

p 
ex

p 	
ex

p 
ex

p 

ex

p 
ex

p 
   ex



ex


ex

	
ex


ex


ex


mx


mx


mx

	
mx


mx



mx


mx


  
e

p e

p 
   e


e


e



e


e

	
e


e


e


e

p
e

p e

p 
   e


e


eqp     eq eq
mq mq
eqp     eq eq mq   
Augmenter
mantissas shifting  incrementation
Fig    The internal synchronization on the on line oating point divider
We apply the scheduling heuristics to some numerical intensive computations such as Gaussian
elimination
 The scheduling problem in the machine
As the number of AUs is xed and the number of nodes may be relatively large  reusing the AUs is
unavoidable A scheduling strategy    must be adopted
The main problem to scheduling tasks in a such machine is due to the fact that incomplete results
can be used as operands for successive operations The levelbased algorithms are not well adapted 
because the level does not represent any more the precedence constraints of the threads of the ma
chine the precedence nodes can deliver some digits of their outputs before ending their computation
and not allthese digits as in parallel arithmetic architectures
A scheduling strategy for this architecture must consider the digit on line delay  the synchronization 
the number of AUs  and the number of digits used to code the numbers Static scheduling strategies
are limited because they can be used only when the delays of the AUs are xed As we consider
variable delay AUs  we use dynamic scheduling
Let us introduce some graph notation
We represent the task graph as a DFG or DAG called G  where  G  N A  and N  f     ng
is the set of nodesn   and A is the set of directed arcs Each node represents an arithmetic ope
ration and the arcs are used to represent the dependencies In particular  an arc aij  A indicates
that the operation corresponding to node j uses the results of the operation corresponding to node
i We say that node i is a predecessor of node j and this last is successor of the former
We dene Si as the set of all the successors of i The in degree of a node i is dened as the
number of predecessors of that node The out degree of node i is dened as the number of its
successors Nodes with in degree of zero are called input nodes and those with out degree zero are
called output nodes We dene I as the set of input nodes I  fi     ip g and O as the set
of output nodes O  fo     ot g  where both p t   A path P  P i j is a sequence of
nodes      m   where   i and m   j The level of i is denoted as li and is dened
as maxlSi    where lSi is the set of the levels of the successors of i and lo   for all
output nodes We dene also D as the higher level of G









serial
inputs network
Serial
variable delay registersAU
interconnection
Outputs
Fig    An heterogeneous ne grain parallel pipelined network
 The lowest delay of computation
Let us show how it is possible to compute with the lowest delay We assume that the number of AUs
is unlimited
We dene the delay path of a node i  represented by DP i ok  as the sum of all the delays of
computation and communication in the path beginning with i and ending with ok Then  if we
call C  Ck l the communication delay of path P k l
DP i ok   ok  Cm  ok   m   
Cm  m    m     
   C i   i
Of course  several DP s may exist for the same node Let us dene SDP i as the set of delay paths
of node i and MP as the maximum value of these DP s
MP i  maxSDP i 
If we denote SMP G  fMP     MP ng andMMP G maxSMP G  then to compute
with the lowest delay  the beginning time of computation of task i  tbi is
tbi  MMP G MP i 
Easily  we nd the ending time of computation of all node tfi  tbi  i lbs  The number of
cycles to compute with the minimumdelay is then maxStfG  where Stf is the set of the ending
times of the graph
The following algorithm computes tbi and tfi for all nodes on a levelbylevel basis beginning at the
level of the outputs Let us present the algorithm
Algorithm  Lowest delay of computation algorithm
 for k   k  D
 for each i with li  k do MP i  maxSDP i
 MMP G  maxSMP G
	 for k   k  n
f
 tbi  MMP G   MP i
 tfi  tbi   i 
 lbs   
g
Where  lbs is the number of digits used to code the numbers
Of course  the delays of computation of the variable delay AUs are unknown before the computation
and it is practically impossible to know them in all cases and hence the computation with the
minimal delay is impossible However  values of these delays can be given by the user from the task
graph and his experience in the problem Closer are the values of this userhint delays  lower will be
the delay of computation
Then  in a rst instance  our purpose is to let the user to use his experience to initialize the values
of the delays But  we consider also the case where defaultvalues must be assumed In our case  the
default value is the lowest delay of operation of the AU see the values of the divider in equation 
for example
The beginning time of computation of each node can be used as the priority for each operation
Lowest is the tbi associated to a node higher its priority is The maximum of the delay paths can
be used also as the priorities of the nodes  avoiding one step of the algorithm  Let us use these
issues to present two scheduling heuristics
One of these heuristics is adaptable to the delay changes of the AUs  in the sense that the priorities
are changed dynamically according to delay changes Let us present the adaptive ones rst This
strategy executes the algorithm  as many times as delay changes occur The nonadaptive computes
only  time the algorithm  Let us see the strategies
If two predecessors of a node have produced valid digits   then we will say that the node is ready
We dene CL as the number of iterations of the algorithm The adaptive heuristic can be stated as
follows
Heuristic  Adaptive delay changes scheduling heuristic
 for all the nodes representing variable length operations set their delay according to the user 
hints if desired else set these delays to their minimum value
 compute all the tbis according to algorithm 
	 assigns priorities to the nodes according to their tbi
 as long as there are nodes to be scheduled do the following
a for each type of task determine the number of ready nodes  Schedule the maximum num 
ber of ready tasks according to the number of available operators of the type and their
priorities
b wait computations of the cycle to be performed
c return to the group of available AUs those whose interval of computation have expired
d increment CL
e if there are tasks that have a delay dierent from the ones in step  set these delays to
their real values and go to step 
Deleting item e of the heuristic   we obtain the nonadaptive ones
	  rst output of each operator di
erents from 
 moreover to be considered ready an input node must satised tbi  CL too
 Task graphs for gaussian elimination
We present here one mode of computation of gaussian elimination that uses intensively divisions
Our purpose with this was to study the behaviour of the machine biased by this type of AU In order
to simplify our approach no pivot is used We used the wellknown method to solve linear systems
Let denote the system as AX  b
Figure  shows the conventions used in the task graphs The resulting task graphs for    and
variable systems are shown in gures    and  Each operation will be performed by the three
AUs introduced above Additionally  an opposer is easy to design it suces to exchange the mantissa
bit  by bit   Because the delays dependences on the data  an analytical method cannot predict
  
x  y
x
 x
x x
y
Fig    Conventions used in the task graphs



 

 



 
b


x
xa
a
a
b
a
a
a
Fig    Tasks graph of a  variable system





 


 










 
 
  




 








a	
x
x
b
a
a
a
a
b
a	
b	 x	a	
a	
a	
a
a
a
a
a		
a	
Fig    Tasks graph of a 	 variable linear system
the performances of the heuristics We use parallel discreteevent simulation We present the main
ideas of our simulation in the following section





 

 














 





 
 


  
 



 


 


 

 


  
 
b
a
a	
a
a
a	
a
b
a
a
a
a
a
a
a
a
a	
a
b	
a	
a	
a		
a	
a	
a	
b
a
a
a
a	
a
a

 







 


x
x
x	





x



  
Fig    Tasks graph of a  variable system
 Parallel simulation
We use synchronous discreteevent parallel simulation      to study some issues of CA
RESSE In our case the events are the input and output of digits of the heterogeneous AUs MasPar
MP  the host computer of the simulation is a SIMD massively parallel computer    The
key idea to simulate several AUs on MP is to map to several PEs  several processes It is possible
to map several AUs of the same or di	erent types to each PE  but all the processors would simulta
neously simulate the same type of operator  since MP is a SIMD computer
The simulation can be viewed as a nite succession of two di	erent steps computation and communi 
cation In fact  due to the dataparallel programming model of MasPar  problems of synchronization
between the di	erent arithmetic units are easily solved The computations are performed in one type
of operator at a time Some important features of the simulation are
 The event list is partitioned or distributed on the PEs In fact  each PE has a variable called
priority that contains its priority relatively to the other tasks of the same type
 There are a global counter for counting the number of cycles used to perform the computations
and local counters to describe the state of the operator The local counters are used to control
the computational progress on the node they belong to The global and local counters always
progress forward
 The time is advanced according to the production of the next event After one step of compu
tation and communication  the time is incremented in one unit
 Using the dataparallel paradigm in a SIMD computer it is guaranteed that the simulated
computation time of each node an AU that produces output digit  is less than the virtual or
simulated receive time of the node that consumes the output digit
Each node of the simulated DFG performs its discreteevent simulation by repeatedly processing
the inputs  performing some computation and outputting its results In our simulation a BS digit is
represented by two bits The 
oatingpoint BS format chosen may have from  to  digits for
the mantissa and from  to  digits for the exponent Control of the AUS process is assumed by
a status variable The process works like a global automaton which controls local ones  maximum 
over
ow detector and pseudonormalizer  etc and circuits serial adder and incrementer  etc
	 Studying the performance of CARESSE
Understanding and explaining the computation of numerical algorithms on CARESSE is a complex
task Graphical visualizations are useful and interesting tools Our simulator allows to measure the
following parameters of performance
 Number of cycles to perform the computations
 The speedup of computation with n operators of each type  dened as the ratio of the number
of necessary cycles to compute with  operator of each type and the number of necessary cycles
to compute with n operators of each type
 Eciency dened as the ratio of the speedup and the number of AUs used
Moreover  statistics or traces show how the utilizations of the di	erent AUs along the time are The
following traces have been generated by the simulator to show the behaviour of CARESSE for 
and variable systems In order to test the di	erences between the two strategies  a great number
of delay changes have been introduced
0
500
1000
1500
2000
2500
0 2 4 6 8 10 12 14
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 1: 64-digit format and 5 delay changes
nonadaptive
adaptive
0
500
1000
1500
2000
2500
0 2 4 6 8 10 12 14
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 1: 128-digit format and 5 delay changes
nonadaptive
adaptive
0
500
1000
1500
2000
2500
0 2 4 6 8 10 12 14
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 2: 64-digit format and 14 delay changes
nonadaptive
adaptive
0
500
1000
1500
2000
2500
0 2 4 6 8 10 12 14
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 2: 128-digit format and  14 delay changes
nonadaptive
adaptive
Fig    The number of cylces of two 	 variable systems
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 1: 64-digit format and 31 delay changes
nonadaptive
adaptive
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 1: 128-digit format and 31 delay changes
nonadaptive
adaptive
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 2: 64-digit format and 30 delay changes
nonadaptive
adaptive
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
 case 2: 128-digit format and 30 delay changes
nonadaptive
adaptive
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 3: 64-digit format and 23 delay changes
nonadaptive
adaptive
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
case 3: 128-digit format and 23 delay changes
nonadaptive
adaptive
Fig    The number of cylces of three  variable systems
0
2
4
6
8
10
0 2 4 6 8 10 12 14
s
p
e
e
d
-
u
p
number of aus
case 1: 64-digit format and 5 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
0 2 4 6 8 10 12 14
s
p
e
e
d
-
u
p
number of aus
case 1: 128-digit format and 5 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
0 2 4 6 8 10 12 14
s
p
e
e
d
-
u
p
number of aus
case 2: 64-digit format and 14 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
0 2 4 6 8 10 12 14
s
p
e
e
d
-
u
p
number of aus
case 2: 128-digit format and  14 delay changes
nonadaptive
adaptive
Fig    Speed up in two 	 variable systems
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 1: 64-digit format and 31 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 1: 128-digit format and 31 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 2: 64-digit format and 30 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 2: 128-digit format and 30 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 3: 64-digit format and 23 delay changes
nonadaptive
adaptive
0
2
4
6
8
10
12
14
16
18
0 5 10 15 20 25 30 35
s
p
e
e
d
-
u
p
number of aus
case 3: 128-digit format and 23 delay changes
nonadaptive
adaptive
Fig    Speed up in three  variable systems
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14
E
f
f
i
c
i
e
n
c
y
number of aus
case 1: 64-digit format and 5 delay changes
nonadaptive
adaptive
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14
e
f
f
i
c
i
e
n
c
y
number of aus
case 1: 128-digit format and 5 delay changes
nonadaptive
adaptive
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14
E
f
f
i
c
i
e
n
c
y
number of aus
case 2: 64-digit format and 14 delay changes
nonadaptive
adaptive
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14
e
f
f
i
c
i
e
n
c
y
number of aus
case 2: 128-digit format and  14 delay changes
nonadaptive
adaptive
Fig    Eciency in two 	 variable systems
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
e
f
f
i
c
i
e
n
c
y
number of aus
case 1: 64-digit format and 31 delay changes
nonadaptive
adaptive
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
e
f
f
i
c
i
e
n
c
y
number of aus
case 1: 128-digit format and 31 delay changes
nonadaptive
adaptive
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
e
f
f
i
c
i
e
n
c
y
number of aus
case 2: 64-digit format and 30 delay changes
nonadaptive
adaptive
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
n
u
m
b
e
r
 
o
f
 
c
y
c
l
e
s
number of aus
 case 2: 128-digit format and 30 delay changes
nonadaptive
adaptive
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
e
f
f
i
c
i
e
n
c
y
number of aus
case 3: 64-digit format and 23 delay changes
nonadaptive
adaptive
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25 30 35
e
f
f
i
c
i
e
n
c
y
number of aus
case 3: 128-digit format and 23 delay changes
nonadaptive
adaptive
Fig    Eciency in three  variable systems
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
a
d
d
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and no delay changes
nonadaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
a
d
d
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and 31 delay changes
adaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
m
u
l
t
i
p
l
i
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and no delay changes
nonadaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
m
u
l
t
i
p
l
i
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and 31 delay changes
adaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
d
i
v
i
d
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and no delay changes
nonadaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
d
i
v
i
d
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and 31 delay changes
adaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
o
p
o
s
s
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and no delay changes
nonadaptive
0
2
4
6
8
10
12
14
16
0 200 400 600 800 1000 1200 1400
n
u
m
b
e
r
 
o
f
 
o
p
o
s
s
e
r
s
 
p
e
r
 
c
y
c
l
e
cycles
case 2: 128-digit format and 31 delay changes
adaptive
Fig    Traces in a  variable system when  AUs of each type are availables
 
 Conclusion
This study is part of a project called CARESSE The goal of this project is to investigate the feasibi
lity of a multiprocessor machine working in digit on linemode Such a machine will be heterogeneous 
that is it will be make up of di	erent types of modules Each arithmetic operation is performed by
a specialized AU A VLSI prototype of the multiplier have been designed and tested
We have presented a new step in the simulation of CARESSE  a machine well tted to the compu
tations with high precision
The division module presented above  carries new diculties for the scheduling problem
We have introduced the concept of delay path to perform the scheduling in a such architecture The
original aspect of the scheduling problem here is that the variable delay of computation of the divi
sion  makes the time of execution not foreseeable Using this concept we have compared two di	erent
scheduling heuristics for CARESSE
The nonadaptive heuristic uses static priorities dened before the computation On the other hand 
the adaptive heuristic uses dynamic priorities The main risk is the starvation  that is the blocking
of certain tasks by the higher priority ones  resulting in higher delays of computation
Moreover  it is well known that the list algorithms may not generate the optimal solution The
simulations show also the well known problem of the stability of such methods  that is  it is not
guaranteed that the augmentation of available resources will result in a proportional diminution
in the computation time A comparison of the two heuristics have been performed for a division 
intensive Gaussian elimination The di	erent results we have shown prove that the choice between
the two strategies is always an open problem The performances of these two heuristics in others
numerical intensive computations are under study
References
 A Avizienis Signeddigit number representations for fast parallel arithmetic IRE Transactions
on Electronic Computers  pp    
 Bernstein D  Rodeh M  and Gerner I On the complexity of scheduling problems for paral
lelpipelined machines IEEE Transactions on Computers  c   
 J Duprat and M Fiallos On the simulation of pipelining of fully digit online 
oatingpoint
adder networks on massively parallel computers In Second Joint Conference on Vector and
Parallel Processing  Lecture Notes in Computer Science  pages   SpringerVerlag  Sep
tember 
 J Duprat  M Fiallos  J M Muller  and H J Yeh Delays of online 
oatingpoint operators in
borrow save notation In Algorithms and Parallel VLSI Architectures II  pages   Noth
Holland  
 Hesham ElRewini and T G Lewis Scheduling parallel programs tasks onto arbitrary target
architectures Journal of Parallel and Distributed Computing     
 MD Ercegovac Online arithmetic an overview In SPIE  editor  SPIE Real Time Signal
Processing VII  pages pp    
 MD Ercegovac and KS Trivedi Online algorithms for division and multiplication IEEE
Trans Comp  Cpp    
 R Fujimoto Parallel discrete event simulation Communications of the ACM    
October 
 A Guyot  Y Herreros  and J M Muller Janus  an online multiplierdivider for manipula
ting large numbers In IEEE th Symposium on Computer Arithmetic  pages   IEEE
Computer Society Press  
 Sylvanus Kla Calcul Parallele et En Ligne des Fonctions Arithmetiques PhD thesis  Ecole
Normale Superieure de Lyon  France  February 
 MasPar Computer Corporation MasPar Parallel Application LanguageMPL   User Guide 

 J Misra Distributed discreteevent simulation Computer Surveys     March 
 JM Muller Arithmetique des Ordinateurs Masson  
 J Nickolls The design of the maspar mp A cost e	ective massively parallel computer In
IEEE  editor  IEEE Compcon Spring   pages pp    
 T R Stiemerling Design and Simulation of an MIMD Shared memory multiprocessor with
interleaved instruction streams PhD thesis  Department of Computer Science  University of
Edinburgh  November 
