Delay-Insensitive Synchronization on a Message-Passing Architecture with an Open Collector Bus by Bekker, H. & Dijkstra, E.J.
  
 University of Groningen
Delay-Insensitive Synchronization on a Message-Passing Architecture with an Open Collector
Bus
Bekker, H.; Dijkstra, E.J.
Published in:
EPRINTS-BOOK-TITLE
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
1996
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):
Bekker, H., & Dijkstra, E. J. (1996). Delay-Insensitive Synchronization on a Message-Passing Architecture
with an Open Collector Bus. In EPRINTS-BOOK-TITLE University of Groningen, Johann Bernoulli Institute
for Mathematics and Computer Science.
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.
Download date: 12-11-2019
Delay-Insensitive Synchronization on a Message-Passing Architecture 
with an Open Collector Bus 
H. Bekker 
Department of Computing Science, 
University of Groningen, 
9700 AV Groningen, The Netherlands 
Abstract 
The peiformance of some algorithms, running on a mes- 
sage passing computer; is limited by the high latency of 
global communications. To increase the performance, a 
simple open collector bus, operated by delay insensitive 
programs running on each processor can be used. We illus- 
trate this by an example: the constraint algorithm SHAKE 
as used in Constraint Molecular Dynamics (M.D.) simu- 
lation. We present a parallelizable SHAKE algorithm and 
show how it can be implemented on a ring architecture. 
On a large ring the use of message passing to synchronize 
SHAKE iterations may take up to 40% of the total time. 
We show how the communication time can be reduced by 
adding a very simple open collector bus, operated by a 
delay insensitive algorithm. In this way the time spent on 
the synchronization of SHAKE iterations will be negligible. 
We want to emphasize that this kind of open collector 
bus can be used with many delay insensitive algorithms. To 
show this we will mention other possible applications. 
Key words: synchronization, delay insensitive al- 
gorithm, open collector bus, constraint dynamics. 
1 Introduction 
Message passing systems consisting of a large number of 
processors, connected by a sparse interconnection topology 
(e.g. a ring or a mesh) prove to be a cost effective solution 
for many practical applications. These systems offer local 
communication with a high bandwidth and a low latency, 
but their global communication falls short in two respects: 
bandwidth and latency. This may give problems for the 
following classes of algorithms: 
(i): Algorithms limited by the sustained bandwidth of 
the architecture. These algorithms often require local or 
global communication of large amounts of data followed 
by calculations which take up less time. This paper is not 
about this class of algorithms but about: 
(ii): Algorithms limited by the latency of the commu- 
nications. These algorithms often require global commu- 
nication of very small amounts of data followed by some 
calculations. An important instance of such an algorithm 
is the synchronization of a fine grained iterative process. 
This paper is about a simple hardware extension, solv- 
ing the class of problems described in (ii). To be more 
specific, we will show that a very simple open collector 
bus, (0.C.-bus) running along all the processors, may be 
used to improve the performance of the algorithms in (ii). 
E.J. Dijkstra 
Department of Computing Science, 
University of Groningen, 
9700 AV Groningen, The Netherlands 
; f  
1 .... ....... 
- 
Figure 1: Open colle tor line, running along P processors. The 
vaiue read atevery processor is the same. It is the logical and of 
the values written at all processors, where True(Fa1se) corresponds 
to an open(c1osed) gate. 
The 0.C.-bus consists of a few (4.. .8) lines, without any 
further lines for clocking or control. (See figure 1). Each 
of these lines is memory mapped, which means that by 
clearing or setting a specific bit in memory every processor 
can clamp (pull down) respectively unclamp the line. By 
reading a specific bit in memory, every processor can obtain 
the logical value of each line. This value is the same on 
every processor, and is the logical and of the values written 
on that line. 
We will show that delay insensitive algorithms are very 
suitable to operate this bus. By using delay insensitive al- 
gorithms a low quality implementation of the 0.C.-bus is 
acceptable: no bus terminators are required, no character- 
istic impedance, etc. Moreover, the use of delay insensitive 
algorithms, also called self timed algorithms, makes it pos- 
sible to operate the 0.C.-bus without clock or control lines. 
The rest of this paper consists of a worked out example 
of molecular dynamics simulation on a message passing 
system. First a general introduction of constraint molecular 
dynamics is given. Then it is shown that running this 
algorithm on a ring architecture leads to the type of problem 
described in (ii). Finally, it is shown how the use of an 0.C.- 
bus, operated by a delay insensitive algorithm solves this 
problem. Also two similar applications of the 0.C.-bus are 
mentioned. 
1066-6192/96 $5.00 0 1996 IEEE 
Proceedings of PDP '96 
75 
2 Constraint Molecular Dynamics simula- 
Molecular Dynamics (M.D.) Simulation is a method to 
simulate the behaviour of a many particle (atom) system 
by numerically integrating Newton’s equation of motion. 
M.D. simulation is performed as follows: the initial system 
state SO, that is, the position ri and velocity vi of every 
particle i in the system at time t o  is given. By integrating 
Newton’s law Fi = m . ai for every particle, subsequent 
states S I ,  S2, . . . , S, are calculated where S, 3 S(to + 
nAt).  To calculate S,+l from S,, first the total force 
Fi (to + nAt) on every particle i due to all other particles 
in the system is calculated. Then this force Fi ( t o  + nAt) is 
used to calculate for every particle the new velocity vi ( t o  + 
n + 1 At . Using this velocity, the new position ri(to + I n + 1 1 1  At of every particle is calculated. Repeating this 
procedure gives the time development of the system. 
Every timestep, during the force calculations, many 
types of interaction-forces are evaluated: Coulomb forces, 
Lennard-Jones forces, covalent forces, etc. Some of these 
interactions are very rigid. The most rigid interaction in 
an M.D. simulation is the covalent interaction. This means 
that two particles having a covalent interaction, have an 
almost constant distance. Put in another way: covalent 
interactions have a high eigenfrequency. The maximal al- 
lowed timestep used in an M.D. simulation is dictated by 
the allowed numerical drift of the integration algorithm, 
so it is dictated by the highest frequency in the system, 
and should be approximately 1 / (40 x highest frequency). 
However, the behaviour of covalent interactions is not part 
of the physics of interest of an M.D. simulation. Leaving 
out frequencies above 114 to 112 the highest frequency does 
not influence the outcome of an M.D. simulation. So, it 
is a waste of computer time to use a timestep based on 
covalent eigenfrequencies. For that reason, nowadays in 
most M.D. programs, the covalent interactions are handled 
using constraint dynamics, which means that the distance 
between particles with a covalent bond is kept constant. 
Then the timestep may be as high as 1/20 to 1/10 (xhighest 
frequency). In this way an M.D. simulation runs two to 
four times faster. 
Because an atom may have covalent interactions with 
a number of atoms’, substituting covalent interactions by 
length constraints will in general result in a set of connected 
length constraints with a, possibly cyclic, graph like struc- 
ture. Covalent interactions are bonded interactions, so, no 
constraints are created or broken during a simulation. 
The introduction of length-constraints has no con- 
sequences for the force calculations, except of course 
that the forces of covalent interactions are not calculated. 
However, the introduction of length-constraints has severe 
consequences for the algorithm in which Newton’s law is 
integrated, resulting in a matrix equation. As the rank of the 
matrix is the number of constraints in the system, for sys- 
tems with many constraints, solving this equation directly 
on a parallel computer is complex. There exists however 
a fast, iterative method called SHAKE [I], to solve the 
matrix equation. The special thing about SHAKE is that 
its iterative way of solving the matrix equation is directly 
tion 
‘In a typical M.D. system the number of constraints is of the same 
order as the number of particles. 
reflected in iterative adjustment of pairs of particle posi- 
tions. This last interpretation of the SHAKE method has 
become so familiar that it is almost forgotten that it is a 
matrix solver in disguise. We will adhere to this habit, and 
in what follows write about the SHAKE algorithm as pair- 
wise adjusting particle positions to constraint conditions in 
an iterative way. 
SHAKE is used as follows. Every timestep, the in- 
teraction forces, the new velocities and new positions are 
calculated as if no constraints exist, except that no covalent 
interaction forces are evaluated. Clearly, particle positions 
obtained in this way do not fulfill the distance constraints 
between particles. Then SHAKE is invoked. In SHAKE, 
particle positions are corrected in an iterative way, such 
that finally all length-constraints are fulfilled within a pre- 
defined tolerance. So, at the end of every timestep many 
SHAKE iterations have to be done. 
SHAKE is implemented as follows. The particle num- 
bers of every pair of particles between which a distance 
constraint exists, are kept in a constraint-list (CL). So, in 
CL, every constraint is represented by two particles. (In 
this article we assume that the constraint distance is the 
same for every constrained pair of particles, so we need not 
store this in CL.) In every iteration, SHAKE goes through 
CL once; the order in which items of CL are processed does 
not matter. Processing an item of CL means that the posi- 
tions of the particles of this pair are adjusted such that their 
relative distance becomes the required constraint distance.2 
Because a particle may have more than one constraint in- 
teraction, repositioning a particle due to one constraint may 
disturb another, previously adjusted constraint. Therefore, 
after every iteration of adjusting positions of particle pairs, 
the constraint conditions are checked. If these conditions 
are not fulfilled within a predefined tolerance, another iter- 
ation is done, in which all constraints in CL are processed 
once again. Typically, at the end of every timestep SHAKE 
does 4 . . .40 iterations, but for large molecules, some hun- 
dreds of iterations may be required. On a single processor, 
SHAKE typically takes 5. . .20% of the total CPU time of 
an M.D. simulation. 
In the SHAKE algorithm as we presented it, two particle 
positions are adjusted when processing an item from CL. 
When processing the next item from CL, a possibly pre- 
viously adjusted particle position is adjusted further. So, 
items from CL cannot be processed simultaneously. Fortu- 
nately, the SHAKE algorithm can be restated without these 
dependencies. When processing an item from CL, instead 
of immediately adjusting the two particle positions, the 
resulting particle displacements are accumulated.3 After 
processing the whole CL, particles are displaced over their 
accumulated displacements. In pseudo code, the parallel- 
izable SHAKE algorithm looks like: 
2Although not relevant for this paper, adjusting positions goes as fol- 
lows. The particles of the constrained pair a, b, with positions ra and 
rb. are reset in the direction of ro(t  - A t )  - rb(t - A t ) ,  such that the 
center of mass of this pair does not change, and their distance becomes 
the constraint distance. 
3This cannot be derived by transformations of the algorithm, but is a 
matter of numerical mathematics. 
76 
procedure SHAKE; 
type rvec= array[l..3] of real; 
var 
partId=l..N; { number ofparticles is N } 
r, displ: array [partId] of rvec; 
CL: array [ 1 ..nr-constr] of 




for i:=l to nr-constr do begin 
displ[CL[i].a] += .... ; See footnote2 } 
displ[CL[i].b] += .... ; { See foomofe2 } 
end; 




At the departments of physical chemistry, and com- 
puter science in Groningen, the M.D. simulation package 
GROMACS [2,3] has been implemented on a custom-built 
ring architecture, consisting of 32 i860 processors. Each 
i860 board plugs into a collective PC bus, and has two eight 
bits wide parallel interfaces (2 Mb/sec) to connect the board 
in the ring. Also on the PC bus is an i486, running UNIX, 
which serves as a host. This host uses the PC bus to load 
code and initial data on the 8 6 0  processors, and for I/O 
purposes. 
In the GROMACS ring implementation, particles are 
statically allocated on processors. An M.D. system of 
N particles, numbered from 1 to N ,  is mapped on P (in 
our case P = 32)processors by allocating the first N I P  
particles on processor 1, the second N I P  on processor 2, 
etc. The processor Hi on which particle i is allocated is 
called its homeprocessor. The home processor of particle i 
calculates the final position of i after a timestep (constrained 
and unconstrained). As will be clear, the particle number- 
ing determines the home processor of every particle. 
For the force calculations it does not matter how particles 
are allocated on the processors. That is because every 
particle potentially interacts with every other particle. 
Therefore, at the beginning of every timestep, the position 
of every particle i is, starting from its home processor Hi,  
distributed over half the ring, in say the positive direction. 
Distribution over half the ring is sufficient because in this 
way every position pair rj , rj is present on at least one pro- 
cessor. After this distribution stage, interaction forces are 
calculated. Then the interaction forces on every particle i 
are communicated in the negative direction to H; where 
they are summed to the net force on particle i. Finally, on 
the home processor of each particle its new, unconstrained 
velocity and position is calculated. Now SHAKE is in- 
voked. Because between any two particles there may be 
a constraint, in principle for every SHAKE iteration, as in 
the force calculations, particle positions would have to be 
40n the GROMACS ring architecture this communication, together 
with the foregoing communication to distribute particle positions takes 
about 5% to 10% of the total time. 
SHAKE on a ring architecture 
... 1........................................:...................~.... 
Figure 2: The lists NCCI, LCI and PCCI on processorp for the 
part of some constraint graph mapped near processorp. NCCI= 
(3,2); (4,2). LCI= (43); (5,6); (6,7). PCCI= (6,lO); (5,8). On 
processorp constraint interactions in NCCI and LCI are evaluated. 
distributed over half the ring. However, this would take far 
too much time. In [4] we proposed a method, to minimize 
communication during SHAKE calculations. It is based 
on the bandwidth reduction algorithm of Gibbs, Poole and 
Stockmeyer. The essence of the method is that particles 
are numbered in such a way that particles between which a 
length constraint exists, get close numbers, so, are mapped 
on close processors. In fact, with this method, even for 
rather complex molecules, particles between which a con- 
straint exists, are mapped on the same or on directly adja- 
cent  processor^.^ So, during SHAKE only communication 
between very near processors is required. 
Most parallel implementations of the M.D. algorithm do 
not include constraint dynamics. In those cases where it is 
included [5,6] no use is made of accumulated displacements 
to parallelize SHAKE, nor the bandwidth reduction method 
to minimize communication during SHAKE iterations. 
We are now almost ready to write down the SHAKE 
algorithm as it runs on every processor of the ring, but 
first we will explain how the global constraint list CL, as 
introduced in section 1,  is distributed over processors, and 
how on every processor this partial list is partitioned into 
even smaller parts. 
The contents of CL do not change during a simulation, so 
an almost perfect load balance for parallel SHAKE calcu- 
lations can be accomplished by assigning the same number 
of items of CL to every processor. On every processor 
the constraint interactions assigned to that processor are 
stored in LCL (Local Constraint List). The list LCL is 
subdivided still further into three sublists (see figure 2): 
NCCI, LCI, and PCCI (Negative-Crossing, Local-, and 
Positive-Crossing Constraint Interactions). On processor 
p ,  NCCI (PCCI) contains those constraint interactions of 
which particle b is home on processor p - 1 (p + l),  and 
particle a is home on p .  LCI contains interactions of which 
both particles are home on p .  So on p ,  the list PCCI con- 
tains the same number-pairs, but in reverse order, as the 
list NCCI on p + 1. The data structures NCCI, LCI and 
PCCI can be used to define which constraint interactions 
are evaluated by which processor: processor p evaluates 
the constraints in its NCCI and LCI list. Then the SHAKE 
algorithm as i t  runs on every processor i s  as follows: 
51f this is not the case, the particles are still mapped on very close 
processors. Such a case can be handled by a straightforward extension 
of the method we propose, but we did not encounter molecules with 
constraint structures of this complexity. 
77 
procedure SHAKE; 
{ parallel SHAKE on a ring architecture, 
type rvec= array[l..3] of real; 
partId=l..N; { N is the number ofparticles } 
var 
r, displ: array [partId] of rvec; 
NCCI: array [ 1 ..nrNCCI] of record a,b: partId end; 
PCCI: array [ 1 ..nr-PCCI] of record a,b: partId end; 
LCI: array [ 1 ..nr-LCI] of record a,b: partld end; 
begin 
send PCCI-b-positions to posdir; 
receive NCCIb-positions from negdir; 
repeat 
see alsoJigure 2 } 
{ home(a)= p, home(b)=p-1 } 
{ home(a)=p, home(b)=p+l }; 
{ home(a)=p, home(b)=p }; 
clear(disp1); 
calculate displacements; 
{ due to constraints in NCCI and LCT} 
send NCCI-bdisplacements to negdir; 
receive PCCI-b-displacements from posdir; 
sum displacements and add to r; 
{for particles home on this processor} 
send PCCI-b-positions to posdir; 






With this, the parallel SHAKE algorithm is completely 
specified, except for the last few statements with LCWT 
and ACWT (Local- and All Constraints Within Tolerance), 
which concern the evaluation of the global stop criterion of 
SHAKE iterations during the current timestep. This will 
be discussed in the next section. 
4 The function ACWT implemented with the 
open collector bus 
SHAKE iterations should be stopped when on every pro- 
cessor the constraints in the lists NCCI and LCI are within 
tolerance, i.e. when on every processor the boolean vari- 
able LCWT is true. Representing LCWT on processor 
p by LCWT , the function ACWT can be specified as 
A C W T = L C h 1  and LCWTz and . . . and LCWTp . 
On a message passing ring architecture such as 
GROMACS, the function ACWT can be implemented in 
three ways. 
(i) By sending a single message around the whole ring 
twice. First the message accumulates the logical and of all 
LCWT, and in a second round this result is passed to all 
processors as ACWT. 
(ii) As P messages, all moving around the whole ring once. 
At every processor one message is released which returns 
to that same processor. While moving around the ring, 
the message evaluates the logical and of all LCWT. When 
arriving at the processor from which it was released, this 
processor inspects the contents of the message to see if an- 
other iteration of SHAKE is required. 
(iii) The third implementation uses the PC bus and the host 
computer. Every processor sends its LCWT to the host. 
There, the logical and is evaluated and transmitted back to 
the individual processors. 
As will be clear, each of these methods takes at least 
P communications. On our GROMACS ring implementa- 
tion, we measured that sending a minimal message (1 byte) 
from a processor to an adjacent processor, or from a pro- 
cessor to the host, takes -150psec, mainly due to startup 
overhead. So, for P = 32, evaluating ACWT takes at least 
32x 15x = 4 . 8 ~  10-”ec. Wealsomeasuredthatthe 
calculations of one SHAKE iteration take the same amount 
of time. (20 SHAKE iterations, without communication 
can take 10% of the total time. One timestep takes 0.1. . .1 
sec, let us say 1 sec, then one SHAKE iteration takes about 
5 x sec.) On the present architecture the one-to- 
one ratio of the time spent in SHAKE calculations and its 
synchronization is however no problem because SHAKE 
typically takes only 10% of the total time, so spending an 
additional 10% in ACWT is no major problem. However, 
when the same type of simulation is done on a ring con- 
sisting of twice as many processors, the total time spent 
on calculations is halved while the time spent in ACWT 
doubles. So, about 40% of the total time will be spent in 
ACWT. 
To solve this problem we will equip our next architec- 
ture with an eight line 0.C.-bus as described in the intro- 
duction. The function ACWT can be implemented using 
four of these lines. We will call these four lines “valid”, 
“accepted”, “next”, and “data”. We name the values writ- 
ten on these lines: 1-valid, laccepted, lmext, ldata; and 
the values read: g-valid, g-accepted, gmext, g-data. The 
prefixes 1 and g stand for local and global. The local vari- 
ables are write-only and the global variables are read-only. 
The signals “valid”, “accepted” and “next” will serve as 
global control signals. In [7] it is explained why at least 
three control signals are required. A delay insensitive im- 






repeat until g-valid; 
ACWT:=gdata; 
lmext:=False 1-accepted:=True; 
repeat until gaccepted; 
1-valid:=False; Imext:=True; 
repeat until gnext; 
{ACWT:= LCWTl and ... and LCWTp} 
end; After the first repeat, “data” is valid. After the second 
repeat, “data” has been accepted by all processors. The 
third repeat is necessary to return to the neutral state. Ini- 
tially, I-valid (g-valid) must be False. It can be seen that 
after g-valid becomes true, i.e. after the slowest process has 
evaluated its LCWT and assigned it to I-data, the evaluation 
of the function ACWT proceeds without delay. On every 
processor, immediately after finishing the function ACWT 




In our particular case, that is, constraint molecular dy- 
namics on a ring architecture, adding a small 0.C.-bus is 
a sensible investment because the price of this feature is 
low (a few hundred dollars) compared to the price of the 
whole computer (M $lOO,OOO), while the speed increase is 
much higher than this ratio. Moreover, the hardware risk 
that goes with this feature, that is, the risk of destabilizing 
an otherwise well functioning architecture, is very small. 
In our opinion, there are many other useful applications 
of an 0.C.-bus in a message passing computer. Especially 
the combination of an 0.C.-bus with delay insensitive al- 
gorithms looks promising. We will briefly mention two 
other applications. 
The first example we want to mention is process arbit- 
ration. On every processor a number is generated. Pro- 
cess arbitration means that on every processor it is decided 
whether the highest number is on this processor. A delay 
insensitive arbitration algorithm, using four wired or lines 
has been designed (unpublished) by C.E. Molnar. This 
way of process arbitration will be much faster than using 
the message passing mechanism. 
The second example is the TRIMOSBUS 171. The 
TRIMOSBUS is a general purpose bus, operated with a 
delay insensitive algorithm. It consists of at least four open 
collector lines, three of which are used for sequencing, and 
the other ones as data lines. It may be used for arbitrary 
point to point communications, and for broadcasting. In 
this way, on a parallel computer, small amounts of data 
may be exchanged between processors much faster than 
with the usual message passing mechanism. 
The research field of “delay insensitive” algorithms and 
hardware is thriving nowadays. Many delay insensitive 
algorithms are conceived, and experimental, delay insens- 
itive hardware is designed. Of both, the correctness can be 
proved by delay insensitive algebra [8,9]. How delay in- 
sensitive algorithms and hardware will develop is not clear 
at this moment. We do feel however, that a simple 0.C.-bus 
connecting the processors of a parallel message passing ar- 
chitecture, combined with delay insensitive algorithms, is 
a simple and fast general purpose feature, which may be 
used to increase the performance of algorithms which can- 
not be implemented efficiently or elegantly with the mes- 
sage passing mechanism. Obviously, an 0.C.-bus cannot 
replace the usual communication and routing hardware of 
message passing systems, but for a number of applications 
it can increase the performance in a straightforward way. 
Because the price/performance ratio of the 0.C.-bus is low, 
and because it is a simple and robust piece of hardware, it is 
worth considering to add this hardware feature to sparsely 
connected parallel computers. 
A reviewer remarked that the synchronization mechan- 
ism described in this article strongly resembles the barrier 
synchronization mechanism of the CRAY T3D architec- 
ture. 
6 Conclusions 
A small 0.C.-bus, operated by delay insensitive al- 
gorithms, is a fast and simple mechanism, which on mes- 
sage passing systems can be used to increase the perform- 
ance of many applications. 
An example of such an application is constraint M.D. 
implemented with the SHAKE algorithm. A SHAKE itera- 
tion can be parallelized by accumulating the displacements 
of every particle and adding the total displacement to the 
particle position at the end of the iteration. 
Synchronizing iterations and the termination of SHAKE 
on a message passing ring architecture, proves to be re- 
IativeIy time-consuming due to the latency of message 
passing. SHAKE becomes less time-consuming by extend- 
ing the hardware with a small 0.C.-bus. Synchronization 
can then be done by a simple delay insensitive algorithm. 
Acknowledgments 
We want to thank M.K.R. Renardus for carefully reading 
and commenting this text, J.T. Udding for his expertise in 
the field of delay insensitive algorithms, and the reviewer 
for making useful remarks. 
Literature 
[l] J.P. Ryckaert, G. Ciccotti, H.J.C. Berendsen, Numer- 
ical integration of the Cartesian equations of motion of a 
system with constraints: molecular dynamics of n-alkanes. 
Joumal of Comp. Phys. 23,327-341,1977. 
[2]H. Bekker,H.J.C. Berendsen, E.J. Dijkstra, S. Achterop, 
R. v. Drunen, D. v.d. Spoel, A. Sijbers, H. Keegstra, 
B. Reitsma and M.K.R. Renardus, GROMACS: a paral- 
lel computer for molecular dynamics simulation. Con. 
Proc. Physics Computing ’92, pages 252-256, World Sci- 
entific Publishing Co. Singapore, New York, London, 1993. 
[3] H. Bekker, E.J. Dijkstra, H.J.C. Berendsen. Molecular 
Dynamics simulation on an i860 based ring architecture. 
Supercomputer 54, X-2,4-10, 1993. 
[4] H. Bekker, E.J. Dijkstra, H.J.C. Berendsen. Mapping 
molecular dynamics simulation calculations on a ring ar- 
chitecture. In Parallel Computing: From Theory to Sound 
Practice, ed. W. Joosen and E. Milgrom, pages 268-279, 
10s Press, Amsterdam, 1992. 
[5] A.R.C. Raine. Systolic loop methods for molecular dy- 
namics simulation, generalized for macromolecules. Mo- 
lecular Simulation, Vol. 7 ,  pages 59-69, 1991. 
[6] S.E. DeBolt, P. Kollman. AMBERCUBE MD, Paral- 
lelization of AMBER’S Molecular Dynamics Module for 
Distributed-Memory Hypercube Computers. Journal of 
Comp. Chem., Vol. 14, No. 3,312-329,1993. 
[7] I.E. Sutherland, C.E. Molnar, C.E. Sproull, J.C. Mudge. 
The TRIMOSBUS. Proc. of the Caltech Con. on VLSZ, 
January 1979. 
[SI L. Lavagro and A. Sargiovanni-Vincentelli. Algorithms 
for Synthesis and Testing of Asynchronous circuits, Kluwer 
Academic Publishers, 1993. 
[9] M.B. Josephs, J.T. Udding. An overview of Delay 
Insensitive Algebra. In Proc. of the 26th Annual 
Hawaii Znt. Con. on System Sciences, ed. T.N. Mudge, 
V. Milutinovic, L. Hunter, 329-338, IEEE Computer Soci- 
ety Press, 1993. 
79 
