High-performance multiprocessor architecture for a 3-D lattice gas model by Morf, M. et al.
3rd NASA Symposium on VLSI Design 1991
N9Z: - lo oo
7.2.1
High-Performance Multiprocessor Architecture
for a 3-D Lattice Gas Model 1
F. Lee, M. Flynn and M. Morf
Computer Systems Laboratory
Stanford University, Stanford, CA 94305
Abstract- The lattice gas method has recently emerged as a promising discrete
particle simulation method in areas such as fluid dynamics. We present a very
high-performance scalable multiprocessor architecture, called ALGE, proposed
for the simulation of a realistic 3-D lattice gas model, H_non's 24-bit FCHC
isometric model. Each of these VLSI processors is as powerful as a CRAY-2 for
this application. ALGE is scalable in the sense that it achieves linear speedup
for both fixed and increasing problem sizes with more processors.
The core computation of a lattice gas model consists of many repetitions
of two alternating phases: particle collision and propagation. Functional decom-
position by symmetry group and virtual move are the respective keys to efficient
implementation of collision and propagation.
1 Introduction
High performance computing has become a vital enabling force in the conduct of science
and engineering research and development. In particular, simulations based on computa-
tional fluid dynamics are less costly and much faster than complex wind tunnel tests. In
the past few years, the lattice gas method [3] has emerged as an attractive, robust and
promising discrete particle simulation method for fluid flow simulations with complicated
boundary conditions, that are difficult or impossible to solve with other methods. Var-
ious standard fluid dynamical equations, including the Navier-Stokes equations, can be
obtained from lattice gas models after proper limits are taken [4].
The core computation of a lattice gas model is inherently suitable for execution on
scalable parallel computing systems, without requiring floating point operations. Increas-
ing amounts of computing power is needed to solve large scale simulation problems. It
is believed that simple and practical application-specific computers (or co-processors) can
achieve performance orders of magnitude higher than existing "general-purpose" supercom-
puters, that invariably focus on floating point operations. This belief has been confirmed in
the case of two-dimensional simulation, but not in the case of three-dimensional simulation,
which is much more important and challenging.
All existing special-purpose lattice gas computers such as CAM-6 [12], RAP1, RAP2 [1],
and LGM-1 [6], deal with two-dimensional lattice gas models. Until today, only one
other design, CAM-8 [10], proposed by Margolus and Toffoli, attempts to deal with three-
dimensional models, but this proposal is limited to 16 or fewer state bits per lattice node.
1This work was supported by NASA Ames Research Center under contract NAGW 4!9.
https://ntrs.nasa.gov/search.jsp?R=19940013889 2020-06-16T18:08:18+00:00Z
7.2.2
Yet we need to simulate models with 24 bits or more per node in order to achieve real-
istic results in studying complex phenomena such as turbulent flow [2]. ALGE is to our
knowledge the first special-purpose machine proposed to tackle a realistic 3-D lattice gas
model.
2 Lattice Gas Models
In order to keep this paper self contained, we repeat some of the material from our previous
publication [7], on which this paper iS based.
In a lattice gas model, space and time are discretized. Time is divided into a se-
quence of equal time steps, at which particles reside only at the nodes of the lattice. The
evo!ution_consists_ot_ two _terna-ting-p_s: (iy propagMion: -du_mg one=::time -step, each
particle moves from one node to another along a link of the lattice according to _gs ve!pe-
ity; (ii) coliis_on: at t_e_end _ a_me step, part]cles _arr[_h-g at _a _-'v-ed Uod_ _c0Uld_e: _d
'{nstantaneously-ac_-_re-new veiodtFes: T--he propert_s of t}ie laitlee not only govern the
propagation phase, but also significantly constrain the collision phase, because the collision
........... : =
The state of a node can be ........... " ....................... '_denoted by the bit vector b = (bl,... ,b,,), where bi = 1 if a
particle with the corresponding velocity v i is present 2 , and bi = 0 otherwise. Let b(x, t),
and b'(x, t) be the states o_ the node at position x and time t before and alter the collision
respectively. The collision phase specifies that, for all x and t,
b'(x,t) -- C(b(x,t)) (1)
where C is a deterministic or non-deterministic n-input n-output boolean collision function.
The propagation phase specifies that, for all x and t,
b,(x+ v',t + 1) - bS(x,t) (2)
An obstacle such as a plate, a wedge or an airplane wing is decomposed into a series
of continuous links which approximate-its geometrical shape. At nodes which represent
an obstacle, particles are either bounced back or undergo specular reflection. This can
be handled by adding one or more obstacle bits to the state of a node and adjusting the
collision function appropriately.
Before simulation, the states of the nodes are initialized according to the initial distri-
bution of particle densities and velocities. Alter simulation, nodes w]thln a volume of tens
of nodes on each side are averaged to compute the macroscopic density and momentum.
There are two types of boundary conditions on the lattice edges we are Concerned with.
The first type is i;]_e periodic boundary-cond-_t_on: the particles exiting from one-edge are
reinjeeted into the other edge in thesame directio_n. The second type, related to a wind-
tunnel experiment, consists in providing a flux of fresh particles on one side of the lattice
and allowing an output flux on the other side. In this paper, we focus on the first type of
_In this paper, Roman and Greek indices refer respectively to labels and components.
!
i
!
=
E
m
3rd NASA Symposium on VLSI Design 1991 7.2.3
/
Figure 1: The pseudo-four-dimensional FCHC model. Only the neighbors of one node are
shown as connected.
boundary condition, because it is basic: it requires no special treatment for nodes on the
edges, as there are no edges in a wraparound lattice space. The second type can be dealt
with as a simple extension.
2.1 Three-Dimensional Lattice
The particular lattice we are most interested in is the FCHC lattice used in three di-
mensional simulations [4,11]. A FCHC (face-centered hypercubic) lattice consists of those
nodes, which are the points with signed integer coordinates (zl, z2, z3, z4) = x such that
the sum zl + z2 + zz + z4 is even. Each node x is linked to its 24 nearest neighbors x' such
that the vector x' - x corresponds to one of the following 24 values:
(±I,-I-I,0,0), (±I,0,-I-I,0), (-1-1,0,0,-I-1),
(0, _1, +1, 0), (0,±1,0,-F1), (O,O,-F1,-I-I). (3)
These 24 nearest neighbors form a regular polytope. With time steps normalized to 1, the
vectors in (3) are also the 24 possible velocities of the particles arriving at or leaving from
a node.
7.2.4
The pseudo four-dimensional FCHC model is derived by projecting the four-dimensional
FCtIC lattice to three dimension so that the fourth dimension has a periodicity of 1. Each
node of a regular cubic lattice is a node in the model. Figure 1 shows the neighborhood of a
node: along the gray links, connecting to 12 neighbors, at most one particle can propagate,
with component v4 = 0; along the thick black links, connecting to 6 neighbors, up to two
particles can propagate, with components v4 = :t:1.
2.2 Isometric Collision Rules ......
Associated with the FCHC lattice is the iaometry group G of order 1!52. Roughly speaking,
an isometry is a symmetry operation such as rotation and reflection about the origin.
The isometric collision rules [5] require that
1. Every collision is an isometry: the output velocities are images of the input velocities
in an isometry.
. The isometry depends on tile momentum only: the momentum of the input state is
computed, and then normalized by taking advantage of the symmetries, and finally
used for classification.
. The isometry is randomly chosen among all optimal isometries: this is why non-
determinism comes into play. (An optimal isometry is one which minimizes the
viscosity of the lattice gas, so that higher Reynolds numbers can be reached.)
3 System Overview
This is an updated version of the design of ALGE as presented in [7]. The machine
is organized as an array processor, which serves as a special purpose high performance
computing engine to a host computer. The host computer down]oadsthe problem (data.)
into the engine and offioads the englne-produced solution. The host provides the user
interface to the computing engine and performs the pre-processing and post:processing
phases of the simff[atio_:= = _ ....
Figure 2 shows a 4x4 configuration of ALGE. The processors are connected as the
nodes of a 2-D toroid. Each identical processor 3 (p) has its own local memory (M). In a
simulation the 3-D problem space is decomposed into non-overlapped equal-sized partitions
such that nodes with the same Z-coordinates map to the same memory space, and adjacent
partitions map to adjacent memory spaces.
4 Processor Architecture
Figure 3 shows the functional block diagram of the processor. The processor contains
the following units: several collision units, an address generator, a transposer, a switch, a
3It may contain several processing elements (PE) as referred in [7].
!
=-
i|
=
[]
lm
mE
3rd NASA Symposium on VLSI Design 1991 7.2.5
Figure 2: A 4x4 configuration of ALGE
memory address register (MAR), a memory data register (MDR) and a control.
The collision units can be viewed as the "arithmetic" units of a "superscalar" processor.
Each unit is capable of computing one collision function per cycle. The address generator
contains a register file and some modified adders. It is responsible for generating the
proper address sequences for reading and writing data from and to memory. Each local
memory can supply one word of k bits per cycle. The n bits of any given node is stored at
a different word address. Hence, it takes n cycles to read all n bits of each of the k nodes.
The transposer is a two-way shift register array. Actually, there are two transposing buffers
so _hat one can be emptied (written back to memory) and filled (read from memory), while
the other is accessed by the collision units. The switch exchanges data with neighboring
processors if necessary. At any time, the processor either reads or writes. Since the
procedure is deterministic, and the access sequence is data independent, all operations
(AG, RD, etc.) are deeply pipelined in order to achieve maximum throughput.
The parameter k is is tile number of partitions mapped to a processor. The optimal
choice of k depends on n, the number of bits per node, the delay through the switch, and
the number of I/O pins and area of the VLSI implementation. Some typical numbers we
are considering are: n = 25 (1 obstacle bit), k = 192 for a processor with 4 collision units.
4.1 Collision Unit
The properties of a lattice not only govern the propagation phase, but also significantly
constrain the collision phase, because the collision ruIe_ must have the same symmetry as
the lattice [4]. How the underlying symmetry group of a lattice gas model can be exploited
to derive compact and high performance processing elements to handle collision functions
of potentially exponential complexities (O(n2")) was posed as a major challenge in this area
of research (see the Preface of [3]). The FCHC. isoro.etric model proposed by Hfinon [5] was
7.2.6
Control
Collision
Units
Address
Generator
Address Data
Transposer
Switch
MDR
Data
Figure 3: Functional block diagram of the processor
the first reM-24:bit threeLdimensional model with a detailed specificatlon o£ an optj_pajzed
non-deterministic collision function. Therefore, it was chosen as our first case study. A
VLSI architecture for the FCHC isometric model has been designed and implemented as
an ASIC. We have shown that a 4000 gate chip can replace the equivalent of 4.5 billion
bits (rather than 384 million b{t-s due to non-determinlsm) of a lookup table used to solve
this problem. Because the architecture is derived by considering the symmetry properties
rather than by brute force logic synthesis, it can be generalized to other classes of lattice
gas models. We present the main ideas in this paper. (Please see [8,9]:for more details)-_
H4non's isometric algorithm [5] shows how the output state of a node is computed as
a non-detern/nistic function of the input state:
1. Compute the momentum of the input state
2. Normall-za-ti-on--Apply the appropriate isometries (symmetry;transformations) to the
input state and the momentum, so that the momentum is normalized.
3. Collision: Choose at random one of the optimal isometries of theclass to which the
normalized momentum belongs, and apply this isometry.
4. Denormalization: Apply the isometries applied in step 2 in reverse order to obtain
the output state.
The application of an isometry to a state is the most frequent and important operation.
An efficient implementation of this operation is thus most crucial. Cayley's theorem states
that every group is isomorphic to a permutation group, hence it is not too surprising that
conditional application of isometries can be implemented as conditional permutations,
which in turn map to simple multiplexers. In essence, the algorithm can be viewed-as a
description of how to generate the right control signals to permute the input state bits.
/ I r
k..._ "
I
=_
z
z
|
B
.=
mm
i
3rd NASA Symposium on VLSI Design 1991 7.2.7
.................... t .....................
I Momentum adder
t6
i t :
[ Momentum normalizer
I Co,li,ioo 'etab'eI , "
Randomizer : .
.......l.::.. ...............i
State normalizer ,J
;[ 2,
State collider I
State denormalizer ]
Output state
Control generator Pemmtadon network
Figure 4: A collision unit
The organization of a collision unit (Figure 4) follows classical lines: the control path
consisting of a momentum adder, a momentum normalizer, a small collision rule table,
and a randomizer; the data path is a conditional permutation network composed of a state
normallzer, a state collider and a denormalizer (inverse-normalizer). The overall feed-
forward character of this unit makes it easy to design a highly pipeIined version with a
proportional increase in throughput.
A CMOS field programmable gate array implementation of the unit with a non-
pipelined latency of 460 ns has been completed. A CMOS gate array implementation
is estimated to have a non-pipelined latency below 50 ns. A collision unit capable of 20
million node updates per second (MNUPS) or more is clearly feasible. This is comparable
to CRAY-2's performance of approximately 30 MNUPS [11].
4.2 Address Generator
As large simulation problems require the use of a huge amount of memory, memory chips
can easily become the dominant cost factor of the system. Our solution avoids the common
but expensive alternative of double buffering the complete memory space, while retaining
a high degree of flexibility in the choice of lengths of each dimension of the (simulation)
problem space. This is made possible by the virtual move addressing mechanism, which
exploits the true data dependency of the computation steps involved. Data movement
implied by propagation but not by communication requirements can thus be eliminated.
The address generator contains a number of registers and an arithmetic datapath
(adder) to generate the complex sequence required.
7.2.8
4.2.1 Virtual Move
Although the FCHC models are our major concerns, the mechanismsdescribed below
apply to a larger classof models with other possible velocities. The following section is
written with generalnotations so asto be valid for any D-dimensional space and arbitrary
velocity.
The propagation equation (2) seems to suggest that at every time step, state bits of
all the nodes have to be moved. However, a closer examination reveals that the equation
actually represents an invariant relationship. If we choose to observe in a Fame o] reference
moving at the velocity v i with respect to the re_t frame of the lattice, the particles with
velocity v i are obviously stationary! Hence, there is no need to actually move the bits in
memory, as long as we keep track of the Gali!ean tran_]ormation. The coordinate x i of the
moving frame is related to the rest coordinate x by the transformation:
x i = x - vit (4)
Suppose the space-time point (x,t) corresponds to (x_,t), then the point (x + v_,t + 1)
corresponds to ((x + v _) - v_(t + 1),t + 1) = (x - v_t,t + 1) = (x_,t + 1). Hence, (1) and
(2) can be written as
b'(xi, t) - C(b(xi,t)) (5)
b,(x',t + 1) - b',(x',t) (6)
If we interpret x i as the physical addre_ used to address memory module i, then x
can be treated as the virtual addreJ_. Equation (6) says that we do not have to move
the bits-at all in the-propagation-p-base. We refer to-this technique as:vlrtual move.
Tile-cost of implementing Vlr[uaI move is to have a slightly more Complicated address
generation scheme. For each virtual address x, we need to generate n physical addresses,
x i (i - 1,..., n) according to (4). However, only one address has to be generated per cycle,
if we access one bit plane at a time. Multiple bit planes can be stored in the same memory
space by interleaving.
4.2.2 Multi-dlmensional Modulo Adder
The transformation (4) requires D modulo subtractions for each v i. How can one proceed
to implement the address generators in hardware?
Suppose we have a wrap-around lattice space of dimension D, implied by the basic
type of periodic boundary condition (see section 2). Equation (4) can be written as
t (a 1,.. D)_i _ (x_-v_)modn_ - .
- (z_, + (-v_t mod n_)) rood n_
= (xo + dis(t)) mod n_ (7)
where n_ is the length of z_-dimension, and
d'_(t) = -v_t mod ,_ (8)
|
Z
=
m
E
i
w_
3rd NASA Symposium or, VLSI Design 1991 7.2.9
Note that di_ has to be recomputed only once per time step by addition:
dis(t + 1) = (dis(t) + (-vi_ mod ns)) rood ,_s
In order to use conventional RAM, we need to map x (x i) to a linear address.
choose the conventional one-to-one mapping
A: <0,1,...,nl} x <o,1,...,n2} × <o,1,...,n,_ - 1>_ {o,1,...,,_1,_,...n_, - 1}
(9)
We
such that D-1
A(x) --- A((za,z2,...,ZD)) = zt + z2n_ + ... + zo H ns (10)
s=l
ES=I ms = m.Assume that all ns's are powers of 2, such that ms = log2 ns and D The
mapping A can then be performed trivially by concatenating the binary representations of
x such that z_+l is on the left of xs. Similarly, we can obtain the linear address of di. Let
a = A(x), and b = A(d'), and define e as
0 if j = _=x m_ for some a e [0, O - 1] (11)ej = 1 otherwise
The purpose of e is to mark the boundary bits of dimensions so that carry-out from lower
dimensions would not be propagated to higher dimensions. The value of e does not change
during a simulation.
We can calculate all D components of x i according to (7) in one step by usiftg a multi-
dimensional modulo adder, which takes three m-bit inputs, a, b, and e, and computes the
sum as s = A(xi). The adder can be built according to the new definitions ofpl, propagate,
and sl, sum:
p, = (a, v _,) e, (12)
s_ = ai obl @ ciei (13)
and our usual definitions of gi, generate, and ci, carry:
gi = ai bi (14)
Co --- 0 (15)
ci+l = giVpici (16)
Hence, this modified adder can be implemented in various ways, such as ripple carry
adder, carry lookahead adder, or carry select adder, as deemed appropriate for the system
requirement and implementation technology.
4.2.3 Example
Let us illustrate the idea by a small example. Suppose we have the 2-D square lattice
with n = 4 bits per node, and v 1 = (1,0), v 2 = (-1,0), v 3 = (0,1), and v 4 = (0,-1).
7.2.10
b
4 5 6 7
t=O
0 1 2 3
t=l
....4 5 6 7 3 0 I 2
b3 bl b2 b4
Figure 5: State b{t propagation
tA(x)I
(o,o) o
(1,0) 1
(2,0) 2
(3,0) 3
(0,1) 4
_(1,1)
(2,1) 6
(3,1) 7
Figure 6: Address mapping A
A(x) I A(x_)
0 3
1 0
2 1
3 2
4 7
5 4
6 5
7 6
A(x 2)
3
0
5
A(x 3) A(x')
4
0
6 1 1
7 2 2
4 3 3
Figure 7: VirtuaI to real address translation for t = i
3rd NASA Symposium on VLSI Design 1991 7.2.11
bit
position
3210
_ e
_ a (x=(3,0))
(d2=(l, O) )
(x2=(O,O))
Figure 8: Operation of a multi-dimensional modulo adder.
 TransposeIBuffer
Collision
Unit
I
Figure 9: Operation of the transposer
The lattice (problem) space has 8 nodes with nl = 4 and n2 = 2. In figure 5, each box
represents a physical memory location, the linear address mapping of the coordinate of
which is given in Figure 6, as calculated by (10). At t = 0, the virtual address x and
physical addresses x; (i = 1,2,3,4) for any given node are the same. At t = 1, they axe
different, as governed by (4) and shown in Figure 7. The numeric label within each box
represents a binary value. Suppose we are interested in the bit plane of b2. The label "3"
of the b boxes represents the binary value of b2 at A(x) = 3 just after collision at t = 0. At
t = 1 just before collision, the label "3" of the b2 boxes is at a different location, because
the bit has been moved to the left by one position, where A(x) = 2. However, the move
can be avoided if the virtual address 2 is somehow translated into the physical address 3, so
that access to A(x) = 2 becomes access to A(x 2) = 3. Figure 8 shows how the translation
can be computed by a multi-dimensional modulo adder.
4.3 Transposer
It contains 2 transpose buffers, to be filled and emptied alternately. Figure 9 shows the
operation of the transposer. It affects memory addressing, data structure to store the array.
7.2.12
04 05 06 07 14 15 16 17
b2 ' t=O
O00l 02 03 I0 ii 12 13
05 06 07 14 I 15 I 16 I 17 I 04 ]
b2 _ t=l
o_ 02 o3 1o i_11_11_iooI
Partition 0 Partition i
Figure 10: Propagation of bits of b_ across partitions
During the i-th cycle, t_e i-th b{ts o_ ihe k words are read and shihed into a transpose
buffer. At the end of the n cycles, the n bits of one node are shifted out per cycle. A
collision unit takes the bits as input, and the output is written back to the buffer. It acts
as a circular shift register. W_th mulfipie (u) co_sion units, multiple updates (co_sions)
can be executed in parallel per cycle. This update continues for k/u cycles until all k
nodes are processed. They are then shifted out bit-by-bit to memory. While one buffer is
busy acting as an n-wide circular shift register to serve the collision units, the other can
be emptied and refilled just in time to take the turn, if k is chosen appropriately,
4.4 Switch
Updating a node at the border of a partition requires reading values from one or more
adjacent partitions. We need to know when to select data bits from which partitions.
According to (2), we know where the ncighboring nodes arc in the problem _pace:
b_(x,t + 1) = b_(x - v',t) (17)
Let us define three coordinate sy-stems, namely, the global, partition, and local coordinates
such that they satisfy the following relationship:
x G = Px p + X L (18)
Where P {s a diagonal matrix with p,, = n,, and the following conditions are satisfiedi
O< p O< c0<:x,,<n_, _z_ <p_, x_, <n,_p_ (19)
Alternatively, we can write for any a
L mod naa mod n.p° n_(z P modp_) + z_ (20)
Then we can show that for any a,
(z_ v'=)modn_,p, = no((z_+b0und(0, z_- ' n :t ': (21)-- ' v., _,)) mod p_,).{-: (xo - v*_) mod n_
i
z
=
Z
|
±
E
R
t
i
I
3rd NASA Symposium o,,VLSI Design 1991 7.2.13
where bound is defined as
bound(L,k,U) = {
-I ifk<L
0 ifL<k<U
1 ifk>U
(22)
• L _ no) is either non-negative or non-positive. Hence,For any given v', bound(0, x_, - v_,
it is only necessary to distinguish whether the value is zero or non-zero.
In equation (21), the partition coordinate determines the source partition, and the
local coordinate determines the bit within a partition. Since the machine is synchronous,
L have the same value for all partitions. This exactly matches theat any one time, all x_
requirement implied by (21).
Figure 10 shows an example. It shows the data distribution at t = 1 for b2 for the same
example shown in Figure 5. The first digit of a label represents the partition coordinate
mapping, while the second one represents the local coordinate mapping.
The function bound can be computed as a carry-out of the multi-dimensional modulo
adder. Let a = A(xL), d_ = -v i,,, and e as defined before, as the inputs to a multi-
L i n_)l is exactly the carry-outdimensional modulo adder, then for each c_, Ibound(0, x,, -vo,
bit which is to be blocked so that it will not flow across the dimension boundary. In
Figure 8 the carry bit c2 = 1 indicates that the local virtual address 3 of a partition maps
to the local physical address 0 of its adjacent partition, as shown in Figure 10.
5 Summary
We have outlined a number of unique architectural features of a very high performance
pipelined array processor dedicated to lattice gas simulation. The architecture is truly
scalable in the sense that it achieves linear speedup for both fixed and increasing problem
sizes with more processors. It is necessary and possible to take advantage of the special
properties of the application to design application-specific computers that are a thousand
times more powerful than existing supercomputers.
The driving limitation of ALGE is memory bandwidth. This situation becomes more
severe as the processing elements run faster and the clock cycle gets shorter. This may be
an ideal project for the use of high density mounting and packaging technology such as
multiple chip modules.
Current work is focusing on resolving finer issues of design and implementation with
the goal of building a prototype system.
The promise of powerful VLSI processors for digital wind tunnels opens up the potential
for desk-top and onboard applications.
References
[1] Andre Clouqueur and Dominique d'HumiSres. R.A.P., A Family of Cellular Automa-
ton Machines for Fluid Dynamics. Helvetica Physica Acta, 62:525-541, 1989.
7.2.14
[2] K. Diemer, K. Hunt, S. Chen, T. Shimomura, and G. Doolen. Density and velocity
dependence of reynolds numbers for several lattice gas models. Lattice Gas Methods
for Partial Differential Equations, pages 137-177, 1990.
[3] Gary D. Doolen, editor. Lattice Gas Methods for Partial Differential Equations, vol-
ume IV of Santa Fe Institute Studies in the Sciences of Complezity. Addison-Wesley,
i990. _ .............
[4] Uriel Frisch, Dominique d_Eumi_res, Brosl Hasslacher, Pierre Lallemand, Yves
Pomeau, and Jean-Pierre Rivet. Lattice Gas Hydrodynamics in Two and Three Di-
mensions. Oomplez Systems, 1(4):649-707, 1987.
[5] Michel H6non. Isometric Collision Rules for the Four-Dimeusion_ FCH_C Lattice Gas..
Complez Systems, 1(3):475-494, June 1987.
[6] Steven D. l_ugelmass. Architectures for Two-Dimensional Lattice Computations with
Linear _¢peedup. PhD thesis, Princeton University, June 1988.
[7] Fung_ F. Lee and Michael J. Flynn. Architectural Mechanisms to Support Three-
Dimensional Lattice Gas Simulations. Third Annual A CM Symposium on Parallel
Algorithms and Architectures, pages 115-122, July 1991.
[8] Fung F. Lee, Michael J. Flynn, and Martin Mort. A VLSI Architecture for the FCHC
Isometric Lattice Gas Model. Technical Report CSL-TR-90-426, Computer Systems
Laboratory, Stanford University, April 1990.
[9] Fung F. Lee, Michael J. Flynn, and Martin Mort. Design of Compact High Perfor-
mance Processing Elements for the FCHC Lattice Gas Models. Proceedings of the
Fifth SIAM Conference on Parallel Processing [or Scientific Computing, March 1991.
[10] Norman Margolus and Tommaso Tofl'oli. Cellular Automata Machines. Lattice Gas
Methods for Partial Differential Equations, pages 219-249, 1990.
[11] Jean-Pierre Rivet, Michel HSnon, Urlel Frisch, and Dominique d'Humi_res. Simulat-
ing Fully Three-Dimenslonal External Flow by Lattice Gas Methods. Proceedings of
the Workshop on Discrete Kinetic Theory, Lattice Gas Dynamics and Foundations o/
Hydrodynamics, pages 276-285,September !988.
[12] Tommaso Toffoli and Norman Margolus: Cellular Automata Machines - A New En-
vironment for Modeling. MIT Press, 1987.
|
'i
E
Z
Z
w
m
