Chapter 5 Streaming Supercomputer by The Pennsylvania State University CiteSeerX Archives
Chapter 5
Streaming Supercomputer
The goal of the streaming supercomputer project is to develop a high-performance computer sys-
tem, hardware and programming system, that achieves two orders of magnitude improvement in
performance/cost compared to today's supercomputers based on clusters of SMPs. The stream-
ing supercomputer achieves this cost performance advantage through the combination of a stream
processor, which increases the arithmetic intensity of a computation1, and a high-performance
interconnection network, that e±ciently provides global bandwidth.
Over the past year we have established the feasibility of the streaming supercomputer concept
and have laid the groundwork for the architecture and programming system of this machine. We
have completed a straw-man speci¯cation of the architecture, have constructed a cycle-accurate
simulation of this architecture, and have identi¯ed the key issues that must be resolved before we
can complete a detailed architecture. In parallel with this architecture work, we have developed
a new stream programming language, Brook, and have coded three model applications in this
language. We have constructed a prototype stream programming system that maps Brook to a
machine-independent stream virtual machine (SVM), and have constructed a rudimentary back-
end compiler that maps these programs from the SVM to the proposed streaming supercomputer
instruction set. Preliminary results of running our model applications on a single-node streaming
supercomputer simulator have established the feasibility of our approach and its ability to sustain
high-levels of performance.
Now that feasibility is established, we plan to resolve key design issues with the architecture and
programming system over the next ¯scal year leading to the goal of generating a ¯nal architecture
speci¯cation and a usable Brook compiler in FY04. We plan a series of point architecture studies
to resolve issues of register organization, ALU mix, inter-node operations, aspect ratio, remote
operations, and network design. We plan to develop a new compiler infrastructure for our Brook
programming system, port the programming system to this infrastructure, and implement deeper
analyses to enable a number of important program optimizations - such as kernel fusion and kernel
¯ssion. We also plan to carry our model applications to the next step with 3-D implementations
and more detailed models.
5.1 Stream Processing
With modern VLSI technology, arithmetic operations (FLOPS) are very inexpensive but global
bandwidth is costly. In a contemporary 0.13¹m CMOS process, a 64-bit °oating-point unit that
can operate at 1GHz takes less than 1mm2 of chip area. Hundreds of such arithmetic units can be
placed on a single chip. While 100GFLOPS of arithmetic performance can be realized on one chip,
1The arithmetic intensity of a computation is the ratio of computation to memory bandwidth.
125bandwidth o® of the chip is limited to 5-10 GWords/s and on-chip global bandwidth is limited to
about 50GWords/s. Conventional CPUs are limited by global bandwidth (on and o® chip) and are
thus unable to exploit the potential of VLSI technology for realizing 100+ GFLOPS chips.
Recent developments enable streaming architectures that e±ciently convert the capabilities of
emerging technology into realized performance on scienti¯c applications. Expressing programs in
the stream programming model, as streams of records passing through arithmetic kernels, exposes
parallelism and locality in the application. A stream architecture provides large numbers of arith-
metic units to exploit the parallelism exposed by the stream model. More importantly it provides a
register hierarchy that can exploit the kernel locality and producer-consumer locality exposed by the
stream model to greatly reduce the demand on global bandwidth. While the locality exposed by
the stream model can be applied to reducing bandwidth demands in a conventional architecture, a
stream processor with a large number of arithmetic units and a richer register hierarchy is required
to reap the full bene¯t.
We envision a streaming supercomputer that delivers orders of magnitude more performance
per dollar than clusters of servers ($20 per GFLOPS and $2 per million GUPS2) and is scalable to
a machine with one PetaFLOPS of peak performance and 1013 GUPS. We expect that applications
will achieve a large fraction of the peak FLOPS (at least 25%) on arithmetic-limited code sections
and a large fraction of peak GUPS on memory-limited code sections.
Streaming supercomputers with this level of performance and e±ciency are made possible by
the con°uence of three recent innovations: stream architecture, high-speed signaling, and e±cient
interconnection network architecture. As described above, stream architectures achieve a high
degree of arithmetic intensity, that is with a high ratio of arithmetic to memory bandwidth. Streams
also o®er an easy way to hide the inherent latency of global memory references. Stream architectures
have been proven on signal- and image-processing applications. Our initial investigations show that
they are equally applicable to a broad class of scienti¯c applications. High-speed signaling and
e±cient network architecture together enable economical memory systems with very high global
bandwidth.
The streaming architecture we envision leverages commodity technology to economically achieve
high performance. It does not, however, use commodity processors. Processors, in fact, are not
really a commodity - they are not interchangeably available from multiple vendors at prices close
to cost. Commodity technology is leveraged in three ways. First, the main memory of the system
is built entirely out of commodity high-bandwidth memory chips. Such memory chips truly are
a commodity - they are available in volume from a number of di®erent suppliers at competitive
prices. Second, the streaming processor chips are fabricated using a standard CMOS process. As
with memories, CMOS wafers are a commodity - being available in volume from multiple ven-
dors. Finally, the system interconnect is constructed using o®-the-shelf connectors and backplane
technology.
Realizing the performance of a streaming supercomputer, of course, requires converting appli-
cations to a stream programming model. This programming model makes communication explicit,
making it easy for the software tools to e±ciently map the application to a streaming architecture.
5.2 Streaming Supercomputer Architecture
The streaming supercomputer is a shared memory parallel computer with up to 16K single-chip
stream processing nodes. An overall block diagram of a streaming supercomputer system is shown
in Figure 5.1. Each node is a 128GFLOPS stream processor with 2GBytes of local DRAM memory
that is accessed with 16GBytes/s of local memory bandwidth. The system is built from building
2A GUPS (global updates per second) is the number of single word memory references to random locations across
its entire memory space that a machine can support per second.
126Bandwidth (GBytes/s)
Region Per Node Total
Node 16 16
Board (16 Nodes) 16 256
Backplane (512 Nodes) 4 2,048
System (16K Nodes) 2 32,768
Table 5.1: Per-node and total memory bandwidth by region for the streaming supercomputer.
blocks that are scalable resulting in useful machines ranging in size from a single node (128GFLOPS)
to a single-board computer (2TFLOPS) to a 32-backplane 2-PetaFLOPS machine. The performance
and performance/cost of the streaming supercomputer derives from the bandwidth hierarchy within
each node, which supports high arithmetic intensity, and the interconnection network that provides
nearly °at memory bandwidth across large streaming supercomputer systems.
The streaming supercomputer provides nearly °at memory bandwidth across large con¯gura-
tions of nodes as illustrated in Figure 5.1 and summarized in Table 5.2. The high bandwidth
interconnection network provides completely °at memory bandwidth between the 16-nodes on a
board - every node can simultaneously maintain 16GBytes/s of memory bandwidth accessing ran-
dom locations on the board. Across a 32-board backplane (512 nodes) per-node bandwidth is cut
by a factor of four with each node able to maintain 4GBytes/s of global memory bandwidth within
a backplane. Even in a large system with 32 backplanes (16K nodes), global memory bandwidth is
still 2GBytes/s per node, one eighth of the local bandwidth. This amounts to over 32TBytes/s of
total global memory bandwidth in a 16K-node system, all of which may cross the bisection of the
network.
While the memory system of the streaming supercomputer is designed to meet the bandwidth
demands of the stream processing nodes, it could also be used to provide a high-performance
memory system for conventional scalar or vector processors. Details on the interconnection network
that provides this bandwidth are described in Section 5.2.3.
Much of the performance/cost advantage of the streaming supercomputer is due to its ability
to sustain high arithmetic intensity - i.e., to sustain high arithmetic rates while using modest
amounts of bandwidth. The locality required to achieve this high arithmetic intensity is exposed
by the stream programming model. To exploit this locality most e®ectively requires a bandwidth
hierarchy that provides very high arithmetic bandwidth with reduced bandwidth at each of several
levels of register and memory hierarchy.
The bandwidth hierarchy of a streaming supercomputer node is illustrated in Figure 5.2 and
summarized in Table 5.2. At the top of the hierarchy, the 240 dual-port local register ¯les pro-
vide almost 4TBytes/s of bandwidth necessary to support the 128GFLOPS of 64-bit arithmetic.
These local registers directly feed the inputs and accept the outputs of the 64, 64-bit °oating-point
multiply-adders in each node. The bandwidth then drops by nearly an order of magnitude to
512GBytes/s at the stream register ¯le which holds streams of data between execution kernels.
Bandwidth drops to 64GBytes/s at the stream cache which captures temporal locality in irregular
applications. The ¯nal bandwidth is 16GBytes/s at the DRAM memory system. Experiments show
that the bandwidth demands of many streaming applications closely match this bandwidth hierar-
chy. The details of the processing node that provides this hierarchy are described in Section 5.2.1
The streaming supercomputer is balanced in terms of cost and in terms of incremental per-
formance per unit cost. By several conventional notions of balance 2GBytes of memory capacity
would seem too little for a 128GFLOPS node. However, this amount of memory per node domi-
nates machine cost. Increasing the memory per node would substantially increase the cost of the
machine without increasing performance. If a larger memory capacity is needed, it is more e±cient
127S
￿
t
￿
r
￿
e
￿
a
￿
m
￿
P
￿
r
￿
o
￿
c
￿
e
￿
s
￿
s
￿
o
￿
r
￿
1
￿
2
￿
8
￿
G
￿
F
￿
L
￿
O
￿
P
￿
S
￿
1
￿
6
￿
 
￿
x
￿
 
￿
D
￿
R
￿
D
￿
R
￿
A
￿
M
￿
2
￿
G
￿
B
￿
y
￿
t
￿
e
￿
s
￿
1
￿
6
￿
G
￿
B
￿
y
￿
t
￿
e
￿
s
￿
/
￿
s
￿
1
￿
6
￿
G
￿
B
￿
y
￿
t
￿
e
￿
s
￿
/
￿
s
￿
3
￿
2
￿
+
￿
3
￿
2
￿
 
￿
p
￿
a
￿
i
￿
r
￿
s
￿
N
￿
o
￿
d
￿
e
￿
O
￿
n
￿
-
￿
B
￿
o
￿
a
￿
r
￿
d
￿
 
￿
N
￿
e
￿
t
￿
w
￿
o
￿
r
￿
k
￿
N
￿
o
￿
d
￿
e
￿
2
￿
N
￿
o
￿
d
￿
e
￿
1
￿
6
￿
B
￿
o
￿
a
￿
r
￿
d
￿
 
￿
1
￿
1
￿
6
￿
 
￿
N
￿
o
￿
d
￿
e
￿
s
￿
2
￿
T
￿
F
￿
L
￿
O
￿
P
￿
S
￿
3
￿
2
￿
G
￿
B
￿
y
￿
t
￿
e
￿
s
￿
I
￿
n
￿
t
￿
r
￿
a
￿
-
￿
C
￿
a
￿
b
￿
i
￿
n
￿
e
￿
t
￿
 
￿
N
￿
e
￿
t
￿
w
￿
o
￿
r
￿
k
￿
3
￿
2
￿
 
￿
R
￿
o
￿
u
￿
t
￿
e
￿
r
￿
s
￿
 
￿
(
￿
8
￿
 
￿
c
￿
a
￿
r
￿
d
￿
s
￿
 
￿
o
￿
f
￿
 
￿
4
￿
 
￿
r
￿
o
￿
u
￿
t
￿
e
￿
r
￿
s
￿
 
￿
e
￿
a
￿
c
￿
h
￿
)
￿
B
￿
o
￿
a
￿
r
￿
d
￿
 
￿
3
￿
1
￿
6
￿
4
￿
G
￿
B
￿
y
￿
t
￿
e
￿
s
￿
/
￿
s
￿
1
￿
2
￿
8
￿
+
￿
1
￿
2
￿
8
￿
 
￿
p
￿
a
￿
i
￿
r
￿
s
￿
T
￿
e
￿
r
￿
a
￿
d
￿
y
￿
n
￿
e
￿
 
￿
G
￿
b
￿
X
￿
B
￿
o
￿
a
￿
r
￿
d
￿
B
￿
a
￿
c
￿
k
￿
p
￿
l
￿
a
￿
n
￿
e
￿
I
￿
n
￿
t
￿
e
￿
r
￿
-
￿
C
￿
a
￿
b
￿
i
￿
n
￿
e
￿
t
￿
 
￿
N
￿
e
￿
t
￿
w
￿
o
￿
r
￿
k
￿
5
￿
1
￿
2
￿
 
￿
R
￿
o
￿
u
￿
t
￿
e
￿
r
￿
s
￿
B
￿
a
￿
c
￿
k
￿
p
￿
l
￿
a
￿
n
￿
e
￿
 
￿
2
￿
3
￿
2
￿
 
￿
B
￿
o
￿
a
￿
r
￿
d
￿
s
￿
5
￿
1
￿
2
￿
 
￿
N
￿
o
￿
d
￿
e
￿
s
￿
6
￿
4
￿
T
￿
F
￿
L
￿
O
￿
P
￿
S
￿
1
￿
T
￿
B
￿
y
￿
t
￿
e
￿
s
￿
1
￿
T
￿
B
￿
y
￿
t
￿
e
￿
s
￿
/
￿
s
￿
2
￿
K
￿
+
￿
2
￿
K
￿
 
￿
l
￿
i
￿
n
￿
k
￿
s
￿
R
￿
i
￿
b
￿
b
￿
o
￿
n
￿
 
￿
F
￿
i
￿
b
￿
e
￿
r
￿
B
￿
i
￿
s
￿
e
￿
c
￿
t
￿
i
￿
o
￿
n
￿
 
￿
 
￿
3
￿
2
￿
T
￿
B
￿
y
￿
t
￿
e
￿
s
￿
/
￿
s
￿
A
￿
l
￿
l
￿
 
￿
l
￿
i
￿
n
￿
k
￿
s
￿
 
￿
5
￿
G
￿
b
￿
/
￿
s
￿
 
￿
p
￿
e
￿
r
￿
 
￿
p
￿
a
￿
i
￿
r
￿
 
￿
o
￿
r
￿
 
￿
f
￿
i
￿
b
￿
e
￿
r
￿
A
￿
l
￿
l
￿
 
￿
b
￿
a
￿
n
￿
d
￿
w
￿
i
￿
d
￿
t
￿
h
￿
s
￿
 
￿
a
￿
r
￿
e
￿
 
￿
f
￿
u
￿
l
￿
l
￿
 
￿
d
￿
u
￿
p
￿
l
￿
e
￿
x
￿
Figure 5.1: Block diagram of a streaming supercomputing system
128DRAM￿
DRAM￿
Stream￿
Cache￿
Bank 0￿
Stream￿
Cache￿
Bank 7￿
Stream￿
Register￿
File￿
Func.￿
Units￿
Func.￿
Units￿
Cluster 0￿
Cluster 15￿
Local Register File￿
16 GB/s￿ 64 GB/s￿ 512 GB/s￿ 3,840 GB/s￿
Figure 5.2: Bandwidth hierarchy of a streaming supercomputer node.
Level Bandwidth
(GBytes/s)
Local Register Files 3,840
Stream Register File 512
Stream Cache 64
DRAM Memory 16
Table 5.2: Bandwidth hierarchy of a streaming supercomputer node.
129to increase the number of nodes, than to increase the memory per node. The additional arithmetic
units increase performance substantially with very little increase in cost.
Similarly, by conventional measures of balance a memory bandwidth of 16GBytes/s (2GWords/s)
would seem too small for a node with 128GFLOPS. However, again this ratio is cost balanced. In-
creasing the memory bandwidth per node would require multiple chips to interface to the memory,
substantially increasing cost. It is more e±cient to populate each chip in the system with inexpen-
sive arithmetic units, i.e., make each chip a node, than to use costly chips just to expand memory
bandwidth. Put di®erently, if an application demands more memory bandwidth, it is more e±cient
to provide this bandwidth by increasing the number of nodes than to increase the bandwidth per
node.
5.2.1 Stream Processor Node Architecture
Figure 5.3 is a block diagram of the stream processor chip that, together with 16 DRAM chips, forms
a node of the streaming supercomputer. A stream processor consists of three independent modules:
the scalar processor, a standard RISC processor core; the stream processor, a set of 16 statically
scheduled SIMD clusters of °oating-point units controlled by a programmable microcontroller, and
the memory system, that performs stream load and store instructions. The scalar processor fetches
instructions for all three modules from a single instruction stream and dispatches instructions to
the stream processor and the memory system via a pair of instruction queues.
Stream Register File￿
Arith.￿
Cluster￿
0￿
Arith.￿
Cluster￿
15￿
Micro-￿
controller￿
Scalar￿
Processor￿
Scalar￿
Cache￿
Memory System￿ Network￿
DRAM￿ DRAM￿
Inter-cluster Crossbar￿
Stream￿
Controller￿
Figure 5.3: SSS Stream Processor
The scalar processor is a conventional 64-bit RISC microprocessor whose instruction set has been
extended to include non-blocking stream instructions and whose register set has been extended to
access scalar registers and stream descriptor registers in the stream processor. The scalar processor
has °oating point support and both L1 and L2 caches which are coherent across SSS nodes; the
cache coherence is implemented by restricting the caches to only storing addresses local to the node
and then snooping all memory requests received from remote nodes. The scalar processor's 64-bit
130address space can be con¯gured to cover some or all of the global shared memory across all the
SSS nodes.
Stream instructions, which include loading and storing streams and invoking kernels, operate
at the granularity of streams of data rather than individual words, and are dispatched by the scalar
processor accompanied by explicit dependency information. The execution of the stream instruc-
tions is handled by the stream controller, which keeps a scoreboard of pending instructions and
their dependencies as well as available resources; a stream instruction is issued to the stream hard-
ware when its dependencies are met and the resources it requires are free. Allowing a single stream
instruction to specify operations on an entire stream of data makes the stream execution units
memory-latency insensitive by amortizing the latency of memory access over the many (possibly
thousands) of individual words accessed by a single stream load or store.
The arithmetic intensity of the stream processor derives from the 16 SIMD clusters each of which
contains 4 °oating-point multiply-add units. The data-level parallelism exposed by the stream
programming model is exploited by having all 16 clusters operate on di®erent stream elements in
parallel, and instruction-level parallelism is exploited by the use of multiple functional units within
each cluster.
MUL￿
ADD￿
Comm.￿
Unit￿
Intra-cluster Interconnect￿
SRF￿
Inter-cluster￿
Crossbar￿
MUL￿
ADD￿
MUL￿
ADD￿
MUL￿
ADD￿
SQRT, INV￿
Lookup￿
Figure 5.4: SSS Cluster Architecutre
Figure 5.4 depicts the architecture of a single SSS cluster. Each cluster consists of 4 fully-
pipelined 64-bit °oating point multiply-add units that compute a £ b + c for input arguments a,
b, and c, a lookup table that gives a starting point for the iterative computation of 1=x,
p
x, and
1 p
x, and an inter-cluster communication unit. Each cluster also contains 15 2-ported 64-word local
register ¯les (LRFs), one on each input of each functional unit. A local interconnect within the
cluster enables data to be transferred from the output of any function unit to the input of any
LRF. With all 16 clusters operating with perfect occupancy, the peak arithmetic performance of
the stream processor is 128 FLOPS/cycle, or 128 GFLOPS at the target clock rate of 1 GHz. At
this frequency, the e®ective bandwidth provided by the LRFs to the functional units across the 16
clusters is 576 words/cycle or 4,608 GBytes/s. The 16 clusters are connected to each other by a
full 16x16 crossbar; the inter-cluster communication unit in each cluster can send and receive one
word of data over the crossbar each cycle.
131The arithmetic units and LRFs in each cluster are controlled by a VLIW microprogram that
is sequenced by the microcontroller in response to a kernel execution instruction from the stream
controller. Each cycle, the microcontroller broadcasts a VLIW instruction across the 16 clusters
in a SIMD manner. The VLIW programs corresponding to each kernel of a stream program are
explicitly staged into the microcode store by stream instructions dispatched by the scalar processor.
In addition to instructions which con¯gure the functional units in the clusters, the microcontroller
supports both conditional and unconditional looping constructs, although arbitrary branches are
not supported in the kernel program due to the SIMD nature of the clusters. The arithmetic
units in the clusters do support a select operation, enabling the use of if-then type constructs via
predicated execution, and the clusters support conditional stream operations.
The stream register ¯le (SRF) { a 512 KB single-ported SRAM { is used to stage stream data
for the clusters of function units. Each cluster can only access its own dedicated bank of the
SRF. Stream memory instructions specify the transfer of data between the SRF and memory, and
the kernel program executed by the microcontroller speci¯es transfers between the SRF and the
clusters of ALUs. A portion of each bank of the SRF also serves as a local scratch-pad memory for
its associated cluster.
A set of 15 128-word stream bu®ers (SBs) handle all °ow control and data transfers between
the SRF core and its clients. Stream bu®er access to the SRF is time-multiplexed but very wide,
e®ectively allowing the single 64-word-wide SRF port to function as several narrower ports; this
feature takes advantage of the fact that stream accesses to the SRF are sequential. Logically, the
stream bu®ers function as FIFOs between the SRF and its clients. Additionally, by allowing the
stream bu®ers to be independant across clusters, conditional streams can be e±ciently supported.
Each cluster has an e®ective bandwidth of 4 words/cycle to its own bank of the SRF, acheiving an
aggregate bandwidth of 512 GB/s at a clock rate of 1GHz.
5.2.2 Memory System Architecture
The memory system of the streaming supercomputer provides high-bandwidth access to a °at
address space to all of the streaming processors and scalar processors in the system. The memory
system provides cache bandwidth of 8 words per cycle (64GBytes/s) on each node and DRAM
memory bandwidth to random locations of 2 words per cycle (16GBytes/s) on each node. Single
word access to remote memory locations is made via the interconnection network (Section 5.2.3).
Figure 5.5 shows a block diagram of one node of the streaming supercomputer memory system.
The system serves memory requests from both the scalar processor and the stream processor.
Stream memory operations are initiated by the two address generators that convert a stream
request into a number of individual word requests using either strided or indexed addressing. Each
address generator generates four individual word requests per cycle. Together they match the cache
bandwidth of eight words per cycle. All requests are routed through a request switch that directs
the request to one of eight stream cache banks based on bits 5-7 of the address. If a reference hits
in a stream cache bank, the requested data (for a read) is returned to a memory reorder bu®er via
the reply switch. The reorder bu®ers assemble replies as they arrive, possibly out of order, and
transfers them to the SRF when a block of contiguous replies have all arrived.
If the request misses in the stream cache bank and the address is in the local DRAM, the request
is forwarded to the DRAM bank associated with this stream cache bank. After reading the data,
it is returned to the reorder bu®er. Data read from the local DRAM may be cached in the stream
cache depending on the options of the stream load instruction and the segment descriptor.
If the request is either uncacheable or misses in the cache and is not local to the current node,
it is forwarded to the network interface. The interconnection network delivers the request to the
addressed node where the request switch distributes it to the proper stream cache bank. The request
is then handled as above with the data being fetched either from the stream cache bank (a hit) or
132Address￿
Generator￿
Reorder￿
Buffer￿
Address￿
Generator￿
Reorder￿
Buffer￿
Network￿
Interface￿
SRF￿ SRF￿
Memory System￿
Request Switch￿ Return Switch￿
Memory￿
Bank 0￿
Memory￿
Bank 7￿
Off-Chip￿
DRAM￿
Memory￿
Network￿
Interface￿
Scalar￿
Cache￿
Figure 5.5: Block diagram of one node of the the streaming supercomputer memory system
from the DRAM on the addressed node. Note that uncacheable data can be held in the stream
cache bank on the same node where the data is present in DRAM since there are no problems with
coherency with such remote-side caching. After the remote memory request is satis¯ed - either
from the cache or local DRAM, the result is returned to the reply switch which routes it to the
network interface. The network returns the result to the requesting node where the result is stored
in the reorder bu®er via the reply switch.
The purpose of the stream cache is to increase available memory bandwidth for access streams
with temporal locality. Stream operations are inherently latency insensitive so there is no need
for a cache to reduce latency. The stream cache has a two-word block size to take advantage of
the higher DRAM bandwidth on sequential accesses without penalizing random single-word access
bandwidth.
Each SSS node has 2 GBytes of external memory distributed across 16 1Gbit DRAM chips.
These chips may be DDR SDRAM or Rambus DRDRAM. Depending on the ¯nal choice of DRAM
chip, each node will have 38-40GBytes/s of sequential memory bandwidth and at least 16GBytes/s
of random memory bandwidth. Each memory bank has a DRAM interface which schedules memory
accesses to optimize use of the DRAM row bu®ers and banks.
SSS memory is protected via segmentation to allow space shared multiple jobs to be run si-
multaneously without the overhead associated with TLBs and paged memory. The segmentation
system also interleaves addresses across nodes in a programmable manner. Segments can be inter-
leaved with a granularity ranging from 1 word to the entire segment in powers of two. All memory
accesses are translated from virtual memory addresses to physical memory addresses via a set of 32
segment registers as shown in Figure 5.6. Each segment register speci¯es the segment length, the
subset of nodes over which the segment is mapped, the interleave factor for the segment and the
cache state of the segment (e.g., cacheable, read-only, private, etc...). The segmentation translation
is performed in the address generators and by the scalar processor.
The stream caches uses segment cache states to allow the program to specify periods during
which a segment is read-only and periods when a segment is private to enable e±cient caching of
shared data without the overhead of a full cache coherence protocol. For example, during many
calculations the current state of a simulated system is read during most of a timestep and then
updated at the end of the timestep. By specifying this data to be read-only during the computation
portion of the timestep and then switching it to read/write for the update, this data can be freely
cached during the computation with no overhead. Similarly the program can specify periods when
a segment is private to a particular node. During these periods, the node can read and write the
private segment in its cache without the need for protocol messages. Typically each node will
map a di®erent portion of a larger data set as a private segment and use this mapping during the
133Node￿ PhysAddr￿ reserved￿ PA:￿
PhysOff￿
+￿
SegAddr￿ SegNum￿ reserved￿ VA:￿
PhysOffHi￿ NodeOff￿ PhysOffLo￿
+￿
n
￿
0
￿
C
￿
a
￿
c
￿
h
￿
e
￿
P
￿
h
￿
y
￿
s
￿
B
￿
a
￿
s
￿
e
￿
 
￿
(
￿
s
￿
h
￿
i
￿
f
￿
t
￿
e
￿
d
￿
 
￿
r
￿
i
￿
g
￿
h
￿
t
￿
 
￿
3
￿
)
￿
n
￿
1
￿
S
￿
R
￿
 
￿
[
￿
 
￿
S
￿
e
￿
g
￿
N
￿
u
￿
m
￿
 
￿
]
￿
:
￿
P
￿
h
￿
y
￿
s
￿
L
￿
e
￿
n
￿
N
￿
o
￿
d
￿
e
￿
B
￿
a
￿
s
￿
e
￿
0￿ (40+n1-n0)￿ 59￿ 63￿
n1￿ (n1-1)￿ n0￿ (n0-1)￿ 0￿ (40+n1-n0)￿
0￿ 40￿
0￿ 40￿ 50￿ 63￿
0￿
3￿
7￿
3￿
8￿
4￿
3￿
4￿
4￿
4￿
9￿
5￿
0￿
6￿
3￿
6￿
4￿
6￿
8￿
6￿
9￿
Figure 5.6: Segmentation
134update portion of a timestep. To support transitions between these cache states, the stream cache
supports write-back of private data, selective invalidation of read-only data, and cache cleaning to
write-back dirty data without invalidation.
The network and memory system incorporate a set of synchronization mechanisms to simplify
the task of coordinating operation between the nodes. Each memory word contains state bits for
safety and coherency. Each memory store and load operations changes the memory state and
illegal operations are detected. Atomic remote operations including fetch & add, add & store, and
compare & swap are also implemented in the memory banks to permit common synchronization
constructs to be implemented without traversing the network multiple times. More complex remote
operations can be implemented using a memory-mapped message send that is received by a message
handling thread on the remote scalar processor.
The streaming supercomputer memory system supports °oating-point and integer add & store
operations across multiple nodes at full cache bandwidth. Performance on add & store, which sums
its argument into a memory location, is important to applications like ¯nite-element codes in which
several elements - possibly on di®erent nodes - all sum contributions into the state of a vertex. To
support such operations, the stream cache supports an add & store mode in which an entry is
written on the ¯rst add & store, subsequent add & store operations sum their values into the entry,
and ¯nally, when the entry is evicted from the cache, it is summed back to memory, rather than
overwriting the memory location. This method allows multiple nodes to simultaneously sum into
a single memory location all operating at their full cache bandwidth.
5.2.3 Interconnection Network
The interconnection network of the streaming supercomputer provides nearly °at memory band-
width across up to 16K stream processing nodes. The inherent latency hiding of stream memory
operations provides su±cient simultaneous outstanding memory accesses to fully exploit the band-
width of the network.
The network is designed with conservative technology assumptions. All of the technology re-
quired to build the streaming supercomputer network: the signaling technology, the connectors,
the parallel optical transceivers, exists today (in 2002). We expect that by the time we build this
machine in 2005-6, signi¯cantly better technology will be available.
The streaming supercomputer network uses a folded Clos topology with concentration as illus-
trated in Figure 5.7. This topology enables the network to make the most e±cient use of inexpensive
high bandwidth electrical signaling for short, on-board and in-cabinet, connections reserving the
use of optical interconnect for the inter-cabinet routing. This topology also allows us to smoothly
vary the global bandwidth of the system by varying the number of routers on the board and the
degree of concentration at each router. The folded Clos topology has the further advantage that
it is inherently fault tolerant. The network gracefully routes around one or more failed routers or
channels.
Each line in the ¯gure denotes one or more channels each channel has a raw bandwidth of
2.5GBytes/s and consists of four 5Gbit/s signals. At the board and backplane level each signal
is carried as an electrical di®erential pair over a di®erential strip-guide. At the system level, each
signal is carried over a ¯ber-optic link. The 5Gb/s electrical and optical signaling required by the
network is available today. The connector, board, and backplane densities necessary to realize this
interconnect are modest.
All of the switching in the network is performed by a single component type, a 48-channel
router. This router accepts 48 2.5GByte/s input channels and switches packets any one of these
input channels to any one of 48 2.5GByte/s output channels.
The network consists of three levels of hierarchy - board, backplane, and system. The bottom
level of the hierarhcy, the board level, connects the 16 stream processors on a board together and
135Node￿
0￿
Router￿
0￿
Node￿
15￿
Router￿
3￿
Board 0￿
Node￿
0￿
Router￿
0￿
Node￿
15￿
Router￿
3￿
Board 32￿
Backplane 0￿ Backplane 32￿
Router￿
B0￿
Router￿
B7￿
Router￿
B31￿
2￿ 2￿ 2￿ 2￿ 2￿ 2￿ 2￿ 2￿
OE/EO￿
16￿
Router￿
S0￿
Router￿
S511￿
Router￿
S15￿
OE/EO￿
16￿
OE/EO￿ OE/EO￿ OE/EO￿
System Interconnect Cabinet￿
OE/EO￿
16￿
16￿ 16￿ 16￿
OE/EO￿
512￿
Figure 5.7: The streaming supercomputer network employs a hierarchical folded Clos topology.
Each 2.5GByte/s channel consists of 4 5Gb/s signals. All routing is performed by a single compo-
nent type, a 48-port °it-reservation router.
136to the backplane. At the board level, each of 16 stream processor chips is connected to each of
four router components by two independent full-duplex channels. Each stream processor has eight
memory channels for a total bandwidth of 20GBytes/s and each router terminates 40 channels, 2
from each of 16 nodes, and 8 channels that run o® the board to the backplane switch. This provides
a 4:1 concentration reducing the per-node bandwidth from 20GBytes/s on the board to 5GBytes/s
over the backplane. Each 16 processor board exports a total of 32 full-duplex channels for a total
of 80GBytes/s full-duplex bandwidth that is carried by 256 5Gbit/s di®erential pairs - 128 in each
direction.
The middle level of hierarchy, the backplane level, connects the 32 boards in a backplane
together and to the system level of interconnect. At the backplane level, each of 32 boards is
connected to each of 32 backplane routers, B0;:::;B31, by a single 2.5GByte/s channel. Router i
on each board is connected to routers B8i through B8i+7 on the backplane. Each backplane router
also connects to the system-level interconnect via 16 optical channels providing 2:1 concentration
between the backplane and the system-level interconnect. The per-node backplane bandwidth of
5GBytes/s is reduced to a per-node system bandwidth of 2.5GBytes/s (one channel per node) via
this concentration.
To keep the backplane completely passive for reliability and ease of servicing, the backplane
routers are mounted to the backplane via daughter cards, four routers on each card. The 16-
processor boards are connected to the backplane via high-speed backplane connectors, Teradyne
GbX or equivalent. To provide reliable 5Gb/s operation, the backplane is fabricated from a low
loss tangent material, such as Nelco 4000. The router daughter cards also include parallel opti-
cal/electrical (O/E) and electrical/optical (E/O) modules that convert sixteen signals per router to
optical form for transmission over ribbon ¯ber to the system interconnect cabinet. Each backplane
connects to the system interconnect cabinet via two 2K ¯ber bundles - one in each direction. Each
bundle includes a 64 ¯bers from each of the eight router cards.
The top level of the network hierarchy, the system interconnect, connects together up to 32
backplanes. The system interconnect cabinet terminates two 2K-¯ber bundles from each backplane
(one in each direction). All of these optical connections are straight across to simplify cabling.
Within the system interconnect cabinet the 512 4-¯ber channels to and from each backplane are
distributed across 512 routers S0;:::;S511. Fiber j from backplane i connects to the ith port
on switch Sj. The system-level routers are packaged four to a card using the same cards as the
backplane-level routers. Since the routers can forward all of the tra±c from on one half of the
machine (backplanes 0 through 15) to the other half (backplanes 16 through 31) and vice versa,
the total bisection bandwidth available is 32TBytes/s.
The network is tolerant to single point faults in either routers or channels. In fact, each stream
processor node is connected to four identical, parallel networks. Three of these four networks could
fail completely
The folded Clos con¯guration provides a great deal of °exiblity in varying the cost and band-
width of the network. For example, we can vary the number of routers on a stream processor board
from one to four to vary the intra-board bandwidth from 5GBytes/s to 20GBytes/s. We can vary
the degree of concentration at the board routers - from the current 4:1 to as much as 32:1 or as
little as 2:1. Similarly we can vary the number of routers at the backplane level (by changing the
concentration at the board level) and vary the concentration at the backplan level - from the current
2:1 to as much as 32:1. This degree of °exibility is very valuable in an experimental machine as
it enables us to adjust the network cost and performance without the need to design new routers.
The °exibility also enables the routing components developed for the streaming supercomputer to
be used as a general-purpose set of building blocks for supercomputer memory systems.
Our initial design for the streaming supercomputer network used a hierarchical network based
on a torus topology at the backplane and system levels. This design was replaced by the current
137Clos con¯guration after initial design studies showed the Clos was able to achieve the bandwidth
speci¯cation at signi¯cantly lower cost. At the board level, for example, the board-level routers are
able to implement 4:1 concentration using only 8 channels on the backplane side. To achieve the
same degree of concentration with a torus network required each of the four board-level routers to
provide 24 channels on the backplane side.
The routers in the streaming supercomputer network employ two-phase adaptive routing. Dur-
ing the up phase, a packet is adaptively routed to higher-levels of the hierarchy until it reaches a
common ancestor of the source and destination nodes. The packet is then routed deterministically
along the only path from this ancestor to the destination during the down phase. At each step of
this up phase the least busy upward channel is chosen for the next hop. Ties between equally busy
channels are broken randomly. A packet may travel zero, one, two, or three hops during the up
phase depending on whether the source and destination are on the same node, board, backplane,
or system. Once a packet has reached a common ancestor of the source and destination it begins
the down phase. During each hop of the down phase, the appropriate portion of the node address
¯eld selects the outgoing channel.
Each of the 48-port routers in the streaming supercomputer implements °it-reservation °ow
control to reduce memory latency. Packets are separated into control °its - which contain routing
and scheduling information - and data °its that contain only payload data. When a request packet
reaches the stream cache bank on the destination node and incurs a cache miss, a control °it for
a reply packet is sent immediately scheduling the ¯rst data °it for a point in the future when the
DRAM access will be completed. This control °it schedules the channels and bu®ers required by
the data °its, overlapping the time required for network control with the time required by DRAM
access and hence substantially reducing overall memory latency.
Overhead of encoding both the link-level protocol and the end-to-end memory request protocol
degrades the bandwidth available for payload data in the streaming supercomputer network. At
the link level, packets are segmented into 64-bit °its and 8-bits of overhead (VC identi¯er, °it type,
and check) are appended to each °it for an overhead of 12.5%. The memory request protocol has
an overhead of 100% for single-word accesses which falls o® to only 5% for 16-word accesses. For
typical mixes of accesses, we expect to yield between 50% and 80% of the network bandwidth as
actual data bandwidth.
5.2.4 Input/Output
The streaming supercomputer interfaces to industry standard I/O devices via a 10Gbit Ethernet
(XAUI) interface on each SSS router. There will be multiple small Ethernet domains connected to
SSS routers; each domain will contain some subset of hard disk storage, Internet2 connection, and
peripherals.
The XAUI interface on each SSS router acts as a bridge between an external network based
on the 10Gbit Ethernet standard and the internal streaming supercomputer network described in
Section 5.2.3. A scalar processor on any node of the streaming supercomputer can send an ethernet
packet by encapsulating the ethernet packet in a streaming supercomputer packet and sending it
to the ethernet interface. The interface strips the internal network overhead from the packet and
retransmits it over the external network. Similarly, an external message arriving at the interface is
encapsulated in an internal network message and delivered to a scalar processor.
The I/O bandwidth per node can be con¯gured based on the number of 10Gbit connections
employed. Each streaming supercomputer router includes one 10 Gigabit Ethernet XAUI interface,
but the number of these interfaces that are actually connected to an Ethernet switch is con¯gurable.
With 16 nodes and 4 routers per board, we can con¯gure I/O bandwidth from 2.5Gbits/s per
node (all router ports employed) to 625Mbits/s per node (one port per board connected). Multiple
physical disks per node will be required to work in parallel to provide such high bandwidth numbers.
138Each Ethernet domain will connect to network storage devices to provide distributed storage.
These multiple connection points allow parallel ¯le transfers to occur. One or more of the Ethernet
domains will also connect to the Internet2 for SSS users to transfer code and datasets.
Network storage solutions are being investigated: Network Attached Storage (NAS), and Storage
Area Networks (SAN) are both possible choices. Since we wish to support a parallel ¯le system,
we must distribute the connection points of the ¯le system through the streaming supercomputer.
This may be done by placing multiple NAS devices in the streaming supercomputer network. If
SANs exist with multiple parallel network connections, then a similar con¯guration is possible.
The streaming supercomputer requires a ¯le system that supports e±cient parallel ¯le transfers.
The ¯lesystem requirements include the ability to stripe ¯les across disks, allowing many nodes
to access a ¯le simultaneously, and robustness in the case of disk failure. We are investigating
existing ¯lesystems to determine if we can adapt an existing ¯lesystem for use in the streaming
supercomputer.
Streaming supercomputer nodes will be able to remotely access any storage data through
memory-mapped transfers over the SSS network. Ethernet tra±c will remain local to the router
and its attached storage devices.
5.3 Stream Programming Systems
5.3.1 Brook: A Stream Programming Language
Perhaps the biggest challenge in the streaming supercomputer project is to develop an appropriate
programming model. The programming environment needs to support high performance programs
that take full advantage of the hardware, as well as being portable to a variety of existing archi-
tectures. In addition, the programming environment should hide machine-speci¯c details so that
programs written in the environment will run e±ciently on future hardware.
As discussed previously, streaming architectures achieve high performance by exploiting data
parallelism and high arithmetic intensity. The programming environment therefore should expose
these fundamental machine constraints in order to encourage the programmer to write code which
will run e±ciently on such an architecture.
Our goal with the programming environment is not to automatically parallelize existing C or
Fortran programs. Many years of research in vectorizing compilers has had limited success in par-
allelizing existing codes. These languages encourage a sequential, imperative style of programming
that causes the programmer to introduce dependencies and synchronization points, often unwill-
ingly, that make parallelization very di±cult. Our goal is to create a programming environment
where parallelism and communication are explicit.
During this ¯rst year, we have designed and built a prototype of a new streaming programming
environment entitled \Brook." The goals for Brook are the following:
1. Brook should encourage the programmer to write code which exposes the data parallelism
and high arithmetic intensity inherent in the underlying algorithm. This programming model
should expose computation and communication within the application in a manner which is
e±cient on any streaming architecture, including the streaming supercomputer.
2. Brook should allow a programmer to implement a wide variety of algorithms in a natural and
intuitive way. The language should present a small set of core operators, which when used in
combination, provide a powerful toolbox.
3. Brook's programming model should be straightforward and natural for veteran programmers.
Adaptation of a new programming language is di±cult enough; any similarities to existing
languages would make migration less painful.
139struct Position {
float x, y, z ;
} ;
typedef struct Position PositionT ;
stream PositionT Pos;
stream PositionT Vel;
kernel void UpdatePosition (stream PositionT pos,
stream VelocityT vel,
float timestep,
out stream PositionT new_pos) {
new_pos.x = pos.x + timestep * vel.x ;
new_pos.y = pos.y + timestep * vel.y ;
new_pos.z = pos.z + timestep * vel.z ;
}
Figure 5.8: Declaring a stream and a kernel in Brook
A major goal of the project during the ¯rst year was to bootstrap the development of Brook.
Brook is at the nexus between architects and the application developers. Therefore, over the last
year, all the members of the SSS group have been actively involved in the design of Brook.
The current Brook programming environment is embedded in the \C" programming language.
A Brook program largely resembles any other C program which includes a main function, variables
and function de¯nitions. Other than the presence of these new keywords, Brook's similarity to C
allows any programmer with basic C experience to understand Brook code. Most new developers
learn how to write Brook code in a single sitting.
Streams and Kernels
The basic unit of computational in Brook is the stream. A stream represents a sequence of data
records of the same type. Typically, streams represent the collection of data being operated on, such
as a particle record containing positions and velocities of water molecules in a molecular dynamics
simulation, or ¯nite element cells containing state variables in a °uid simulation. The programmer
executes functions, or kernels, on each element of the stream. Kernels operate on each element of
an input stream and place the result of a computation on an output stream. A molecular dynamics
kernel for example, may update the position of each water molecule based on its velocity vector. A
Brook program consists of de¯ning streams of data and operating on that data with a sequence of
user-de¯ned kernel functions.
Brook extends the \C" syntax with a few new modi¯er keywords such as stream and kernel
for specifying streaming primitives. In addition, Brook introduces a runtime library of functions
which allow for creation and manipulation of streams.
Figure 5.8 shows a declaration of a stream and a kernel in brook. The variables Pos and Vel
represent an array of °oats. However, they are also streams and can be used as arguments to
kernels. Kernel functions resemble C functions with a few subtle di®erences. The programmer
declares a kernel function with the kernel keyword indicating execution on streaming hardware.
The arguments to kernel functions consist of input streams (pos, vel), scalar constants (timestep),
and output streams (new pos). Stream arguments are explicitly declared as either read only or
140write only with the out keyword. This makes the data°ow explicit for the compiler. The kernel
function will remove one element from each of the input streams, execute the body of the kernel
function, and store the result onto the output stream. This repeats until there are no more elements
left on the input stream.
A critical decision in the Brook design was establishing the constraints for kernel functions. By
placing restrictions on kernels, Brook semantics allow for e±cient execution of kernels on streaming
hardware by operating in parallel over the elements in a stream. Although these decisions are still
not ¯nalized, we decided to place some restrictions on kernels:
1. Global variables are not visible within kernels. Only the arguments to the kernel function are
accessible. This restricts the use of pointers to global variables, or references to static global
variables.
2. No dependencies can exists between stream elements. No static variables may be declared
within a kernel. A kernel computation cannot obtain a previous result originating from an
prior element's computation.
3. Stream arguments are read only or write only. By default, arguments are considered to be
read only, unless speci¯ed otherwise by keyword.
These main restrictions ensure data parallelism and high arithmetic intensity for Brook kernels. By
removing any possible dependencies between stream elements, the compiler is free to schedule the
computation in parallel. Furthermore, since kernel execution is entirely local, we limit the number
of global reads required for computation. These design decisions have allowed Brook programmers
to construct programs without constant concern over precisely how the parallel execution proceeds.
A key question in developing Brook was how to support data structures commonly found in
scienti¯c applications. For this reason we have been investigating how to support multidimensional
arrays and meshes.
The general philosophy is to treat streams as views of memory. Conceptually a stream is a
linear sequence of references to data stored in memory. Normally, references are in memory order;
that is, the data is accessed one element at a time in the order it is stored. This de¯nition of
a stream allows streams to be easily formed from 1D and nD arrays. More generally, we want
access functions to data in di®erent orders. For example, strided access to an array would allow
it to be accessed in column- rather than row-major order. Other access patterns could be create
block-access modes.
More generally we want to convert dynamically allocated collections of records such as lists,
trees and graphs to streams. Unstructured ¯nite element meshes are basically graphs of ¯xed size
records.
By de¯ning accessors for di®erent data structures, we separate the access of the data from the
operations on the data. Thus, it is easy to write a Brook kernel to operate on records independently
of the type of collection of records. This is because each Brook kernel operates only on one record.
In physical simulation, it is often necessary to access a local neighborhood of a record. In the
case of a 1D array, the neighborhood is the previous and next records. In the case of an nD array,
it is a small nD array centered around the record of interest. This neighborhood locality may be
exploited in the architecture to minimize data movement.
Thus, to e±ciently support multidimensional arrays in Brook, we may give a stream a shape.
This shape may be used to associate a multidimensional neighborhood or stencil around each
stream element. Although this feature is common in array processing languages, it is a new feature
to streaming languages.
Figure 5.9 shows the declaration of a 2D array and a kernel that access a neighborhod. Note
that the array addresses inside the kernel are relative to the current location.
141stream Cell Grid;
kernel void Flux ( float rate,
stream Cell grid[3][3], out stream Cell newgrad ) {
float k = 1.0f - (rate*4.0f);
newgrid =
grid[1][1] * k +
grid[0][1] * rate +
grid[2][1] * rate +
grid[1][0] * rate +
grid[1][2] * rate;
}
Figure 5.9: Neighborhoods in Brook
Stream Operators
Brook programs consist of a sequence of kernel applications, intermixed with data movement and
reorganization. Some key stream operators are:
1. StreamLoad and StreamStore. These operators create a stream from a base address. StreamStride
forms a stream by taking m records and then skipping n records.
2. A stream of memory references may be created using StreamLoadRef. StreamGather and
StreamScatter loads and stores a stream in the references contained in another stream.
StreamScatterOp implements an op= using memory reference.
3. StreamStencil, StreamGroup/StreamUngroup, StreamReplicate/StreamConcatenate. These
operators perform rerrangements on streams. StreamStencil forms a stream of neighbor-
hoods; the group and ungroup operators combine/uncombine consecutive records into a sin-
gle record; the replicate operators replicates records, and the concatenate operator joins two
streams together.
The goal eventually is to settle on a small complete set of operators in the base system, and
then a library of convenient operators built on top of those operators.
The ¯nal important feature of the language is support for reductions. One way to perform a
reduction is to use the codeStreamScatterOp. Another way is to declare a variable in a kernel to
be a reduction variable.
Inside the kernel the += acts as a global reduction. All the standard C op= statements may be
used to form reductions.
Status
The ¯rst stage in the development of the Brook language is largely complete. We have °ushed out
a basic design and have developed a working prototype. Establishing the programming language
in the early stages of the SSS project has not only provided a basis for the other portions of the
142stream float a[1024];
float sumval;
kernel void sum( stream float a, reduce float *sumval )
{
*sumval += a;
}
project, but has accelerated interest in the streaming supercomputer. In the subsequent sections
we will describe how molecular dynamics and °uid °ow are implemented in Brook.
Currently we have almost 15,000 lines of Brook written by 8 di®erent application designers,
on three di®erent OS platforms. We have released documentation of the Brook speci¯cation and
have provided a C implementation which runs on Win32, Linux, and Macintosh OS X platforms.
Using the metacompiler technology developed in tandem with this project, application designers
have been writing Brook code for three months, well in advance of any hardware simulator. Most
of our application writers come from the scienti¯c community which vary widely in preferences for
di®erent development platforms. Having a cross platform development environment has proven
vital in recruiting new developers.
5.3.2 A Streaming Virtual Machine
The Stream Virtual Machine (SVM) is a machine-independent model of a parallel streaming com-
puter that is used as an intermediate representation in the compilation process from a high level
stream language like Brook down to a speci¯c stream architecture. By compiling ¯rst to the SVM
and then from the SVM to a target architecture as illustrated in Figure 5.10 we can make our
software portable across many di®erent target architectures. Only the back-end of the compiler
needs to change to compile code for a new target architecture. All of the applications, and the
front-end of the compiler are completely portable.
Brook Program￿
Stream Virtual Machine￿
Stream Programming￿
Language￿
Intermediate￿
Representation￿
Machine Binary￿
Streaming￿
Super￿
Computer￿
Imagine￿
Smart￿
Memories￿
Conventional￿
Multiprocessor￿
High Level Compiler￿
Machine Independent￿
Parameterized￿
Virtual Machine￿
Machine￿
Specific￿
Low Level Compiler￿
Figure 5.10: Compilation Process
The SVM is intended to be representative of several di®erent target architectures including:
Streaming Super Computer Parallel shared-memory machine with stream processors at each
node.
143Imagine Existing stream processor for media applications.
Smart Memories Parallel machine with nodes that can be con¯gured to be either a stream
processor or a threaded processor.
Conventional multiprocessor To run brook programs on conventional machines, test the pro-
grams and get performance benchmarks.
The Stream Virtual Machine represents architectures with variable resources: nodes, computa-
tion, memory and bandwidth . As thus the SVM is parameterizable depending on the end target
con¯guration. The parameters will be presented in more details but ¯rst we will introduce the
concept of a stream node.
A stream node consists of a conventional, scalar processor that is extended with a stream
execution unit and a stream memory unit as illustrated in Figure 5.11. The scalar processor executes
a single instruction stream that contains scalar instructions, executed by the scalar processor, kernel
invocations, that are executed by the stream execution unit, and stream load/store instructions that
are executed by the stream memory unit. The stream node also consists of two local memories: the
kernel program memory and the stream register ¯le. The kernel program memory holds de¯nitions
of kernels to be executed on the stream execution unit. The Stream Register File stages active
portions of the active streams during execution of a stream program.
Stream Unit￿
Stream Exectuion Unit￿
Stream Register File￿
Kernel￿
Program￿
Memory￿
Stream Memory Unit￿
Scalar Processor￿
C
￿
o
￿
m
￿
m
￿
a
￿
n
￿
d
￿
s
￿
 
￿
I
￿
N
￿
C
￿
o
￿
m
￿
p
￿
l
￿
e
￿
t
￿
i
￿
o
￿
n
￿
s
￿
 
￿
O
￿
U
￿
T
￿
Memory￿
Figure 5.11: Stream Virtual Machine Node
144SVM Parameters
The front-end of the compiler makes decisions on how to compile a Brook program into SVM
code based on parameters of the SVM. For example, it chooses a schedule and a stream strip size
based on the available storage in the stream register ¯le. The parameters of an SVM are listed in
Table 5.3.2 and illustrated in Figure 5.12.
 Node￿
Stream Unit￿
Stream Execution￿
Unit￿
FLOPS￿
Stream Register File￿
S￿SRF￿
Local Memory￿
S￿ MEM￿
B￿ SRF￿
B￿ MEM￿
Scalar￿
Processor￿
B￿ NODE￿
N￿ NODE￿
Figure 5.12: Stream Virtual Machine Parameters
Field Description
NNODE Number of nodes
FLOPS Arithmetic Intensity
SSRF Size of Stream Register File
SMEM Size of main memory in words (per node)
BNODE Bandwidth between nodes
BMEM Bandwidth between memory and SRF
BSRF Bandwidth between SRF and cluster
Table 5.3: SVM Parameters
Scalar Processor
The interactions between the scalar processor and the stream units (stream execution unit and
stream memory unit) is one where the control thread launches operations on the stream units
145and the completions signals are returned to the scalar processor. The stream units works as co-
processors to the scalar processor.
Stream Instructions, the commands to the stream units have dependencies between each other
that are explicitly coded in the control thread SVM code. The stream instructions are C function
calls in the control thread. They are non-blocking to the control thread and return immediately.
Each stream instruction returns a DONE record which gives completion information and error
status. The control thread can thus voluntarily block on the completion of a stream instruction but
by nature none of them are blocking. The dependencies between stream instructions is explicitly
coded by passing as arguments the IDs of the stream instructions which it depends on. This
non-blocking execution model permits kernel invocations to be overlapped with stream memory
operations - hiding memory latency.
Stream Memory Unit
The stream memory unit executes stream load and store instructions received from the scalar
processor. These instructions, listed below, may be strided or indexed (gather/scatter), and include
fetch and add operations.
1. StreamLoad/StreamStore : This loads/stores a stream of data records placed in memory at
a regular stride into/from a compact representation in the Stream Register File (SRF).
2. StreamGather/StreamScatter : This loads/stores a stream of data records whose location in
memory is contained in a stream already in the SRF, into/from a compact stream in the SRF.
3. StreamScatterAdd : Adds a stream of data records in the SRF into a stream in memory
speci¯ed by a stream of addresses in the SRF.
4. StreamFetchAdd : Like StreamScatterAdd but also loads the stream of sums back into the
SRF.
5. KernelLoad : This loads a kernel program from main memory to the kernel program memory.
The scatter add operation is useful when many stream elements need to sum their values into
a single memory location. Fetch and op with integer add is used, for example, to update the head
and tail pointers of distributed lists.
Stream Execution Unit
The stream execution unit executes kernels consuming and producing streams in the SRF. The
control thread can start kernels, with a stream instruction by specifying where the input and
output streams of the kernel reside in the SRF. The structure of kernel programs is usually a
preamble, then a loop over the elements of a stream, ¯nally a postscript to ¯nish up.
The special functions of kernels relate mostly on how to access elements in the SRF. Regular
stream accesses consume and produce stream elements in order. Neighboring stream elements can
be accessed with special peek function which imply communication between parallel computation
elements.
Multiprocessor Implementation
The SVM has been implemented for a conventional multiprocessor system in order to simulate
SVM code and get rough performance numbers. The implementation is based on POSIX thread.
Each stream node consists of these POSIX threads:
1461. The Control thread runs the scalar processor code.
2. The Stream Instruction dispatcher keeps track of the launch and completion of stream in-
structions issued by the scalar processor.
3. The stream memory unit which executes the stream memory operations.
4. The stream execution unit executes kernels.
5.3.3 Stream Compilation
We are building a streaming compiler that takes a Brook program as input, and generates ex-
ecutable code for the streaming supercomputer. The compilation process involves three main
phases: parsing of the Brook program, global analyses and optimizations, and code generation to
the SSS architecture. The ¯rst two phases are machine independent and generate code for the
streaming virtual machine. The ¯nal phase converts streaming virtual machine code to streaming
supercomputer machine code.
The entire planned compilation process is depicted in Figure 5.13, and some of the passes will
be described below.
For our initial implementation we concentrated on only those passes which are absolutely nec-
essary to produce correct, somewhat optimized code targeted at a single node of the streaming
supercomputer. We also chose to leverage as much existing technology as possible, which includes
the metacompiler and a modi¯ed version of the Imagine processor programming tools.
The Frontend passes are performed using the metacompiler. The result is a parsed Abstract
Syntax Tree, which is used while still in the metacompilation process to perform critical global
passes and produces StreamC and KernelC code. StreamC is handled by Imagine's stream compiler,
IStream, and performs the stream scheduling aspect of the global passes. KernelC is used by the
Imagine kernel scheduler, ISched, to compile a VLIW schedule that can be directly run on the
streaming supercomputer clusters.
We will now continue by going into greater detail of the compilation process, and ¯nish with
the current status.
Front-end Parsing
The stream language Brook closely resembles general purpose programming languages like C or
C++. Brook is basically C with a few additional keywords that indicate which functions are
kernels, which variables are streams, and so on. As such, to compile the Brook language, we use an
existing C compiler (gcc) augmented with the Brook speci¯c syntax and semantics. This section
will describe this process in more detail.
The front-end of our compiler for Brook uses a process called metacompilation, a process of
giving a general purpose compiler additional information that can be used to make system speci¯c
decisions. In this case, we tell the compiler additional keywords that are allowed by the language
and how those keywords in°uence the translation process that the compiler performs. With no
augmentation, our metacompiler takes C source code as input and produces the same C source
code as output. However, the metacompiler can input extensions to the language as well as the
original source code to allow languages that aren't quite C to be compiled. For example, in Brook,
the following lines are included in the compiler extension:
annotations {
typedef: { "stream" };
function: { "kernel" };
147Frontend￿
Metacompiler￿
Global Compilation Phase￿
Expand Stream Operators￿
Reduction Optimization￿
Persistent Storage￿
Management￿
Conditional Conversion￿
Split/Merge Kernels￿
Split Records￿
Data Partitioning￿
Task Partitioning￿
Synchronization￿
Stream Scheduling￿
Strip-mining￿
Software￿
Pipelining￿
SRF Allocation￿
Operation￿
Ordering￿
S
￿
t
￿
r
￿
e
￿
a
￿
m
￿
 
￿
V
￿
i
￿
r
￿
t
￿
u
￿
a
￿
l
￿
 
￿
M
￿
a
￿
c
￿
h
￿
i
￿
n
￿
e
￿
Code Generation￿
Intercluster￿
Communication￿
Loop Optimization￿
Communication￿
Scheduling￿
Register Allocation￿
Brook Code￿
SSS Code￿
Figure 5.13: Compilation Process
148decl: { "out" }, { "in" };
}
This code indicates that typedefs may include a new keyword, stream (to indicate that a new
type is a stream type), that functions may be declared with the keyword kernel, and that declara-
tions may include the keywords in and out. This simple syntax does restrict the types of extensions
that are allowed. However, by restricting a language to be close to the original language, much of
the existing compiler may be used to compile the language.
To indicate what the semantics of the language do, compiler extensions may match on certain
constructs and indicate that a di®erent construct should be output instead. This concept is made
clear through a simple example. Imagine we compile the code:
typedef stream float floats;
kernel void add (floats a, floats b, out floats c) {
c = a + b;
}
int main () {
floats a;
floats b;
floats c;
! read data from file into a and b ;
add (a, b, c);
return 1;
}
With no compiler extensions, our C to C compiler would simply output the above functions
unchanged. However, if, in Brook, we know that calls to kernel functions should be wrapped
with code that creates the actual output stream and allocates space for it. We can include this
information in a compiler extension:
{ is_kernel_function_call (current_stmt) } ==>
{ mc_emit (``! create and allocate stream c according to the size of a and b ;'');
mc_emit_tree (current_stmt);
}
The left hand side of the arrow indicates what should be matched (any statement that is a call
to a kernel function) and the right side is arbitrary C code to tell the compiler what to do when the
construct is matched. In this case, the compiler will simply insert the appropriate code for creating
the output stream, resulting in the above code looking like this:
int main () {
floats a;
floats b;
floats c;
149! code for reading the lengths of a and b ;
! code for creating and allocating c ;
add (a, b, c);
return 1;
}
While this example is contrived, it demonstrates simply the types of changes extensions are
allowed to introduce to the language. The patterns that extensions may use to match source code
constructs are quite powerful. The metacompiler matches these patterns against each tree in the
abstract syntax tree (AST) of the currently parsed function. When the metacompiler ¯nds a match,
it executes the corresponding action code (the code found after the ==>). Extension writers are
allowed to match wildcards (e.g., any pointer, any constant, etc.) and use the matched wildcards
in future patterns.
The next sections describe the use of the meta-compiler and the parsing information for the
global compilation phase and code generation.
Global Compilation Phase
This phase deals with analyses and optimizations common to all stream architectures (Figure 5.13):
² Expand Stream Operators { since our low level stream targets do not directly support many
of Brook's rich set of stream operators, we must expand these into a set of tailored kernel
calls and memory operations.
² Reduction Analysis { an appropriate mechanism for dealing with reduction streams and vari-
ables must be chosen from the ones available in the hardware.
² Persistent Storage Management { Brook does not have persistent state to aid in parallel com-
pilation. Therefore managing the hardware's local persistent storage is a crucial compilation
stage.
² Conditional Conversion { stream hardware provides various mechanisms for dealing with
conditionals, such as predication and conditional streams. The conditional statements in
Brook must be analyzed and translated into the most ¯tting conditional support mechanism.
² Split/Merge Kernels and Records { this pass attempts to repartition the program from a
performance point of view, rather than programmer convenience.
² Data and Task Partitioning { these passes deal exclusively with multi-node support and are
responsible for most e±ciently using the parallel resources in the system.
² Synchronization { once the data is laid-out, the tasks selected, and the reduction and memory
operations marked, the compiler must insert appropriate synchronization points for correct
concurrent execution.
² Stream Scheduling { this pass generates the lower level stream code required for code gener-
ation. it includes strip-mining, software pipelining, double-bu®ering, and SRF allocation.
At this stage of the project we are concentrating on compilation to a single node, and can
therefore leverage the programming tool-chain of Imagine. Our strategy is to analyze the Brook
program using the parse-tree supplied by the metacompiler, and produce StreamC and KernelC
150code. These programs are in a format that is readable by our code-generation tool set, which is a
modi¯ed version of the Imagine tools.
The global phase corresponds to generating the StreamC code. StreamC is a stream program-
ming language, but more low level than Brook. Therefore, Our initial compilation consists mainly
of syntax changes, analysis and implementation of the Brook operators, and compliance and uti-
lization of the SSS SIMD cluster architecture. Most of the work concentrates on the latter two,
since StreamC is not as rich as Brook and places many restrictions on stream properties due to
SIMD. We will expand on this in the next few paragraphs.
Expand Brook Stream Operators Most of the Brook stream operators have a straightforward
representation in hardware. The most notable exceptions are StreamProduct and StreamSelfProd-
uct which produce an O(n2) number of stream pair elements. These operators are used to express
certain aspects of the algorithm in a simple way, but will incur a high performance penalty if run
directly on the stream processor. Another complex operator is streamStencil which requires us
to split the input streams into small strips such that we can use local storage in the clusters to
e±ciently handle the neighbor communication. We will explain this more in the next section.
We use the metacompiler to emit StreamC code that implements the operators through a
combination of compile-time and run-time structures for holding stream properties, and various
tailored kernels and memory operations for the particular streams encountered.
Reduction Identi¯cation and Optimizations Reductions must be identi¯ed and then the
appropriate kernels generated. Several optimizations can then be applied:
² Kernel Combining { if the order of generated values is known (or does not matter) the
reduction can be combined with the value calculation.
² Sorting vs. Memory { when the order does matter the compiler must choose whether to
enforce it by sorting the values (based on the reduction element location), or to use an
indexed store to memory. When using a store, it is necessary to ¯rst combine the reduction
with the value calculation and then split this kernel at every memory operation.
² Atomic Add&Store { since the hardware supports atomic fetch&op instructions per stream
element, they can be used to reduce the amount of synchronization and message passing.
Syntax Translation StreamC and Brook di®er in some important ways in their syntax and
semantics. For example, streams in Brook do not have a de¯ned length, and the length is determined
based on the inputs to their producing kernel. In StreamC, on the other hand, streams are declared
with an associated length.
We take care of such di®erences by introducing appropriate StreamC code and slight modi¯ca-
tions of Brook, using the metacompiler.
Stream Scheduling Finally, we pass the resulting StreamC to IStream, which performs stream
scheduling resulting in code executable by the streaming supercomputer hardware simulator.
The next section deals with generating kernel code, and also gives more speci¯c examples of
Persistent Storage Management and Conditional Conversion.
Code Generation
We already discussed the stream code generation in the previous section, and will now deal with
the generation of kernel code. The compilation of Brook kernels into streaming supercomputer
kernels can be broadly classi¯ed into four major phases:
151² Syntax Transformations { metacompiling the body of Brook kernels into streaming supercom-
puter kernels such that the kernel code that is generated conforms to the syntax requirements
of KernelC. These include sometimes complex transformations such as converting if-then-else
statements into predicated statements.
² Handling Reductions { these include scalar and stream variable reductions, and involves
managing persistent storage and performing inter-cluster communication.
² Handling Multi-dimensional Streams { some of this step is performed during the global phase,
but the kernels must also be modi¯ed.
² Optimization and VLIW scheduling { these are done using the kernel scheduler which is based
on Imagine's ISched tool.
Syntax Transformation Most syntax transformations are simple and involve mostly di®erent
names for certain C operators and math functions. Slightly more complex, is producing wrapper
code which explicitly reads and writes stream elements as opposed to Brook's more functional
syntax and semantics. Finally, more complex transformations involve allocating local storage and
converting conditionals. For now we only deal with predication leaving conditional streams for
future versions (conditional streams are an e±cient mechanism for handling conditionals on SIMD
architectures).
Reductions Reductions, in Brook kernels, can be of two types: scalar/stream reduction and
expanded stream reduction. In scalar/stream reduction, a reduction variable, or a stream of reduction
variables, is passed to the kernel function and written via the reduction operators. This is easily
handled by writing wrappers around the Brook kernels, which initialize the reduction variables, and
communicate the reduced values across clusters before returning from the kernel. As mentioned
before, expanded streams (using streamProduct operators) pose a greater challenge for e±cient
compilation. These are dealt by just-in-time expansion of strips of the stream utilizing the clusters
local storage.
Multi-Dimensional Streams Multi-dimensional streams provide the programmer with a simple
interface to manipulate streams that represent inherently multi-dimensional data. The streaming
supercomputer kernels, however, only support linear one-dimensional streams. We must therefore
e±ciently map Brook's multi-dimensional operators and accessors onto the stream processor's one-
dimensional view.
We perform this mapping by utilizing the inter-cluster communication network, to build the
entire stencil within the clusters' local storage. In this way we avoid extraneous memory reads and
utilize the available bandwidth most e±ciently. To do this we must make sure that the stream is
passed to the kernel in strips which do not over°ow local storage.
This is best illustrated with an example. Figure 5.14 shows how a 2-D (640 £ 100) stream
is divided into several single dimensional strips, and the order in which the data is read into the
clusters. Speci¯cally,the stencil size in this example is 3¤3. The ¯gure also depicts how the elements
of the single dimensional streams are striped across the 16 clusters of the stream processor. Within
each cluster, the elements are traversed in column major order:
Cluster 0: 1;101;201;301;2;102;202;etc
Cluster 1: 401;501;601;701;402;502;602;etc
and so on. The stencil values for 304, for example, are f203;303;403;204;304;404;205;305;405g.
At any given time, a sliding window holds all the elements that are present in the local storage of
clusters.
152Figure 5.14: Stream traversal order for a 640 ¤ 100 stream with stencil size of 3 ¤ 3. The numbers in the
¯gure represent the position of elements in the 2-D stream.
The traversal order ensures that elements belonging to the same stencil are read close to each
other in time. Consequently, extraneous re-reading of elements is limited only to the elements
along the top and bottom boundaries. In addition, this traversal order also ensures that there is
minimum communication between clusters.
VLIW Scheduling The VLIW scheduler is based on Imagine's ISched tool. It performs low-level
optimizations such as loop-unrolling and software pipelining, as well as communication sheduling
to produce e±cient VLIW code for the streaming supercomputer execution clusters.
Summary and Status
To summarize we will describe the current status of the compiler. The frontend is implemented
using the meta-compiler and is able to parse Brook code and provide us with access to the parsed
Abstract Syntax Tree. We then use this information to perform simple global passes that are
required to produce correct single-node streaming supercomputer code. This is done by creating
a corresponding StreamC program and using the stream scheduler based on Imagine's IStream to
generate code. We currently do not perform complex optimizations, but instead rely on e±cient
implementation of the Brook operators and low level scheduling optimizations. Finally, we also
153perform local persistent storage management, inter-cluster communication, and reduction handling
of Brook kernels to produce KernelC code. This kernel code is then passed to the VLIW scheduler
for low-level optimization and cluster code generation.
5.4 Stream Applications
This section presents the status of four di®erent applications that are coded in Brook. The work on
the applications is helping the hardware group and the language developers in making decisions on
the hardware speci¯cations and on the language features. We are trying to cover as many algorithms
as possible, to study the applicability of the streaming computing model to high performance
computing. As the language and the tools are becoming more robust, we intend to increase the
complexity of the present applications and to add more applications to the list in order to cover a
broader range of scienti¯c discipline.
5.4.1 StreamFEM
StreamFEM is a discontinuous Galerkin (DG) ¯nite element application code written in the Brook
streaming language. The DG method was chosen as it is well suited to solving systems of ¯rst-
order conservation laws on general unstructured meshes. Speci¯cally, StreamFEM has the current
capability of solving systems of 2D conservation laws corresponding to scalar transport, compress-
ible gasdynamics and magnetohydrodynamics (MHD) using element approximation spaces ranging
from piecewise constant to piecewise cubic polynomials.
StreamFEM provides useful input to the Streaming Supercomputer project:
² (Memory subsystem evaluation) The use of triangulations in StreamFEM necessitates ir-
regular data access from/to main memory. The Streaming Supercomputer hardware design
includes full support for hardware gather/scatter operations to main memory. In addition, a
stream cache and indexable SRF have been added to the hardware design. One use of the
StreamFEM application (already underway) is the preliminary evaluation of the SRF and
memory subsystem. More generally, StreamFEM and it's extension to 3D will be used as
part of the evaluation process for future hardware revisions and the ¯nal tuning of hardware
components.
² (Brook language functionality) During the development phase of StreamFEM, interaction
with the language developers has helped insure that the Brook language provides the nec-
essary language constructs used frequently in ¯nite element application codes, e.g. scatter-
accumulation, variable degree graphs, global reductions, etc.
An important attribute of StreamFEM is the ability to simulate a variety of physical systems of
increasing complexity using a numerical method with an adjustable order of accuracy. These various
possibilities produce a dramatic variation in the arithmetic intensity (Ops/memory access) in the
overall application. This re°ects the reality that in industrial codes a wide variation in arithmetic
complexity is also observed. The ability to simulate this variation to some extent is quite useful in
the evaluation of Streaming Supercomputer hardware designs. Changes in arithmetic intensity are
obtained by
² varying the number of PDEs in the system. Currently StreamFEM supports scalar advection
(1 equation), the Euler equations of gasdynamics (4 equations), and magnetohydrodynamics
(6 equations).
154² varying the order of the polynomial approximation space used in each element. Currently
StreamFEM supports polynomial approximation spaces in each element that are piecewise
constant (1 degree of freedom/equation), piecewise linear (3 degrees of freedom/equation),
piecewise quadratic (6 degrees of freedom/equation), and piecewise cubic (10 degrees of free-
dom/equation).
StreamFEM methodology
StreamFEM implements the discontinuous Galerkin ¯nite element method for solving a system
of m coupled ¯rst-order di®erential equations in 2 space coordinates and time which represents a
conservation law process. Let u(x;t) : IR2 £ IR+ 7! IRm denote the dependent solution variables
and f(u) : IRm 7! IRm£2 the °ux vector. The prototype Cauchy problem is then given by
½
u;t +
P2
i=1 fi
;xi = 0
u(x;0) = u0(x)
: (5.1)
The StreamFEM implementation of the DG method utilizes piecewise polynomial approximations
of degree k in space and of degree 0 in time with no continuity between space-time elements such
that for a single time slab In = [tn
+;tn+1
¡ ] with [0;T] = [nIn and spatial element K
Vh =
n
uh juh
jK£In 2
³
Pk(K)
´m
£ P0(In)
o
:
The DG method is then succintly stated in variational form.
Find uh 2 Vh such that for all wh 2 Vh
B(uh;wh)DG = 0 (5.2)
and
B(u;w)DG =
Z
In
Z
­
(¡u ¢ w;t ¡ fi(u) ¢ w;xi)dxdt
+
Z
­
¡
w(tn+1
¡ ) ¢ u(tn+1
¡ ) ¡ w(tn
¡) ¢ u(tn
¡)
¢
dx
+
Z
In
X
e2E
Z
e
(w(x¡) ¡ w(x+)) ¢ h(u(x¡);u(x+);n)dxdt
where h(u(x¡);u(x+)) denotes a numerical °ux function. In the StreamFEM code, a numerical
°ux due to Harten, Lax and van Leer [1] is used for all calculations. Boundary conditions (not
shown here) are included in the actual implementation. Implicit \mass" matrices are avoided by
the use of L2 orthogonal basis representations in each element so that the time stepping becomes
explicit.
Current Status
The StreamFEM code has been successfully implemented in the Brook streaming language. Two
algorithms are included in the Brook implementation. One algorithm is parsimonious with respect
to memory access but computes each °ux twice during each time step. The second algorithm is
parsimonious with respect to computation but requires approximately twice as much memory access
during each time step as the previous algorithm. The Brook programming environment includes
a C runtime library so that correctness of application codes can be veri¯ed. A snapshot from a
StreamFEM transient Euler °ow computation is shown in Fig. 5.15 with °ow conditions given in
the ¯gure caption.
155Figure 5.15: StreamFEM(k = 0) compressible °ow computation of a Mach 3.0 momentum discon-
tinuity ((½u)left=(½u)right = 5) impinging on a cylinder geometry in a narrow channel test section
at time T=2. Gouraud shading of the °uid density is depicted in the upper half of the ¯gure and
a triangulation of the geometry is plotted in the lower half of the ¯gure.
A variant of StreamFEM has been ported to StreamC/KernelC so that precise hardware per-
formance can be measured using a suite of simulation tools originally developed for the Stanford
Imagine hardware. One problem encountered in this conversion from Brook to StreamC/KernelC
is the rather large size of the Brook kernels in StreamFEM. One improvement currently underway
to the StreamFEM application includes the restructuring and splitting of some Brook kernels to
improve performance and portability to the Imagine simulator. Another important task is the
extension of StreamFEM to 3D. This extension is vital so that any hardware evaluations computed
using StreamFEM are based on the more demanding 3D computations.
5.4.2 StreamFLO
One of the applications that is under development in Brook is StreamFlo. StreamFlo is a ¯nite
volume 2D Euler solver with a non-linear multigrid. The original Fortran code (FLO82) was written
by Prof. Jameson and his approach is used in many industrial and research codes. The choice of
this code was motivated by the need of a complete application that was going to represent a typical
scienti¯c code (TFLO has a similar structure), without having too much complexity (single block
vs. multiblock, 2D vs. 3D, Euler vs. Navier-Stokes). One of the main outcome of this project has
been the interaction with the Brook developers to improve the expressability of the language and
facilitate the future application developers. The multigrid algorithm will also expose the "short-
stream" e®ects. In 3D there is almost an order of magnitude di®erence in numbers of elements per
level. Even starting from a ¯ne mesh with a billion cells, we can end up with only hundred cells
per level, if the number of levels is big enough.
156Code organization
The code is organized as follow:
² There is an external driver that controls the multigrid strategy and the multigrid data move-
ment (transfer from ¯ne to coarse grid and vice versa). Data is stored in a 1D stream which
holds all the multigrid levels. Each level is selected using the streamDomain command and
reshaped in a 2D stream using streamShape. When the 3D version of StreamFlo will be de-
veloped, the logical structure of this driver will not be changed, we will only need to reshape
each level in a 3D stream. The data movement in a multigrid algorithm consists of two oper-
ations, restriction and prolongation. The restriction operation transfers the data from a ¯ne
to a coarse mesh, while the prolongation operation does the opposite. To implement these
data movements, we added to Brook an operator that performs local grouping.
² On each multigrid level, the code works on a 2D grid, where it needs to compute the time
step, the eulerian and dissipative °uxes, in order to advance the solution in time. This
involves mostly operations performed on stencils. The width of the stencil depends on the
numerical scheme. Most of the operations (computation of admissible time step, eulerian
°uxes, dissipative °uxes used on coarse meshes) are performed on a 3 £ 3 stencil. The
computation of the dissipative °uxes on the ¯ne mesh use a more sophisticated algorithm
(symmetric limited positive H-CUSP scheme) that requires a 5 £ 5 stencil. The code uses O
meshes; this means that the grid is periodic in one dimension and has a wall and an external
boundary in the other direction. To handle this kind of topology, Brook was modi¯ed to
implement stencil operators with independent boundary conditions in each logical direction.
Code porting
To port a Fortran/C code to Brook, there are two essential steps:
² De¯ne the data layout and arrange the data in streams
² De¯ne the computation kernel to be performed on the streams.
The decoupling of the data access pattern from the computation, produces a code that is cleaner
and easy to understand. In order to show this and an example of Brook code, let's take a look at
the code performing the restriction operator in the multigrid. From the analysis of the following
original Fortran code :
C COLLECTS THE RESIDUALS AND TRANSFERS THE SOLUTION TO A COARSER MESH
SUBROUTINE COLLC (IIE,JJE,WW,WWR)
.........
DO N=1,4
JJ = 1
DO J=2,JL,2
JJ = JJ +1
II = 1
DO I=2,IL,2
II = II +1
WWR(II,JJ,N) = DW(I,J ,N) +DW(I+1,J ,N)
+DW(I,J+1,N) +DW(I+1,J+1,N)
END DO
END DO
END DO
157we can understand that all this routine does ( see ¯gure 5.4.2 ), it is to group 4 cells on the ¯ne
grid (A,B,C,D), sum them and store the result in the corresponding cell (AA) on the coarse level,
then move to the next 4 cells and keep moving until it covers all the interior of the domain. The
1 2 IL+1 IL
JL+1
JL
1
2 A B
C D
AA
Figure 5.16: Restriction operator
Brook code for the kernel function performing the restriction operator, will sum the 4 element on
the ¯ne grid stream and store the result in the corresponding element on the coarse grid stream.
©(AA) = §f©(A) + ©(B) + ©(C) + ©(D)g
kernel void TransferField(flow2d_s fine_flow, outfixed Flow coarse_flow)
{
coarse_flow.rho= fine_flow[0][0].rho +fine_flow[0][1].rho
+fine_flow[1][0].rho +fine_flow[1][1].rho;
coarse_flow.u = fine_flow[0][0].u +fine_flow[0][1].u
+fine_flow[1][0].u +fine_flow[1][1].u ;
coarse_flow.v = fine_flow[0][0].v +fine_flow[0][1].v
+fine_flow[1][0].v +fine_flow[1][1].v ;
coarse_flow.e = fine_flow[0][0].e +fine_flow[0][1].e
+fine_flow[1][0].e +fine_flow[1][1].e ;
The last operation we need to perform is to create the right stream from the ¯ne grid to feed
the kernel. We want to take the ¯ne and coarse grids, select the interior elements, arrange the ¯ne
grid in 2 £ 2 groups and apply the TransferField kernel.
// Fine mesh is a 2D stream of size (nx+2,ny+2)
streamSetLength(flow,(nx+2)*(ny+2));
streamShape(flow,2,(nx+2)*(ny+2));
// Coarse mesh is a 2D stream of size (nx/2+2,ny/2+2)
streamSetLength(coarse_flow,(nx/2+2)*(ny/2+2));
streamShape(coarse_flow,2,(nx/2+2),(ny/2+2));
// Select the interior of the two stream
streamDomain(flow,flow,2,2,nx+1,2,ny+1);
streamDomain(coarse_flow,coarse_flow,2,2,(nx/2)+1,2,(ny/2)+1);
// Group elements of the fine grid in a 2x2 stencil
158streamGroup(local_flow2d,flow,STREAM_GROUP_HALO,2,2,2);
// Apply restriction operator
TransferField(local_flow2d, coarse_flow)
The code is less compact of the original Fortran code but easier to understand and to modify.
Implementation of boundary conditions
The implementation of boundary conditions in a SIMD program is not a simple task. The numerical
treatment of the elements close to the boundary is generally di®erent from the one used for the
interior points, in contrast with the SIMD philosophy of applying the same instructions to multiple
data. Inside the kernels, depending on the global position of an element in the original stream,
we may need to have di®erent branches of code to handle this. In Brook, when a stream enters a
kernel, all the information on global position of an element are lost (at least from the programmer
point of view). The solution found to overcome this problem is to encode logical position in the
geometry. The original de¯nition for the mesh:
struct Grid_struct {
float x;
float y;};
typedef stream struct Grid_struct Grid;
was modi¯ed to include information on the logical position.
struct Grid_struct {
int i; /* logical position in x direction */
int j; /* logical position in y direction */
float x;
float y;};
typedef stream struct Grid_struct Grid;
In a conventional cache based machine, this modi¯cation results in a cache load of all the 4
elements of the Grid struct (they are contiguous in memory) for every reference to any element and
a waste of precious cache space. In the streaming architecture, the compiler will analyze the use
of the elements of the Grid struct and will load in the stream register ¯le only the elements used
by the kernel. If a kernel does not need information on the logical position of a cell but just the
physical coordinates, it will not waste stream register ¯le space with elements that is not going to
use. Another possible solution will be to use a mask array. For some situations, it is also convenient
to write a di®erent kernel that just deals with the boundaries, but this is not always possible.
Status
Stream°o is in the ¯nal stage of development, all the major subroutines and all the multigrid
data movement have been coded and debugged. Once the code is completed, we will analyze the
simulator performance to identify bottlenecks and improve performance.
5.4.3 StreamMD
The massive computational power that will be provided by streaming supercomputers will be used
to model the folding of human proteins. Learning more about this process will have a tremendous
impact on biology and medicine and will help medical researchers understand diseases and ¯nd
159cures. Proteins are involved in almost all cellular processes in the body. They are formed by a
chain of amino acids. They fold into complex three-dimensional shapes and even a small change in
its structure can turn a desirable protein into a disease. The numerical simulation of the folding of
a protein is a challenging task and is one of the most signi¯cant grand challenge. High-performance
computing technologies are a key element in advancing the solution of this problem.
StreamMD is a Molecular Dynamics (MD) solver written in Brook. It is based on solving the
Newton's equations of motion:
mi
d2 ri
dt2 = ¡rriU(r1;:::;rN)
def = F(ri)
(mi is the mass of particle i located at ri). The function U is an empirical potential energy which
de¯nes the energy of the system in a given con¯guration. The velocity Verlet method (or Leap-frog)
is used to integrate in time the equations of motion. Assume we have the position ri(t) and velocity
vi(t) at time t, then we obtain the new values at time t + 4t using:
ri(t + 4t) = ri(t) + 4t vi(t) +
4t2
2
F(ri(t))
mi
vi(t + 4t) = vi(t) +
4t
2
F(ri(t)) + F(ri(t + 4t))
mi
Using this method, it is possible to simulate the complex trajectories of atoms and molecules
for very long periods of time. However only the advent of more powerful supercomputers will allow
studying molecules over time scales which are biologically and experimentally relevant.
The most expensive part of the simulation is the computation of the forces F(ri). In order to
start with a simple test case, we simulated a box of water molecules. In this case, the potential
energy is de¯ned as the sum of two terms: electrostatic potential and the Van der Waals potential.
A cuto® is applied so that all particles which are at a distance greater than rcuto® do not interact.
The potential energy is then a function of the form:
U =
X
i;j;jrl(i)¡rl(j)j<rcuto®
qiqj
4¼²0jrl(i) ¡ rl(j)j
+
X
i;j;jrm(i)¡rm(j)j<rcuto®
4²ij
Ãµ
¾ij
jrm(i) ¡ rm(j)j
¶12
¡
µ
¾ij
jrm(i) ¡ rm(j)j
¶6!
where l(i) is a list of all charged particles and m(i) is a list of all oxygen atoms in the system.
To establish whether streaming supercomputing will be successful for MD simulations, we con-
sidered the following issues:
² Can the streaming computing model be applied in a natural way to MD simulations? The
following section illustrates how the code was implemented in Brook and what capabilities
were required. The most di±cult point is the implementation of the reduction operations.
These operations are naturally implemented using the gatherOP and scatterOP functionality.
² Is the application arithmetic intensive enough that bandwidth is not the limiting factor?
We performed a study of one key kernel schedule. This study shows that the application is
su±ciently arithmetic intensive. The bandwidth between the SRF and the clusters which is
required is very low and is not the limiting factor.
² Can the computation be split in such a way that we make an e±cient use of the ALUs? The
entire application schedule was considered for the simpler case of the Imagine hardware. This
160hardware doesn't have all the capabilities of the streaming supercomputer. The result shows
that memory operations hinder the calculation and the full utilization of the ALUs. Several
remedies are implemented in the streaming supercomputer so that this limitation is overcome.
In particular, the stream cache will allow a better utilization of the bandwidth.
Brook for StreamMD
Brook was developed with the intent of symplifying the task of implementing parallel codes and
hiding most of hardware issues such as communication from the user. In the case of streamMD,
this was done by implementing two functions, streamGatherOP and streamScatterOP. The rest
of the implementation are standard calls to kernel functions. The result is a simple and e±cient
implementation of the MD code for the streaming architecture.
The current implementation utilizes an underlying grid which allows, for each atom, to rapidly
search for all atoms within a distance rcuto®. The construction of the 3D array of grid cells con-
taining the list of all atoms within the cell is naturally implemented using streamGatherOp. This
function computes in parallel and very e±ciently the index of each atom within each grid cell. The
implementation is as follows. First we build an array GridCellNumber which contains the grid cell
number of each particle. Then we initialize to zero an array NumberOfAtom of size the total num-
ber of grid cells. StreamGatherOp is then used to fetch for each particle the current index number
in NumberOfAtom and add one. The result is stored in AtomNumber, which then contains for
each atom its unique index in the grid cell.
Once we have this information, for each grid cell, we build a list of all pairs of interacting water
molecules for the current cell and the surrounding cells. This produces a long stream of pairs which
is then passed to a kernel function which computes the interaction forces for all these pairs. A
reduction must then be done on all the computed forces to obtain the total force acting on each
molecule. This is implemented in parallel using streamScatterOp. This function takes the stream
with all the partial forces computed so far and scatters it to memory. StreamScatterOP does it in
such a way that each time a value is sent to memory, it is added to the value currently stored.
Another feature that was added in Brook to help implement e±ciently the MD code is the
reference stream construct. Reference streams allow for indirect addressing of memory data. They
are explicitely dereferenced by the streamScatter and streamGather functions.
Performance on the Streaming Supercomputer
To illustrate the performance of the application, we looked at two key schedules: the kernel schedule
for the force computation and the application schedule for kernels and memory operations. This
was done for the Imagine architecture which is simpler and has less capabilities than the Streaming
Supercomputer architecture. Speci¯cally, Imagine does not support GatherOp and ScatterOp which
required us to take a slightly di®erent approach to performing the reductions. This resulted in
increased memory bandwidth requirements on the Imagine.
The results we obtained guided some of the architectural decisions that will be validated by
compiling the full code with the Brook compiler.
Figure 5.17 depicts the VLIW schedule of the force calculation kernel on Imagine. A single force
calculation for the electrostatic potential, which produces partial forces for a pair of molecules,
requires roughly 500 Imagine operations. When optimizations including software pipelining and
loop unrolling are applied, the steady state performance is 13 cycles for a single interaction. This
corresponds to a total of 106 cycles, with 8 computations being performed at one time across the 8
clusters. It is important to note that the main loop is divide-limited. This can be seen by looking
at the sixth column on Figure 5.17. It is constantly ¯lled whereas idle times appear in the ¯rst
¯ve columns (3 adds + 2 multiplies). This is in part because the divide functional-unit is not fully
161pipelined. In order to overcome this shortcoming, the architecture for the Streaming Supercomputer
has been revised to include support for software iterative divides instead of a complete hardware
solution. This should allow us to better utilize the available computation resources and to better
pipeline the loop.
Importantly, the SRF bandwidth required by this kernel is low at about 1 word for every 30
instructions, justifying the large bandwidth gap between the local registers and the SRF.
After examining the kernel performance, we looked at the schedule of kernels and memory
operations. Figure 5.18 shows a representative part of the execution of the force calculation. When
these tests were made, the gridding technique had not been implemented. Therefore we used a
kernel, annotated as 1, to generate all possible pairs of molecules within the cuto® radius and ¯lter
out far-away molecule pairs. The result of this kernel is two streams which are processed by two
computation kernels, annotated by 2 and 3. The loop was unrolled twice to allow us to hide some of
the memory latency, hence each kernel appears twice. As can be seen, the execution is hindered by
the memory operations, and does not fully utilize the functional units. This can be seen by looking
at the number of cycles separating calls to kernels. This is due to the fact that the computation
cannot start until all the molecules have been gathered and ¯ltered.
This problem will be solved with SSS, whose bandwidth memory system is larger than Imagine
by a factor of about 2. We also introduced a stream cache on the SSS for an additional bandwidth
ampli¯cation. These two elements are likely to eliminate the stalls.
162ADD0 ADD1 ADD2 M UL0 M UL1 DIV0 INO0 INO1 INO2 INO3 INO4 INO5 INO6 INO7 SP_0 SP_0 COM 0 MC_0 JUK0 VAL0
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
CHK_ANY LOOP
FSUB SPREAD COND_IN_R FSUB SPREAD COND_IN_D FSUB ISUB32 SPCREAD_WT SPCWRITE FSUB FSUB SPCREAD_WT SPCWRITE FSUB SPCREAD_WT SPCWRITE FSUB FSUB SPREAD SPCREAD_WT SPCWRITE FMUL FMUL FSUB COND_IN_R FMUL FMUL FSUB COND_IN_R COND_IN_R FADD FMUL COND_IN_D SPCREAD_WT SPCWRITE FADD FMUL FSUB COND_IN_R FSUB FADD COND_IN_R SPREAD COND_IN_D SPREAD FSUB FSUB FADD COND_IN_R FSUB COND_IN_R FSUBFINVSQRT FSUB COND_IN_R COND_IN_R SPREAD FMUL FMUL FSUBFINVSQRT COND_IN_R COND_IN_R FMUL FMUL FSUB COND_IN_R COND_IN_R FADD FMUL COND_IN_D FADD FMUL SPCREAD_WT SPCWRITE COND_IN_R SPREAD COND_IN_R FADD SPREAD COND_IN_R FSUB COND_IN_R FADD FSUB COND_IN_R COND_IN_R FSUB FDIV SPREAD COND_IN_R FSUB COND_IN_R FINVSQRT FMUL FMUL FSUB COND_IN_D FMUL FMUL FSUB FADD FMUL FADD FMUL IADD32 FADD SPREAD FSUB FSUB FADD SPREAD_WT SPWRITE FSUB FINVSQRT SPREAD FSUB IADD32 FMUL FMUL FSUB FINVSQRT SPWRITE IADD32 FMUL SPREAD_WT SPWRITE FMUL FMUL FSUB SPCREAD_WT SPCWRITE FADD FMUL IADD32 SPREAD SPREAD FMUL FADD FMUL FSUB FADD FSUB FSUB FSUB SPCREAD_WT SPCWRITE FMUL FADD FLT FINVSQRT FSUB FLT SPCREAD_WT SPCWRITE FSUB FSUB FMUL FMUL SPCREAD_WT SPCWRITE FSUB FINVSQRT FLT IABD32 FLT FMUL FMUL IABD32 FSUB FADD FMUL FLT FSUB SPCREAD_WT SPCWRITE FLT FSUB IABD32 FADD SPREAD FMUL FADD SPREAD FSUB FSUB FSUB FSUB FSUB SPREAD FADD SPREAD SPREAD FINVSQRT FSUB FSUB FSUB FSUB FMUL FMUL SPREAD FINVSQRT FSUB FMUL FMUL FMUL FADD SPCREAD_WT SPCWRITE SPCREAD_WT SPCWRITE FMUL FADD FADD FSUB FSUB FSUB FSUB FADD FINVSQRT FSUB FSUB FSUB FSUBFMUL FMUL SPCREAD_WT SPCWRITE FINVSQRT FSUB FMUL FMUL FMUL FADDFMUL FADD SPREAD FADD SPREAD FSUB FSUB FADD FINVSQRT SPREAD FSUB FSUB FSUB SPREAD FMUL FMUL FINVSQRT FSUB SPWRITE FMUL FMUL FMUL FADD FMUL FADD FADD FMUL FSUB FSUB FADD FINVSQRT FSUBFMUL FSUB FSUB FMUL FMUL COND_IN_R FINVSQRT FMUL FSUBFMUL FMUL SPCREAD_WT SPCWRITE FMUL FADD FSUB COND_IN_R FMUL FMUL COND_IN_D FMUL FADDFMUL FSUB SPREAD FADD FSUB IMUL32 SPCREAD_WT SPCWRITE ILT32 FSUB FMUL IMUL32 SPCREAD_WT SPCWRITE FMUL FMUL FADD FMUL FMUL SPREAD FINVSQRT FMUL FMUL COND_IN_R FSUB FMUL FMUL SPWRITE COND_IN_R FMUL FMUL SPREAD_WT SPWRITE ISUB32FDIV FADD FMUL FMUL FMUL FMUL SPCREAD_WT SPCWRITE FSUBFMUL FMUL SPREAD FADD FMUL FMUL SPCREAD_WT SPCWRITE FMUL FMUL SPCREAD_WT SPCWRITE FMUL FMUL FMUL FADD FMUL FSUB COND_IN_R FADD FMUL FMUL COND_IN_R FMUL FMUL COND_IN_R FMUL FMUL COND_IN_R SPCWRITE FADD FMUL FMUL SPCWRITE COND_IN_R SPCREAD_WT FINVSQRT FMUL FMUL FLT SPCWRITE SPCREAD_WT COND_IN_R FADDFMUL FADD FMUL SPCWRITE SPCREAD COND_IN_D FMUL FMUL SPCWRITE SPCREAD FINVSQRT FMUL FMUL SPCREAD_WT SPCWRITE ILT32 FADD FMUL FMUL IABD32 FLT SPCREAD_WT SPCWRITE COND_IN_R FADD FADD FMUL FMUL SPCREAD_WT SPCWRITE COND_IN_R FMUL FADD FADD FMUL SPCREAD_WT SPCWRITE COND_IN_R FADD FADDFMUL FMUL SPCREAD_WT SPCWRITE COND_IN_R FADD FADD FMUL FMUL SPCREAD_WT SPCWRITE FADD FADD FADD FMUL FMUL SPCREAD_WT SPCWRITE FMUL FMUL FADD FADD SPCREAD_WT SPCWRITE COND_IN_R FMUL FADD FADD FADD FMUL SPCREAD_WT SPCWRITE COND_IN_R FMUL FADD SPCWRITE COND_IN_R FADD FADD SPCWRITE FMUL FMUL FADD FADD FADD SPCWRITE COND_IN_R FMUL FMUL FADD FADD SPCWRITE COND_IN_R FMUL FMUL FADD FADD SPCWRITE SPCREAD FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD SPCWRITE SPCREAD_WT FMUL FMUL FADD FADD FADD SPCWRITE SPCREAD_WT FADD FMUL FMUL SPREAD FADD SPCWRITE FADD FADD FMUL FMUL SPREAD FADD SPCWRITE FADD FMUL FMUL SPREAD FADD SPCWRITE FMUL FMUL SPREAD FADD FADD SPCWRITE FADD FMUL FMUL SPREAD FADD FSUB SPCWRITE FMUL FMUL SPREAD FSUB SPCWRITE FADD FMUL FMUL SPREAD FSUB SPCWRITE FMUL FMUL SPREAD FSUB FSUB FADD SPCWRITE FADD FADD FMUL FMUL SPREAD FSUB SPCWRITE FMUL FADD FMUL FSUB FSUB SPCWRITE SPCREAD_WT FMUL FADD FADD FMUL SPCWRITE FSUB SPCREAD_WT FMUL FADD FADD FADD SPWRITE FMUL SPREAD_WT FADD FADD FADDFMUL FMUL SPWRITE SPCREAD_WT FADD FADD FADD FMUL SPWRITE SPCREAD_WT FADD FADD FADD FMUL SPCWRITE SPCREAD FADD FADD FADD SPCWRITE SPCREAD FADD FADD FADD SPWRITE SPCREAD_WT FADD FADD FADD SPWRITE SPCREAD_WT FADD FADD FADD SPWRITE SPCREAD_WT FADD FADD FADD SPWRITE FMUL SPCREAD_WT FADD FADD SPWRITE FADD SPREAD_WT FADD FADD SPWRITE FADD SPREAD_WT FADD FADD FSUB SPCWRITE SPCREAD FADD SPCWRITE FSUB FADD SPREAD FADD FADD SPCWRITE FADD SPCREAD_WT FADD SPCWRITE FSUB FADD SPREAD FADD FADD SPWRITE FADD SPREAD FADD SPWRITE FADD FADD SPREAD_WT FADD FADD FADD SPCWRITE SPREAD FADD SPWRITE SPREAD FSUB FADD SPCWRITE FSUB SPREAD SPCREAD_WT SPCWRITE FSUB FSUB SPCREAD_WT SPCWRITE FADD FSUB SPCREAD_WT SPCWRITE FSUB FADD FLT SPCREAD_WT FSUB SPWRITE SPCREAD_WT FSUB SPWRITE SPCREAD_WT SPWRITE FLT COND_IN_R CHK_ANY SPCREAD_WT FADD SPWRITE IABD32 FLT COND_IN_D NLOOP SPCREAD_WT SPWRITE FLT SPCREAD_WT SPWRITE FSUB IABD32
NLOOP
SPREAD SPREAD SPREAD SPREAD SPREAD
Figure 5.17: Kernel Schedule for the force calculation. Cycles are marked vertically on the left, and
the di®erent columns correspond to the various Imagine resources used. The 3 left-most columns
are the adders, followed by 2 multipliers, the divide unit, and then communication, scratch-pad,
and SRF resources.
163MC_0 MEM_0
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
766 Save to MAR3 from SDR7 for 'tmp_in_forces3*' 769 Load from MAR2 to SDR6 w ith SDRIdx27 for 
'pos0_1*' 777 Load from MAR5 to SDR29 for 'tmp_in_forces*'
778 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 787 Load from MAR7 to SDR24 w ith SDRIdx27 for 
796 Load from MAR4 to SDR30 w ith SDRIdx22 for 
788 Save to MAR6 from SDR24 for 'pair_molcls2_ib*'
797 Save to MAR5 from SDR30 for 'tmp_in_forces3b*'
781 Save to MAR0 from SDR31 for 'tmp_out_forces*' 803 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 799 Save to MAR0 from SDR31 w ith SDRIdx11 for 
'forces*'
806 Save to MAR1 from SDR8 for 'tmp_out_forcesb*' 825 Run 'GenerateLow erMolclPairs' at MPC 0 w ith SDRs 810 Save to MAR4 from SDR8 w ith SDRIdx26 for 
854 Run 'GenerateLow erMolclPairs' at MPC 0 w ith SDRs
872 Load from MAR0 to SDR25 w ith SDRIdx19 for  873 Save to MAR1 from SDR25 for 'pair_molcls1_i*'
858 Save to MAR4 from SDR4 for 'close_infob' 877 Load from MAR7 to SDR13 w ith SDRIdx19 for 
'pos0_2*'
882 Load from MAR6 to SDR18 for 'forces*'
878 Save to MAR4 from SDR13 for 'pair_molcls2_i*' 899 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 887 Load from MAR2 to SDR29 w ith SDRIdx5 for 
'forces*' 891 Load from MAR0 to SDR10 w ith SDRIdx27 for 
892 Save to MAR1 from SDR10 for 'pair_molcls1_ib*' 888 Save to MAR5 from SDR29 for 'tmp_in_forces3*' 909 Load from MAR6 to SDR12 w ith SDRIdx27 for 
'pos0_2*' 914 Load from MAR2 to SDR6 for 'forcesb*' 902 Save to MAR7 from SDR11 for 'tmp_out_forces*'
919 Load from MAR0 to SDR24 w ith SDRIdx16 for 
'forcesb*'
910 Save to MAR3 from SDR12 for 'pair_molcls2_ib*'
928 Load from MAR7 to SDR15 for 'base_molcls2b' 920 Save to MAR1 from SDR24 for 'tmp_in_forces3b*' 929 Load from MAR5 to SDR8 for 'tmp_in_forcesb*'
938 Save to MAR3 from SDR11 w ith SDRIdx3 for 
'forces*' 942 Load from MAR2 to SDR25 w ith SDRIdx28 for  930 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 946 Load from MAR5 to SDR13 w ith SDRIdx28 for 
'pos0_2*'
943 Save to MAR1 from SDR25 for 'pair_molcls1_i*' 953 Load from MAR3 to SDR29 w ith SDRIdx1 for 
'forces*' 947 Save to MAR7 from SDR13 for 'pair_molcls2_i*'
933 Save to MAR4 from SDR26 for 'tmp_out_forcesb*' 939 Save to MAR0 from SDR26 w ith SDRIdx9 for 
'forcesb*' 950 Load from MAR4 to SDR18 for 'forces*' 954 Save to MAR0 from SDR29 for 'tmp_in_forces3*'
957 Load from MAR2 to SDR21 w ith SDRIdx14 for  962 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 972 Load from MAR4 to SDR7 w ith SDRIdx14 for 
'pos0_2*'
958 Save to MAR1 from SDR21 for 'pair_molcls1_ib*' 976 Load from MAR3 to SDR6 for 'forcesb*'
981 Load from MAR2 to SDR19 w ith SDRIdx17 for 
'forcesb*'
973 Save to MAR6 from SDR7 for 'pair_molcls2_ib*'
982 Save to MAR1 from SDR19 for 'tmp_in_forces3b*' 965 Save to MAR5 from SDR11 for 'tmp_out_forces*' 991 Run 'MedMolclInteractions' at MPC 574 w ith SDRs
984 Save to MAR5 from SDR11 w ith SDRIdx3 for 
994 Save to MAR7 from SDR27 for 'tmp_out_forcesb*' 1013 Run 'GenerateLow erMolclPairs' at MPC 0 w ith 
SDRs
998 Save to MAR2 from SDR27 w ith SDRIdx16 for 
1042 Run 'GenerateLow erMolclPairs' at MPC 0 w ith 
SDRs 1059 Load from MAR5 to SDR25 w ith SDRIdx28 for 
1060 Save to MAR7 from SDR25 for 'pair_molcls1_i*'
1049 Save to MAR0 from SDR29 for 'med_infob' 1063 Load from MAR4 to SDR13 w ith SDRIdx28 for 
'pos0_2*'
1068 Load from MAR6 to SDR30 for 'forces*'
1073 Load from MAR0 to SDR31 w ith SDRIdx10 for 
'forces*' 1064 Save to MAR2 from SDR13 for 'pair_molcls2_i*'
1076 Load from MAR5 to SDR21 w ith SDRIdx17 for 
'pos0_1*'
1074 Save to MAR1 from SDR31 for 'tmp_in_forces3*'
1084 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 1093 Load from MAR3 to SDR7 w ith SDRIdx17 for  1077 Save to MAR7 from SDR21 for 'pair_molcls1_ib*'
1103 Load from MAR7 to SDR15 w ith SDRIdx9 for 
'forcesb*'
1094 Save to MAR0 from SDR7 for 'pair_molcls2_ib*'
1087 Save to MAR2 from SDR11 for 'tmp_out_forces*' 1104 Save to MAR4 from SDR15 for 'tmp_in_forces3b*'
1109 Load from MAR2 to SDR23 for 'base_molcls2b' 1118 Save to MAR0 from SDR11 w ith SDRIdx3 for  1110 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 1123 Load from MAR1 to SDR4 w ith SDRIdx12 for 
'pos0_1*' 1128 Load from MAR4 to SDR8 w ith SDRIdx12 for 
'pos0_2*' 1124 Save to MAR5 from SDR4 for 'pair_molcls1_i*' 1137 Load from MAR0 to SDR18 w ith SDRIdx14 for 
2
2
3
3
1
1
1
1
MC_0 MEM_0
97000
98000
99000
100000
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
766 Save to MAR3 from SDR7 for 'tmp_in_forces3*' 769 Load from MAR2 to SDR6 w ith SDRIdx27 for 
'pos0_1*' 777 Load from MAR5 to SDR29 for 'tmp_in_forces*'
778 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 787 Load from MAR7 to SDR24 w ith SDRIdx27 for 
796 Load from MAR4 to SDR30 w ith SDRIdx22 for 
788 Save to MAR6 from SDR24 for 'pair_molcls2_ib*'
797 Save to MAR5 from SDR30 for 'tmp_in_forces3b*'
781 Save to MAR0 from SDR31 for 'tmp_out_forces*' 803 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 799 Save to MAR0 from SDR31 w ith SDRIdx11 for 
'forces*'
806 Save to MAR1 from SDR8 for 'tmp_out_forcesb*' 825 Run 'GenerateLow erMolclPairs' at MPC 0 w ith SDRs 810 Save to MAR4 from SDR8 w ith SDRIdx26 for 
854 Run 'GenerateLow erMolclPairs' at MPC 0 w ith SDRs
872 Load from MAR0 to SDR25 w ith SDRIdx19 for  873 Save to MAR1 from SDR25 for 'pair_molcls1_i*'
858 Save to MAR4 from SDR4 for 'close_infob' 877 Load from MAR7 to SDR13 w ith SDRIdx19 for 
'pos0_2*'
882 Load from MAR6 to SDR18 for 'forces*'
878 Save to MAR4 from SDR13 for 'pair_molcls2_i*' 899 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 887 Load from MAR2 to SDR29 w ith SDRIdx5 for 
'forces*' 891 Load from MAR0 to SDR10 w ith SDRIdx27 for 
892 Save to MAR1 from SDR10 for 'pair_molcls1_ib*' 888 Save to MAR5 from SDR29 for 'tmp_in_forces3*' 909 Load from MAR6 to SDR12 w ith SDRIdx27 for 
'pos0_2*' 914 Load from MAR2 to SDR6 for 'forcesb*' 902 Save to MAR7 from SDR11 for 'tmp_out_forces*'
919 Load from MAR0 to SDR24 w ith SDRIdx16 for 
'forcesb*'
910 Save to MAR3 from SDR12 for 'pair_molcls2_ib*'
928 Load from MAR7 to SDR15 for 'base_molcls2b' 920 Save to MAR1 from SDR24 for 'tmp_in_forces3b*' 929 Load from MAR5 to SDR8 for 'tmp_in_forcesb*'
938 Save to MAR3 from SDR11 w ith SDRIdx3 for 
'forces*' 942 Load from MAR2 to SDR25 w ith SDRIdx28 for  930 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 946 Load from MAR5 to SDR13 w ith SDRIdx28 for 
'pos0_2*'
943 Save to MAR1 from SDR25 for 'pair_molcls1_i*' 953 Load from MAR3 to SDR29 w ith SDRIdx1 for 
'forces*' 947 Save to MAR7 from SDR13 for 'pair_molcls2_i*'
933 Save to MAR4 from SDR26 for 'tmp_out_forcesb*' 939 Save to MAR0 from SDR26 w ith SDRIdx9 for 
'forcesb*' 950 Load from MAR4 to SDR18 for 'forces*' 954 Save to MAR0 from SDR29 for 'tmp_in_forces3*'
957 Load from MAR2 to SDR21 w ith SDRIdx14 for  962 Run 'MedMolclInteractions' at MPC 574 w ith SDRs 972 Load from MAR4 to SDR7 w ith SDRIdx14 for 
'pos0_2*'
958 Save to MAR1 from SDR21 for 'pair_molcls1_ib*' 976 Load from MAR3 to SDR6 for 'forcesb*'
981 Load from MAR2 to SDR19 w ith SDRIdx17 for 
'forcesb*'
973 Save to MAR6 from SDR7 for 'pair_molcls2_ib*'
982 Save to MAR1 from SDR19 for 'tmp_in_forces3b*' 965 Save to MAR5 from SDR11 for 'tmp_out_forces*' 991 Run 'MedMolclInteractions' at MPC 574 w ith SDRs
984 Save to MAR5 from SDR11 w ith SDRIdx3 for 
994 Save to MAR7 from SDR27 for 'tmp_out_forcesb*' 1013 Run 'GenerateLow erMolclPairs' at MPC 0 w ith 
SDRs
998 Save to MAR2 from SDR27 w ith SDRIdx16 for 
1042 Run 'GenerateLow erMolclPairs' at MPC 0 w ith 
SDRs 1059 Load from MAR5 to SDR25 w ith SDRIdx28 for 
1060 Save to MAR7 from SDR25 for 'pair_molcls1_i*'
1049 Save to MAR0 from SDR29 for 'med_infob' 1063 Load from MAR4 to SDR13 w ith SDRIdx28 for 
'pos0_2*'
1068 Load from MAR6 to SDR30 for 'forces*'
1073 Load from MAR0 to SDR31 w ith SDRIdx10 for 
'forces*' 1064 Save to MAR2 from SDR13 for 'pair_molcls2_i*'
1076 Load from MAR5 to SDR21 w ith SDRIdx17 for 
'pos0_1*'
1074 Save to MAR1 from SDR31 for 'tmp_in_forces3*'
1084 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 1093 Load from MAR3 to SDR7 w ith SDRIdx17 for  1077 Save to MAR7 from SDR21 for 'pair_molcls1_ib*'
1103 Load from MAR7 to SDR15 w ith SDRIdx9 for 
'forcesb*'
1094 Save to MAR0 from SDR7 for 'pair_molcls2_ib*'
1087 Save to MAR2 from SDR11 for 'tmp_out_forces*' 1104 Save to MAR4 from SDR15 for 'tmp_in_forces3b*'
1109 Load from MAR2 to SDR23 for 'base_molcls2b' 1118 Save to MAR0 from SDR11 w ith SDRIdx3 for  1110 Run 'CloseMolclInteractions' at MPC 186 w ith SDRs 1123 Load from MAR1 to SDR4 w ith SDRIdx12 for 
'pos0_1*' 1128 Load from MAR4 to SDR8 w ith SDRIdx12 for 
'pos0_2*' 1124 Save to MAR5 from SDR4 for 'pair_molcls1_i*' 1137 Load from MAR0 to SDR18 w ith SDRIdx14 for 
2
2
3
3
1
1
1
1
Figure 5.18: Schedule of the kernels and memory operations for the force calculation. Cycles are
marked vertically on the left, and the di®erent columns correspond to the various Imagine resources
used. The left-most column represents kernels running on the clusters, and the right column is for
memory operations.
1645.4.4 StreamVIZ
After computing volumes of data on large multiprocessor computers, one is left with the daunting
task of analyzing and interpreting the resulting °ow ¯elds. For example, consider a variable density
incompressible °ow simulation of °ow past a sphere on a 10003 grid where one has approximately
109 values for velocity vectors, densities and pressures at every time step. While one might be
interested in the overall drag on a sphere, it makes little sense to report only this number (as
many researchers do), especially when 5 ¤ 109 numbers are available at each simulation time step.
One of our visualization goals is to provide high-level global visualizations of entire °ow ¯elds in
order to aid in data interpretation. For example, when carrying out an actual experiment of water
°owing past a sphere, an experimentalist might inject dye into the °ow in order to ascertain where
the most interesting parts of the °ow are marking them for further study. This can be mimicked
numerically by injecting a passive scalar just upwind of the sphere and rendering it as a smoke
trail entrained in various vortices. These techniques are not meant to replace standard ones; just
as the experimentalists dye does not alleviate the need to measure the drag on the sphere. But
these visualization techniques do provide a global basic understanding of the °ow ¯eld allowing
one to more easily mark speci¯c regions of the °ow for further study. Our goal is not to obtain
the exact physical picture of the simulation, but instead to create plausible visualizations that
provide useful information at a high level. For example, in clean combustion the entire reaction
region is invisible, and it does little good to draw a picture of nothing. Thus, we are working to
visualize these processes in creative ways that draw as much as possible on realistic techniques, but
add nonphysical plausible aspects when deemed necessary. On the other hand, it is important to
keep in mind that more realistic rendering should markedly increase the e±ciency of the high-level
interpretation of data.
The goal of the StreamVIZ portion of this project is twofold. First, obviously, it is advanta-
geous to ensure that the new streaming super computer architecture can support all the standard
visualization algorithms used in everyday engineering analysis as well as the new high level photore-
alistic tools that we are currently developing. Second, and maybe more importantly, since the other
streaming applications are fairly specialized, this part of the streaming supercomputer project fo-
cuses more broadly on the fundamental mathematical algorithms and numerical analysis to ensure
that the new computer's architecture can not only be used to solve the specialized problems we are
focusing on in our particular ASCI center, but also be used by researchers throughout the world
interested in a variety of problems. Thus, this past year while working on algorithms for visualizing
°uid dynamics, we spent considerable time focusing on a number of related numerical algorithms
from the standard mathematical numerical analysis literature. For example, we worked on iterative
methods for solving large sparse linear systems focusing on non-multigrid methods since multigrid
is successfully being pursued by the StreamFlo team. We worked on general ¯nite di®erence meth-
ods for elliptic, parabolic and hyperbolic partial di®erential equations including both explicit and
implicit ordinary di®erential equation solvers. This complements well the °uid dynamics pursuits of
the rest of the center. And we worked on level set methods for solving problems with interfaces, in
particular the particle level set technology for droplet breakup problems. Similarly, when we start
to consider visualization associated with applications currently being pursued by the StreamFEM
team, we will consider algorithms for nonlinear time dependent ¯nite element calculations as well
as algorithms for contact, collision and fracture modeling. In fact, we have already begun working
in this direction simulating soft deformable surfaces with millions of triangles in the simulation
mesh and tens of thousands of resolved collisions per time step with no self-interference allowed.
This work was published in [2], see ¯gure 5.19. And when considering visualization related to the
e®orts of the StreamMD team, we will also be considering algorithms for short and long range
interactions, as well as collisions between rigid and articulated bodies (i.e. representing molecules).
We have been working on photorealistic visualization of smoke, water and ¯re since these rep-
165Figure 5.19: Sample calculations of a thin surface with millions of triangles in the simulation mesh
and tens of thousands of resolved collisions per time step with no self-interference allowed.
resent a large portion of the visual phenomena present in °uids. In this work, we have used custom
algorithms both because they are quite fast and because we want to stress a broad algorithms
base for the streaming supercomputer. Our goal in these simulations is to get as much detail as
possible into the simulation so that the data sets are richer for the visualization phase. Moreover,
this allows us to get complex visualization data with very short simulation times, thus providing a
wealth of data for the visualization process. Of course, this rules out many approaches like using
averaged equations or fully resolving all scales on the computational grid. With this in mind, we
used the vorticity con¯nement method of Steinho® to generate visually complex °ow ¯elds in our
smoke simulations. Although not completely grounded in rigorous physics, this method has been
shown to produce °ow ¯elds which are useful \models" for complex rotating °ows. For example,
vorticity con¯nement is commonly used in the rotocraft industry to generate complex °ow ¯elds
that can be used to test controllers whereas other methods usually fail due to their inability to
either resolve the necessary scales or provide useful results in under-resolved calculations. Figure
5.20 shows some typical visualizations of smoke. This work was published in [5].
For the water simulations, we use the high quality particle level set method described in [3] to
carry out the °uid dynamics calculations. This method has been shown to be more accurate than the
volume of °uid method and more robust than the front tracking method especially in three spatial
dimensions. This high degree of accuracy allows us to obtain complex dynamic interfaces rich with
features for our isosurface visualization algorithms. Figure 5.21 shows some typical visualizations
of water.
For the ¯re simulations, we based our work on our recent paper [10] that improved the work of
[6] allowing for the merging of combustion fronts, the simulation of °ames with high curvature, and
a three dimensional implementation. [10] extended the ghost °uid method (invented in [4]) from
incompressible two phase °ow [8],[7] to include interfaces where the normal velocity has a jump
discontinuity due to the conversion of one phase into another (in this case, the chemical reaction).
For the purposes of visualization we made a few more modi¯cations to this algorithm [9] both to
improve its e±ciency and to increase the amount of detail present for visual analysis. Moreover, we
added programmable density and temperature variables that could be tuned to visualize di®erent
parts of the °ow ¯eld as if the density variable represented soot that emitted blackbody radiation
based on the temperature ¯eld. Some sample simulations are shown in ¯gure 5.22. Note that we
can render the zero isocontour of the level set function as a blue °ame core in order to visualize
the thin combustion zone, or we can visualize the °ow ¯eld using our soot emission model.
166Figure 5.20: Sample calculations of smoke using the vorticity con¯nement method of Steinho®. A
high degree of rolling motion can be obtained on coarse grids using this method.
Figure 5.21: Sample calculations of water using the new particle level set method. The zero level
set is used to render the water surface.
Figure 5.22: Sample calculations of ¯re using the level set method and the ghost °uid method. The
¯gure on the left shows the zero level set rendered as a blue °ame core. The ¯gure on the right
shows typical blackbody radiation.
167168Bibliography
[1] Harten A., Lax P. D., and van Leer B., 1983, On upstream di®erencing and Godunov-type
schemes for hyperbolic conservation laws. SIAM Rev., 25:35{61.
[2] Bridson, R., Fedkiw, R. and Anderson, J.,2002, Robust Treatment of Collisions, Contact and
Friction for Cloth Animation, SIGGRAPH 2002, ACM TOG 21, 594-603.
[3] Enright, D., Marschner, S. and Fedkiw, R., 2002, Animation and Rendering of Complex Water
Surfaces, SIGGRAPH 2002, ACM TOG 21, 736-744.
[4] Fedkiw, R., Aslam, T., Merriman, B. and Osher, S., 1999, A Non-Oscillatory Eulerian Ap-
proach to Interfaces in Multimaterial Flows (The Ghost Fluid Method), J. Comput. Phys. 152,
457-492.
[5] Fedkiw, R., Stam, J. and Jensen, H., 2001, Visual Simulation of Smoke, SIGGRAPH 2001,
23-30.
[6] Helenbrook, B., Martinelli, L., Law, C., 1999, A Numerical Method for Solving Incompressible
Flow Problems with a Surface of Discontinuity, J. Comp. Phys. 148, 366-396.
[7] Kang, M., Fedkiw, R. and Liu, X.-D., 2000, A Boundary Condition Capturing Method for
Multiphase Incompressible Flow, J. Sci. Comput. 15, 323-360.
[8] Liu, X.-D., Fedkiw, R. and Kang, M., 2000, A Boundary Condition Capturing Method for
Poisson's Equation on Irregular Domains, J. Comput. Phys. 160, 151-178.
[9] Nguyen, D., Fedkiw, R. and Jensen, H., 2002, Physically Based Modeling and Animation of
Fire, SIGGRAPH 2002, ACM TOG 21, 721-728.
[10] Nguyen, D., Fedkiw, R. and Kang, M., 2001, A Boundary Condition Capturing Method for
Incompressible Flame Discontinuities, J. Comput. Phys. 172, 71-98.
169