Prefetching on the Cray-T3E: a model and its evaluation by Mueller, Matthias M. et al.
Prefetching on the Cray-T3E: A Model and its evaluation
Technical Report No. 26/97
Matthias M. Muller

, Thomas M. Warschko and Walter F. Tichy
University of Karlsruhe, Dept. of Informatics
Am Fasanengarten 5, 76 128 Karlsruhe, Germany
e-mail: fmuellerm,warschko,tichyg@ira.uka.de
10th December 1997
Abstract
In many parallel applications, network latency causes
a dramatic loss in processor utilization. This paper ex-
amines software controlled access pipelining (SCAP)
as a technique for hiding network latency. An analytic
model of SCAP briey describes basic operation tech-
niques and performance improvements. Results are
quantied with benchmarks on the Cray-T3E. The
benchmarks used are Jacobi-iteration, parts of the
Livermore Loop kernels, and others representing six
dierent parallel algorithm classes. These were par-
allelized and optimized by hand to show the perfor-
mance tradeo of severals pipelining techniques. Our
results show that SCAP on the Cray-T3E improves
performance compared to a blocking execution by a
factor of 2.1 to 38. It also got a performance speed-up
against HPF of at least 12% to a factor of 3.1 depen-
dent on the algorithm class.
1 Introduction
As microprocessors get faster and the gap between
computation and communication speeds widens, net-
work latency becomes the dominant factor of the ex-
ecution time of ne-grained parallel programs. Given
a 300 MHz clock, a 1:5s latency corresponds to 450
clock cycles as is the case for the Cray-T3E. Thus, in-
stead of a single communication operation one could
perform 450 arithmetic instructions. This situation
becomes worse by a factor of up to 10 once soft-
ware overhead is factored in. If, however, the par-
allel machine is capable of performing communication
and computation concurrently, the loss in eciency

Supported by the Graduiertenkolleg Karlsruhe 'Be-
herrschbarkeit komplexer Systeme'
can be reduced by overlapping several communication
requests.
Little is known about the eects of latency hiding ap-
plied to communication networks in massively parallel
computers with distributed memory. This paper re-
ports on experiments on the Cray-T3E that quantify
the eects of latency hiding on real programs, namely
parallel versions of the Livermore Loops, Jacobi-
iteration, and a few others.
The basic latency hiding technique is discussed in Sec-
tion 2 combined with an overview of the analytic net-
work model. Section 3 introduces the architecture of
the T3E and compares it to the assumptions made in
Section 2. Section 4 presents parallel algorithm classes
and their instantiation by the benchmark set. Perfor-
mance results are discussed in Section 5. Section 6
presents conclusions and future work.
2 SCAP
2.1 The Technique
In ne-grained parallel applications as in most other
parallel applications latency prevents fast access to
non-local memory. This work targets latency hid-
ing through both overlapping computation and com-
munication by splitting non-local memory access into
prefetch and access.
2.1.1 Network requirements
We distinguish blocking and overlapping networks (see
gure 1):
Blocking Networks: The Processor stalls until the
desired remote value arrives. Hence, there is no
1
way other than task switching to overlap dier-
ent communication requests. The processor delay
equals the latency of the underlying network.
(C
n
,4t
n
)-Overlapping Networks: The network is
able to issue a communication request every 4t
n
cycles. It can handle up to C
n
operations per
processor in parallel. C
n
and 4t
n
are explained
in detail in section 2.2.
Furthermore, the (C
n
,4t
n
)-overlapping network
should serve the communication requests of each pro-
cessor in FIFO order to achieve maximum latency hid-
ing capabilities because the value prefetched rst is
accessed rst.
2.1.2 Basic Operation of SCAP
The basic operation of SCAP is illustrated using the
following simple forall-statement:
FORALL i = 0..N-1 DO
A[i] := B[q(i)];
END
The program fragment updates array A in parallel,
indexing array B with permutation q. A parallelizing
compiler maps the problem size N onto P real proces-
sors. This technique is called virtualization. Hence,
each processor emulates V = d
N
P
e virtual processors
within a virtualization loop. Both A and B are dis-
tributed over the P processors. Since the value of q(i)
can not be determined at compile time the compiler
has to insert remote memory accesses. The virtualiza-
tion of the program fragment is as follows:
/* Forall processors in parallel */
FORALL j = 0..P-1 DO
/* Simulate V virtual processors */
FOR k=j*V TO (j+1)*V-1 DO
/* Calculate remote address */
a1 := calculate address(B[q(k)]);
/* Read remote data element */
A[k] := remote read (a1);
END
END
In the worst case, every processor issues V non-local
memory accesses. These stall the processor if the net-
work can not serve the desired values fast enough.
Hence, execution time of this loop is at least V times
the network latency. The following transformation of
the loop shows how communication and computation
can be overlapped:
FORALL j=0...p-1 DO
FOR k=j*V TO (j+1)*V-1 DO
/* Calculate remote address */
a1 := calculate address ( B[q(k)] );
/* Start read request */
prefetch(a1);
END
FOR k=j*V TO (j+1)*V-1 DO
/* Recalculate remote address for simplicity */
a1 := calculate address ( B[q(k)] );
/* Access data element */
A[k] := access(a1);
END
END
In this transformation, the main loop is split into two
instances: a prefetch and an access (or calculation)
loop. Instead of stalling on a remote memory access
as in blocking networks, the processor issues remote
memory requests. After the prefetch loop is executed
the calculation loop accesses non-local memory with-
out waiting time (if the data is already present !). In
the best case, program speed-up is about (V  1) times
the network latency because there is at most one wait-
ing period (arrival of rst data item) which is bridged
with subsequent communication requests compared to
V waiting times in a blocking network. The nature of
this speed-up is explained below. However, the double
address calculation and the loop cost time, also. To
reduce overhead of address calculation we assume a
global address space where either network or proces-
sor is able to compute local addresses eciently.
2.2 The analytic network-model
We only give a short overview of the model. A com-
plete discussion is given in [10].
First of all, the execution time T of a parallel al-
gorithm can be written as the sum of computation
time and communication time: T = T
Computation
+
T
Communication
. While T
Computation
stays constant for
a xed algorithm, T
Communication
depends on the un-
derlying network. Communication time for blocking
networks T
Blocking
(k) is
T
Blocking
(k) = k  T
Latency
; (1)
because each of the k non-local memory accesses lasts
one network latency.
The case for (C
n
,4t
n
)-overlapping networks is not
that easy because each communication request de-
pends on the ones before:
2
Ac
c
e
s
s
1
A
c
c
e
s
s
2
A
c
c
e
s
s
3
Inquiry 1 Reply 1
Inquiry 2 Reply 2
Inquiry 3 Reply 3
Inquiry 1 + Reply 1
Inquiry 2 + Reply 2
Inquiry 3 + Reply 3
a) Blocking
b) Overlapping
Figure 1: Dierent network models.
T
Overlapping
(k) = T
Prefetch
(k) +
k
X
i=1
T
Wait
i
;
(2)
0  T
Wait
i
 T
Latency
:
T
Prefetch
(k) denotes the time for the prefetch loop and
T
Wait
i
is the waiting time for communication request i.
While T
Prefetch
(k) depends only on address calculation
designation of T
Wait
i
involves the parameters in table
1.
a) Application parameters
Parameter Description
4t
p
Time spent in one iteration of
the prefetch loop
4t
c
Time spent in one iteration of
the calculation loop
b) Hardware parameters
Parameter Description
4t
n
Network issue time
C
p
Size of processor prefetch buer
C
n
Capacity of network
Table 1: Hardware and application specic parame-
ters.
4t
p
and 4t
c
vary for dierent applications and have
to be measured for each new program. 4t
n
, C
p
, and
C
n
characterize the network, hence they are xed for
a given architecture. Network capacity C
n
indicates
the amount of communication requests the network
can overlap. C
n
and 4t
n
are connected over T
Latency
because
T
Latency
= C
n
 4t
n
: (3)
The prefetch buer decouples processor from network
such that the processor can issue more prefetch in-
structions than network can overlap. Consequently,
C
p
is assumed to be much larger than C
n
(C
p
 C
n
).
The relationship of 4t
n
, 4t
p
, and 4t
c
on one side
and of C
p
, C
n
, and k on the other side introduces six
dierent network model classes. They are summarized
in table 2.
0 < 4t
n
 0 < 4t
p
4t
p
 4t
c
< 4t
n
0  k  C
n
 C
p
Class 1 Class 4
C
n
< k  C
p
Class 2 Class 5
C
p
< k Class 3 Class 6
Table 2: Dierent processor network models.
The rows indicate the dierent amount of communica-
tion compared to prefetch buer and network capac-
ity. The columns distinguish dierent relationships
between network issue rate and processor prefetch
time. Parameter 4t
c
covers not only address calcu-
lation but also some computation which uses the re-
quested non-local memory content. Consequently, it
is 4t
c
> 4t
p
which does not aect the rst column.
However, it incorporates additional classes to the sec-
ond but they are further of no interest. The discussion
of the entire second column and its subclasses can be
found in [10].
Now, we can calculate T
Wait
i
in (2) for the network
model classes one to three.
Class 1: (0 < 4t
n
 4t
p
 4t
c
; 0  k  C
n
 C
p
,
see gure 2)
There are fewer communication requests than the
3
Time
Network
Program
: : :
T
Latency
P
1
P
2
P
3
: : :
P
k
A
1
A
1
A
2
A
3
: : :
4t
p
4t
p
4t
p
T
Wait
4t
c
4t
c
Figure 2: Behavior of SCAP, class 1.
network can overlap (k  C
n
). Hence, after exe-
cuting the prefetch loop the processor stalls wait-
ing for the rst non-local access. All subsequent
accesses to non-local values have no delay because
of 4t
c
 4t
n
. Consequently:
T
Wait
i
=
(
T
Latency
  k 4t
p
if i = 1
0 if 1 < i  k
(4)
Class 2: (0 < 4t
n
 4t
p
 4t
c
; C
n
< k  C
p
, see
gure 3)
Time
Network
Program
: : :
T
Latency
P
1
P
2
P
3
: : :
P
k
A
1
A
2
: : :
4t
p
4t
p
4t
p
4t
c
Figure 3: Behavior of SCAP, class 2.
The number of communication requests is larger
than the network capacity C
n
and smaller than
the prefetch buer size C
p
. Hence, there are no
waiting times for non-local memory accesses dur-
ing the calculation loop because of 4t
c
 4t
n
.
Therefore :
T
Wait
i
= 0; 1  i  k (5)
Class 3: (0 < 4t
n
 4t
p
 4t
c
; C
p
< k, see gure
4)
The dierence to case 2 lies in the number of com-
munication requests which exceeds the size of the
prefetch buer (k > C
p
). Thus, SCAP changes
from a two to a three step loop execution. In the
rst loop, C
p
values are prefetched. In each itera-
tion of the second loop, one entry of the prefetch
Time
Network
Program
: : :
T
Latency
P
1
P
2
P
3
: : :
P
C
p
P
C
p
+1
A
1
A
2
: : :
4t
p
4t
p
4t
p
4t
p
4t
c
Figure 4: Behavior of SCAP, class 3.
buer is accessed, the content is used, and an-
other communication request is issued with the
empty entry. This loop is executed k C
p
times.
The third loop accesses the remaining C
p
non-
local values.
4t
c
+4t
p
 4t
n
because of 4t
c
 4t
p
 4t
n
and therefore, there are no waiting times:
T
Wait
i
= 0; 1  i  k (6)
Now, the waiting times are inserted in (2) to cal-
culate the communication time of SCAP, assuming
T
Prefetch
(k) equals k  4t
p
.
Class 1: Time of SCAP covers the prefetch loop and
waiting time for the rst non-local memory access
(4):
T
Overlapping,1
(k) = T
Latency
(7)
Class 2 and 3: As there are no waiting times (see
(5) and (6)), time for communication equals the
overhead of the prefetch loop:
T
Overlapped,2,3
(k) = k  4t
p
(8)
The runtime eort of SCAP against blocking execu-
tion is summarized below.
Class 1: The dierence between T
Blocking
(1) and
T
Overlapping,1
(7) is as follows:
T
di
= (k   1)  T
Latency
(9)
SCAP runtime is (k   1)  T
Latency
faster than
blocking execution.
Classes 2 and 3:
T
di
= (1 
1
c
)  k  T
Latency
; 1 < c < C
n
(10)
4
with c =
T
Latency
4t
p
. In both cases the advantage
of SCAP is limited by the network capacity C
n
.
The case c = 1 is omitted because this results in a
blocking network. In networks with larger capac-
ities, 1  
1
c
denotes the speed-up in percentage.
For example with c = 2, SCAP shows a reduction
of communication time of 50% (90% with c = 10).
After presenting SCAP and its theoretical runtime im-
provements the next section deals with the architec-
ture of the Cray-T3E and covers its classication in
the above mentioned classes.
3 The Cray-T3E
3.1 Architectural Overview
The T3E consists up to 2048 DEC Alpha EV5 21164
processors running at 300 MHz. They are connected
with a 3D-torus network. The net is decoupled from
the processors at a speed of 75 MHz [8] with over-
lapped communication. Each link has a bandwidth of
approximately 500 MB/s resulting in a 3 GB/s trans-
fer rate for a single node.
The network interface consists of 512 user and 128
system E-registers, memory mapped in the address
space of each processor. They are the only way to
perform data transfer between distinct nodes in the
network. Reads and writes between E-registers and
global memory are called gets and puts. To load a
global memory content into the processor, a get and a
subsequent read of the E-register has to be executed.
The latter operation stalls the processor until the value
arrives. This is achieved in hardware according to the
implicit state of the E-register. On a put the local
memory of a remote node is modied and the cache
is updated [6]. Hence, the T3E implements a global
address space with locally consistent memory.
3.2 Characteristic Parameters
The model parameters of the T3E from table 1 are
given below.
4t
p
= 160ns
4t
c
= 107ns
4t
n
= 13:3ns
C
p
= 480 entries
C
n
= 56 entries
We derived the rst two from measurements. The last
three were taken from literature [8]. With these pa-
rameters, we got T
Latency
= 1489:6ns which diers
only 0:5% from measurement.
For the classication of the T3E, there is 4t
n
< 4t
p
for all applications. 4t
c
< 4t
p
in contrast to the
model adds only some waiting times but does not af-
fect the classication. Consequently, the Cray-T3E
can be graded into the rst three classes of table 2.
4 Benchmarks and their Imple-
mentations
The classication of the parallel algorithms used is due
to the relationship of communication to computation
time and due to the communication pattern. Table 3
gives an overview.
T
Communication
< T
Communication

T
Computation
T
Computation
Reduction LL3 (32) LL5 (32)
Indexed
Arrays
LL1 (32)
Data-transfer (2),
Rotate (32)
Indirect
Indexed
Arrays
- LL13 (64)
2D-grid Jacobi (64) -
Table 3: Parallel algorithm classes.
The columns present the dierent amount of commu-
nication while the rows focus on the communication
pattern. The number in brackets show the quantity
of PEs which took part in the calculation. We chose
the algorithms from table 3 because they are simu-
lated and discussed in [10] which presents a detailed
discussion of them, also. These algorithms are repre-
sentatives for dierent algorithm classes and show the
principle SCAP behavior of their class.
LL1, LL3, LL5 and LL13 are parallelized versions
of the corresponding Livermore loops. Jacobi im-
plements a 2d-grid nearest neighborhood calculation.
Rotate is a cyclic shift of an integer array and Data-
transfer copies memory blocks from one to an ad-
jacent node.
Where possible, we parallelized the algorithms in ve
dierent ways:
Blocking: There is only one communication request
at a time. As one E-register acts on one data
element there is only one E-register in use.
SCAP: Up to 128 communication requests are used
in parallel. If there are more than 128 non-local
memory accesses SCAP changes to a three step
loop execution.
5
Vector (-SCAP): 8 E-registers can be combined to
a vector. Each time message aggregation is possi-
ble vector communication is preferred. There are
up to 64 vectors (=256 E-registers) used in paral-
lel to obtain a maximum sustained data transfer.
Shared memory library: (Shmem) All commu-
nication is done with the Cray standard shared
memory library functions.
HPF: As a comparison to an existing data parallel
compiler the executables of the Portland Group
HPF-compiler version 2.2 [9] are considered. This
seemed interesting to us as SCAP is a construc-
tive transformation of parallel algorithms and it
is going to be integrated into a parallelizing com-
piler.
For a detailed description of the Livermore Loops and
their parallelization see [1, 10]. The rst four of the
above versions are coded by hand in C and compiled
with the -O3 command line option. The options for
the HPF compilation are -Msmp and -O2. Most of
the HPF-versions are instrumented with the HPF-
directive !HPF$ independent, on home(...) which
results in a parallelization of the corresponding do- or
forall-loop. We iterated each test one million times
and measured runtime with the Unix clock function.
5 Results
The runtimes of the dierent versions of each bench-
mark are compared to blocking execution. We also
show the ratio of the approximated and measured run-
times of SCAP and blocking execution. Approxima-
tion was done with respect to our model. The discus-
sion of each benchmark includes three plots. The rst
one shows the runtimes, the second one presents the
relative performance compared to blocking execution,
and the last one shows the ratio of approximation and
measurement. In the speed-up plot, a number less
than one indicates a slow-down.
5.1 Data-Transfer
The dierent versions of Data-transfer behave as
expected (see gure 5). SCAP performs ve times bet-
ter than Blocking. Vector and Shmem get a relative
speed-up of 37 and 62 compared to Blocking. SCAP
improves performance just with overlapping communi-
cation requests. The improvement of Shmem against
Vector seems due to the heavily optimized shared
1 100 10000 1000000
Packet size
1
10
100
1000
10000
100000
Ti
m
e 
in
 u
s
1 100 10000 1000000
Packet size
1
10
Sp
ee
d-
up
Shmem
Vector
SCAP
Blocking
Figure 5: Benchmark: Data-transfer. Packet sizes
are in units of integers (8 Byte).
memory library of Cray (which we were not able to
reproduce).
The model approximates runtime of Blocking in a
range of 0.5% (see gure 6). Approximation of SCAP
lies in a 10% range of the actual runtimes. The kink
at packet size 128 shows the change from a two loop to
a three loop execution of SCAP. The good approxima-
tion of Blocking is due the small amount of computa-
tion and easy communication structure in this bench-
mark.
1 10 100 1000
Packet size
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking
SCAP
Figure 6: Approximation of Data-transfer.
6
1 100 10000
Virtualization
10
100
1000
10000
100000
Ti
m
e 
in
 u
s
1 100 10000
Virtualization
5
10
15
Sp
ee
d-
up
Shmem
Vector
SCAP
HPF
Blocking
Figure 7: Benchmark: Rotate
5.2 Rotate
Figure 7 shows the dierent versions of Rotate. The
maximum achieved performance speed-ups of Shmem,
Vector, SCAP and HPF in relation to Blocking are
16.6, 12.9, 3.1 and 1.0 respectively. As in Data-
transfer communication increases with virtualiza-
tion because the arrays are distributed cyclicly. There-
fore, the high speed-up numbers are expected. The
reason for the kink in the speed-up of Shmem and
Vector at virtualization 4096 is not known, so far. It
seems to be hardware related because two dierent
implementations are aected.
1 10 100 1000
Virtualization
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 8: Approximation of Rotate.
Both approximations (see gure 8) lie within a 10%
range of the measured runtimes. This is not surprising
as the communication structure of Rotate is simple
(only communication with one PE) and the proportion
of computation is still very small.
5.3 Jacobi
1 10 100 1000
Virtualization
10
100
1000
10000
100000
1000000
Ti
m
e 
in
 u
s
1 10 100 1000
Virtualization
1
2
3
Sp
ee
d-
up
Shmem
SCAP
Vector
HPF
Blocking
Figure 9: Benchmark: Jacobi
Figure 9 presents runtime and program speed-up of
Jacobi. For large problem sizes all versions have
nearly the same runtimes because computation in-
creases with the square of virtualization whereas com-
munication increases linearly. Shmem, Vector, SCAP
and HPF perform better than Blocking (factor of 2.8,
2.3, 2.8, and 1.4 respectively) only for small virtual-
izations. Later, computation gets the dominant fac-
tor and advantages of better communication primi-
tives decreases. In contrast to Data-transfer and
Rotate, Jacobi has a 2d-grid neighborhood commu-
nication which behaves rather dierent than left-right
communication. The slow-down of Shmem and Vec-
tor is due to communication because they dier from
SCAP in the way communication is done.
1 10
Virtualization
0.8
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 10: Approximation of Jacobi.
7
Compared to the benchmarks before approximation is
not as close as the ones before (see gure 10). It per-
forms worse than 20% for virtualization of 1. Later, es-
timation of Blocking lies within 8% and SCAP swings
from 6% to -9% of the measured time.
5.4 LL1
1 10 100 1000 10000
Virtualization
10
100
1000
Ti
m
e 
in
 u
s
1 10 100 1000 10000
Virtualization
1
2
3
4
Sp
ee
d-
up
Shmem
Vector
SCAP
Blocking
HPF
Figure 11: Benchmark: LL1
With runtime of LL1 (see gure 11), the at most 11
non-local memory references can be seen. They are
reached at virtualization of 11. At this point, Shmem,
Vector and SCAP behave 4.3, 3.4, and 1.9 times better
than Blocking. Later, they show the same behavior as
the latter one whereas HPF decreases to a slow-down
of 5. The relative performances are as expected be-
cause SCAP gets its best relative speed-up at virtual-
izations with highest proportion of communication.
Apart from small virtualizations ( 4), approxima-
tions of SCAP and Blocking are near the measure-
ments ( 4% for both versions) due to the constant
amount of communication and its easy structure (see
gure 12).
5.5 LL3
The dierence from LL3 to LL1 is the constant
amount of communication for all virtualizations as it
1 10 100 1000
Virtualization
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 12: Approximation of LL1.
is a reduction over the PEs which took part at the
computation.
1 10 100 1000 10000
Virtualization
100
1000
Ti
m
e 
in
 u
s
1 10 100 1000 10000
Virtualization
1.0
1.5
2.0
Sp
ee
d-
up
Vector
Shmem
SCAP
Blocking
HPF
Figure 13: Benchmark: LL3
Hence, the dierent runtimes and relatives perfor-
mances (see gure 13) for small virtualizations are not
surprising. Vector, Shmem, and SCAP behave 2.2,
1.5, and 1.2 times better than Blocking (HPF is 35%
slower). Like LL1, LL3 is dominated later by compu-
tation explaining the same performance results. Pro-
trusion of Vector compared to Shmem is due to the
higher fan-in of 8 (2 for Shmem) of the reduction.
Therefore, local results are fetched with one vector
operation. SCAP acts with the same fan-in as Vector.
Approximation of Blocking is done with a deviation of
at most 5.5% (see gure 14). The dierence of SCAP
is slightly larger ( 9%). So far, we can not explain
the stepping in the SCAP ratio plot. As the amount of
PEs is not varied and communication stays constant
8
1 10 100 1000
Virtualization
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 14: Approximation of LL3.
estimation of SCAP works well.
5.6 LL5
1 10 100 1000 10000
Virtualization
100
1000
10000
100000
1000000
Ti
m
e 
in
 u
s
1 10 100 1000 10000
Virtualization
5
10
Sp
ee
d-
up
Shmem
Vector
SCAP
HPF
Blocking
Figure 15: Benchmark: LL5
The behavior of the dierent versions of LL5 are
shown in gure 15. Shmem, Vector, SCAP, and HPF
behave 13.0, 9.6, 2.6, and 2.2 times better than Block-
ing. In contrast to LL3, LL5 involves each local el-
ement in the reduction. Hence, communication de-
pends on problem size and therefore, SCAP performs
better than Blocking for lager virtualizations.
Runtime approximation of Blocking is not worse than
15% (see gure 16). The results for SCAP are encour-
aging as they are in a range of 10% relative to the
measurements despite the high amount of communi-
cation.
1 10 100 1000
Virtualization
0.9
1.0
1.1
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 16: Approximation of LL5.
5.7 LL13
1 10 100 1000 10000
Virtualization
1000
10000
100000
Ti
m
e 
in
 u
s
1 10 100 1000 10000
Virtualization
0.1
1
Sp
ee
d-
up HPF
SCAP
Blocking
Figure 17: Benchmark: LL13
LL13 is the only benchmark with an irregular commu-
nication pattern. Therefore, Vector and Shmem were
omitted. HPF and SCAP behave 5.7 and 3.2 times
better than Blocking, respectively. Advantage of HPF
for large virtualizations is due to inspector-executor
([3]) which recognizes local array elements and fetches
them eectively. SCAP has no runtime check and gets
all elements with E-registers which is time consuming
for large virtualizations. For the future, it is inter-
esting to enhance SCAP with a little runtime check
for local array elements and to compare this version
to HPF. The disadvantage of inspector-executor is the
large overhead decreasing runtime for small virtualiza-
tions (about 9 times slower than SCAP !).
Blocking is estimated within a range of 10% except
for virtualization 1 and 2 (see gure 18). For virtual-
izations between 4 and 64 SCAP approximation is not
worse than 10%. The error for larger ones is explained
9
1 10 100 1000 10000
Virtualization
1.0
1.2
1.4
R
at
io
 p
re
d.
/m
ea
s.
Blocking 
SCAP 
Figure 18: Approximation of LL13.
with the maximum speed-up of SCAP at virtualization
of 256 which could not be detected by the model and
seems to be hardware related.
6 Conclusions and Future work
This paper introduced SCAP as a constructive trans-
formation rule to decrease communication costs in
data parallel applications. This is done by overlap-
ping both communication and computation with par-
allel communication requests. Our work distinguishes
to other prefetching papers as our target architecture
implements a global address space and data distribu-
tion lies in the response of the programmer and not the
system. As we know data distribution we can prefetch
eectively and data cannot be invalidated by others.
This decreases network utilization as there is no ad-
ditional net trac caused by e.g. the virtual address
space.
Our model achieved runtime approximations on the
T3E in a range of 10% (with some explained excep-
tions). We presented a transformation rule which is
easy to implement and which performs better than
HPF in six of our seven benchmarks. As we target
data parallel programs, SCAP is the technique to im-
prove performance of HPF. It is true that shared mem-
ory programs are very fast compared to platform inde-
pendent implementations like HPF but SCAP and its
simple communication mechanism is an example for
both platform independent modeling and good per-
formance.
As a plan for the near future, SCAP is going to be
implemented in an HPF or HPF-subset compiler to
show the possibilities of an automatic transformation
compared to hand-written code and with respect to
existing compilers. This work will address a runtime
check for local elements to improve runtime for irreg-
ular communication patterns, an extension for vector
prefetching and an advanced model for runtime ap-
proximation.
References
[1] J. T. Feo. An analysis of the computational and
parallel complexity of the Livermore loops. Par-
allel Computing, 7(2):163{185, June 1988.
[2] High Performance Fortran Forum. High Per-
formance Fortran Language Specication 1.1,
November 1994.
[3] Charles Koelbel and Piyush Mehrotra. Support-
ing shared data structures on distributed mem-
ory architectures. In Proc. of the 2nd ACM
SIGPLAN Symp. on Principles and Practice of
Parallel Programming, PPOPP, pages 177{186,
March 1990.
[4] Charles H. Koelbel, David B. Loveman, Robert S.
Schreiber, Guy L. Steele Jr., and Mary E. Zosel.
The High Performance Fortran Handbook. MIT
Press, 1994.
[5] Michael Metcalf and John Ker Reid. Fortran
90 explained. Oxford science publications. Ox-
ford University Press, Walton Street, Oxford
OX2 6DP, UK, reprinted with corrections edition,
1994.
[6] Wilfried Oed. Massiv-paralleles Prozessorsystem
CRAY T3E. Technische Dokumentation, Cray
Research GmbH, Riesstrae 25, 80992 Munchen,
1996.
[7] Steven L. Scott. Synchronization and commu-
nication in the T3E multiprocessor. ACM SIG-
PLAN Notices, 31(9):26{36, September 1996.
[8] Steven L. Scott and Gregory M. Thorson. The
Cray T3E network: Adaptive routing in a high
performance 3D torus. HOT Interconnects IV,
August 15-16 1996.
[9] The Portland Group, Inc. pghpf: User's Guide.
9150 SW Pioneer Court, Suite H, Wilsonville,
Oregon 97070, February 1997.
[10] Thomas M. Warschko. Eziente Kommunikation
in Parallelrechnerarchitekturen. PhD thesis, In-
stitut fur Programmstrukturen und Datenorgani-
sation, Fakultat fur Informatik, Universitat Karl-
sruhe, Am Fasanengarten 5, 76128 Karlsruhe,
1997. To appear.
[11] Thomas M. Warschko, Christian G. Herter, and
Walter F. Tichy. Latency hiding in parallel sys-
tems: A quantitative approach. Interner Bericht
10
10/94, Universitat Karlsruhe, Fakultat fur Infor-
matik, Marz 1994.
A E-register programming
The appendix covers some details needed for program-
ming of the E-registers. The information about hard-
ware centrifuge and address-translation is taken from
[7]. All other features are documented in the accord-
ing C-headers.
A.1 The hardware centrifuge
An E-register command operates on ve single E-
registers. These are partitioned into an aligned Mask-
Oset-(MO-)block which contains four E-registers and
another E-register (see gure 19).
Mask
Base
Stride
Addend
Index
PE & VA
63 0
Figure 19: Five E-registers forming a command.
Each E-register is 64 bits long. The Index contains at
bit 56 the number of the additional MO-block. This
number is read and extracted. The remaining bits
form a merged PE and virtual-address index. The
Mask of the aligned-block indicates those bits in In-
dex forming the PE number. The masked bits are
extracted and form the virtual PE number. Now, the
Base is added to Index resulting in the nal virtual-
address. According to the virtual PE number the com-
mand is issued to the network in case of a non-local
memory-access. Typically a Base and Mask of a MO-
block are set up for each shared distributed array and
then the index is varied. The Stride is necessary for
vector-gets and -puts. After each E-register command,
it is added to the Index to get a new mixed PE number
and virtual-address. The Addend is used for fetch-and-
add operations.
A.2 Programming
An E-register command is performed as follows:
1. Mask and Base are written to a MO-block. The
number of the Mask-register has to be a multiple
of 4.
2. The command itself is separated into the actual
command and the Index containing the address.
The command as its whole is issued as an assign-
ment. The actual command is the left-hand side
and the address is the right-hand side, e.g in C:
*PUT(E-register-number) = Index
As a result, a write to a non-local memory-location is
a store to I/O-space, which is expensive but with less
processor-overhead than a traditional message-passing
system.
E-registers are able to perform the following opera-
tions:
Put: To perform a put to a global (local or non-
local) memory-location the processor must store
the value to a specied E-register rst. In the
according E-register-command this E-register is
named.
Get: It is formed in the same way as a put. The
distinction is in reading the desired value. On a
load from the specied E-register the processor
will be stalled if the E-register is not yet lled
by the network. Stalling is done by the above
mentioned state-bits as long as the value arrives.
Afterwards the processor continuous.
Vector-operations: E-registers can be formed to a
vector. Thus, one vector-command serves eight
E-registers. This decreases processor overhead
and increases network throughput. The stride for
gets and puts is given by Stride in the aligned-
block. For fast access of the vector-values, they
should be written rst to local memory and
then loaded into the processor, because cacheable
memory loads can be performed at roughly twice
the bandwidth of E-register loads.
Atomic-Memory-Operations: (AMOs) Most in-
teresting of these are the fetch-and-increment and
fetch-and-add AMOs. They add a specied value
to the target memory-content and return the orig-
inal value. Addend supplies the value for fetch-
and-add.
11
