Compiling applications with the KarHPFn compiler by Mueller, Matthias M.
Compiling Applications with the KarHPFn
 
Compiler
Matthias M Muller
Institute for Program Structures and Data Organization
Universitat Karlsruhe Germany
email muellermiraukade
Technical Report No 			


April  			
Abstract
This paper compares the prefetching technique VSCAP software controlled access pipel
ing with vector commands with Cray TEs highly optimized sharedmemory functions
SHMEM and with Portland Group HPF BMN
 
	
 PGHPF on three application bench
marks namely PDE FIRE and Veltran Previous work showed the good performance of
VSCAP for single communication kernels This paper examines VSCAP and practicability
of KarHPFn our prototype HPF compiler in the context of whole applications The results
show that VSCAP generated by KarHPFn reduces communication overhead of negrained
data parallel applications to a minimum This leads to a performance gain compared to
PGHPF between a factor of  for FIRE to a factor of  for PDE VSCAP programs
are nearly as fast as SHMEM for regular communication patterns but  times faster than
SHMEM in the case of dynamic communication patterns All results were measured on 
processors
  Introduction
As microprocessors get faster and the gap between computation and communication speeds widens
network latency becomes the dominant factor of the execution time of negrained parallel pro
grams Instead of a single communication operation a processor could perform hundreds to thou
sand arithmetic operations This situation is even worse by an order of magnitude once the
software overhead of communication libraries is taken into account
Software controlled access pipelining with vector commands VSCAP overcomes these shortcom
mings by means of prefetching The potential of VSCAP and its predecessor SCAP has been
demonstrated by analytic models simulations and hand coded benchmarks War	
 MWT	 In
Mula we introduced KarHPFn an optimizing HPF compiler which transforms dataparallel
programs to programs using VSCAP for the communication The benchmark set consisted of small
communication kernels which emphasized the impact of fast communication The results showed
that VSCAP is as fast as TEs sharedmemory functions for regular communication patterns
But VSCAP programs are up to  times faster in the case of dynamic communication patterns
KarHPFn compiled programs are at least  times faster than programs compiled by the Portland
Group HPF compiler
Now this paper extends the above comparison to applications which are not characterized by a
single communication pattern This comparison shows also the practicability of KarHPFn The
benchmark set consists of applications from geophysics Veltran and uid dynamics FIRE A
solver for partial dierential equations PDE is also presented The tests were done on a Cray
TE using up to  processors
 
Karpfen without h is the german word for carp

Prefetching is not new Previous research addresses it in the context of prefetching cache lines
nonblocking loads scheduling techniques and speculative execution on uniprocessors or small
scale cachecoherent multiprocessors CB	 RL	 MLG	 CKP	 GGV	 Prefetching is also
used by software distributed sharedmemory systems to prefetch whole memory pages LCD
 
	

BPA	
But little is known about the eects of latency hiding applied to communication networks in
massively parallel computers with distributed memory And to our knowledge KarHPFn is the
rst work targeting compiler directed prefetching for these architectures
The next section describes the basic principle and the transformation rules of VSCAP which are
incorporated in our KarHPFn compiler Section  discurses the benchmark set After that the
test environment is described A presentation of our results terminates this paper
 VSCAP
In negrained parallel applications as in most other parallel applications latency prevents fast
access to nonlocal memory Software controlled access pipelining with vector commands VSCAP
targets latency hiding through overlapping of computation and communication by splitting non
local memory access into low overhead prefetch and access This section explains VSCAP by
showing the basic concepts of SCAP the predecessor of VSCAP without vector commands The
concepts of SCAP can be easily extended to VSCAP with additional vector commands for prefetch
and access A detailed explanation of SCAP and VSCAP can be found in War	
 and Mulb
respectively
  Basic Idea of SCAP and its extension to VSCAP
The aim of SCAP is a runtime improvement achieved by overlapping several communication
requests leading to a communication pipeline in negrained data parallel applications For a
better understanding of the basic principle of SCAP we rst explain how communication is usually
done
The processor issues a request to the network downwards arrow in Figure  and waits until the
network replies upwards arrow Only then the processor continues its execution and issues a
new request This is done as long as the processor requires remote data elements to perform its
local part of computation As the processor blocks after each data request we call this kind of
communication the blocking mode see the upper half of Figure 
Now let us assume the processor could issue all its communication requests and the network
would be able to process them in an overlapped fashion This would lead to a shorter waiting
period for the processor accessing the rst and all other successive remote data elements Finally
communication could be performed faster compared to the above mentioned blocking execution
We call this kind of communication overlapping communication see the lower half of Figure 
To enable overlapping communication the network interface has to provide a prefetch buer that
decouples the processor from the network execution see Figure 
The second task of the prefetch buer is to synchronize the processor with the network execution
The synchronization becomes necessary if the processor wants to access a data element which
has not been delivered by the network yet In this case the processor is stalled until the value
arrives The waiting time in the lower part of Figure  denotes a processor stall Figure  shows
an overlapped execution of the processor and the network without processor stalls
VSCAP extends SCAP by the means of vector commands for prefetch and access Instead of issuing
a communication requests for each nonlocal data element the processor prefetches and accesses
L   data elements at once Lis the vector length of the vector commands VSCAPs vector
commands reduce prefetch and access overhead of SCAP and improve communication time further
But the usage of vector commands supposes regular communication patterns with equidistant
osets between successive data accesses Vector strategies describe the possible combinations of
vector and single element access

Time
Wait
Latency
Wait
Blocking
Processor
Overlapping
Wait
Latency Latency
Wait
Latency
Latency
Latency
Network
Processor
Network
Figure  Basic idea of SCAP
   Vector strategies
Vector commands for prefetching are only useful if displacements of the elements are equidistant
and known at compile time Otherwise if element locality can be computed only at run time
as in dynamic communication patterns or if distance of elements varies on a perelement basis
singleelement prefetch instructions are used as a full block strategy For this reason we introduce
the notion of a vector strategy which declares the usage of vector operations
p a vector strategy
A p avector strategy declares usage of vector operations for prefetch p    and access
a    operations Assuming vector lengths p a  f Lg there are four possible vector
strategies
Network execution
Processor execution
Prefetch Buffer
Figure  Usage of the Prefetch Buer

Vector strategy Explanation
 Elementwise prefetch and access operations
L Element operations for prefetch but vector access
L Vector prefetch but elementwise access operation
LL Vector operations both for prefetch and access
The four vector strategies are characterized as follows
  vector strategy This is SCAP It is used for communication patterns that do not allow
any vector commands They are characterized in varying element distances for prefetch and
access eg in masked assignments or in arbitrary blockcyclic distributions
 L vector strategy Elementwise prefetching is done in dynamic communication patterns eg
in indirect indexed array access like the one shown in Section 
L  vector strategy This strategy uses vector prefetch and elementwise access operations as
in scatter operations As message passing architectures utilize remote write accesses already
this vector strategy is not investigated further
LL vector strategy Vector operations can be used both for prefetch and for access if the
location of data elements can be determined at compile time eg in ane communication
patterns Ba  I  b where a and b are variables that do not change their values during
execution of communication
This paper shows performance results of the LL and   vector strategies Results of the
 L vector strategy can be found in Mula
The L vector strategy is not discussed further for the above reason The remaining three
strategies are selected automatically during the compilation Their usage depends on the commu
nication pattern datadistribution and the iteration space of the dataparallel forall Section 
describes the selection of the vector strategy according to the dimensional parameter space
  Transformation rules
This paragraph describes the techniques used in the transformation from a dataparallel forall
statement to VSCAP Section  describes the implementation of these techniques within the
KarHPFn compiler
The transformations are illustrated using the following simple forallstatement
FORALL i   TO N
Ai  Bqi
END
The program fragment updates array A in parallel indexing array B with array q A parallelizing
compiler maps the problem size N onto P real processors N  P  This technique is called
virtualization Assuming that P divides N each processor emulates V 
N
P
virtual processors
within a virtualization loop Both A and B are distributed over the P processors using the the
ownercomputes rule Since the value of qi can not be determined at compile time the compiler
has to insert remote memory accesses The virtualization of the program fragment is as follows
given the blocking execution mode
FORALL j   TO P  Forall processors in parallel
FOR k  jV TO jV  Simulate V virtual processors
adr  calculate addressBqk  Calculate remote address
Ak  remote readadr  Read remote data element
END
END

In the worst case every processor issues V nonlocal memory accesses These stall the processor
if the network can not serve the desired values fast enough Hence execution time of this loop is
at least V times the network latency
The following transformation of the loop shows how communication and computation can be over
lapped
FORALL j   TO P
FOR kjV TO jV  Prefetch loop
adr  calculate addressBqk  Calculate remote address
prefetchadr  Start read request
END
FOR kjV TO jV  Access loop
adr  calculate addressBqk
Ak  accessadr  Access data element
END
END
In this transformation the main loop is split into two instances a prefetch and an access or
calculation loop Instead of stalling on a remote memory access as in blocking mode the processor
issues remote memory prefetch requests After the prefetch loop is executed the calculation loop
accesses nonlocal memory without waiting time if the data is already present in the prefetch
buer This is the code for SCAP or the  vector strategy In the best case program speedup
is about V   times the network latency because there is at most one waiting period arrival of
rst data item compared to V waiting times in a blocking network
VSCAP improves the above code with vector access commands further While vector prefetch
operations cannot be used due to the dynamic prefetch pattern caused by the index array q vector
accesses are possible because of the regular array accessAk This leads to a  Lvector strategy
FORALL j   TO P
FOR kjV TO jV  Prefetch loop
adr  calculate addressBqk  Calculate remote address
prefetchadr  Start read request
END
FOR kjV TO jV STEP L  Vector access loop
adr  calculate addressBqk
vector accessadrAk  Access L data elements
END
END
For brevity we assume that L divides V  Otherwise additional elementwise access operations has
to be used to get the remaining V mod L data elements which do not ll a vector of length L
The access loop is blocked with blocksize L Within the loop vector accessadrAk copies L
entries from the prefetch buer starting at address adr to successive memory locations beginning
with Ak If we assume that a vector access lasts as long as an elementwise access operation the
duration of the vector access loop is decreased about the factor L of the vector length
If the number of nonlocal memory accesses is too large to t into the prefetch buer VSCAPs
transformation rule uses a three loop execution pattern where the middle loop alternates between
access and prefetch instructions This transformation is not shown for the sake of brevity
 KarHPFn
This section describes the architecture and the compilation steps of the Karlsruher HPF compiler
KarHPFn

 Overview
KarHPFn is a sourcetosource compiler to transform a data parallel HPF program into an exe
cutable Fortran 	 node program that uses only Cray TEs Eregister operations for communi
cation see Figure  KarHPFns program transformations concentrate on the forallstatement
node program
(SPMD)
(data parallel)
HPF
KarHPFn
F90 + E-register
F90
Figure  Overview of KarHPFn
KarHPFn is based on the ADAPTOR frontend Bra	 All subsequent analysis and transforma
tion phases use the CocktailToolbox GE	 to operate on the abstract syntax tree built by that
frontend Dependence and partitioning analysis phases use common techniques to perform their
tasks
  Transformation within KarHPFn
KarHPFn transforms a given sourceprogram in four steps First data distribution information is
evaluated and the computation is spread among processors using the ownercomputes rule The
second step analyses subscripts of the arrays involved to determine the communication pattern
The appropriate pattern is selected using the pattern matching technique from Boz	 In the
example of Section  KarHPFn determines the tuple i qi KarHPFn assumes that all data
accesses are remote because the value of q cannot be computed at compile time As each reference
to Bqi is treaten as a nonlocal array access although some elements could be local we call
this speculative prefetch
In the third step KarHPFn determines the vectorstrategy It uses the subscript information
i qi and the datadistribution eg a blockwise distribution and looks up the corresponding
table for a suitable vectorstrategy see Table 
Table  Selection of vectorstrategies
Comm pattern Data distribution Vector strategy
i b
 
 i i c
y
 i i b
BLOCK
CYCLIC
CYCLICk
LL
LL
LL  
i a  i b
BLOCK
CYCLIC
CYCLICk
LL
 L
 L  
i qi
z
BLOCK
CYCLIC
CYCLICk
 L
 L
 L  
 
a and b are arbitrary variables
y
c is a constant
z
q is a function or an array
The lines for the CYCLICk distribution have an additional   entry for possible vector strate
gies This is caused by the fact that some combinations of iteration space and blocksize k force
a calculation of the local index set at runtime As these index sets do not have equidistants

osets between consecutive data elements each data element has to be prefetched and accessed
separately The switch to a  vector strategy is done automatically
Returning to our example KarHPFn selects with the tuple i qiBLOCK the  Lvector
strategy
The fourth and nal step generates the appropriate pipeline for communication and inserts it into
the nal program
As KarHPFn supports four dierent communication modes BLOCK SCAP VSCAP and
SHMEM

 and three dierent data distributions BLOCK CYCLIC and CYCLICk distribu
tions there are  possible pipelines for the  Lvector strategy Furthermore KarHPFn can
apply two dierent optimizations to this kind of vector strategy The rst one reduces address
calculation overhead as it uses capabilities of the hardware centrifuge of the Cray TE The sec
ond one introduces a software test to determine locality of data accesses This is useful if only
nonlocal dataelements have to be prefetched These two optimizations can be applied to three
of the above four communication modes for all possible datadistributions this leads to additional
	  	   pipelines
Therefore KarHPFn can generate      dierent communication pipelines for the  L
vector strategy which shows the need for a sophisticated software architecture for the pipeline
generation module
 Benchmarks
The purpose of our benchmark set is to show the eects of fast communication in the context of
an entire application instead of the focus on a single and small part of the program Execution
on more than hundred processors indicates performance in a massively parallel environment
We chose three applications from three dierent domains PDE a solver for partial dierential
equations FIRE a uid dynamics application and Veltran from geophysics Characteristics of
these benchmarks are explained below
 PDE
PDE is a dimensional Poisson solver using redblack relaxation The dimensional grid is
divided into red and black points In an iteration rst red points are calculated using values of
the six adjacent black points In the second step of the iteration black points are determined
using the new red values
PDE splits the entire dgrid into smaller cubes that are distributed among processors Before
computation takes place each processor has to read the border of its local cube from virtually
neighbouring processors This so called nearestneighborhood communication pattern results in
a linear communication compared to a cubic computation time Therefore we expect a small
advantage from fast communication for problem sizes with large local cubes because calculation
time dominates communication But as the number of processors increase and as the virtualization
and computation time decrease the usage of fast communication results in a further speed up
Due to the regular communication pattern KarHPFn uses a LLvector strategy for VSCAP
Remote data is read into overlap areas of local cubes
  FIRE
FIRE is a uid dynamics package from AVL List using the method of conjugate gradients on un
structured meshes BSCG	 The main communication loop in FIRE gathers cells from DIREC
indexed through LCC

sharedmemory library


FORALL J TO 
FORALL NC TO NNINTC
IF LCCMASKJNC THEN
DIRECVJNC  DIRECLCCJNC
END
END
END
FIRE distributes the problem domain in blocks It has an high proportion of communication to
compuation leading to a large communication overhead regardless of local problem sizes
Without the ifstatement in the above program fragment KarHPFn would use a  Lvector
strategy but predication with LCCMASK forces a  vector strategy for FIRE because prefetch
and access operations depend on runtime information
 Veltran
Veltran is an application from geophysics that uses velocity analysis to calculate consistency of
earth layers JPK	 It uses the method of conjugate gradients
Communication in Veltran is a scatter operation of local parts of two distributed dimensions of
the problem domain If we use a concurrentread memory model the communication reduces to a
read of the remote data elements The read access reduces the communication overhead for every
data element from OlogP  as in message passing architectures to O where P denotes the
number of processors Nevertheless an equal amount of communication and computaion leads to
a high communication overhead for all virtualizations
Two dimensions of the dimensional global problem are distributed in blocks over processors
KarHPFn uses a LLvector strategy for the scatter operation
 Test environment
This section presents a short overview of the Cray TE and the investigated dierent benchmark
versions
 Architecture of Cray TE
The TE 	 used for our measurements consists of up to  DEC Alpha  processors
running at MHz They are connected with a dimensional torus network The net is decoupled
from the processors at a speed of 
 MHz ST	 with overlapped communication Each link has
a bandwidth of approximately  MBs resulting in a  GBs transfer rate for a single node
The network interface consists of  user and  system Eregisters memory mapped into
the IOspace of each processor Eregisters provide the only means to transfer data between
processors Read and write operations between Eregisters and the global memory are called gets
and puts To load a global memory content into the processor a get and a subsequent read of
the Eregister has to be executed The latter operation stalls the processor until the value arrives
This synchronization is implemented in hardware using the readiness state of the Eregister On a
put the memory of a remote node is modied and the cache is updated Oed	 Hence the TE
implements a global address space with locally consistent memory
Eight Eregisters can be combined to a vector Distance between successive vector elements have
to be equidistant to ensure correct address translation Thus the TE enables VSCAP with a
vector length of L  
  Benchmark versions
Benchmarks were compiled to four dierent versions

BLOCK simulates blocking communication of the TE as described in Section 
VSCAP does prefetch and access with vector operations  Eregister vectors of  Eregisters
each are used to allow a total number of  outstanding communication requests 
Eregisters suce to hide network latency and to get maximum throughput Sco	
SHMEM uses Crays sharedmemory system functions for communication SHMEM delivers
maximum communication performance on the TE for regular communication patterns
SHMEM behaves like BLOCK in dynamic patterns because of lack of support by the system
library Hence each element has to be read by a separate function call to shmem get resulting
in blocking performace
PGHPF represents the executables of the Portland Group HPF compiler PGHPF is the com
mercial HPF compiler available for the Cray TE
The rst three version were compiled by KarHPFn using dierent command line options while
PGHPF compiled the fourth one Both compilers got the same HPFsource Standard optimiza
tions were turned on for both PGHPF and for the Fortran 	 compilation step of KarHPFn Time
was measured with the realtime clock RTC  Problem size was kept constant for each benchmark
but the number of processors was increased from  to  in powers of 
 Results
The runtimes of the four dierent versions for each benchmark are given The discussion of each
benchmark includes two plots The rst one shows the runtimes and the second one presents the
relative performance compared to BLOCK Since we focus on the reduction in communication
time the plots do not account for speed up compared to a one processor execution
 PDE
A global cube with N  

data points formed the base of the measurements of PDE Figure 
presents runtimes and relative performance of the four dierent versions of PDE
1000
10000
100000
1e+06
10 100
R
un
tim
e 
in
 m
s
Number of PEs
block
vscap
shmem
pghpf
0
0.5
1
1.5
2
2.5
3
10 100
R
un
tim
e 
co
m
pa
re
d 
to
 B
LO
CK
Number of PEs
Figure  Runtimes and speed up relative to BLOCK of PDE
The runtime plot on the left displays three dierent lines The upper one denotes runtime of
PGHPF The solid line in the middle is execution time of BLOCK our base model and the lower
line presents runtime of VSCAP and SHMEM The plot leads to three dierent observations First
fast communication improves performance only for smaller local problemsizes eg at execution
on more than  processors This behaviour is caused by the fact that with an increasing number
of processors less computation is needed and therefore the relative advantage of better communi
cation grows Second VSCAP is as fast as highly optimized sharedmemory functions and third
	
PGHPF is slower than reading one remote data element after the other The plot on the right
hand side quanties the last two observations VSCAP and SHMEM are more than  times
faster than BLOCK on  processors while PGHPF is about 
 times slower The comparison
of VSCAP with SHMEM and PGHPF shows that VSCAP is only  slower than SHMEM but
	 times faster than PGHPF on  processors
  FIRE
FIRE was calculated with  cells Thus virtualization variied during the measurements
from 	 for  processors to  cells for  processors Figure  shows the execution time
and the relative performance of the FIRE versions
1e+06
1e+07
1e+08
1e+09
10 100
R
un
tim
e 
in
 m
s
Number of PEs
block
vscap
shmem
pghpf
0.5
1
1.5
2
2.5
3
3.5
10 100
R
un
tim
e 
co
m
pa
re
d 
to
 B
LO
CK
Number of PEs
Figure  Runtimes and speed up relative to BLOCK of FIRE
The runtime plot shows two dierent groups SHMEM BLOCK and PGHPF belong to the
upper slower group and VSCAP to the fast one The plot shows a constant dierence in runtime
caused by the constant ratio of communication to computation time throughout all virtualizations
The plot on the right displays relative performance compared to BLOCK PGHPF is  faster
SHMEM is  slower than BLOCK Only VSCAP shows with a substantial speed up compared
to BLOCK factor  on  processors VSCAP is also  times faster than SHMEM and 
times faster than PGHPF on  processors
The poor performance of SHMEM is due to the fact that the sharedmemory library does not
support any kind of dynamic communication pattern Though each nonlocal element has to
be read with a seperate call to shmem get resulting in a blocking execution The reason for
the PGHPF behaviour is the usage of inspectorexecutor model that increases communication
overhead by additional execution time for the communication schedule VSCAP is not aected by
this overhead because time for prefetch is limited to predicate evaluation and to issue the prefetch
 Veltran
Measurements of Veltran were done using a global problem size of  points Figure  show
the results
The runtime plot on the left shows two dierent lines The upper line contains BLOCK and
PGHPF while VSCAP and SHMEM belong to the lower one The constant advantage in runtime
accounts for the constant ratio of communication and computation Relative performances are
shown in the plot of the right hand side of Figure  PGHPF is almost as fast as BLOCK 
while VSCAP and SHMEM are at least  times faster Relative runtime gain of the latter two
grows with the increasing number of processors used in parallel It reaches its climax at 
processors where VSCAP is  and SHMEM 
 times faster than BLOCK and PGHPF VSCAP
is only  slower than SHMEM

0.1
1
10
100
1000
10 100
R
un
tim
e 
in
 s
Number of PEs
block
vscap
shmem
pghpf
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
10 100
R
un
tim
e 
co
m
pa
re
d 
to
 B
LO
CK
Number of PEs
Figure  Runtimes and speed up relative to BLOCK of Veltran
The plot on the right hand side shows waves in the speed up of VSCAP and SHMEM These waves
are caused by the dimensional virtual processor grid that was changed between successive mea
surements to take the increasing number of processors into account Dierent from expectation
however this change did not lead to a constant improvement
 Conclusions
This paper examined the performance of VSCAP on applications PDE FIRE and Veltran
and showed the practicability of our HPF compiler KarHPFn We compared VSCAP both with
the highly optimized sharedmemory library and with the Portland Group HPF compiler Test
programs were generated automatically by KarHPFn operating on the same HPF sources as
PGHPF
On regular communication patterns KarHPFns VSCAP is nearly as fast as the system functions
 On dynamic communication patterns FIRE however VSCAP is  times faster due to
lack of support of these patterns by the system library
A comparison of VSCAP to PGHPF shows KarHPFns strength a programmer gets a 	 times
faster program PDE just by exchanging PGHPF with KarHPFn without the need for additional
knowledge on communication techniques These results were not limited to small problems as the
large problem sizes and the measurements on  processors show
Further work concentrates on VSCAP for workstation clusters broadening the range of suitable
architectures The SCI standard seems to be a promising candidate to achieve this goal Further
questions in this context include the behavior of VSCAP on dynamic communication patterns and
the mixture of VSCAPs remote read with remote write access
References
BMN
 
	
 Z Bozkus L Meadows S Nakamoto V Schuster and M Young PGHPF  an
optimizing High Performance Fortran compiler for distributed memory machines Sci
entic Programming 	  Spring 		

Boz	 Zeki Bozkus Compiling Fortran DHPF for Distributed Memory MIMD Computers
PhD thesis Syracuse University June 		
BPA	 Ricardo Bianchini Raquel Pinto and Claudio L Amorium Data Prefetching for
Software DSMs In 	th International Conference on Supercomputing
 Melbourne
pages  	 July  
 		

Bra	 Thomas Brandes Adaptor A compilation system for data parallel fortran programs
Technical report German National Center for Computer Science GMD St Augustin
Germany 		 ftpftpgmddeGMDadaptordocsadaptorps
BSCG	 P Brezany V Sipkova B Chapman and R Greimel Automatic parallelization of the
AVL FIRE benchmark for a distributedmemory system Lecture Notes in Computer
Science   		
CB	 TienFu Chen and JeanLoup Baer Reducing memory latency via nonblocking and
prefetching caches In Fifth International Conference on Architectural Support for
Programming Languages and Operating Systems pages   Boston Massachusetts
October 		 Also available as U Washington CS TR 	
CKP	 David Callahan Ken Kennedy and Allan Portereld Software prefetching In Fourth
International Conference on Architectural Support for Programming Languages and
Operating Systems pages   Santa Clara California April 		
GE	 Josef Grosch and Helmut Emmelmann A tool box for compiler construction In
Dieter Hammer editor Compiler Compilers
 Third International Workshop on Com
piler Construction volume 

 of Lecture Notes in Computer Science pages  
Schwerin Germany   October 		 Springer 		
GGV	 Edward H Gornish Elana D Granston and Alexander V Veidenbaum Compiler
directed data prefetching in multiprocessor with memory hierarchies In Proceed
ings  International Conference on Supercomputing pages   Amsterdam
June  		
JPK	 Matthias Jacob Michael Philippsen and Martin Karrenbach Largescale parallel
geophysical algorithms in Java a feasibility study Concurrency Practice and Experi
ence    September 		 Special Issue Java for Highperformance
Network Computing
LCD
 
	
 Honghui Lu Alan L Cox Sandhya Dwarkadas Ramakrishnan Rajamony and Willy
Zwaenepoel Compiler and Software Distributed Shared Memory Support for Irregular
Applications ACM SIGPLAN Notices 
  July 		

MLG	 Todd C Mowry Monica S Lam and Anoop Gupta Design and evaluation of a
compiler algorithm for prefetching In Fifth International Conference on Architectural
Support for Programming Languages and Operating Systems pages  
 Boston
Massachusetts October 		
Mula Matthias M Muller KaHPF Compiler generated Data Prefetching for HPF In High
Performance Computing in Science and Engineering  pages 
  Springer

Mulb Matthias M Muller Latenzzeitverbergung in datenparallelen Sprachen PhD thesis
School of Computer Science Universitat Karlsruhe February 
MWT	 Matthias M Muller Thomas M Warschko and Walter F Tichy Prefetching on
the CrayTE In 	th International Conference on Supercomputing pages  

Melbourne July  
 		
Oed	 Wilfried Oed Massivparalleles Prozessorsystem CRAY TE Technical report Cray
Research GmbH Munchen November 		
RL	 Anne Rogers and Kai Li Software support for speculative loads In Fifth Interna
tional Conference on Architectural Support for Programming Languages and Operating
Systems pages   Boston Massachusetts October 		

Sco	 Steven L Scott Synchronization and communication in the TE multiprocessor ACM
SIGPLAN Notices 	  September 		
ST	 Steven L Scott and Gregory M Thorson The Cray TE network Adaptive routing
in a high performance D torus HOT Interconnects IV August   		
War	
 Thomas MWarschko Eziente Kommunikation in Parallelrechnerarchitekturen PhD
thesis School of Computer Science Universitat Karlsruhe 		


