Project Triton: towards improved programmability of parallel computers by Philippsen, Michael et al.
 
PROJECT TRITON
TOWARDS IMPROVED
PROGRAMMABILITY OF
PARALLEL COMPUTERS
Michael Philippsen Thomas M Warschko
Walter F Tichy Christian G Herter
Ernst A Heinz and Paul Lukowicz
University of Karlsruhe
Department of Informatics
Germany
ABSTRACT
This paper appeard in David J Lilja and Peter L Bird editors The
Interaction of Compilation Technology and Computer Architecture
Kluwer Academic Publishers Boston Dordrecht London 
The main objective of Project Triton is adequate programmability of massively par
allel computers This goal can be achieved by tightly coupling the design of program
ming languages and parallel hardware
The approach taken in the Project Triton is to let highlevel machine independent
parallel programming languages drive the design of parallel hardware This approach
permits machineindependent parallel programs to be compiled into ecient machine
code The main results are as follows
Modula
This language extends Modula with constructs for expressing a wide range of
parallel algorithms in a portable problemoriented and readable way
Compilation Techniques
We present techniques for the ecient translation of Modula and similar im
perative languages for several modern parallel machines and derive recommen
dations for future parallel architectures
Triton Parallel Architecture
Triton	
 is a scalable mixedmode SIMD	MIMD parallel computer with a highly
ecient communications network It overcomes several deciencies of current
parallel hardware and adequately supports highlevel parallel languages
 
  Chapter 
 INTRODUCTION
Project Triton focuses on the goal of adequate programmability of parallel ma
chines Adequate programmability means that programs can be formulated in
a problemoriented rather than a machineoriented fashion and can be compiled
to run eciently on a wide range of parallel machines
The development of todays commercially available parallel computers has been
mainly driven by hardware considerations Although users of the rst massively
parallel machines were willing to expend substantial programming e	ort to
reach acceptable performance
 the increasing availability of parallel machines
calls for drastically lower program development costs Given the rapid evolution
of parallel architectures
 few users can a	ord the luxury of nonportability of
their programs Besides portability
 better programming languages are needed
for problemoriented formulation of programs with multiple levels of parallelism
and multiple grain sizes of parallel operations
One can observe that parallel computing is presently repeating some of the de
velopment steps that sequential computing underwent in the past four decades
Improvements in hardware architecture
 operating systems
 programming lan
guages
 and compiler technology should eventually render the current practice
of machinedependent parallel programming as obsolete as machinedependent
sequential programming
As in sequential computing
 the goal of adequate programmability of parallel
machines can only be achieved by tightly coupling the design of programming
languages
 compilers
 and parallel hardware Triton couples the following sub
projects
Explicitly Parallel Language We show simple language constructs that a
avoid shortcomings of known parallel programming languages
 b express
parallel programs in a highlevel
 problemoriented way
 and c can be
translated into ecient code for various parallel architectures We use
Modula   as an example language see section   similar extensions
could easily be integrated into other imperative programming languages
Optimizing Compilers We found e	ective optimization techniques that im
prove runtime on parallel machines dramatically 
  
   
   In general

however
 compilation and optimization are severely hampered by current
parallel hardware Our experience with writing compilers for parallel ma
chines has led us to formulate several recommendations for future parallel
Project Triton Programmability of Parallel Computers 
architectures Optimization techniques
 hardware recommendations
 and
performance results are given in section 
Parallel Machine Architectures We explore novel architectural paradigms
of parallel computers by building a massively parallel computer called
Triton   that is presented in section  Triton is a mixedmode
SIMDMIMD computer called SAMD
 for synchronous or asynchronous
instruction
 multiple data with a highly ecient communications net
work Triton overcomes various deciencies of current parallel machines

supports highlevel parallel languages such as Modula 
 and implements
many of our recommendations
 MODULA
The majority of programming languages for parallel machines
 including LISP

C
 MPL
Fortran 
 Fortran D
 HPF
 Blaze
 Dino
 and Kali 
 
 
 
  

 
 
 su	er from some or all of the following problems
Manual Virtualization The programmermust write explicit code for map
ping processes
 whose number is problem dependent
 onto the available
processors
 whose number is xed This task is not only tedious and repet
itive but also one that makes programs nonportable
Manual Data Allocation Distribution of data over memory modules must
be programmed explicitly for achieving adequate performance Because of
the tight coupling of data allocation with algorithms and the topology of
interconnects
 the programs are dicult to comprehend and not portable
Manual Communication Interprocess communication must be imple
mented by means of lowlevel message passing primitives This results
in dicult code
 especially if asynchrony and nondeterministic behavior is
possible
MIMDSIMD Exclusiveness Parallel programming languages are either
synchronous or asynchronous
 reecting whether the target machine is a
SIMD or MIMD architecture On SIMD machines
 programs are restricted
to total synchrony even if that causes poor machine utilization On MIMD
machines
 tightly synchronous execution is expensive Since the choice is
dictated by the available hardware rather than the actual problem
 the
resulting programs are often distorted and not portable
   Chapter 
Modula  preserves the main advantages of data parallelismwhile avoiding the
above drawbacks The new language constructs allow for clear and portable
parallel programs without intolerable loss of eciency Because of the com
pactness and simplicity of the extensions
 they could easily be incorporated
into other imperative programming languages
 such as Fortran
 C
 or Ada
The following list describes the central advantages of our language approach
The programming model of Modula  is a superset of data parallelism
The language provides a single address space Note that shared memory
is not required a single address space merely permits all memory to be
addressed uniformly
 but not necessarily at uniform speed On machines
without a shared address space
 the compiler must insert communication
instructions for all nonlocal accesses
Synchronous and asynchronous parallel computations as well as arbitrary
nestings thereof can be formulated in a totally machineindependent way
Procedures may be called in any context sequential and parallel and
at any nesting depth Furthermore
 additional parallel processes can be
created inside procedures recursive parallelism
Overview of Language Extensions
Modula extends Modula  with the following two language constructs
 The FORALL statement
 which has a synchronous and an asynchronous
version
 is the only way to introduce parallelism into a Modula  program
  The layout of array data may optionally be specied per dimension by
socalled allocators
 eg
 CYCLE
 SPREAD
 and LOCAL They do not have any
semantic meaning but are merely data layout hints for the compiler
The Modula  syntax of the FORALL statement is listed below
FORALL ident  SimpleType IN PARALLELSYNC
 VarDecl
BEGIN	
StatementSequence
END 

Project Triton Programmability of Parallel Computers 
SimpleType is an enumeration or a possibly nonstatic subrange
 ie the bound
ary expressions may contain variables The FORALL creates as many concep
tual processes as there are elements in SimpleType The identier introduced
by the FORALL statement is local to it and serves as a runtime constant for
every process created by the FORALL The runtime constant of each process is
initialized to a unique value of SimpleType The FORALL statement provides an
optional section for the declaration of variables local to each process These
local variables lead to better source code structuring
 thus greatly increasing
the readability and eciency of parallel code
Each process created by a FORALL executes the statements in StatementSe
quence The END of a FORALL statement imposes a synchronization barrier on
the participating processes termination of the whole FORALL is delayed until
all created processes have nished their execution of StatementSequence
The version of the FORALL statement asynchronous or synchronous deter
mines whether the created processes execute StatementSequence concurrently
or in lockstep Hence
 for nonoverlapping vectors X
 Y
 and Z the simple asyn
chronous FORALL statement on the lower left suces to implement the vector
addition X  Y  Z In contrast
 parallel modications of overlapping data
structures require synchronization provided by synchronous FORALLs Thus

irregular data permutations can be implemented easily
 as shown on the lower
right The e	ect of the second FORALL statement is to permute the vector X
according to the permutation function p Here
 the synchronous semantics en
sure that all righthand side elements Xpi	 are read and temporarily stored
before any variable Xi	 is written This behavior stems from the implicit
synchronization barrier introduced between the left and right hand side of any
assignment in a synchronous context
FORALL i

N	 IN PARALLEL FORALL i

N	 IN SYNC
Zi	  Xi	  Yi	 Xi	  Xpi	
END END
In synchronous branching statements such as IF C THEN SS ELSE SS END

the set of participating processes divides into disjoint and independently op
erating subsets
 each of which executes one of the branches SS and SS in
the example in unison In contrast to other dataparallel languages
 no as
sumption about the relative speeds or order of the branches may be made The
execution of the entire statement terminates when all processes of all subsets
have completed their branches
  Chapter 
The semantics of other synchronous control statements are dened in a similar
fashion See the language denition  for more details
The synchronous version of our FORALL operates much like the more recent
HPF FORALL
 except that ours is fully orthogonal to the rest of the language
Any statement
 including conditionals
 loops
 other FORALLs
 and subroutine
calls
 may be placed in its body Thus
 the language explicitly supports nested
and recursive parallelism An asynchronous FORALL is absent from HPF
 OPTIMIZATION TECHNIQUES AND
HARDWARE RECOMMENDATIONS
The proposed language features can be translated to existing parallel machines
We have shown this for SIMD
 MIMD
 and SISD machines For SIMD ma
chines
 we have implemented compilers targeting the Connection Machine CM 
and the MasPar MP For MIMD machines
 we are currently targeting a net
work of workstations and the KSR Additionally
 Modula  programs can
be translated for sequential UNIX workstations SISD
 which is appropriate
for program development and teaching
In this section we present basic translation and optimization techniques We
derive recommendations for hardware improvement by showing that the com
piler is hindered by certain hardware characteristics Finally
 section  pro
vides performance results for the Modula  compiler and for the optimization
techniques presented
 Implementing FORALL Statements
Obviously
 the most challenging task in translating Modula  for parallel ma
chines is to generate ecient code for FORALL statements
Basic Translation Scheme
A straightforward MIMD approach is to implement the processes created by
a FORALL with threads
 distributed over the p available processors In our
implementation
 however
 not more threads are spawned than processors are
available Each of these threads simulates a certain number of processes in
Project Triton Programmability of Parallel Computers 
virtualization loops This choice is necessary because thread switching is much
more expensive than loop control
Synchronization and termination at the end of a FORALL require a barrier If
all n threads of a FORALL terminate at about the same time
 then the termi
nation on a MIMD machine without synchronization hardware requires time
proportional to Ologn In our implementation
 synchronization is done on
the processor level instead of the process level
 thus reducing the overhead to
Olog p
The synchronous nature of a SIMD machine
 coupled with the broadcast bus
from the frontend control processor
 makes process creation
 termination

and load balancing operate in constant time However
 if the number t of
necessary processes for nested FORALLs cannot be evaluated statically during
compilation
 a Ologn summation must be used Once t is known
 t stacks
are created by assigning to each of processors a segment of dtpe stacks This
takes constant time and balances the load perfectly Process termination also
takes constant time
 since there is no synchronization overhead The problem
here is to achieve a good load balance in case of deactivation of processes by
cascaded IFstatements inside of FORALLs and in case of nested parallelism
Because of the frequency of FORALL statements in parallel programs we need
hardware support for fast process creation
 termination
 and context switching
Obviously
 the problem of translating FORALL statements into ecient code is
more complex
SIMD MIMD
? ?
synchronous
parallelism
asynchronous
parallelism
The problem of implementing a synchronous FORALL on a MIMD machine

or an asynchronous FORALL on a SIMD machine demands novel optimization
techniques and requires hardware improvements Both are presented in the
next two sections
  Chapter 
Elimination of Synchronization Barriers
We have developed transformation schemes that map synchronous into equiva
lent asynchronous FORALL statements with temporary variables These can be
implemented on MIMD machines directly Our transformations are more gen
eral than Hatchers work  because they also account for nested parallelism
and do not rely on having explicit communication instructions in the source
program Consider for example the following synchronous FORALL statement
and its inecient transformation into asynchronous FORALLs shown below
FORALL i 

N	 IN SYNC
Zi	  Zi	
Xi	  Xi	
Yi	  Ypi	
END

FORALL i

N	 IN PARALLEL Hi	  Zi	 END
FORALL i

N	 IN PARALLEL Zi	  Hi	 END
FORALL i

N	 IN PARALLEL Hi	  Xi	 END
FORALL i

N	 IN PARALLEL Xi	  Hi	 END
FORALL i

N	 IN PARALLEL Hi	  Ypi	 END
FORALL i

N	 IN PARALLEL Yi	  Hi	 END
Although this transformation yields code that can be run on both SIMD and
MIMD machines
 it is not ecient because of the short virtualization loops that
are a problem on SIMD machines 
 and the large number of synchronization
messages induced on current MIMD hardware
Our optimization is based on the insight that most of the synchronization barri
ers introduced by synchronous language constructs need not be implemented in
real synchronous FORALLs to ensure the prescribed semantics  To detect re
dundant synchronization barriers we apply data dependence analysis originally
developed for parallelizing Fortran compilers
In the above example there are only three data dependences
 one per assign
ment After restructuring the program
 a single synchronization barrier which
cuts all three dependences at once suces to ensure correctness
FORALL i

N	 IN PARALLEL
Hi	  Zi	
Project Triton Programmability of Parallel Computers 
Hi	  Xi	
Hi	  Ypi	
END
FORALL i

N	 IN PARALLEL
Zi	  Hi	
Xi	  Hi	
Yi	  Hi	
END
The number of resulting asynchronous FORALLs and
 correspondingly
 the num
ber of time consuming synchronization messages on MIMD machines is re
duced Furthermore
 virtualization loops grow with the number of synchro
nization barriers that can be eliminated Larger virtualization loops allow for
traditional sequential optimizations
 such as common subexpression elimina
tion and strength reduction
 which are an important source of performance
improvement on SIMD machines
We are presently exploring optimizations that reduce the amount of temporary
storage In the above example
 the temporary array H could be changed into a
perprocessor register by taking advantage of the direction of the virtualization
loop However
 such optimizations usually require additional synchronization
barriers
 trading space for time
Synchronization barrier elimination as described here has been implemented in
our compiler suite targeting SIMD
 MIMD
 and SISD machines Quantitative
results are given in section 
Although this optimization technique is quite ecient
 some synchronization
barriers will always remain necessary to be implemented On current paral
lel machines even these synchronization barriers can cause a signicant loss
of performance The reason is that communication networks today are op
timized for transporting data at high rates with use of latency hiding tech
niques However
 for synchronization only a few bits
 ie
 small packets
 have
to be transmitted and latency hiding is impossible Hence
 we recommend
hardware support for fast barrier synchronization on MIMD machines
Branch Combining
Branch combining is a method that attempts to improve the eciency of asyn
chronous FORALLs on SIMD machines The goal is to avoid the idling of a large
  Chapter 
fraction of the processors Although the semantics of Modula  prescribe that
branching statements create independent groups of threads
 it is permissible to
combine these groups when they execute the same code Identical segments in
di	erent branches can be found statically by determining the Longest Common
Subsequence   of these branches Hanxleden  studies branch combining
for the special case of parallel loops on SIMD machines
 Automatic Data and Process Distribution
Because of slow and high latency communication networks on distributed mem
ory machines
 the distribution
 ie
 alignment and layout
 of data and processes
over the available processors is a central problem
Alignment is the task of nding an appropriate tradeo	 between the two con
icting goals of  data locality and   maximum degree of parallelism Our
automatic alignment algorithm is described in    and briey sketched below by
means of an example Layout is the assignment of aligned data structures and
processes to the available processors Desirable goals are  the exploitation
of special hardware supported communication patterns and  simple address
calculations We use an automatic mapping   of arbitrary multidimension
al arrays to processors Thus we exploit grid communication if available and
achieve ecient address calculations
To understand the techniques and advantages of automatic data and process
distribution consider the following example
VAR A ARRAY 

	 SPREAD OF INTEGER
B ARRAY 

	 SPREAD OF INTEGER
FORALL i

	 IN PARALLEL
Ai	  Bi	
Bi	  
END
Without any distribution optimization Ai	 and Bi	 would reside in the
same processors local memory In the FORALL statement
 however
 Ai	 and
Bi	 are used together although they are located in di	erent processors
 at
least at virtualization boundaries
Project Triton Programmability of Parallel Computers 
90
100
Array A
Array B
stored together
1
0 1
used together
i
i
Enlarging As bounds will result in the same storage pattern for both A and B
Enlarging As upper bound will ensure this e	ect even if virtualization due to
segmentation over a small number of processors is necessary The enlargement
decouples alignment and layout Since the resulting arrays have the same size

the layout algorithm maps corresponding elements of the array to the same
processor We allow for moderate storage waste because the primary goal is
execution speed This transformation yields the following code fragment
VAR AB ARRAY 

	 SPREAD OF INTEGER
FORALL i

	 IN PARALLEL
Ai	  Bi	
Bi	  
END
The corresponding storage pattern is illustrated below
1 90
0 100
Array A
Array B
stored together
1
used together
i
i
Up to now we have only dealt with the alignment of data Process alignment is
also achieved by means of sourcetosource transformations During the trans
formation
 FORALLs are attributed with an ALIGNED WITH clause that directs
the code generator to allocate each process exactly where the corresponding
data element resides
  Chapter 
VAR AB ARRAY 

	 SPREAD OF INTEGER
FORALL i

	 IN PARALLEL ALIGNED WITH Ai	
Ai	  Bi	
END
FORALL i

	 IN PARALLEL ALIGNED WITH Bi	
Bi	  
END
The resulting patterns of date storage and process scheduling is shown below
1 90
0 100
Array A
Array B
on a single processor
1
1
1
First
Second
FORALL
FORALL
aligned
aligned
The original FORALL has been split into two In the rst FORALL
 the process
with index i will be executed where data element Ai	 resides In the second
FORALL
 the process with index i will be scheduled according to the distribution
of Bi	 This results in local accesses that could not be achieved with a
single FORALL Cost estimation is necessary to tradeo	 the cost of splitting up
of FORALLs and the cost of access to nonlocal data See    for more details
Automatic data and process distribution as described here have been imple
mented in our compiler suite quantitative results are given in section 
Although the compiler can often nd alignments and layouts that reduce the
amount of nonlocal communication signicantly
 there will still be communi
cation in the general case Therefore
 the basic performance of the network
must be improved and we conclude that
MCPS measure must approach MIPS measure
Project Triton Programmability of Parallel Computers 	
Present communication networks are far too slow compared to the speed of the
processors We measure the speed of the communication network in Million
Connections Per Second MCPS
 ie
 the number of messages delivered per
second This measure is more accurate than bandwidth numbers because it
does not hide latency and routing overhead Unfortunately
 typical MIMD
machines can execute  to 
 arithmetic instructions in the time it takes
to deliver a single small message This disparity forces a distortion of parallel
algorithms reducing the number of data packets by reorganizing the algorithm
and by combining packets becomes allimportant
With some SIMD machines
 the ratio of arithmetic operation time to packet
delivery time is somewhat better approximately  for neighbor communica
tion and  for random communication   But even that is not sucient
the programmer is still forced to nd good mappings of the data structures
onto the topology of the network to exploit the faster neighbor communication
A second performanceoriented recommendation is
support for latency hiding
since it allows the delivery of packets concurrently with computation That
would enable the compiler to interleave computation and communication and

thus
 to hide some of the communication latency Support for latency hid
ing could be implemented by an independently operating network with asyn
chronous message delivery or a decoupled access processor prefetching data
Furthermore
 we recommend a
shared address space
System wide addresses are especially important for the implementationof point
ers because otherwise they would have to be simulated quite ineciently in
software Therefore
 all processors should be able to generate addresses for the
entire memory of the system Even the memory of the control processor
 eg

the frontend of a SIMD machine
 should be part of that address space As
noted earlier
 a shared address space does not imply shared memory
The problems due to di	erent types of addresses can be studied with C The
old version has about ten di	erent variants for each pointer type The variants
reect whether the pointer itself is stored in a singular or a parallel variable
and whether it actually points to a singular or a parallel variable
 plus some
additional variants The result is that parallel pointers in old C are exceed
ingly complicated to program with It appears that the same complexities
would arise in new C and were omitted for this reason
 resulting in severe
nonorthogonalities and restrictions See   for a more detailed critique of
   Chapter 
C These diculties in compiler and language design would vanish with a
hardwareprovided shared address space
 Benchmark Results
Presently
 our benchmark suite consists of thirteen problems collected from
literature 
 
 
 
  For each problemwe implemented the same algorithms
in Modula 
 in sequential C
 and in MPL
 
 Then we measured the runtimes
of our implementations on a K MasPar MP SIMD and a SUN SISD
for widely ranging problem sizes Measurements for LANs and KSR are not
yet available
Modula Programs In Modula  we employ our libraries wherever possi
ble A technical deciency in our current Modula  compiler forced us to man
ually unroll twodimensional arrays into onedimensional equivalents This
will no longer be necessary in the future
MPL Programs In MPL we have implemented the same algorithms as in
Modula  and carefully handtuned them for the MasPar MP architecture
The MPL programs make extensive use of registers
 neighbor communication

standard library routines
 and documented recommendations and programming
tricks We are quite certain that we have not overlooked possible optimizations
because the MPL programs were reworked numerous times and the best imple
mentation chosen by careful measurement However
 to ensure the fairness of
the comparison
 the MPL programs are as generally scalable as their Modula 
counterparts Since scalability is not restricted to multiples of the number of
processors
 boundary checks are required in every virtualization loop
Sequential C Programs The sequential C programs implement the parallel
algorithms on a single processor We use optimized libraries wherever possible
In the following
 we rst compare the resource consumption of these three
program classes Secondly
 we discuss their overall performance The individual
benchmark problems and their performance are presented in   Finally
 we
show the quantitative e	ects of the optimization techniques
 
MPL  is a dataparallel extension of C designed for the MasPar MP series In
MPL the number of available processors the SIMD architecture of the machine its 	D
meshconnected processor network and the distributed memory are visible The program
mer writes a SIMD program and a sequential frontend program with explicit interactions
between the two MPL provides special commands for neighbor and general communication
Virtualization loops and distributed address computations must be implemented by hand
Project Triton Programmability of Parallel Computers 	
Resource Consumption
The comparison is based on the criteria program space
 data space
 development
time
 and runtime performance
Program Space Our compiler translates Modula  programs to MPL or
C The resulting programs consume slightly more space than the handcoded
MPL or C programs Regarding source code length
 Modula  programs are
typically half the size of their corresponding MPL or C programs
Data Space The memory requirements of the Modula  programs are typi
cally similar to those of the MPL and C programs Memory overhead is limited
to variable replication into temporaries which occurs during synchronous as
signments This replication
 however
 is most often also required in handcoded
MPL Furthermore
 there is some additional overhead caused by controlling
synchronous
 nested
 and recursive parallelism  bytes per FORALL
Development Time Due to compiler errors detected while implementing
the benchmarks
 we cannot give exact quantitative gures on implementation
and debugging time However
 we estimate that the implementation e	ort in
Modula  is not more than one fth of the MPL e	ort
Runtime Performance
MPL versus Modula The general relative performance of Modula 
is quite stable over all problem sizes and averages to  of the MPL per
formance Problems that can be implemented in MPL with a high amount
of neighborship communication on arrays with multiple dimensions
 currently
perform quite bad in Modula  since the necessary optimization is not imple
mented yet
Sequential C versus Modula The general relative performance of
Modula  is again quite stable over all problem sizes and averages to 
of the sequential C performance
For widely varying problem sizes we measured the runtime of each test program
on a K MasPar MP and a SUN We used the highresolution DPU timer
on the MasPar and the UNIX clock function on the SUN sum of user and
system time Below
 t
m 
represents the Modula  runtime on either a K
MasPar MP or a SUN as appropriate t
mpl
gives the MPL runtime on a
K MasPar MP t
c
stands for the sequential C runtime on a SUN
  Chapter 
We dene performance as problem size per time unit and focus on performances
size
t
m  

size
t
mpl
 t
mpl
t
m 
and
size
t
m  

size
t
c
 t
c
t
m 









n
u
m
b
e
r
o
f
m
e
a
s
u
r
e
m
e
n
t
s









         

relative performance
cumulative relative performance distribution
tc	tm on seq SUN
tmpl	tm on MasPar
The overall distribution of relative performances proves to be encouraging The
above histogram provides the number of relative performance values falling into
one of the classes 
 
   
  The numbers are the
accumulated sums over all problems and problem sizes all data points
Eect of the Optimizations
Alignment and Layout Data locality should pay o	 since data access in
volving communication is slower than access to local memory The following di
agram compares the runtimes of two versions The rst version has no ALIGNED
WITH clause in the program text t
noalignopt
 The compiler produces code
that detects dynamically at runtime whether addresses are local or not In the
second version t
alignopt

 alignment optimization in the compiler has pro
duced ALIGNED WITH information Thus
 the code generator statically knows
about locality
Project Triton Programmability of Parallel Computers 	





n
u
m
b
e
r
o
f
m
e
a
s
u
r
e
m
e
n
t
s






 
     
performance improvement factor
Speedup due to Alignment and Layout
tnoalignopt	talignopt on MasPar
On the MasPar MP
 this optimization improves runtime performance by 
on average The advantage of statically determined locality grows with the
amount of data accessed No di	erences could be measured on a sequential
workstation since all accesses are local
Elimination of Synchronization Barriers The elimination should pay o	
for machines without synchronization hardware Most MIMD machines
 for
example
 synchronize by message passing which can be two or three orders of
magnitude slower than instruction execution However
 synchronization barrier
elimination is benecial even on SIMD machines because it reduces virtualiza
tion overhead and the number of temporary variables needed Furthermore
 it
may improve register usage
The following diagrams show the performance ratio between runs without and
with elimination of synchronization barriers t
nosyncopt
t
syncopt
 for the
MasPar and the SUN
Synchronization barrier elimination improves runtime by over  on a MasPar
MP and by over a factor of   on sequential workstations Originally
 the
benchmark programs had a total of   synchronization barriers which were
reduced to  by applying the optimization
  Chapter 







 
     
Speedup due to Elimination of Synchronization
tnosyncopt	tsyncopt on MasPar







 
     
performance improvement factor
tnosyncopt	tsyncopt on seq SUN
On SISD and MIMD machines
 the performance improvement stems from the
fact that fewer virtualization loops and fewer temporaries are needed On
a workstation
 loop control and computation is done by the same processor
Without the elimination of synchronization barriers more than  of the run
time is used for loop control and memory access for additional temporaries
On the MasPar MP
 loop control is performed by the fast frontend processor
whereas the computation is done by the much slower parallel processors Since
the optimization technique only a	ects the frontend part
 the relative perfor
mance gain is smaller for the MasPar MP than that achieved on sequential
workstations
	 TRITON

The poor programmability of todays parallel machines is a consequence of the
fact that the design of these machines has been driven mostly by hardware
considerations Programmability seems to have been a secondary issue
 result
ing in languages designed specically for a particular machine model Such
Project Triton Programmability of Parallel Computers 	
languages do not satisfy the needs of programmers who need to write machine
independent applications
General Architecture Triton is a SAMD synchronousasynchronous in
struction streams
 multiple data streams machine it runs in SIMD mode
where strict synchrony is necessary it can switch to MIMD mode where
concurrent execution of di	erent tasks is benecial It is even possible to
run a subset of the processors in SIMD mode and the other in MIMD
Thus
 Triton is truly SAMD
 ie
 mixedmode
 not just switchedmode
Only a few research prototypes of mixedmode machines have been built
OPSILA
 TRAC
 AP
 and PASM  
 
   Triton provides sup
port for switching rapidly between the two modes With Modula  we
have a highlevel language to control both modes e	ectively
Fast Barrier Synchronization Fast barrier synchronization is supported
by special synchronization hardware both in SIMD and MIMD mode
Synchronization with hardware support overcomes the necessity of coarse
grained parallelism
Network We chose the De Bruijn network for Triton because it has sev
eral desirable properties logarithmic diameter
 xed degree
 etc
 is cost
e	ective to build
 and can be made to operate extremely fast and reliably
In section  we present performance gures
Scalability and Balance Parallel machines should scale in performance by
varying the number of processors and by adapting to progress in technol
ogy Furthermore
 the performance of the individual components proces
sor
 memory
 network
 and IO should harmonize Scalability in size is
mainly a property of the network Popular networks do not scale well
hypercubes are too expensive because they have variable degree Grids
cause high latency because of large diameter Tritons De Bruijn net
has none of these problems and
 hence
 scales well It is also well matched
to the speed of the processors Section  comments on the scalability of
Triton in terms of technology
IO Capabilities IO must also scale with the number of processors Few
parallel machines provide for scalable IO Triton implements a massive
ly parallel IO architecture one disk per processor For large sets of disks
we have extended the traditional notion of a le to what we call a vector
le Massively parallel IO also provides the basis for research in parallel
operating systems
 such as virtual memory
 parallel paging strategies
 and
true multiuser environments Results in these areas are required in order
to bring parallel machines into widespread use
  Chapter 
	 Architecture of Triton

Triton is divided into a frontend and a backend portion The frontend is a
UNIX workstation with an interface connected to the backend portion via the
instruction and the control bus The backend portion consists of the processing
elements
 the network
 and the IO system
The Triton prototype uses an Intel  based PC running BSD UNIX as its
frontend The prototype will contain     PEs
   of which are supplied
with a disk   of the PEs are provided for computation and  PEs are for hot
standby These PEs can be congured under software control into the network
if other PEs fail The reconguration involves changing the PE numbers con
sistently and recomputing the routing tables in the network processors The  
disks are logically organized in  groups of  disks where each group contains
 data and one parity disk RAID level   is used for error handling The
logical organization of Triton is presented in the following diagram
Network
PE PE PE PE
          
Instruction Bus
Control Bus
FE
Ethernet
In SIMD mode
 the frontend produces the instruction stream and controls the
backend portion at instruction level In MIMD mode
 the frontend is responsi
ble for downloading the code and the initiation of the program The instruction
bus is  bits wide For reasons of decoupling frontend and backend in order
to reduce the time of the frontend waiting for the backend to become ready
 or
vice versa
 the instruction stream is sent through a fo The handshake signals
necessary to control the instruction stream are part of the control bus
The common address space of Triton is needed by the so called analyze
mode In this mode the frontend has direct access to the local memory of
all backend processing elements It can be used for frontend controlled data
transport between frontend and backend both directions and for reasons of
debugging To support the analyze mode the control bus includes  address
Project Triton Programmability of Parallel Computers 	
lines and several dedicated control signals the instruction bus is used for data
transport While in analyze mode
 all PEs release their local busses to enable
direct memory access from the frontend
The processing elements are designed as universal computing elements
 capable
of performing computation as well as service functions Each PE consists of a
Motorola MC microprocessor
 a memory management unit MC

a numeric coprocessor MC or MC 
   MBytes of main memory
 a
SCSI interface
 a networkprocessor
 and the frontend bus No extra controllers
for mass storage access or any other IO are necessary The following gure
shows the architecture of the processing elements of Triton and emphasizes
the novel aspects
 ie
 the di	erences to a traditional sequential architecture
RAM
SCSI
NET
FPU
CPU
MMU
FE
Address Bus
Data Bus
The network of Triton consists of the networkprocessors included in the PEs

the interconnection lines realized with at cables
 and fo bu	ers for interme
diate bu	ering of data packets The network can route data packets from their
source to their destination asynchronously
 ie
 without interfering with the
PEs Noninterference permits latency hiding techniques to be applied in the
compiler The interface between a PE and its respective networkprocessor is
also implemented with fos because of decoupling reasons
Parity checking of main memory
 network links
 and mass storage implements
error detection Periodic signature tests locate malfunctioning elements
The architecture of Triton matches the following recommendations derived
in section  to support the translation of highlevel languages
  Chapter 
Hardware support for fast creation and termination of processes and for
context switching The processor and the memory management unit pro
vide hardware support for virtual memory and special instructions to build
process management routines
Hardware support for fast synchronization Several globalOR signals are
provided both for SIMD and MIMD processing
Hardware support for synchronous and asynchronous parallelism ie fast
switching between SIMD and MIMD mode
	 Details of Selected Hardware Aspects
Instruction Bus and Control Bus
The instruction and control busses are implemented by a hierarchy of bus
drivers for signals from the frontend to the backend For the opposite direction

globalOR lines are emulated by explicit ORcombination of the signals from
individual PEs In SIMD mode all PEs read and execute the same instruction
or idle This is controlled with a threeway handshake protocol
The general problem with handshake protocols on hierarchies of bus drivers
is that these introduce a nonnegligible amount of delay if the signals must
traverse the complete hierarchy several times In Triton this delay is reduced
by employing instruction bu	ers at each driver level Thus
 the handshake
protocol is executed between every two hierarchy levels in pipeline fashion

rather than between the frontend and the PEs This technique results in the
same duration of instruction fetch in SIMD and MIMD mode
Global Address Space
As mentioned above
  bit addresses are used to implement the global address
space The least signicant   bits are used to select the memory and the
memory mapped IO in the PEs up to  MByte The next  bits are used
to identify the PE to be accessed up to k The remaining bit distinguishes
between frontend and backend addresses The id of the PEs is twofold Each
PE has a hardware id which is selected by a switch setting Additionally
 each
PE has a software identication which is used while computing Initially
 the
software id is set to the same value as the hardware id but it may change
during operation because of reconguration
Project Triton Programmability of Parallel Computers 
As long as the total size of memory over all processors stays below  GBytes

the internal   Bit address arithmetic of the MC is sucient For larger
congurations
 the  address bit arithmetic must be simulated in software
SIMDMIMD ModeSwitching
In SIMD mode the function codes of the processor are used to determine
whether the processor accesses data or instructions Accordingly
 the processor
bus is connected to the local memory or the instruction bus
 respectively The
values of the program counters are completely ignored in SIMD mode In or
der to switch to MIMD mode
 a program must be downloaded to the memory
of the PEs Downloading is done via the instruction stream in SIMD mode
Thus
 the distribution of code is
 in contrast to many other MIMD machines

accomplished in a time proportional to the length of the code
 independent of
the size of the machine The switch from SIMD to MIMD mode is performed
by two instructions With the rst instruction
 the program counter is set ac
cording to the location of the program to be executed in MIMD mode With
the second instruction
 the MIMD request in the command register local to
the PE is activated The PE then switches to MIMD mode at the end of the
current cycle and commences execution of the local code without delay To
switch from MIMD to SIMD mode
 the MIMD request in the local command
register is deactivated which causes the PE to switch to SIMD mode at the end
of the current cycle The next instruction is then expected form the instruction
stream
Data Transfer
There are several di	erent data paths to consider in parallel computers The
most important one is the PEnetwork see section  Another important
path is the data transport from the frontend to the backend and vice versa
There are di	erent possibilities for each direction to transport data from the
frontend to the backend
 the easiest way is to send the data as immediate data
via the instruction stream in SIMD mode With that possibility any subset of
the PEs can be the destination of the data Unfortunately
 only unidirectional
access is possible The second possibility is direct memory access in analyze
mode Here
 data can be transferred in both directions The drawback of
the analyze mode is that no computation can take place and only one single
PE can be accessed at any time The third is to use the network There is
one dedicated network node connected to the frontend This path is especially
useful for transmittingmore than a few bytes from di	erent PEs to the frontend

   Chapter 
eg picture data Another advantage of a network node included in the frontend
is that parallel computation can commence while data is being transported
Fast Barrier Synchronization in MIMD Mode
If all PEs are executing the same code
 barrier synchronization is easily done
by a globalOR line Each PE sets its ready bit to true as soon as it reaches
the synchronization barrier Approximately one clock cycle after the last PE
sets its bit
 the frontend recognizes it and noties the PEs on the result line
In general MIMD case more than one group of processes exists But the single
globalOR line cannot be partitioned according to the process distribution To
implement barrier synchronization with several groups of processes the global
OR line is administrated by the frontend as a synchronization resource Each
group of processes is identied by a unique process group number Initially

each PE is allowed to request the synchronization line on behalf of a group
The request is performed by the rst PE reaching a barrier which interrupts
the frontend and sends the group identication via the analyze circuits If
more than one PE reaches a barrier at once
 the analyze circuits will select one
randomly The frontend then knows which group demands the synchronization
line Next
 the frontend interrupts all PEs and forces them into SIMD mode
to perform a barrier setup The PEs not belonging to the requesting group are
prohibited to request the sync line themselves They also turn on their ready
bits The PEs belonging to the requesting group set their ready bit to true if
they already reached the barrier
 otherwise to false After this setup phase
 the
PEs return to MIMD mode and continue computation
 independently of their
group membership As soon as the last ready bit is turned on
 the group owning
the globalOR line synchronizes and releases the sync line The frontend then
releases the request prohibition in order to enable other groups to synchronize
	 Communications Network
The Triton network is based on the generalized De Bruijn net 
  that
can be characterized by the following interconnection rule assuming all nodes
are labeled  through N  
 a node with label X has direct connections to the
nodes with labels   X   mod N where   fd g The outdegree d
equals the indegree The number N of nodes in the network is not limited to
powers of two The maximum diameter is dlog
d
Ne The average diameter is
well below log
d
N and in practice quite close to the theoretical lower bound
 the
average diameter of directed Moore graphs Example the average diameter of
Project Triton Programmability of Parallel Computers 
the De Bruijn net with N    and d    is only  worse than the average
diameter of the optimal Moore graph with same number of nodes In general

Moore graphs cannot be realized in practice

































































































































































































































































































































































































































































































































































































































 

























































































































































































































































































































































































































































































	



































































































































































In our implementation we use degree d   
 which makes our net a perfect shuf
e The above graph illustrates the structure of a De Bruijn net with  nodes
In comparison with other frequently used networks
 this design has the benets
of a constant degree per node and a small average diameter Data transport
is done via a tablebased
 selfrouting packet switching method which allows
virtualcutthroughrouting and load dependent detouring of packets Every
node has its own routing table and three input bu	ers two for intermediate
storage of data packets coming from other nodes and one to communicate with
its associated PE An output bu	er is used to deliver data packets to the asso
ciated PE Bu	ering temporally decouples the network from local processing
Packets contain the address of the target node
 the length of the message
 and
the data itself The size of the packets can range from  to  bytes
To achieve low latency
 we implemented the communications processor at the
gate level with a programmable gate array The throughput of one communi
cations processor  measured on our prototype  is 
 packets per second
  bit user data which is equivalent to  MCPS Thus
 the communications
processor is three to ve times faster than the PEs can read or write pack
ets Therefore
 we have adequate performance to build large networks and to
operate them under heavy loads without su	ering from high latency
The communications processor routes the packets without interfering with the
PE Optimal routes are stored in a routing table per communications proces
sor Hence
 the network can transport data concurrently to the operation of
the processing elements This feature can be used by the compiler to overlap
communication and computation
  Chapter 
In order to analyze the behavior of the network
 we built a simulator based on
the measured performance of a single communications processor We simulated
the overall performance of the network in various modes The number of nodes
examined ranged from   to 
  The results of a series of experiments with
a random communication pattern are given in the following diagram












  
  
 
   

Number of nodes








































































































































































































































































































































































b
b
b
b
b
b
b
b
b
 
 
 
 
 
 
 
 
 
r
r
r
r
r
r
r
r
r
   transfer time in network cycles
  maximum diameter  average diameter
Both the sender and the receiver were chosen randomly
 with the restriction
that the number of data packets to be transported equals the number of nodes
in the network The simulation shows that the network scales well the delay
introduced by the network lies within OlogN 
 where N denotes the number
of nodes and messages
The robustness against overload is surprisingly good Even if all processing
elements send a large number of packets simultaneously
 the overall through
put of the network does not decrease Irregular permutations are performed
especially fast All hard patterns known from literature
 eg
 transposition
of a matrix
 buttery
 and bit reversal are delivered at average speed or faster
A severe disadvantage of De Bruijn networks is that they are not deadlock free
If used as a selfrouting packet switching network with a limited bu	er size and
no possibility of rearranging the packets in the bu	ers
 deadlock can occur
Dally et al  show that a network is deadlock free if its dependency graph is
free of cycles This condition is easy to prove for hypercubes it does not hold
for De Bruijn nets Although several other methods for deadlock avoidance are
known from literature
 none of them apply to De Bruijn nets
Project Triton Programmability of Parallel Computers 
We have implemented a combination of three methods to cope with deadlocks

contention
 and starvation The rst one uses a static priority scheme to reduce
the likelihood of deadlocks This gives us something similar to a prioritypath
through the network that has to be mapped to a hamiltonian cycle within
the De Bruijn net The second method is a static packet insertion rule which
prevents overloading of the network The third method uses detouring of
packets if it detects contention In case of starvation
 we use a routing similar
to the one used in Denelcor HEP   although without explicit packet priorities
The design of our network matches all the network related recommendations
of section 
The MCPS measure approaches the MIPS measure  MCPS vs 
MIPS
The network operates independently with asynchronous message delivery
Triton provides uniform communication instructions There is no dif
ference between neighbor
 global
 and frontend communication
		 Scalability
Scalability of parallel machines can be interpreted in two ways  scalability
within the size of the machine and scalability in terms of technology The
Triton architecture scales well in size With our current hardware we are
able to connect up to 
 PEs The design allows for up to k PEs
Scaling Triton in technology raises some diculties The De Bruijn net and
the IOSystem should scale well Upgrading to a stateoftheart microproces
sor requires to change the  bit design to a   bit or even  bit design which is
simple Also the upgrade of our network processor
 a FPGA with about 

gates
 is quite easy for an experienced VLSIdesigner However
 the frontend
instruction bus needed for SIMD processing is crucial
 since the delay induced
by the hierarchical bus architecture cannot be reduced in the same way that
must be expected of the performance improvement of standard processor chips
Provided that we have hardware support for fast barrier synchronization
 our
experience in compiler technology shows how to produce ecient code Since
the synchronization hardware of Triton  a reductionbroadcast tree  scales
well
 the Triton architecture
 given minor modications
 must be considered
to scale well in terms of technology
  Chapter 
 STATUS AND FUTURE
Compilers
A rst nonoptimizing Modula  compiler targeting the Connection Machine
CM  was operational in  Since spring  
 optimizing Modula  com
pilers for the MasPar MP and sequential machines are available Modula 
compilers for the KSR and networks of workstations are under construc
tion Contact mscira
uka
de if interested Further research will focus on
Modula 
 nested and recursive parallelism
 as well as other optimiza
tions for latency hiding and parallel expression evaluation
The migration to a new compiler becomes necessary because our current im
plementation has already reached its limits of extensibility This also o	ers the
opportunity to move to a new language Modula clearly distinguishes be
tween the orthogonal properties range andmode synchronous or asynchronous
of parallel computations and supports more modern concepts than its prede
cessor
 eg parallel objectoriented programming
Triton
The prototype of the rst node was completed in October  The individual
components communication processor
 PE
 and control processor interface
are tested and running according to their specications Manufacturing of the
printed circuit boards is in progress The nal assembly of Triton will be
completed in 
One serious bottleneck in the current Triton architecture is the missing net
work packaging unit This unit should be capable of taking an address in the
global addressspace of Triton
 recognize the remote processor number
 and
automatically send the corresponding packet through the network Moreover

in case of a datafetch request
 the packagingunit of the responding processor
should be able to process data requests and send the requested data back to
the originating processing element Currently
 this functionality is implemented
with lowlevel libraries We expect that hardware support for these features will
improve performance of remote data access at least by an order of magnitude
Further research will focus on novel architectures to overcome the network
latency problem
Project Triton Programmability of Parallel Computers 
 CONCLUSION
Massively parallel machines are now beginning to be used routinely With
increased use
 the programmability of these machines has become an impor
tant concern We have proposed simple language extensions that allow for clear
and portable expression of parallel algorithms in most imperative programming
languages We have demonstrated that these language constructs can be com
piled e	ectively for parallel machines Beside further research in optimization
techniques
 parallel architectures must be tuned to the needs of higherlevel
languages and the capabilities of compilers We have proposed and are still
evaluating promising hardware features that would support problemoriented
explicitly parallel programming languages
Acknowledgements
We would like to thank our students
 especially Boris Bialek
 Udo B!ohm
 HaJo
Brunne
 Thomas Gauweiler
 Stefan H!an"gen
 Oliver Hauck
 Ralf Kretzschmar

Michael L!angle
 Hendrik Mager
 and Markus Mock for many valuable ideas
and their implementation work
REFERENCES
 Selim G Akl The Design and Analysis of Parallel Algorithms Prentice
Hall
 Englewood Cli	s
 New Jersey
 
  M Auguin and F Boeri The OPSILA computer In M Consard
 edi
tor
 Parallel Languages and Architectures
 pages  Elsevier Science
Publishers
 Holland
 
 N G De Bruijn A combinatorial problem In Proc of the Sect of Science
Akademie van Wetenschappen
 pages 
 Amsterdam
 June  
 
 Peter Christy Virtual processors considered harmful In Proc of the
th Distributed Memory Computing Conference
 pages 
 Portland

Oregon
 April    May  
 
 William J Dally and Charles L Seitz Deadlockfree message routing in
multiprocessor interconnection networks IEEE Transactions on Comput
ers
 C
 
  Chapter 
 John T Feo
 editor A Comparative Study of Parallel Programming Lan
guages The Salishan Problems Elsevier Science Publishers
 Holland
  
 Geo	rey Fox
 Seema Hiranandani
 Ken Kennedy
 Charles Koelbel
 Uli Kre
mer
 ChauWen Tseng
 and MinYou Wu Fortran D language specica
tion Technical Report CRPCTR
 Center for Research on Parallel
Computation
 Rice University
 December 
 Alan Gibbons and Wojciech Rytter Ecient Parallel Algorithms Cam
bridge University Press
 
 Philipp J Hatcher and Michael J Quinn DataParallel Programming
on MIMD Computers MIT Press Cambridge
 Massachusetts
 London

England
 
 Ernst A Heinz Modula An eciently compilable extension of
Modula for explicitly parallel problemoriented programming In Joint
Symposium on Parallel Processing
 pages   
 Waseda University

Tokyo
 May 
 
 Ernst A Heinz and Michael Philippsen Synchronization barrier elim
ination in synchronous forall statements Technical Report No 

University of Karlsruhe
 Department of Informatics
 April 
  Christian G Herter
 Thomas M Warschko
 Walter F Tichy
 and Michael
Philippsen Triton A massivelyparallelmixedmode computer designed
to support high level languages In 	th International Parallel Processing
Symposium
 Proc of nd Workshop on Heterogeneous Processing
 pages

 Newport Beach
 CA
 April 
 
 Takeshi Horie
 Hiroaki Ishihata
 Toshiyuki Shimizu
 Sadayuki Kato

Satoshi Inano
 and Morio Ikesaka AP architecture and performance
of LU decomposition In Proc of the  International Conference on
Parallel Processing
 volume I
 pages 
 August 
 High Performance Fortran HPF Language specication Technical re
port
 Center for Research on Parallel Computation
 Rice University
  
 Makoto Imase and Masaki Itoh Design to minimize diameter on building
block network IEEE Transactions on Computers
 C 
 June

 Joseph J#aJ#a An Introduction to Parallel Algorithhms AddisonWesley

Reading
 Mass
  
Project Triton Programmability of Parallel Computers 
 MasPar Computer Corporation MasPar Parallel Application Language
MPL Reference Manual
 September 
 Piyush Mehrotra and John Van Rosendale The BLAZE language A par
allel language for scientic programming Parallel Computing
 

November 
 David A Patterson
 Garth Gibons
 and Randy H Katz A case for re
dundant arrays of inexpensive disks RIAD In Proc of the  ACM
SIGMOD Conference on Management of Data
 pages 
 Chicago

June 
 
  Michael Philippsen Automatic data distribution for nearest neighbor net
works In Frontiers The Fourth Symposium on the Frontiers of Mas
sively Parallel Computation
 pages 
 Mc Lean
 Virginia
 October
 
  
  Michael Philippsen
 Ernst A Heinz
 and Paul Lukowicz Compil
ing machineindependent parallel programs ACM SIGPLAN Notices

 
 August 
   Michael Philippsen and Markus U Mock Data and process alignment in
Modula  In Christoph W Kessler
 editor
 Automatic Parallelization 
New Approaches to Code Generation
 Data Distribution
 and Performance
Prediction
 pages 
 AP Saarbr!ucken
 Germany
March 
 

 Verlag Vieweg
 Wiesbaden
 Germany
Advanced Studies in Computer
Science
  Michael Philippsen and Walter F Tichy Modula  and its compilation
In First International Conference of the Austrian Center for Parallel Com
putation
 Salzburg
 Austria
 
 pages  Springer Verlag
 Lecture
Notes in Computer Science 
  
  Lutz Prechelt Measurements of MasPar MP A communication oper
ations Technical Report No 
 University of Karlsruhe
 Department
of Informatics
 January 
  M Rosing
 R Schnabel
 and R Weaver DINO Summary and example
In Proc of the Third Conference on Hypercube Concurrent Computers and
Applications
 pages  
 Pasadena
 CA
  ACM Press
 New York
  David Sanko	 and Joseph B Kruskal eds Time Warps
 String Edits

and Macromolecules The Theory and Practice of Sequence Comparison
AddisonWesley
 Reading
 Mass
 
  Chapter 
  HJ Siegel
 T Schwederski
 JT Kuehn
 and NJ Davis An overview of
the PASM parallel processing system In DD Gajski
 VMMilutinovic

HJSiegel
 and BP Furht
 editors
 Computer Architecture
 pages 
IEEE Computer Society Press
 Washington
 DC
 
  Burton J Smith Architecture and applications of the HEP multiprocessor
computer system In Real Time Signal Processing IV
 Proceedings of SPIE

pages    International Society for Optical Engineering
 
  Thinking Machines Corporation
 Cambridge
 Massachusetts Lisp Refer
ence Manual
 Version 
 
 Thinking Machines Corporation
 Cambridge
 Massachusetts C Language
Reference Manual
 April 
 Walter F Tichy and Christian G Herter Modula  An extension of
Modula  for highly parallel
 portable programs Technical Report No

 University of Karlsruhe
 Department of Informatics
 January 
  Walter F Tichy
 Michael Philippsen
 and Phil Hatcher A critique of the
programming language C Communications of the ACM
   

June  
 Reinhard v Hanxleden and Ken Kennedy Relaxing SIMD control ow
constraints using loop transformations Technical Report CRPCTR  

Center for Research on Parallel Computation
 Rice University
 April  
