Triton/1: a massively-parallel mixed-mode computer designed to support high level languages by Herter, Christian G. et al.
Triton
A MassivelyParallel MixedMode Computer
Designed to Support High Level Languages
Christian G Herter Thomas M Warschko Walter F Tichy and Michael Philippsen
University of Karlsruhe Dept of Informatics
Postfach  D	
 Karlsruhe  Germany
This paper appeared in th International Parallel Processing Symposium Proc of nd Work
shop on Heterogeneous Processing pages 	
 Newport Beach CA April 
 
Abstract
We present the architecture of Triton a scalable
mixedmode SIMDMIMD parallel computer The
novel features of Triton are	
  Support for highlevel machineindependent pro
gramming languages

  Fast SIMDMIMD mode switching

  Special hardware for barrier synchronization of
multiple process groups

  A selfrouting deadlockfree perfect shue inter
connect with latency hiding
The architecture is the outcome of an integrated de
sign in which a machineindependent programming
language optimizing compiler and parallel computer
were designed handinhand
  Introduction
The goal of adequate programmability of parallel ma
chines can best be achieved by tightly coupling the de
sign of machineindependent programming languages
compilers and parallel hardware In the past the de
velopment of parallel computers has been mainly driv
en by hardware considerations without regard to the
law of the weakest link
The overall performance of a parallel computer is de
termined by the performance of its slowest part with
respect to the requirements of the software
Ignoring software requirements has resulted in unsat
isfactory performance of these machines on machine
independent parallel programs To avoid these short
comings and to show that highlevel parallel program
ming does not necessarily lead to poor performance
we specically analyzed the requirements of program
ming languages and compilers before designing the
hardware
The following section outlines the parallel program
ming language Modula and derives general require
ments for parallel computers In section 	 we describe
the architecture of Triton
 We emphasize those fea
tures which arose from software requirements
 Modula
Modula pronounced Modulastar is a small ex
tension of Modula for massively parallel program
ming The programming model of Modula incor
porates both data and control parallelism and allows
mixed synchronous and asynchronous execution
Modula is problemoriented in the sense that the
programmer can choose the degree of parallelism and
mix the control mode SIMD or MIMDlike as need
ed by the intended algorithm Parallelism may be
nested to arbitrary depth Procedures may be called
from sequential or parallel contexts and can them
selves generate parallel activity without any restric
tions Most Modula programs can be translated
into ecient code for both SIMD and MIMD archi
tectures
 Overview of language extensions
Modula extends Modula with the following two
language constructs
 Parallelism can be created in Modula pro
grams only by means of the FORALL statement
There are two versions of this statement a syn
chronous and an asynchronous one
 The distribution of array data may be optional
ly specied by socalled allocators in a machine
independent way Allocators do not have any se
mantic meaning they are hints for the compiler
Because of the compactness and simplicity of the ex
tensions they could easily be incorporated in other
imperative programming languages such as Fortran
C or Ada
A synchronous FORALL statement creates a set of
synchronously running processes The number of the
processes is determined by the FORALL statement and
is not limited to the number of the PEs of the ma
chine As long as there is no branch in the control
ow of the statements inside a FORALL statement the
semantics of the execution is equivalent to a SIMD ma
chine executing those statements If there are branch
es in the control ow that dene alternatives like IF
THEN ELSE or CASE the processes are partitioned
into several groups Each of those groups is executing
one branch in the control ow The processes belong
ing to one group execute synchronously in SIMD style
but groups are allowed to execute concurrently with
respect to each other There is no assumption about
the relative speeds of two processes in dierent groups
In contrast to the synchronous FORALL statement
an asynchronous FORALL statement creates a set of in
dependent processes running at an unspecied relative
speed Common to both variants is that the termi
nation of a FORALL statement is determined by the
termination of the last process created by the FORALL
statement The end of a FORALL statement always
denes a synchronization barrier For further details
about the language see  for detailed discussion of
compilation techniques and optimization see 
 Software requirements for parallel
computers
On distributed memory machines the distribution of
array data over the available processors is a central
problem Two conicting goals  data locality and
 maximum degree of parallelism must be recon
ciled Data locality means that data elements should
be stored local to the processors that need them to
minimize communication costs Perfect data locali
ty could be achieved by employing a single proces
sor Parallelism which reduces runtime may unfor
tunately reduce locality and increase communication
costs Additional goals for data distribution are 	
exploiting special communication patterns supported
by hardware and  generating simple address calcu
lations to prevent addressing from becoming a domi
nant cost
Even with optimal layout of the data there will
still be communication in the general case In fact in
most massively parallel applications that are not triv
ially parallel communication is almost as frequent as
computation Therefore
the network MIPS measure must approach the CPU
MIPS measure
A second performanceoriented recommendation is an
independently operating network with asynchronous
message delivery
since it allows the delivery of packets concurrently
with computation That would enable the compiler to
interleave computation and communication and thus
to hide some of the communication latency
On the other hand we recommend a
shared address space
A shared address space does not imply shared memo
ry it only means that every processor can generate ad
dresses for the entire memory in the system System
wide addresses are especially important for pointers
because otherwise they would have to be simulated
quite ineciently in software Even the memory of
the control processor eg the frontend of a SIMD
machine should be part of the shared address space
Furthermore and similar to the above we call for a
uniform memory access instructions
Many parallel machines today provide a set of instruc
tions for accessing local memory a second one for ac
cessing memory in neighbors and a third set for ac
cessing distant memory units The dierences in speed
are signicant and therefore require that the compil
er detect the faster cases However it is often impossi
ble to know statically for which case to optimize For
instance we found that in many cases it was impossi
ble to determine in the compiler whether a procedure
would access local or nonlocal memory The generat
ed code thus has to check all three cases at runtime
Such simple frequent and dynamic analyses could be
done more eciently in hardware
Barrier synchronization is extremely frequent Basi
cally every communication requires a barrier synchro
nization The reasoning is as follows Communication
sends or receives data Unless communication is re
dundant there must be a write between two succes
sive communication calls to the same cell It follows
that a synchronization operation must be placed some
where between one of the communication calls and the
write in order to avoid race conditions If many pro
cesses communicate at once as in massively parallel
machines this type of synchronization amounts to a
barrier No process may proceed until all processes
have completed either their write or communication
operation Because of its frequency we need
fast barrier synchronization
In principle the communication network could be used
for barrier synchronization However communication
networks usually have high latency which make them
too slow for fast barrier synchronization These nets
are optimized for transporting data while barrier syn
chronization requires the transport of only one or two
bits but must also implement a reduction or scan op
eration on these bits
An additional complication is that there are usu
ally several groups of processes that need to com
municate among themselves necessitating multiple
nonoverlapping barriers Consider for instance a
pipelined architecture in which one set of processes
passes data to another set Each set may have to
synchronize internally independent of the other Sim
ilarly an IFstatement within a FORALL divides a
set of processes into two subsets which may have to
synchronize independently Thus we need
barrier synchronization for multiple independent
sets of processes
 Triton 
The poor programmability of todays parallel ma
chines is a consequence of the fact that the design
of these machines has been driven mostly by hard
ware considerations Programmability seems to have
been a secondary issue resulting in languages designed
specically for a particular machine model Such lan
guages do not satisfy the needs of programmers who
need to write machineindependent applications
Triton
 matches most of the recommendations of
the previous section Looking from a general point of
view Triton
 is determined by the following state
ments
General Architecture Triton
 is a SAMD
synchronous
asynchronous instruction streams mut
liple data streams machine it runs in SIMD mode
where strict synchrony is necessary it can switch to
MIMD mode where concurrent execution of dierent
tasks is benecial It is even possible to run a subset
of the processors in SIMD mode the other in MIMD
Thus Triton
 is truly SAMD ie mixedmode not
just switchedmode Only a few research prototypes
of mixedmode machines have been built OPSILA
TRAC and PASM   Triton
 provides support
for switching rapidly between the two modes and a
highlevel language to control both modes eectively
Fast barrier synchronization is supported by
special hardware The usage of synchronization hard
ware is possible in both operating modes Synchro
nization with hardware support overcomes the neces
sity of coarse grained parallelism
Network We chose the De Bruijn network for Tri
ton
 because it has several desirable properties log
arithmic diameter xed degree etc is costeective
to build and can be made to operate extremely fast
and reliable In section 		 we present performance
gures
Scalability and balance Parallel machines
should scale in performance by varying the number of
processors furthermore the performance of the indi
vidual components processor memory network and
I
O should harmonize Scalability is mainly a proper
ty of the network The most popular networks today
hypercubes and grids do not scale well hypercubes
are too expensive because they have variable degree
while grids cause high latency because of large diame
ter Triton
s De Bruijn net has none of these prob
lems and scales well It is also well matched to the
speed of the processors
IO capabilities I
O must also scale with the
number of processors Few parallel machines today
provide for scalable I
O Triton
 implements a mas
sively parallel I
O architecture one disk per proces
sor For large sets of disks we have extended the
traditional notion of a le to what we call a vector
le Massively parallel I
O also provides the basis for
research in parallel operating systems such as virtu
al memory parallel paging strategies and true multi
user environments Results in these areas are required
to bring parallel machines into widespread and gen
eral purpose use
 Architecture of Triton
Triton
 is divided in a frontend and a backend por
tion The frontend typically consists of a UNIX work
station with a memorymapped interface to connect
via the instruction and the control bus to the backend
portion The backend portion consists of the process
ing elements the network and the I
O system
The Triton
 prototype will be built up of a Intel
 based PC running BSD UNIX as frontend The
prototype will contain    PEs of which  are
supplied with a disk  of the PEs are provided for
computation and  PEs are for hot stand by These
PEs can be congured under software control into the
network if other PEs fail The reconguration in
volves changing the PE numbers consistently and re
computing the routing tables in the network proces
sors The  disks are logically organized in  groups
of  disks where each group contains  data and one
parity disk RAID level 	  is used for error handling










In SIMD mode the frontend produces the instruc
tion stream and controls the backend portion at in
struction level In MIMD mode the frontend is re
sponsible for downloading the code and the initiation
of the program The instruction bus is  bits wide
For reasons of decoupling frontend and backend in or
der to reduce the time of the frontend waiting for the
backend to become ready or vice versa the instruction
stream is sent through a fo The handshake signals
necessary to control the instruction stream are part of
the control bus
For reasons of debugging it is a good idea to have
direct access from the frontend to all parts of the
machine especially to the main memory distributed
among the processing elements As a direct conse
quence of the common address space of Triton
 this
is possible via the so called analyze mode To sup
port the analyze mode the control bus includes  ad
dress lines and several dedicated control signals the
instruction bus is used for data transport While be
ing in analyze mode all PEs release their local busses
to enable frontend access via direct memory access
The processing elements are designed as universal
computing elements capable of performing computa
tion as well as service functions Each PE consists
of a Motorola MC microprocessor a memory
management unit MC a numeric coprocessor
MC or MC   MBytes of main memory
a SCSI interface and a networkprocessor Figure 
gives an overview No extra controllers for mass stor
age access or any other I
O are necessary
The network of Triton
 is built up of the network
processors included in the PEs the interconnection
lines realized with at cables and fo buers for in
termediate buering of data packets The network is
able to route data packets from their source to their
destination without interfering with the PEs Non
interference permits latency hiding techniques to be
applied Again for reasons of decoupling the interface
between a PE and its respective network processor is
implemented with fos
Parity checking of mainmemory network links and
mass storage implements error detection Periodic sig












 Detailed discussion of selected hard
ware aspects
In order to get a better idea on the architectural fea
tures of Triton
 it is necessary to look into some im
plementation details of the hardware
The implementation of the instruction bus and
the control bus is quite naturally done by a hier
archy of bus drivers for signals from the frontend to
the backend For the opposite direction globalwired
or lines are emulated by explicit ORcombining the
signals from the single PEs In SIMD mode all PEs
execute the same instruction at a time or idle in
cluding reading the instruction at the same time The
reading of instructions by the PEs is controlled via
three control signals Instruction strobe signals the
frontend that all PEs currently listening to the in
struction stream are ready to read an instruction The
frontend then asserts the instruction and answers with
instruction transfer acknowledge If all PEs current
ly listening accepted the instruction instruction fetch
done is asserted which signals the end of an instruc
tion transfer The threeway handshake introduces
a nonnegligible amount of delay due to the signals
traversing the complete bus hierarchy several times
To reduce that delay we introduced an instruction
buer at each driver level in the bus hierarchy re
ducing the delay for the instruction fetch by two clock
cycles in the normal case Thus the handshake de
scribed above is executed in between every two hier
archy levels rather than between the frontend and the
PEs
As mentioned above the global address space
constists of a  bit address The least signicant 	
bits are used to select the memory and the memory
mapped I
O in the PEs Another  bits are used to
identify the PE to be accessed and one bit is used to
distinguish frontend and backend The identication
of the PEs is twofold Each PE has a hardware iden
tication which is selected by a switch setting The
hardware identication is used to select the PE in the
analyze mode for debugging Additionally each PE
has a software identication which is used while com
puting The software id is initially set to the same
value as the hardware id but can alter for reasons
of hardware error handling However implementing
a concept with a  bit address space does not auto
matically imply computing with  bit addresses all
the time In the majority of the cases computing with
	 bit addresses suces reducing the time spent with
address calculation
Another point of interest is modeswitching
Though the MIMD mode is more natural to the pro
cessor the system is started in SIMD mode This is
done to save additional hardware for the startup code
In SIMD mode the function codes of the processor
are used to determine whether the processor intends
a data or a program access According to that the
processor bus is connected to the local memory or the
instruction bus respectively If a PE is selected not
to execute an instruction the local signal listen to
instruction stream is turned o and the processor of
that PE is not notied of instructions except if the in
struction is unconditional The value of the processors
program counter is completely ignored in SIMD mode
In order to switch to MIMD mode the program to be
executed has to be downloaded to the memory of the
PEs This is done via the instruction stream in SIMD
mode Thus the distribution of code is in contrast to
many other MIMD machines done in a time propor
tional to the length of the code independent of the
size of the machine The switch from SIMD to MIMD
mode is performed by two instructions With the rst
instruction the program counter is set according to
the location of the program to be executed in MIMD
mode by a JMP instruction With the second instruc
tion the SIMD request bit in the command register
local to the PE is deactivated The PE then switches
to MIMD mode at the end of the current cycle and
commences execution of the local code without de
lay To switch from MIMD to SIMD mode the SIMD
request bit in the local command register simply is ac
tivated which causes the PE to switch to SIMD mode
at the end of the current cycle The next instruction
is then expected form the instruction stream
While some PEs are executing in MIMD mode the
rest of the PEs may execute in SIMD mode This is
achieved by activating the instruction transfer hand
shake in the case of MIMD operation If there is no PE
left to execute in SIMD mode but still some instruc
tions remain in the instruction fo the handshake sig
nals automatically empty the buer
Data transfer is an important point in every par
allel computer There are several dierent data paths
to consider The most important point is the data
transport between the PEs That task is performed
by the network which is described later in detail An
other important point is the transport of data from
the frontend to the backend and vice versa There are
dierent possibilities for each direction To transport
data from the frontend to the backend the easiest way
is to send the data as immediate data via the instruc
tion stream in SIMD mode With that possibility any
subset of the PEs can be the destination of the data
Unfortunately only unidirectional access is possible
The second possibility of transferring data is the di
rect memory access within the analyze mode Herein
data can be transferred in both directions The draw
back of the analyze mode is that no computation can
take place and not more than one PE can be accessed
concurrently The third possibility of data transport is
via the network There is one dedicated network node
which is connected to the frontend This is especial
ly useful in the case that more than a few bytes have
to be transported from dierent PEs to the frontend
eg picture data Another advantage of a network
node included in the frontend is that computation can
commence while data is transported
 Fast barrier synchronization in MIMD
mode
An important problem is the realization of barrier syn
chronization in the case that several dierent sets of
processes are distributed randomly over the PEs A
set or group of processes is dened as executing the
same part of code eg procedure and therefore shar
ing common variables If there is only one set of pro
cesses requesting synchronization the barrier synchro
nization is easily done by the usage of a globalwiredor
line Each PE sets its ready bit on the line to true as
soon as it reaches the synchronization point Approx
imately one clock cycle after the last PE sets its bit
the frontend is able to recognize the result and the
PEs are notied by the result line
The problem of using synchronization with a single
globalwiredor line in MIMD mode is that a global
wiredor line cannot be partitioned randomly In the
general case more than one group of processes exists
Each of these groups share common variables to which
accesses have to be regulated In most cases the groups
of processes are distributed randomly over the set of
PEs so that they cannot be partitioned by partitioning
the backend
To enable the usage of hardware supported syn
chronization with several groups of processes running
in MIMD mode the globalwiredor line is adminis
trated by the frontend as a synchronization resource
in the following way Each group of processes is iden
tied by a unique process group number Initially the
synchronization line is not used and each PE is al
lowed to request it on behalf of a group The request
is performed by the rst PE reaching a barrier That
PE signals the frontend by the service request line and
sends the group identication via the analyze circuits
If more than one PE reaches a barrier at once the an
alyze circuits will select one randomly The frontend
then knows which group demands the synchronization
line Next the frontend interrupts all PEs and forces
them into SIMD mode to perform a barrier setup The
PEs not belonging to the requesting group are pro
hibited to request the sync line themselves They also
turn on their ready bits The PEs belonging to the
requesting group set their ready bit to true if they
already reached the sync point otherwise to false
After this setup phase the PEs return to MIMD
mode All PEs continue computation independent of
their group membership As soon as the last ready
bit is turned on the group owning the globalwiredor
line synchronizes and then releases the sync line The
frontend then releases the request prohibition in order
to enable other groups to synchronize
This discussion glossed over the diculties that
arise if a PE virtualizes ie executes several threads
or processes which may belong to dierent groups
In this case the ready bits have to be virtualized as
well The details depend on the virtualization strat




 network is based on the generalized De
Bruijn Net  	 The number N of nodes in the net
work is not limited to powers of two The maximum
diameter is dlog
d
Ne The average diameter is well
below log
d
N and in practice quite close to the theo
retical lower bound the average diameter of directed
Moore graphs
In our implementation we use degree d   which
makes our net a perfect shue see gure 	 In com
parison with other frequently used networks this de
sign has the benet of a constant degree per node and
a small average diameter Data transport is done via
a tablebased selfrouting packet switching method
which allows wormholerouting and load dependent
detouring of packets Every node is equipped with its
own routing table and with four buers two for in
termediate storage of data packets coming from other
nodes and two to communicatewith its associated pro
cessing element The buering temporally decouples



































































































































































































































































































































































































































































































































































































































































































































































Figure 	 De Bruijn Net with  nodes
The communications processor is able to route the
packets without interfering with the local processing
element Optimal routes are stored in a routing ta
ble per communications processor The network can
thus transport data in parallel with the operation of
the processing elements This feature can be used by
the compiler to overlap communication and processing
time by rearranging code
In order to analyze the behavior of the network we
built a simulator based on the measured performance
of a single communications processor We simulat
ed the overall performance of the network in various
modes The number of nodes ranged from 	 to 
Figure  presents the results of a series of experiments
with a random communication pattern random H
permutations
Both the sender and the receiver were chosen random
ly with the restriction that the number of data packets
to be transported is the same as the number of nodes
in the network The simulation shows that the net
work scales well the delay introduced by the network


























































































































































































   transfer time in network cycles
  maximum diameter
 average diameter
Figure  Performance on random communication
The robustness against overload is surprisingly
good Even if all processing elements send a great
number of packets simultaneously the overall through
put of the network does not decrease Irregular per
mutations are performed especially fast All hard 
patterns known from literature eg transposition of
a matrix buttery and bit reversal perform well too
The delivery time for those is equal to or lower than
the delivery time for random permutaitions
 Conclusion
The integrated approach of designing language com
piler and hardware together has lead to a parallel
architecture that supports higherlevel languages ad
equately Fast barrier synchronization for multiple
process groups SAMD mode shared address space
and a fast independently operating network should
make parallel computers run eciently even when pro
grammed in a machineindependent fashion
Status and schedule of Triton
The fully functional prototype of a PE board was
completed in October  The individual compo
nents communication processor processing element
and control processor interface are tested and are run
ning according to specications The manufacturing
of the printed circuit boards is in progress The nal
assembly of Triton
 will be completed early in 	
References
 M Auguin and F Boeri The OPSILA computer
In M Consard editor Parallel Languages and Ar
chitectures pages 	!	 Elsivier Science Pub
lishers Holland 
 N G De Bruijn A combinatorial problem In
Proc of the Sect of Science Akademie van Weten
schappen pages ! Amsterdam June 

	 Makoto Imase and Masaki Itoh Design to min
imize diameter on buildingblock network IEEE
Transactions on Computers 		! June

 David A Patterson Garth Gibons and Randy H
Katz A case for redundant arrays of inexpensive
disks RAID In Proc of the  ACMSIGMOD
Conference on Managenment of Data pages !
 Chicago 	 June 
 Michael Philippsen Walter F Tichy and Chris
tian G Herter Modula and its compilation
In First International Conference of the Austrian
Center for Parallel Computation pages !	
Salzburg Austria September  Springer Ver
lag Lecture Notes in Computer Science 
 HJ Siegel T Schwederski JT Kuehn and NJ
Davis An overview of the PASM parallel process
ing system In DD Gajski VMMilutinovic and
HJSiegel and BP Furht editors Computer Ar
chitecture pages 	! IEEE Computer Society
Press Washington DC 
 Walter F Tichy and Christian G Herter Modula
 An Extension of Modula for Highly Parallel
Portable Programs Technical Report No 

Interner Bericht University of Karlsruhe De
partment of Informatics January 
