HPP : a high performance PRAM by Formella, Arno et al.
HPP
A HIGH PERFORMANCE PRAM
Arno Formella Jorg Keller Thomas Walle

Fachbereich  Informatik
Universitat des Saarlandes
Postfach 
 Saarbrucken
Germany SFB  D
HPP A High Performance PRAM
Arno Formella Jorg Keller

Thomas Walle
Universitat des Saarlandes FB 	 Informatik
Postfach 

 	 Saarbrucken Germany
Abstract
We present a fast shared memory multiprocessor with uniform memory access time	 A

rst prototype SBPRAM is running with  processors a  processor version is under
construction	 A second implementation HPP using latest VLSI technology and optical
links shall run at a speed of  MHz	 To achieve this speed we 
rst investigate the redesign
of ASICs and network links	 We then balance processor speed and memory bandwidth by
investigating the relation between local computation and global memory access in several
benchmark applications	 On numerical codes such as linpack  and  GFlops shall be
possible with  and  processors respectively thus approaching processor performance
of Intel Paragon XPS	 As nonnumerical codes we consider circuit simulation and raytracing	
We achieve speedups over a one processor SGI challenge of  and  for  processors
and  and  for  processors	
Keywords shared memory multiprocessor parallel random access machine multithread
ing latency hiding irregular applications
 Introduction
Parallel systems are used for the solution of computationally intensive problems	 While
numerical problems are mostly regular and thus suited for distributed memory machines
DMMs other problems are often irregular	 Examples are simulations or raytracing	 For
those problems shared memory multiprocessors SMM are more suitable as they do not
require partitioning in order to minimize replication and avoid expensive message passing	
Current massively parallel SMMs such as KSR or Stanford DASH are NUMA non
uniform memory access machines 	 The large variation in memory access time requires
careful tuning of applications to obtain expected performance	 Tuning often requires par
titioning to exploit locality and to avoid false sharing of cached pages and thus leads to
problems similar to programming DMMs see  p	 	
These problems are avoided in a UMA uniform memory access machine in theoretical
computer science also known as a PRAM parallel random access machine	 Busbased
UMA machines have been built for small numbers of processors e	g	 Sequent symmetry	
There are several approaches to simulate a PRAM with many processors on more realistic
multiprocessors see  for a survey	 One of these simulations Ranades Fluent Machine

Supported by a DFG Habilitation Fellowship

 has been turned into a parallel architecture called SBPRAM 	 Key concepts are
avoiding hot spots by universal hashing implementing concurrent access by combining
hiding latency by synchronous multithreading with hardware support for multiple contexts	
Furthermore parallel pre
x computations without serialization are supported	
We investigate how the SBPRAM can be made faster in order to be competitive to the
speed of DMMs	 To do this we 
rst consider hardware improvements i	e	 the redesign of
network chips and the use of optical links	 We then investigate the inuence of allowing mul
tiple local instructions per global instruction	 We present the necessary hardware require
ments and explore to which extent applications can exploit this feature without expensive
compiler optimizations	 We use both numerical and irregular nonnumerical benchmarks	
Our main result is that the improved PRAM called HPP runs at a speed of MHz
and achieves a tenfold performance improvement over the SBPRAM for a  processor
machine and about fold for a  processor machine	 In inner loops numerical codes
achieve a performance of  to  GFlops on a  processor HPP  to  GFlops on a
 processor HPP	 Thus for codes similar to Linpack with matrices of size  we
obtain a performance of 	 MFlops per processor approaching the performance of the
Intel Paragon XPS with 	 MFlops per processor

	 For circuit simulations speedups of
 over a SUN Sparc or a one processor SGI Challenge are possible with a  processor
HPP if several input vectors are simulated simultaneously	 For raytracing speedups of
 over a one processor SGI Challenge are possible with a  processor HPP	 For 
processors these speedups are  and  respectively	
A project with similar goals is the Tera computer  now marketed as Tera MTA	
Tera directly targeted leading edge technology e	g	 a GaAs processor with a  ns cycle
time should be used	 To our knowledge there is no prototype yet available	 Tera also uses
multithreading and interleaving of global and local instructions and also provides hardware
support for multiple contexts	 However the Tera processor simulates several instructions
of each thread before switching and thus looses synchronous behavior	 Also the Tera
machine does not support combining and their fetchadd primitive leads to serialization	
Furthermore Tera does not have local memory at processors thus the fraction of global
instructions will be much higher than in HPP	
The remainder of the article is organized as follows	 In section  we briey review the
SBPRAM architecture and the prototypes technology	 In section  we investigate how
processors network nodes and network links can be improved	 In section  we investi
gate our benchmark applications and show which performance gain is possible by careful
instruction scheduling	 In section  we conclude and present further directions of research	
 SBPRAM
The SBPRAM  is a massively parallel multiprocessor architecture with p processors
providing users with a virtual shared memory	 It is based on Ranades Fluent Machine 	

In  p  an Intel Paragon XPS with 	
 processors is reported to obtain a performance of 
GFlops on Linpack with a matrix size of  This machine was the fastest among all machines listed

Butterfly Network
n1 2
n1 2
HOST
Processors
Local Memory
and Harddisks
Global
Memory Modules
Figure  SBPRAM Architecture
 Architecture
As in all massively parallel shared memory machines the global memory of the SBPRAM is
physically distributed among pmemory modules	 Memory requests are transmitted between
processors and memory modules via an interconnection network see Fig	 	
Machines such as KSR  or Stanford DASH  realize the shared memory by caching
remote data and keeping the caches coherent cf	 e	g	 KSRs Allcache Protocol	 This leads
to large variations in memory access time which makes performance prediction and tuning
dicult 	
In contrast the SBPRAM uses universal hashing to distribute addresses among the
memory modules in a random fashion every memory access is remote	 The hashing avoids
module congestion and leads to a large but uniform memory access time	 The latency
to access global memory is hidden by using multithreaded processors which simulate v
virtual processors in a pipeline	 Each virtual processor has its own register set thus context
switching does not cause any overhead	
The interconnection network is a buttery network	 Network latency is c log p cycles
hence v can be set to that value	
Concurrent access of multiple processors to some memory cell is handled by combining	
The requests of each physical processor are sorted according to their hashed addresses	 The
sorted order of requests is maintained in each network node by merging the incoming streams
of requests	 Requests to one cell must inevitably meet and can be combined	 Answers are
duplicated on the way back	
Computation of parallel pre
x sums is implemented by the same mechanism	 The
network nodes can perform simple integer arithmetic	
 Prototype
The SBPRAM prototype  consists of p   physical processors and the same number
of memory modules

	 Each physical processor implements v   virtual processors which
are scheduled roundrobin for every instruction	 Load instructions to global memory are
delayed i	e	 the result is only available in the next but one instruction	 The physical
processor is realized as an ASIC	 The register sets of the virtual processors are held ochip

Currently a processor version is running the complete machine is still under construction

in a fast static RAM	 The processor runs at a speed of MHz which is determined by the
speed of the interconnection network as we will see	
The sorting device needed to inject requests into the network is realized as a linear
sorting array in a separate ASIC	 It receives requests at processor speed and sends requests
with network speed	 A request consists of a  bit address a  bit data word and  mode
bits	 An answer to a request is a  bit data word	
The network speed is MHz	 This frequency is determined by using the minimum of
a the critical path in the network chip which allows MHz and b the speed of chip
IO which allows MHz	 A network chip implements a routing switch with two inputs
and two outputs	 Due to pin restrictions a request must be transmitted or received in two
cycles	 Selection starts after having received the 
rst part of a request and takes two cycles
as well	
Network links split requests as well	 Each network link has four control signals in
each direction and thus consists of          bits in forward direction and
     bits in backward direction	
As the network needs two cycles to handle a request a processor utilizing the network at
its peak bandwidth can have a speed of at most MHz	 We assume here that a processor is
able to access the global memory via the network in each instruction	 However a utilization
of  is not possible because conicts can occur within the network	 To keep the protocol
between processors sorting devices and network nodes simple we chose cycle times that
are multiples of each other	 Hence MHz was the maximum frequency for the processor
utilizing half of the networks peak bandwidth	
The network consists of  stages each with  network chips	 We implement on a
printed circuit board either a stage buttery network or two stage buttery networks	
Thus we obtain three levels each consisting of  boards	 The wiring between boards is
done by at cables	 A link is realized by one cable consisting of  wires  for signals
and  for ground	
 Technological Improvements
The speed of the SBPRAM processor MHz is quite slow	 This speed is determined by
the speed of the SBPRAM network	 The network speed is limited by three factors the
processing speed of the network chip the IO capacity of the network chip and the capacity
of transmissions between network boards	
To make the SBPRAM faster we explore how these limitations change by the use of
 technology and how processors and memory modules can be adapted to such a faster
network	 We investigate how fast we can clock network chips how fast we can transmit
and receive requests with network chips and how fast the network links can be	
 Network chips
Our current network chip is fabricated in Thesys  metal layer m HCMOS technology
	 It uses about k gates and has  signal pins	 The critical path is caused by the
bit integer ALU	 A worst case analysis determined the maximum clock frequency to be
MHz	 To estimate todays maximum frequency we compare dierent technology levels

manufacturer technology m typ	 delay ps
Motorola HDC 	 
Motorola HC 	 
Motorola MC 	 
Thesys THA 	 
Table  Typical Delays of Nand Gates with Fanout  for various Technologies	
from some manufacturers	 Referring to Table  we estimate the factor by which the nearly
available m technology will run faster	
Motorolas MC technology is  times faster than our current	 From the three values
of Motorola technologies we extrapolate the m technology to be about  faster than
the MC technology	 Thus an overall speedup of about  will be possible i	e	 a todays
routing chip would run with an internal clock speed of     MHz	 The cost of
the chips would not increase dramatically while adhering to standard CMOS technology in
contrast to even faster technologies such as ECL GaAs or full custom design	
The second point that is critical for the chip speed are input and output times	 If we
make chip IO independent of the inner computation as shown in Fig	  we get the following
values for input and output times	
PLL
routing
logic
output
buffer
input
pad
ASIC
CLK
D
Figure  Input and output circuit
The access time to the external register is  ns the typical delay of the input pad is  ns
and the internal register setup time is  ns	 If we use a PLL for the internal clock only the
delay of the clock input pad has to be added	 This totals to  ns	 In consideration of some
smaller delays due to board wires we can say that the inputs can be driven with a frequency
of MHz	
For outputs the internal register access time is  ns the output driver delay is  ns if
we assume a capacitive load of pF	 This can be achieved if the external register is close
to the network chip	 The external register has a maximum setup time of  ns	 This totals
to  ns also allowing for a frequency of MHz	 As these times should not increase for
better technologies the timing of the chip IO is not critical	
For our current ASIC the number of signal pins was a limiting factor due to the cost of a

bigger package	 Thus in forward direction we multiplexed inputs and outputs	 Because the
external multiplexer circuit works close to its limits we can not apply this trick at higher
clock rates	 But the implementation of a routing switch needs only  links with  bit each
i	e	  signal pins in total	 If we assume that we must must add one power pin for every
three signal pins about  pins will be needed which is possible with todays ASICs	
 Links between boards
The links between two network boards respectively between processor or memory boards
and network boards reach a maximum length of about m	 Even if we assume higher
integration the length will not be less than m	 Thus the transmission with at cables will
be reasonable only up to a frequency of MHz	 To achieve the same bandwidth as the chips
three its of one packet have to be transmitted staggered	 First this leads to a complicated
external logic	 Three dierent clocks have to be generated for the registers	 Furthermore a
multiplexer circuit has to be added	 Both implementations with multiplexers as well as with
drivers have decisive disadvantages	 The ICs as well as the additional connectors need an
enormous amount of area	 Second the additional logic increases the external propagation
delay see Fig	 	 Third one has to employ three cables with  wires each which leads
to mechanical problems	 Thus at cables do not seem suitable	
As an alternative approach we investigate the transmission via optical links	 Actually a
transceiver circuit with  GBits bandwidth is available  	 Up to a frequency of MHz
a word of  bits can be transmitted in parallel	 The area needed on each side is about

mm

	 Here too we have to apply two multiplexed cables for one link	 But dierently
to at cables we can choose very fast and very high integrated multiplexerregister ICs
which we use in our current machine too	 If we generate an inverted clock the external
circuitry can be held simple cf	 Fig	 	
chip
network
transcv
laser
laser
transcv
transcv
transcv
transcv
transcv
diode
diode
diode
diode transcv
diode
laser
transcv
transcv
transcv
transcv
transcv
laser
laser
diode
laser
Figure  Two multiplexed optical channels with  wires each
If we take into consideration that optical cables with multiple optical wires are available
it seems possible to achieve the desired bandwidth with optical links with the same logical
behavior of the network as with at cables	
 Processor
If we run the network with MHz and need two cycles to handle a request the processor
must be able to run at   MHz	 This is not a limitation for internal speed	

package dimensions HDMP 
 
mm x 

mm LSC mm x 
mm

The processors IO needs are as follows in every cycle it does an instruction fetch
i	e	 it outputs a  bit address and reads a  bit instruction	 In addition it might send
a request  bit and receive an answer to a request  bit	 This is not a limitation
as it requires  signal pins at MHz a requirement that is met by the network chip	
Commercial processors allow even higher rates e	g	 the DEC Alpha has a  bit data bus
and  bit address bus at a speed of MHz 	
The processor can even run faster by using the following observation	 In an application
the total number of store instructions is not larger than the total number of load instructions
provided that the input is read from and the output is written to a disk	 The network
provides enough bandwidth to send two store requests in two instructions each request
consisting of two parts	 On average however a processor will only send a load request and
a store request consisting of only three parts together	 Hence the bandwidth will suce
on the average if the processor runs faster by a factor of  achieving a speed of MHz	
We call this frequency the request rate	 The time between two requests is called request slot	
As load instructions are delayed answers to loads are available after two request slots	
Our assumption only holds if there are no bursts of store instructions	 The only
case where we observed those in our benchmarks see next section were sequences of push
instructions at the entry of functions	 These pushs however are handled in processors local
memories and do not need network access	 Even if short bursts occur these would utilize
     of the network bandwidth	 This can be tolerated if it does not last too
long	
 Commercial processors
If we switch to the technology of commercial processors it should be possible to implement
the SBPRAM processor in a single chip	 First it should be possible to put the register sets
onchip	 As we have v   register sets each with  registers of  bits this requires an
 kByte dualported RAM	 Onchip memories of this size are possible e	g	 the DEC Alpha
has 
rst level data and instruction caches onchip each with  kByte	
Second the sorting device can be implemented on the processor chip	 This would not
even increase the pin count	 The speed of the sorting algorithm is determined by a critical
path through an ALU	 Thus the speed can be similar to the speed of the network chip	
Third we can run the processor faster by a factor r where r is an integer	 We must
ensure that in a request slot which now contains r instructions only one instruction accesses
the global memory	 Thus the request rate is not altered	 Furthermore a delayed load by
r instructions must be tolerated because the answer to a load request is available after
two request slots	 Finally bursts of load instructions must not happen	 Note that this
improvement will only work if the fraction of global instructions does not exceed r	
The benchmarks of the next section suggest that a value of r   is possible on a
range of applications	 This pushes processor speed to     MHz the speed of
the network	 The internal processor speed is still below the actual numbers for commercial
microprocessors	 As we have  pipeline stages because of the virtual processors even
complex oating point operations are as fast as integer operations	 However we have to
take care of chip IO and instruction memory	
Chip IO can be brought down to a speed of   MHz by doubling all busses
and the use of alternate busses for virtual processors with odd and even numbers respec

tively	 For instructions which now must be fetched every  nanoseconds we will use an
onchip instruction cache and a second level cache ochip	 If the virtual processors run
synchronously the onchip cache will be large enough to deliver almost all instructions at
the requested speed	 If the virtual processors run asynchronously e	g	 if each operates on
a dierent application the size of the instruction cache is too small to serve  virtual
processors	 They can be served by the second level cache which is large enough but this
will slow down the machine by a factor of  to 	 This slowdown can be avoided by care
fully assigning virtual processors to applications	 If the virtual processors of one physical
processor belong to only four groups the onchip cache should still suce	
 Memory
The memory boards must handle requests at MHz provided that all memory mod
ules are utilized evenly	 The universal hashing only supports almost even utilization	
Furthermore there may be bursts by requests arriving every other network cycle i	e	 at
  MHz	
To handle this each memory module consists of four banks of EDRAM  which
is fast dynamic RAM with on chip cache	 The EDRAM is available in a M SIMM
package with a cycle time of ns	 If we assume that at most each third request accesses a
certain bank then the module can handle requests at a rate of MHz	 In case of bursts
packets are queued at each bank	 This avoids blocking of one bank because another bank
is crowded	
 Machine size
The design of the HPP so far assumes that it has  processors as in the SBPRAM	
A machine with  processors is possible as well	 As mentioned in Section 	 a stage
buttery network can be implemented on one network board	 Thus a stage buttery
network can be implemented without increasing the three stages of network boards	 As the
additional network links are all onboard the memory access latency increases only slightly	
Thus  virtual processors per physical processor are still sucient to hide this latency	
The speed of the machine is not aected	
 Applications
Before we discuss some application programs we analyze the instruction stream of a physical
processor	 The machine instructions of the SBPRAM can be divided into global and local
instructions	 Global instructions are those loading from or storing to global memory	 Local
instructions are all other instructions	 The behavior of the machine as a synchronous PRAM
is determined only by the correct sequence of global instructions for all instruction streams
of the virtual processors	 If the compiler achieves that in each request slot just one global
instruction is scheduled the run time can be reduced by a factor of r where r is the number
of instructions executed in one request slot	 In assembler programs of irregular applications
typically only  of all instructions are global	 Note that local variables and stacks are
held in local memory	 Because a delay slot is implemented for global load instructions one
has to take care of truedependencies in a schedule both between two global instructions
and between one global and following local instructions	

The instruction stream of a virtual processor is given by a trace through the basic block
graph during run time	 The ratio of the number of instructions to the number of global
instructions gives an upper bound for the improvement that can be achieved by speeding up
the execution time of a local instruction	 Because we want to make a worst case analysis
we consider the worst case ratio which can be found in any possible trace in the basic block
graph	 The calculation of this worst trace is straight forward through a two pass analysis
of the machine program	
The network can be accessed with a request rate of f  MHz see section 		 As
we count only the number c of global memory accesses load and store operations the peek
oating point performance P in inner loops is given by
P  p  nc  f
where p is the number of physical processors and n is the number of oating point operations	
We present four examples	 Two of them are simple numerical loops namely the dot
product as base for matrix multiplication and an indexed dot product as base for many
complex address patterns in numerical applications	 The other two examples are inner loops
of irregular applications namely discrete event simulation and raytracing which usually
exhibit poor performance on distributed memory machines	
 Dot Product
The following C routine computes the dot product of a row vector of a matrix with a column
vector of the same or another matrix	 The length of the vectors has been set to a constant
value	
Real DotProductReal row Real col

Real sum	
int i	
foriMSIZE	i	i

 
sum  row  col	
col  MSIZE	

returnsum	

In assembly language for the SBPRAM the body of the loop may look like the following
sequence	 The syntax of the mnemonics is straight forward	 The operands are given in the
order source source destination	 We have chosen the same symbolic names as in the C
routine instead of introducing register names	 An instruction popgn R I R loads the
value at address RI from global memory into register R and increments R with the
immediate value I	 Clearly the registers must be preloaded with the appropriate values
before entering the body of the loop	

popgn col MSIZE tmp
L popgn row  tmp
sub i  i
fmul tmp tmp tmp
popgn col MSIZE tmp
fadd sum tmp sum
bnz L
The inner loop of the dot product has six instructions	 Only two count as oating point
operations	 This results in a performance of approximately  MFlops on the SBPRAM
prototype as described in section 		 Two of the instructions are global load instructions	
There exists a dependency tmp between line  and line  of the loop	 Hence without
software pipelining the block can be executed in three request slots i	e	 two load instructions
and one delay slot	 So the peak performance on matrix multiply on the HPP will be close
to 	 GFlops for   matrices or a set of smaller matrices allowing enough
parallelism	 A HPP would yield 	 GFlops for  matrices	
 Indexed Dot Product
As a more complex example of numerical code we chose an indexed dot product	 A row of
an index matrix is used to address a row of a data matrix and a column of another index
matrix is used to address a column of the same or another data matrix	 The following C
routine shows the product with pointer arithmetic	
Real IndexDotProductReal row Real col int i int j

Real sum	
int k	
forkMSIZE	k	k

 
sum  rowi  colj  MSIZE	
j  MSIZE	

returnsum	

The translation of the C routine to machine language for the body of the loop is straight
forward as well	 There occur additional load instructions as well as more complex index
calculations in the basic block	 The following assembly sequence uses single assignment
strategy for temporary values in registers clearly this register need can be reduced	

popgn i  tmp
L popgn j MSIZE tmp
add row tmp tmp
ldgn tmp  tmp
mul tmp MSIZE tmp
add col tmp tmp
ldgn tmp  tmp
sub k  k
fmul tmp tmp tmp
popgn i  tmp
fadd sum tmp sum
bnz L
The inner loop of the indexed dot product has eleven instructions	 Only two count as
oating point operations	 This results in a performance of approximately  MFlops on
the SBPRAM	 Four of the instructions are global load instructions	 Without software
pipelining the block can be executed in 
ve request slots	 So the peak performance on such
a routine which covers a large set of often used inner loops in scienti
c programming will
be close to 	 GFlops on the HPP provided there is sucient parallelism to explore	 A
processor version will yield 	 GFlops	
With software pipelining a more sophisticated compiler can reduce the number of request
slots to two for the dot product and to four for the indexed dot product thus almost 
respectively  GFlops seem to be achievable on a  processor HPP	 This increases to
 and  GFlops on a  processor HPP	 In  p	  the performance of  machines
on the Linpack benchmark is listed	 This performance is comparable to the performance on
indexed dot product	 The fastest machine is an Intel Paragon XPS with  processors	 It
achieves a performance of 	 GFlops at a matrix size of 	 Hence the performance
per processor is 	 MFlops	 On indexed dot product the HPP achieves a performance
of   GFlops  MFlops	
 Circuit Simulation
The third example is taken from the SPLASH benchmark suite 	 We analyzed the inner
loop of the parallel discrete event simulator as implemented in 	 In this loop more than
 percent of the total run time is spent	 The basic block graph contains  nodes	 The
worst case ratio of total to global instructions on any trace is about 	 i	e	 we expect!
including the faster clock speed!an improvement of about  on the HPP compared to
the SBPRAM	
Note that the achievable speedup for discrete event simulation is strongly limited by
the critical path of the circuit being simulated as long as the conservative approach is
implemented	 Due to the possibility to use very ecient parallel data structures with
concurrent access to shared data more aggressive simulation methods become an interesting
and promising research area	 If the HPP is implemented with more processors than the
SBPRAM they can be used only eectively if there is enough parallelism to exploit	
Concurrent simulation of more than one test pattern seems to be the method of choice
where the representation of the simulated circuit is stored only once in memory	 An example

are production tests for ASICs that consist of  to  independent groups of patterns
depending on the size of the ASIC	
In  the SBPRAM implementation is compared to a sequential implementation
to obtain an absolute speedup	 There for the benchmark circuit Multiplier both a SUN
Sparc  and a SGI Challenge need about  seconds to simulate the circuit on the input
vectors delivered with SPLASH	 The SBPRAM with  processors needs about  seconds	
Then the HPP with  processors obtains an absolute speedup of     	 On a
 processor HPP    test patterns can be simulated simultaneously	 On a 
processorHPP this increases to 	 Then the speedups rise to 	 and 	 respectively	
 Raytracing
The last example consists of the inner loop of a raytracer 	 Here the basic block graph
has  nodes where the subroutine calls to the function calculating intersection points
are not counted	 The program spends most of the run time more than  percent for
large scenes in this loop	 Any trace through the complex loop reveals a ratio of total to
global instructions in the range of 	 and 	 In the assembly code no consecutive global
instructions appear	 So it seems that the proposed improvement of performing three times
as much local instructions as global instructions is easy to achieve	
Clearly exact performance data for the irregular application cannot be provided but
due to the fact that one of the fastest raytracing methods has been parallelized with almost
linear speedup even for a large amount of processors the performance of the HPP will
be considerably larger than the one of any other parallel machine	 In  it is shown that
the SBPRAM prototype is seven times faster than an SGI challenge with one MHz
MIPS R processor	 The analysis of the HPP promises a 	 times faster version of
the raytracer	 Moreover a  processor machine will be about  times faster than a one
processor SGI challenge on suciently large data bases	
 Conclusions
We presented how to reengineer the SBPRAM multiprocessor	 The request rate

was
increased from MHz to MHz i	e	 by a factor of 	 The main changes were the use
of fast ASIC technology for routing nodes and of optical network links	 A further speed
gain was obtained by separating local and global operations during one global operation
several local instructions can be executed	 Hence we get higher instruction throughput
while the request rate remains unaltered	
We veri
ed this approach with several benchmarks both numerical and irregular non
numerical ones	 Our 
rst observation is that we 
nd enough local instructions to 
ll in the
fraction of global to total instructions is low because stack operations are local	 Second
there is enough independence that a compiler can statically schedule instructions without
fancy optimizations if we overlap one global with two local instructions	 This increases
processor speed by a factor three to MHz	 To be more exible and possibly obtain
further speed gains we could switch to dynamic scheduling i	e	 by using a superscalar
outoforder issuing processor	

Number of global memory accesses per processor per time unit

The peak performance of the resulting machine calledHPP thus is  GFlops	
As the same architecture can be used with  processors this increases to GFlops	
We guessed the performance of the benchmarks by inspecting the compiler generated
assembler code of their kernels	 For linpack type applications we obtain a performance
of MFlops per processor thus approaching the MFlops per processor of an
Intel Paragon which was rated best in 	 For circuit simulations we achieve an absolute
speedup of  and  over SUN Sparc  or one processor SGI Challenge with  and 
processors respectively	 For raytracing we achieve absolute speedups of  and  over a
one processor SGI Challenge with  and  processors respectively	
A further increase in speed is possible if we enhance the request rate of the machine	
For instance a factor two of network bandwidth is possible if we allow complete requests
to be transmitted at once	 In the current solution the amount of transceiver circuits and
hence the total area per link needed would be doubled	 This seems to be impossible to
implement on one board	 The problem of the enormous amount of external circuits can be
overcome if we connect the serial line directly to the chip	 Referring to the S project at
Sun Microsystems  this is possible at a transmission frequency of GHz	 The amount
of chip area needed for each channel is mm

	 Because the pin limitations are dropped
too now we can choose a more suitable master which exploits the network chip better	 For
every link  serial lines are needed	 As a chip has four links this totals to  serial lines
respectively interface pins	 With higher transmission rates of optical links this number
can be reduced	 Then more network nodes can be integrated in one chip	 This decreases
network latency	 Furthermore the machine becomes smaller	
References
 Ferri Abolhassan Reinhard Drefenstedt Jorg Keller Wolfgang J Paul and Dieter Scheerer
On the physical design of PRAMs Computer Journal 	
 December 
 Ferri Abolhassan Jorg Keller and Wolfgang J Paul On the costeectiveness of PRAMs
In Proceedings of the rd IEEE Symposium on Parallel and Distributed Processing pages 
IEEE December 
 George Almasi and Allan Gottlieb Highly Parallel Computing BenjaminCummings nd
edition 
 Robert Alverson David Callahan Daniel Cummings Brian Koblenz Allan Portereld and
Burton Smith The Tera computer system In Proceedings of the  International Conference
on Supercomputing pages  ACM 
 Digital Equipment Corporation Maynard Mass DECchip  and DECchip A Al
phaAXP Microprocessors	 Hardware Reference Manual 
 Arno Formella Ray Tracing Complex Scenes Parallel or Sequential In M

H Hamza edi
tor Proceedings of 
th
IASTED
ISMM International Conference on Parallel and Distributed
Computing and Systems pages 
 October 
 Arno Formella and Christian Gill Ray Tracing A Quantitative Analysis and a New Practical
Algorithm The Visual Computer 	 December 

 Hewlett Packard Geneva Switzerland HDMP
 Low Cost Gigabit Rate Trans
mit
Receive Chipset 
 Hewlett Packard Geneva Switzerland LSC mW  Pin DIL Cooled Laser Module 

 J Keller Th Rauber and B Rederlechner Conservative Circuit Simulation on Shared
Memory Multiprocessors  In Proc th Workshop on Parallel and Distributed Simulation
Philadelphia USA May 
 Alexander C Klaiber and Henry M Levy A comparison of message passing and sharedmemory
architectures for dataparallel programs In Proceedings of the st Annual International Sym
posium on Computer Architecture April 
 Daniel Lenoski James Laudon Kourosh Gharachorloo WolfDietrich Weber Anoop Gupta
John Hennessy Mark Horowitz and Monica S Lam The Stanford DASH multiprocessor IEEE
Computer 	 March 
 Andreas G Nowatazyk Michael C Browne Edmund J Kelly and Michael Parkin Sconnect
from networks of workstations to supercomputer performance In Proceedings of the nd
Annual International Symposium on Computer Architecture pages 
 
 Ramtron International Corporation Colorado Springs CO Specialty Memory Products Data
book October 
 Abhiram G Ranade How to emulate shared memory Journal of Computer and System
Sciences 	 
 Bernd Rederlechner Parallele Diskrete Ereignissimulation auf der SBPRAM Diplomarbeit
Universitat des Saarlandes FB Informatik 
 JP Singh WD Weber and A Gupta SPLASH Stanford Parallel Applications for Shared
Memory Computer Architecture News 	 

 Thesys GmbH Erfurt Germany THA  Macro Cell Databook Rev  April 
 Leslie G Valiant General purpose parallel architectures In Jan van Leeuwen editor Handbook
of Theoretical Computer Science Vol A pages  Elsevier 

