Scheduling communication on an SMP node parallel machine by Falsafi, Babak & Wood, David A.
Scheduling Communication on an SMP Node Parallel Machine 
Babak Falsafi and David A. Wood 
Computer Sciences Department 
University of Wisconsin-Madison 
1210 West Dayton Street 
Madison, WI 53706 
{ babak,david} @cs.wisc.edu 
Abstract 1 Introduction 
Distributed-memory parallel computers and networks of 
workstations (NOWs) both rely on eficient communication 
over increasingly high-speed networks. Software communica- 
tion protocols are often the p e l f o m c e  bottleneck; Several 
current and proposed parallel systems address this problem by 
dedicating one general-purpose processor in a symmetric mul- 
tiprocessor (SMP} node specijically for protocol processing. 
This scheduling convention reduces communication latency 
and increases effective bandwidth, but also reduces the peak 
performance since the dedicated processor no longerpe8orm.s 
computation. 
In this papel; we study a parallel machine with SMP nodes 
and compare two protocol processing policies: Fixed, which 
uses a dedicated protocol processor; and Floating, where all 
processors perform both Computation and protocol processing. 
The results ffom synthetic microbenchmarks and Jive mac- 
mbenchmrks show that: i) a dedicated protocol processor 
benefits light-weight protocols much more than heavy-weight 
protocols; ii) Fixed improves pe formanee over Floating when 
communication becomes the bottleneck, which is more likely 
when the application is very communication-intensive, over- 
he& are very high, or there are multiple (i.e., more than two} 
processors per node; iii) a system with optiml cost-effective- 
ness is likely to include a dedicatedprotocolprocessor; at least 
for light-weight protocols. 
This work is supported in part by Wright Laboratory Avionics Director- 
ate, Air Force Material Command, USAF, under grant #F33615-94-1- 
1525 and ARPA order no. B550, NSF PYI Award CCR-9157366, NSF 
Grant MIP-9225097, an IBM graduate fellowship, and donations from 
A.T.&T. Bell Laboratories, Digital Equipment Corporation, IBM Corpo- 
ration, Sun Microsystems, Thinking Machines Corporation, and Xerox 
Corporation. The US. Government is authorized to reproduce and dis- 
tribute reprints for Govemmental purposes notwithstanding any copy- 
right notation thereon. The views and conclusions contained herein are 
those of the authors and should not be interpreted as necessarily repre- 
senting the official policies or endorsements, either expressed or implied, 
of the Wright Laboratory Avionics Directorate or the US. Government. 
Parallel computers are emerging as the supercomputers of 
choice, exhibiting impressive performance on many classes of 
large and important applications. Commodity microprocessors 
form the core of computation in these machines, exploiting 
large sales volumes and rapid technology improvements to pro- 
vide superior cost-performance [ I]. Low-level communication 
in these machines is implemented in the form of messaging 
over high speed networks. Both applications programs and the 
system software employ a variety of protocols to schedule and 
coordinate communication and computation. These protocols 
range fi-om low-level messaging functionality, such as check- 
summing, reliable delivery, and flow control, to high-level par- 
allel programming abstractions, like coherent distributed 
shared memory. 
Systems can implement these protocols in either hardware 
or software. Many researchers and vendors favor software 
implementations due to their flexibility [7], reduced manufac- 
turing cost [17], shorter design times [16], and increasedporta- 
bility [10,25]. However, as the performance of network 
interface hardware improves, software protocol overheads 
begin to dominate end-to-end communication time [ 1 I]. 
To address this problem, many distributed-memory paral- 
lel machines employ an embedded processor to off-load the 
primary (computation) processor(s). For example, the Meiko 
CS-2, IBM SP-2, and proposed Stanford FLASH [I31 and 
Wisconsin Typhoon [23] all use embedded processors to accel- 
erate communications performance. By reducing the fi-equency 
of system calls, interrupts, locking, and cache pollution, these 
processors reduce communication latency and increase effec- 
tive bandwidth. 
This paper studies an altemative approach which employs 
one of several processors on a symmetric multiprocessor 
(SMP) node for protocol processing. Small S M P  systems- 
such as the recently-announced Intel Pentium Pro servers-are 
becoming widely available, making them attractive building 
blocks for parallel computers [16]. The Intel Paragon, and the 
proposed MlT StarT-NG [4] and Wisconsin Qphoon-0 [20] 
0-8186-7764-3/97 $10.00 0 1997 IEEE 
128 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
systems all dedicate one processor of a multiprocessor node 
specifically for protocol processing. 
While a dedicated protocol processor can improve com- 
munications performance, it provides little benefit for compute- 
bound programs. These applications would rather use the dedi- 
cated processor for computation. In a recent experiment, 
Womble, et al., demonstrated that using the Paragon’s protocol 
processor for computation (via a low-level cross-call mecha- 
nism under SUNMOS) improved performance on Linpack by 
more than 50% [28]. Similarly, others have shown that a dedi- 
cated protocol processor provides little benefit for systems with 
large communication latencies and overheads as in ATM [ 121 
or HIPPI [6] networks. 
In this paper, we ask the question: “when does it make 
sense to dedicate one processor in each SMP node specijically 
for protocol processing?” The central issue is when do the 
overheads eliminated by a dedicated protocol processor offset 
its lost contribution to computation? We address this question 
by examining the performance and cost-performance trade-offs 
of two scheduling policies: 
Fixed, where one processor in a multiprocessor node is 
dedicated for protocol processing, and 
Flouting, where all processors perform computation and 
altemate acting as protocol processor. 
We limit our study by only considering SMP nodes inter- 
connected using relatively simple network interfaces-similar 
in complexity to the Thinking Machines CM-5 NI [$]-where 
most protocol processing occurs on a regular processor. In con- 
trast, other mearch has examined complex, powerful network 
interfaces that use embedded protocol processors to off-load 
protocol processing [13,23]. While both simple and complex 
network interfaces are interesting, we focus on the former, 
lower-cost alternative. 
We analyze the policies using two sets of experiments. In 
the first set, we use synthetic microbenchmarks to examine two 
simple requestheply protocols and show the following results: 
1. A dedicated protocol processor benefits light-weight proto- 
cols (e.g., Split-C get/put [27]) much more than heavy- 
weight protocols (e.g., page-based DSM [2]). This is 
because the overheads saved by the Fixed policy represent a 
significant fraction of the light-weight protocol’s total 
round-trip time. 
2. Fixed reduces protocol processor occupancy [9]-i.e., the 
time it takes to handle a protocol event-and thus performs 
better than Floating when communication becomes the bot- 
tleneck This is more likely when the application is very 
communication-intensive, protocol overheads are very 
high, or there are multiple (i.e., more than two) processors 
per node. 
3. A system with optimal cost-effectiveness-in the number 
of processors per node-is likely to include a dedicated 
protocol processor when overheads are a si@cant com- 
ponent of protocol processor occupancy. This follows 
because the incremental cost of an additional processor is 
typically less than the relative performance increase pro- 
vided by lower protocol processor occupancy. 
In the second experiment, we examine the same policies 
using five shared-memory applications. These applications run 
on a fine-grain distributed shared-memory machine basal on 
the Tempest interface [23]. Besides corroborating our findings 
h m  the first experiment, the results also show that communi- 
cation pattems in some applications decrease the benefit of the 
Fixed policy. Under Floating, an idle processor acts much like a 
dedicated protocol processor, which occurs more frequently 
with bursty and synchronous communication. 
The next section summarizes the system architecture and 
simulation methodology. Section 3 describes the two protocol 
processing altematives in more detail. Section4 briefly 
describes the cost model and system characteristics that affect 
the performance of the system under the policies. Section 5 and 
Section 6 present results from our microbenchmark and mac- 
robenchmark experiments, respectively. Finally, Section 7 con- 
cludes the paper. 
2 SystemArchitectum 
Figure 1 illustrates the general class of parallel machines 
that we study in this paper. Each node of this system is modeled 
after a SPARCStation 20 consisting of one or more 200 MHz 
HyperSparc processors, each with a 1M direct-mapped data 
cache, connected by 100 M H z  split-transaction bus.’ ’We 
assume perfect instruction cache performance but model con- 
tention at the memory bus accurately. A snooping cache-coher- 
ence protocol keeps the caches within a node consistent. A 
network interface devics-similar to that on a Thinkjag 
Machines CM-5 [8]-with a pair of uncached memory- 
mapped send and receive queues resides on the memory bus 
and connects the node to a low-latency, high-bandwidth net- 
work. We assume a point-to-point network with a constant 
message latency of 100 cycles but model contention at the net- 
work interface device. 
An operating system both provides local services and 
manages the nodes collectively as a single parallel machine 
[8,1]. Parallel applications follow the SPMD programmiing 
model. In this paper, we assume space sharing-where the 
nodes are logically allocated to separate parallel tasks. More 
general time sharing is of course possible, but is beyond trhe 
scope of this paper. 
High performance communication is performed via im 
active message abstraction [27]. Message arrivals either cause 
intempts or the processors poll the network interface to e b u -  
nate the interrupt overhead. A memory-mapped interrupt arbi- 
1. In this paper we use cycle to refer to processor cycles. 
129 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
Node M Node 0 
FIGURE 1. A multiprocessor node parallel machine 
ter device located on the memory bus distributes interrupts 
among processors in a round-robin fashion. Unless stated oth- 
erwise, we assume an interrupt overhead of 200 cycles, charac- 
teristic of carefully tuned parallel computers [22]. We also 
assume Tempest active message semantics which reduces the 
need for synchronization by requiring sequential execution of 
handlers within each node [21]. 
A fine-grain sokware distributed shared-memory system 
extends the coherent shared-memory abstraction beyond a sin- 
gle node. This system is based on the Tempest interface and 
allocates shared memory at the page granularity, but maintains 
coherence via a fine-grain access control mechanism [25]. 
While the result.. of this paper are largely independent of 
whether the mechanism i s  implemented in hardware or soft- 
ware, we assume a hardware implementation via a Typhoon-1 
(TI> board [20]. On each node, a T1 board performs a tag 
lookup to enforce the Tempest access control semantics on 
shared-memory loads and stores that miss in the cache. On a 
remote miss, the hardware provides handler dispatch informa- 
tion in a cacheable memory location. T1 also facilitates mes- 
saging by providing a cacheable control register for detecting 
message arrivals in order to eliminate poll tr&c h m  the 
memory bus. 
3 h t o c d  mess ing  Policies 
In this paper, the termprotocolprocessing refers to execut- 
ing the user and system software needed to manage communi- 
cation between cooperating nodes. For the distributed shared- 
memory system we focus on in this study, protocol processing 
includes executing remote miss  handlers-invoked on a fine 
grain access control exception-and active message handlers. 
Because of Tempest atomicity requirements, each node is lim- 
ited to one processor executing protocol events at any one time. 
Regardless of the policy we say that this processor is acting as 
protocol processor. 
In this study, we examine two scheduling policies for pro- 
tocol processing: Fixed and Floating. In Section 4 we qualita- 
tively analyze the cost-perfomance trade-offs between the 
different policies and identify application and system character- 
istics that affect these trade-offs. 
Fixed. The Fixed policy dedicates one processor of a multipro- 
cessor node to perform only protocol processing. The dedi- 
cated protocol processor executes all the remote miss  and 
active message handlers. By always polling the network when 
otherwise idle, the protocol processor eliminates the need for 
message interrupts or polling by the compute processor(s). 
Floating. The disadvantage of dedicating a protocol processor 
is that it may wate cycles that could have productively contrib- 
uted to computation. The Floating policy addresses this 
dilemma by using all processors to perform computation; how- 
ever, when one becomes idle (e.g., due to waiting for a remote 
request or synchronization operation) it assumes the role of 
protocol processor, Since all processors may be computing, 
either interrupts or periodic polling are required to ensure 
timely handling of active messages. On the other hand, once a 
processor assumes the role of protocol processor, it acts much 
like a dedicated protocol processor. We use the term Single to 
refer to the special case of a single processor (per node) per- 
forming all protocol processing as well as all computation. 
4 When does dedicated protocol processing 
make sense? 
In this paper, we pose the question: “when does dedicated 
protocol processing make sense?’ We address this question by 
evaluating when one of our two protocol processing policies 
performs better or is more cost-effective than the other. While 
there are many factors-including system software complexity, 
and protection [15]-we believe that performance and cost- 
performance are important. 
130 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
To quanti@ cost-effectiveness, we use the simple cost 
model from Wood and Hill [30]. A change, e.g., adding a sec- 
ond processor, is cost-effective if and only if the increase in cost 
(or costup) is less than the increase in performance (or 
speedup). In this paper, we say a system is Cost-effective if its 
cost-performance ratio is less than a uniprocessor node’s. A 
system is must cost-efective if it achieves the lowest cost-per- 
formance ratio. Our simple cost model assumes that a proces- 
sor represents 30% of the cost of a uniprocessor node.’ Thus, a 
two-processor node and a five-processor node have costups of 
1.3 and 2.2, respectively. 
To answer “when” one policy is better than another, we 
examine which factors significantly affect performance. In the 
remainder of this section, we identify and qualitatively analyze 
four factors that have first-order effects. 
ComputatiodCommunication Ratio. Efficient protocol pro- 
cessing helps speed up communication. Compute-intensive 
applications-such as some dense matrix methods-require 
little communication. These applications, characterized by hav- 
ing large computation-to-communication ratios, perform well 
even with very heavy-weight protocols [2]. Thus such applica- 
tions should not benefit from a dedicated protocol processor. 
Conversely, other applications have lower computation-to- 
communication ratios and may benefit from a dedicated proto- 
col processor. 
Protocol Processing Overhead. A dedicated protocol proces- 
sor improves performance by eliminating two types of protocol 
processing overhead: direct and indirect. Direct overhead con- 
sists of the overheads incurred when a processor assumes or 
relinquishes the role of protocol processor. As such, it includes 
the overhead of disabling and enabling interrupts, accessing a 
lock-that ensures there is only a single (acting) protocol pro- 
cessor on a node-and checking when to relinquish being pro- 
tocol processor. Direct overhead also includes the overhead of 
delivering and returning fiom an interrupt. Indirect overhead 
consists of cache interference between computation and proto- 
col threads and migration of protocol processor lock [5], code 
[ 181 and data [24] among processors. A dedicated protocol pro- 
cessor always eliminates both direct and indirect overheads and 
becomes beneficial when the overheads it saves are large com- 
pared to overall communication time. 
Protocol Weight. Protocol weight is a measure of the proto- 
col’s total execution time. It is a function of the protocol com- 
plexity and the speed of the network interface device. We 
characterize the weight by end-to-end communication time: 
~~ ~ 
1.  The incremental cost of an additional processor varies greatly depend- 
ing on the processor, memory hierarchy, peripherals, and the overall sys- 
tem cost per node. In many cases the incremental cost may be less than 
30% which will shift cost-performance in favor of Fixed. 
heavy-weight protocols have larger end-to-end latencies than 
light-weight protocols. Protocol weight affects the policy brade- 
off because for heavy-weight protocols, the overheads saved by 
Fixed become an insignificant &tion of the overall communi- 
cation time. Thus, a dedicated protocol processor should be 
more beneficial for light-weight protocols (e.g., active mes- 
sages) than for heavy-weight protocols (e.g., page-based 
DSM). This runs counter to the common intuition that dedicat- 
ing a protocol processor helps off-load heavy-weight protocols 
from the computation processor. 
Number of Processors per Node. The number of processors 
per node has several effects on the policy trade-off. First, more 
processors increase the likelihood that at least one processor is 
idle (e.g., waiting for a protocol response). Under the Floating 
policy, such a processor acts as protocol processor, significantly 
reducing the direct overhead by eliminating interrupts. A dedi- 
cated protocol processor, however, saves all of the direct and 
indirect overhead which may improve performance in the pres- 
ence of high bus contention. Second, more processors decrease 
the opportunity cost (in lost computation) of the dedicated pro- 
cessor. Third, by parallelizing the computation within a node, 
multiple compute processors decrease the apparent compiuta- 
tion-to-communication ratio. This increases the demand for 
protocol processing which makes performance more sensitive 
to protocol processor uccupancy [9]-i.e., the time it takels to 
handle protocol events. Finally, sharing a dedicated protccol 
processor among multiple compute processors amortizes its 
cost, decreasing the performance improvement needed to be 
cost-effective. 
5 Microbenchmark Analysis 
In this section we evaluate the Fixed and Floating policies 
using two simple synthetic benchmarks. We base our bench- 
marks on a simple requestheply protocol, similar to that 
employed by many parallel computing paradigms [5,10,2,23]. 
Figure 2 (left) illustrates the execution of such a protocol under 
the Fixed policy. The compute processor NlCP submits a 
request to the protocol processor NlPP, which in turn sends a 
message. At the destination node, protocol processor N2PP 
immediately invokes the protocol handler and sends the appiro- 
priate reply. Because of the dedicated protocol processor, coim- 
pute processor N2CP proceeds uninterrupted. Finally, the rejdy 
arrives and the handler runs on NlPP, which then resumes the 
computation thread. 
Figure 2 (right) illustrates the same remote requestheply, 
but for the Floating policy. The (compute) processor NlCP2 
submits a request, becomes the protocol processor and sends a 
message. When the message arrives at node 2, all processors 
are busy computing. Thus, an interrupt is generated causing 
processor N2CP1 to act as protocol processor. The requesting 
processor incurs the overhead of two context switches (to and 
13 1 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
A 
N1CP NlPP NSPP NSCP 
y ---- 
NlCP1 NlCP2 NPCPl N2CP2 
Time 
- Interrupt 
Request 
Relinquish PP 
FIGURE 2. Requestheply protocol in Fixed (left) and Floating (right) 
from the protocol thread) and the resulting cache pollution. The 
replying processor additionally incurs the overhead of deliver- 
ing (and returning from) the interrupt. An idle processor acting 
as protocol processor (NlCP2) can immediately handle a 
request by another processor on the node (NlCPl), thereby 
eliminating the interrupt overhead. 
Our benchmarks time the execution of a tight-loop run- 
ning on a two-node machine. Each iteration alternates between 
computing and issuing a remote request using a simple request/ 
reply protocol. To induce cache effects, computation is inter- 
leaved with uniformly random accesses to a (private) proces- 
sor-specific segment of the address-space. The size of the 
segment is equal to the size of the processor cache. We let com- 
pute processor caches warm up before the start of measure- 
ments. 
We experiment with two requestheply protocols with dif- 
ferent protocol weights. Our null-handler protocol represents 
the lightest-weight protocol achievable in our simulated sys- 
tem. The protocol handlers do nothing but send the appropriate 
active message, i.e., the reply handler simply sends a null mes- 
sage back to the requester. Ourfetch-block protocol is represen- 
tative of the medium-weight protocols needed to support fine- 
grain distributed shared-memory systems [10,25]. We do not 
consider a heavy-weight protocol, e.g., page-based DSM, since 
prior work indicates that a dedicated protocol processor will be 
of little use [ l2,6]. The processors randomly request a 128-byte 
block of data from the private segment of a remote processor. 
In addition, the protocol handlers manipulate the memory 
block state in a protocol table. Both the data block transfer and 
accesses to protocol table contribute to cache pollution. 
We define Lmin to be minimum round-trip latency under 
the Fixed policy. Under our system assumptions, the protocol 
Time 
round-trip times are 3 p for the null-handler protocol and 8 ps 
for the fetch-block protocol. We vary the following parameters 
in the experiment: 
= mean computation time between requests, 
= thread compute-utilization in the absence of 
C 
U 
Oin, = overhead of handling an interrupt. 
We use an exponential random stream with mean C to 
generate computation times, and adjust C to derive various val- 
ues for U. We vary Oin, by delaying a thread upon an interrupt 
for a fixed number of cycles. The number of iterations in a loop 
is inversely proportional to the number of compute processors 
per node, e.g., Hoating on a two-processor node and Fixed on a 
four-processor node execute half and one-third as many itera- 
tions as Single, respectively. 
Figure 3 (left> compares the performance for the null-han- 
dler protocol in one and two-processor node machines. The fig- 
ure plots execution times of Single and Floating normalized to 
Fixed as thread compute-utilization increases. Points above the 
horizontal line indicate that Floating (Single) performs worse 
than Fixed. The thick and thin lines depict high and low inter- 
rupt overheads, respectively. The graphs for Single (solid 
curves) illustrate the intuitive result that communication-inten- 
sive programs (small v) benefit more ftom a dedicated proto- 
col processor than computation-intensive programs (large v). 
When the program becomes communication-bound (C cc 
Lmh), however, the compute processor in Single becomes idle 
and acts like a protocol processor, reducing the number of 
taken interrupts. The graphs also indicate that, when interrupt 
overhead is high, even a small number of interrupts severely 
impacts the execution time. 
protocol contention (C/(C+L~,J), 
132 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
4.50 - 
E 4.00 - 
Fetch-block Protocol 
5.00 
4.50 -  Single,high Oint SingleJow Oint - Single,high Oint - Single.low Oint 
I.-- Floating,high Oint 
_ _ - -  FloatingJOw Oint 4.00 - 
E 3.50 - 
0 3 3.00 - 
w 
.- 
.- 
2.50 - 
1-11 Floating,high Oin 
FloatingJow Oint - - - -  
0.004 , . , . . . . . . I 0.00 4 . . . , , , . , . 
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1..0 
Compute Utilization (U) Compute Utilization (U) 
FIGURE 3. Relative performance of Fixed and Floating with varying interrupt overhead 
The figures compare execution times of Fixed and Floating versus thread compute-utilization (U). The figures plot 
execution times of Single and two-processor Floating normalized to two-processor Fixed for two values of interrupt 
overhead (Oint); low and high Oint correspond to values of 200 cycles (1 ps) and 2000 cycles (10 ps) respectively. 
Values over the horizontal lines indicate better performance under the Fixed policy. 
The dashed curves plot the normalized execution time for 
a two-processor node under the Floating policy. With high 
interrupt overheads, the Floating policy behaves like the Fixed 
policy; the two (compute) processors altemate acting as the 
protocol processor eliminating the interrupt overhead. Protocol 
processing migration, however, incurs indirect overhead, 
slightly reducing performance under Floating relative to Fixed. 
With low intempt overheads, there is little benefit from a dedi- 
cated protocol processor, but potential gain fiom improving 
computation time. Under Floating, both processors perform 
computation, resulting in significantly better performance at 
higher compute-utilizations. This is not surprising since our 
microbenchmark is perfectly parallelizable. 
Figure 3 (right) compares the performance of the policies 
for our fetch-block protocol. The figure corroborates our intu- 
ition that a dedicated protocol processor is more beneficial for 
light-weight protocols than for heavy-weight protocols. The 
result follows fiom the observation that Fixed does best when 
the interrupt overhead is much greater than the round-trip 
latency COi,,, >> LmiJ This result suggests that dedicated pro- 
tocol processors may become more attractive as interrupt laten- 
cies go up (due to faster processors) and protocol weights go 
down (due to faster network interfaces). 
This graph illustrates the surprising result that for a com- 
munication-bound program and low interrupt overhead, Single 
outperforms Fixed. This occurs because our synthetic protocol 
always reads message data into the protocol processor's cache. 
Under Fixed, the compute processor always misses on message 
data, resulting in a cache-to-cache transfer. Conversely, under 
Single, there is only one cache, so the transfer is eliminated. 
Network interfaces equipped with caches (e.g., CNI [ 191) allow 
protocols to directly transfer data into the compute proces~or~s 
cache. 
Unlike the null-handler protocol, the Floating policy mlain- 
tains its advantage over Fixed even at low computeutilizations. 
Overheads in the fetch-block protocol account for an insignifi- 
cant fraction of communication time. Moreover, at low oom- 
pute utilizations the extra compute processor in Floating 
parallelizes communication by doubling the number of out- 
standing requests per node, improving performance over Fixed. 
Multiple Compute Processors Per Node. More processors 
per node helps Floating by increasing the likelihood that an idle 
processor is acting as protocol processor. The added benefit of 
an extra compute processor, however, diminishes with a larger 
number of processors. Multiple compute processors also 
increase the contention for the single protocol processor. Under 
low compute-utilization, Floating approximates Fixed, since: an 
acting protocol processor eliminates the (direct) interrupt over- 
head. Fixed, however, provides better throughput by also elini- 
nating the (direct and indirect) overheads associated with 
protocol thread migration (Section 4). 
Figure 4 (left) plots the normalized execution time for the 
null-handler protocol under the Floating policy, relative to 
Fixed, while varying the number of processors per node. Flalat- 
ing generally outperforms Fixed on two processors, but as ithe 
number of processors increases, greater demand for protocol 
processing makes communication the bottleneck. Because, ihe 
dedicated protocol processor minimizes protocol processor 
occupancy, Fixed provides greater throughput and can support 
a larger number of compute processors. Eventually, the prom- 
133 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
Null-handler Protocol 
2.20 > I 
2.00 
E 1.80 
1.60 
1.40 
F 
.- c 
=I 
X !$! 1.20 
a 
N “m 1.00 
E t; 0.80 
2 
0.60 
Fetch-block Protocol 
2.20 
2.00 -- 
Number of Procs 
5 
a 
E 1.80 
1.60 
7 
0 1.40 
F 
.- c 
s 
.- w3 1.00 
; 1.20 
0.80 
0.60 
z 
0.40 -2 0.40 1. 
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
Compute Utilization (U) Compute Utilization (U) 
FIGURE 4. Relative performance of Fixed and Floating with varying number of processors 
The figures compare execution times of Fixed and Floating versus thread compute-utilization (U). The figures plot 
execution time of Floating normalized to Fixed while varying the number of processors per node (Oint = 200 cycles). 
Values over the horizontal lines indicate better performance under the Fixed policy. 
col processor saturates regardless of policy and the relative per- 
formance levels off. 
At low compute-utilizations, Fixed saturates more quickly 
than Floating with an increase in the number of processors; 
request rates in Floating remain lower than those in Fixed 
because an acting protocol processor must return to computa- 
tion before it can contribute to request traffic. As such, higher 
request rates in Fixed saturate the protocol processor with a 
fewer processors. At saturation, however, lower occupancy in 
Fixed nearly doubles the performance over Floating. 
Compute-intensive programs take advantage of the extra 
compute processor in Floating to improve computation time. 
An increase in the number of processors, however, gradually 
diminishes Floating’s advantage over Fixed because the added 
benefit of an extra compute processor becomes insignificant. 
Figure 4 (right) plots the same graphs for the fetch-block 
protocol. Much like the null-handler protocol, Fixed outper- 
forms Floating when protocol processor utilization is high, i.e., 
there are more than two processors per node and compute-utili- 
zation is low. At saturation, however, Fixed improves perfor- 
mance only by 20% because the overheads it saves are a small 
fraction of total (per-request) protocol processor occupancy. 
Cost/Performance. Cost-performance also varies with an 
increase in the number of processors. Cost-performance only 
improves if the performance improvement ftom an extra pro- 
cessor is larger than the cost-increment. Adding processors to a 
node helps reduce the computation time, but also increases 
contention for the protocol processor. When communication 
becomes the bottleneck, cost-performance degrades with each 
extra (compute) processor. Adding a dedicated protocol proces- 
sor, however, may improve cost-performance by decreasing 
protocol processor occupancy and increasing throughput. 
Figure5 (left) illustrates cost-performance for our null- 
handler protocol. The figure plots cost-performance ratio where 
1 represents a uniprocessor node. We examine both policies at 
two computeutilizations, against the number of processors per 
node. Values under the horizontal h e  (at 1) correspond to sys- 
tems that are cost-effective-i.e., systems with better (lower) 
cost-performance than a uniprocessor node. Adding a dedi- 
cated protocol processor to a uniprocessor node is cost-effec- 
tive for both low (U=O.3) and high (U=O.7) compute- 
utilizations. Thus, the overhead saved by a dedicated protocol 
processor justifies the additional cost. Using the second proces- 
sor for computation takes advantage of the extra parallelism 
available in the benchmark and improves performance further, 
resulting in an even more cost-effective system. 
Surprisingly, the Fixed policy always provides the most 
cost-effective system for our null-handler protocol. Fixed can 
always accommodate a larger number of (compute) processors 
than Floating, because of its lower protocol processor occu- 
pancy. A larger number of processors also reduce the relative 
cost-increment fkom an additional processor. The combined 
effect drives cost-performance lower under Fixed. 
Figure 5 (right) illustrates the cost-performance graphs for 
the fetch-block protocol. Unlike the null-handler protocol, add- 
ing a dedicated protocol processor to a uniprocessor node is not 
cost-effective, whereas using a second processor for computa- 
tion is. Similarly, Fixed is no longer most cost-effective. Com- 
munication rapidly becomes a bottleneck-with more than two 
processors per node-but the reduction in protocol processor 
occupancy from Fixed is not high enough to overcome the cost. 
Even at higher compute-utilizations (U=O. 7), the relative cost- 
134 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
1.40 7 
1.20 - 
Q) 
a 
1.00 
E 
.t“ 0.80 - 
g 
8 0.60 - 
0 
0.40 - 
0.20 
Fetch-block Protocol 
1.40 
Floating, UzO.3 
Fixed, U=0.3  Floating, U=0.7 
Fixed, k 0 . 7  
- 
----  
1.11 
- -  
\ *  .* .* .* 
1.20 - 
Q) 
1.00 - 
E- 
0.80 - 
a“ 
0.60 - 
0 
Floating, U=0.3 
Fixed, UsO.3 - Floating, U=0.7 
Fixed, UsO.7 
0.20 
2 3 4 5 
Number of Processors per Node 
increment is high enough to prevent Fixed from improving 
cost-effectiveness over Floating. 
6 Macrobenchmark Analysis 
Although microbenchmark analysis helps develop intu- 
ition about relative performance, it makes many simphfying 
assumptions. For example, our experiments ignored synchroni- 
zation, burstiness of communication, cache effects due to large 
data sets, and bandwidth limitations of the memory bus. In this 
section, we re-examine our policies in the context of a network 
of eight multiprocessor workstations, each with five processors. 
I Name 11 InputDataSet I 
1 ambt 11 24 x 24 x 24 cubes, 12 iters I 
FIGURE 5. Cost-performance of Fixed and Floating with varying number of processors 
The figures plot cost-performance of Fixed and Floating (Oint = 200 cycles) varying the number of processors per 
node for two values of compute-utilization (U=0.3 and U=O,7). Costups and speedups are calculated with respect to a 
uniprocessor node (Single). The graphs assume that the cost of a processor is 30% of the cost of a uniprocessor 
node. Values over the horizontal lines indicate design points that are not cost-effective. 
135 
bames 8192 bodies 
38400 nodes, 15% remote, 40 iterations 
gauss 960 x 960 matrix 
I tomcutv 11 960 x ~omatrices, 10 iterations I 
TABLE 1. Application input parameters 
Table 1 lists the applications and corresponding input data 
sets we use in this study. Appbt is a three-dimensional fluid 
dynamics application [7]. Barnes is an N-body simulation from 
the SPLASH-2 suite [29]. Em3d models the propagation of 
electromagnetic waves in three dimensions [5]. Gauss solves a 
linear system of equations using Gaussian elimination [3]. 
T o m m  is a parallel version of the SPEC benchmark. 
Our .transparent distributed shared-memory system uses a 
128-byte Stache [23] protocol to keep data coherent between 
nodes; intra-node communication occurs through the MOESI 
coherence protocol on the bus. We measure a minimum mn- 
ning time for the request handler to be 125 cycles, and reply 
and response handlers to be 140 cycles for a total of 900 cycles 
( 4 . 5 ~ )  of round-trip latency. 
6.1 Baseline System 
Figure 6 compares the performance of Fixed and Floating 
with varying number of processors per node. Except for em3d, 
adding a dedicated protocol processor to a uniprocessor node 
improves performance by at most 25%. Em3d is our most ccim- 
munication-intensive application with a compute-utilization1 of 
less than 50%. The application iterates over a bipartite graph, 
computing new values for each graph node. Fetching remote 
node values dominates the running time of an iteration. Ehni- 
nating interrupt overhead allows Fixed to improve per€ormarce 
by 63%. 
Using the second processor for computation-under the 
Floating policy-improves performance by 54%-98% in all 
the applications. Appbt, bames, gauss and tomcatv all halve 
moderate to high compute-utilizations and can take advantage 
of the second compute processor. In .“d, the second proces- 
sor both contributes to computation and alternates with the 
other processor to act as protocol processor. 
As we increase the number of processors per node, we 
increase both computational resources and demand for proio- 
col processing. Tomatv is our most computebound applica- 
tion and primarily benefits from addition of compute 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
500 
- a, 400 
0 
c 300 
h 
v) 
% 
E 200 
E 
i= 100 
0 .-  .- 
0 
UPPbt 
barnes 
em3d 
gauss 
tomcah 
1 2 3 4  1 2 3 4  1 2 3 4  1 2 3 4  1 2 3 4  
aPPbt ba rnes  e m 3 d  g a u s s  tomcatv 
Number of Processors per Node  
FIGURE 6. Performance of Fixed and Floating with varying number of processors per node 
1.25 1.78 5.14 
1.19 1.51 3.63 
1.63 2.32 8.76 
1.05 1.18 2.57 
1.05 1.18 1.88 
processors. Compute-utilizations in uppbt and bumes are at 
moderate levels (= 70%). Protocol processing in these applica- 
tions begins to dominate running time with three or more pro- 
cessors per node. The dedicated protocol processor in Fixed 
reduces occupancy and improves performance over Floating at 
four processors per node. 
Gauss also exhibits a moderate level compute-utilization. 
Because communication in gauss is synchronous, an idle pro- 
cessor remains the acting protocol processor during the com- 
munication phase. As a result, Floating mimics the behavior of 
Fixed and stays competitive at four processors per node. 
Em3d, with its low compute-utilization manages to satu- 
rate the protocol processor with only two (compute) proces- 
sors. Although two or more processors virtually eliminate all 
the interrupts, at saturation point the indirect overhead of a 
floating protocol processor hu t s  the performance under Float- 
ing to 70% of that under Fixed. 
aPPbt 
barnes 
66 Interrupt Overhead 
0.76 0.94 1.95 
0.68 0.76 1.04 
The performance of Floating (Single) is sensitive to how 
quickly the system can interrupt a processor and dispatch a pro- 
tocol handler. Today’s commercial operating systems do not 
provide fast delivery of user-level interrupts. Exception han- 
dling on these systems can take up to 200 ps [26], one to two 
orders of magnitude longer than that on some carefully tuned 
parallel computers [22]. In this experiment we study the sensi- 
tivity of the policy trade-off to interrupt overheads. 
Table 2 presents execution times of Single and two-pro- 
cessor Floating, normalized to two-processor Fixed for three 
values of interrupt overhead. As predicted by our microbench- 
mark analysis, very high interrupt overheads severely impact 
the performance of Single. Increasing interrupt overhead by 
two orders of magnitude can increase the running time of Sin- 
gle by over 800%. This result corroborates the observation that 
with stock operating systems, networks of workstations 
(NOWs) [l] may have to rely on program instrumentation 
[27,14] to perform periodic polling. 
Interrupt Overhead 
Application 
Single/Fixed 
1 em3d 11 1.06 1 1.07 1 i2I 1 
TABLE 2. Sensitivity to interrupt overhead 
The Table presents execution times of Single and 
two-processor Floating normalized to two-processor 
Fixed for various interrupt overheads. Numbers 
appearing in boldface indicate points where Fixed 
outperforms Floating. 
gauss 0.63 0.66 
tomcah, 0.53 0.54 
Another observation, consistent with our microbenchmark 
results, is that very high interrupt overheads have a much 
smaller impact on the performance of Floating than Single. In 
all the applications, a two orders of magnitude increase in inter- 
rupt overhead slows the program down by at most 160%. This 
is because an idle processor acting as protocol processor elimi- 
nates many of the interrupts. High interrupt overhead has the 
largest impact on uppbt, because this application uses spin 
locks to synchronize threads in a gaussian elimination phase. 
As such, an idle processor spinning on a lock takes an intempt 
upon arrival of every message. 
136 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
Ojn, = 200 cycles ( 1 p) Oim = 2000 cycles (10 p) 
. I Floating, bames Fixed, bames - - _ - -  
1 .o 
0.8 
0.6 
0.4 
Floating, tomcatv 
Fixed, tomcatv 
\ *  - 
\ *  -.)-I \ -  
1 
a, 
0 
I z
0 
a 5 w s 
I I 
2 3 4 5 
Number of Processors per Node 
Floating, barnes 
Fixed, bames 
Floating, tomcatv 
- 
----  - 
0.6 
0.4 
Fixed, tomcatv - -111 
2 3 4 5 
Number of Processors per Node 
FIGURE 7. Cost-performance of Fixed and Floating with varying number of processors 
The figures plot cost-performance of Fixed and Floating varying the number of processors per node for barnes and 
fomcatv Costups and speedups are calculated with respect to a uniprocessor node (Single). The graphs assume that 
the cost of a processor is 30% of the cost of a node. Values over the horizontal lines indicate design points that are not 
cost-effective. 
6 3  Cost/performance 
Figure 7 plots cost-performance for two applications with 
moderate (bumes) to high (tomcatv) compute-utilizations ver- 
sus the number of processors. The graphs indicate that adding a 
dedicated protccol processor to a uniprocessor node is never 
cost-effective for the lower interrupt overhead (left). This is not 
surprising since performance improves by at most 20% 
whereas the system cost goes up by 30%. When overhead is 
high (right), performance in bumes improves by 50% justi@- 
ing the cost of the dedicated protocol processor. Computation 
in tomcm remains the dominant factor in the running time. 
Even with higher interrupt overhead the program benefits little 
ftom a dedicated protocol processor. A second compute pro- 
cessor, however, improves performance in the two applications 
by at least 70% and is therefore cost-effective. 
Much as our microbenchmarks predicted, when interrupt 
overhead is low-as compared to protocol weight-the system 
is most cost-effective under the Floating policy; for bumes, 
cost-performance under Fixed reaches a minimum close to, but 
not the same as, that under Floating. Tomcutv speeds up lin- 
early and therefore always reaches a lower cost-performance 
under Floating. When the number of processors is large enough 
(> 6), speedup dominates cost-performance in tomutv causing 
it to eventually level off. At this point, Floating results in a mar- 
ginal improvement in cost-performance over Fixed. 
High interrupt overhead, however, changes the balance. 
Bums achieves a minimum cost-performance under the Fixed 
policy. The high overhead increases protocol processor occu- 
pancy, resulting in a higher protocol processing to running time 
ratio. The Fixed policy reduces protocol processor occupancy, 
allowing the protocol processor to accommodate a larger num- 
ber of processor before protocol processing saturates. At this 
point, the performance improvement due to a dedicated proto- 
col processor is large enough to offset its incremental cost. 
Floating remains most cost-effective for the more compulte- 
intensive application, tomcutv. High interrupt overhead, how- 
ever, slightly closes the gap in cost-performance between to the 
two policies for this application. 
7 Summary and Conclusions 
In this paper, we examined how protocol processing 
should be scheduled on an S M P  node parallel machine. Previ- 
ous systems such as the Intel Paragon have dedicated a proces- 
sor specifically for protocol processing. Others have recently 
argued that all processors should be used for both computation 
and communication [12,6]. We examined when it does and 
does not make sensc to dedicate a protocol processor. 
We presented results from synthetic benchmarks for tvvo 
general requestheply protocols to illustrate the trade-olffs 
between the policies. The results showed that: i) a dedicated 
protocol processor benefits light-weight protocols much m m  
than heavy-weight protocols; ii) Fixed improves performance 
over Floating when communication becomes the bottleneck, 
which is more likely when the application is very communica- 
tion-intensive, protocol overheads are very high, or there are 
multiple (i.e., more than two) processors per node; iii) a system 
with optimal cost-effectiveness is likely to include a dedicawd 
protocol processor when overheads are a significant component 
of protocol processor occupancy. 
137 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
Finally, we evaluated these policies in the context of a fine- 
grain user-level dis&ibuted shared-memory system. We pre- 
sented results from simulating a network of eight multiproces- 
sor workstations-each with up to five processo"ing 
five shared-memory applications using a software coherence 
protocol. Besides corroborating our findings from the first 
experiment, the results also show that bursty and synchronous 
communication pattems in some applications reduce overhead 
and therefore decrease the benefit of the Fixed poky. 
We would like to thank Doug Burger, Jim Goodman, 
Mark Hill, Stefmos Kaxiras, and Anne Rogers for their helpful 
comments on earlier drafts of this paper. 
eferences 
Tom Anderson, David Culler, and David Pattenon. A Case for NOW 
(Networks of Workstations). IEEEMicro, 15(1):54-64, February 1995. 
John B. Carter, John K. Bennett, and Way Zwaenepel. hplementation 
and Performance of Munin. In Proceedings ofthe I3rhACM Symposium 
on operating System Pnnciples (SOSP), pages 152-164, October 1991. 
Satish Cbmdm James R. Larus, and Anne Rogers. Where is Time Spent 
in M e s s ~ P a s s ~ n g  and Shared-Memory Programs. In Proceedings of 
the Sixth Intemattonal Conference on Architectural Support for Pro- 
gramming Languages and Operating Systems, pages 61-73, San Jose, 
Wforma, 1994. 
Derek Chiou, Boon S. Ang, Arvind, Michael J. Beckerle, Andy Bough- 
ton, Robert Greiner, James E. Hicks, and James C. Hoe. S W - N G  
Seamless Parallel Computing. In Proceedings ofEURO-PAR'95,1995. 
D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, 
T. von Eicken, and K. Yelick. P d e l  Programming in Split-C. In Pro- 
ceedzngs of Supercomputzng '93, pages 262-273, November 1993. 
Andrew Erlichson, Neal Nuckolls, Greg Chesson, and John Hennessy. 
SOWLASH: Analyzing the Performance of Clusterad Distributed Virmal 
Shared Memory In Proceedings of the Seventh lntematzonal Conference 
on Architectural Support for Programmzng Languages and Operaring 
Systems (ASPLOS VU), October 1996. 
Babak Falsaf, Alvin Lebeck, Steven Reinhardt, Ioannis Schoinas, 
Mark D. Hill, James Larus, Anne Rogers, and David Wood. Application- 
Specific Protocols for User-Level Shared Memory. In Proceedings of Su- 
percomputing '94, pages 38&389, November 1994. 
W. Daniel Hillis and Lewis W Tucker. The CM-5 Connection Machine: 
A Scalable Supercomputer. Communzcatiow of the ACM, 36(11):31-40, 
November 1993. 
Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg, and 
John Hennessy. The Effects of Latency, Occupancy, and Bandwidth in 
Distributed Shared Memory Multiptomsors. Technical Report CSLTR- 
95-660, Computer Systems Laboratory, Stanford University, J a n q  
1995. 
[lo] Kirk L. Johnson, M. Frans Kaashoek, and Deborah A. Wallach. CRL: 
High-Performance All-Softwm Distributed Shared Memory. In Pro- 
ceedzngs of the 15th ACM Symposium on Operating System Principles 
(SOSP), pages 213-228, December 1995. 
[ l l]  Vijay K-cheti and A n h w  A. Chien. Software Overhead in Message 
Layer: Where Does the Time Go? In Proceedings ofthe Sixth Internution- 
a1 Conference on Architectural Support for Programming Languages 
and Operatmg Systems (ASPUIS VI), 1994. 
[12] Magnus Karlsson and Per Stenst" Performance Evaluation of a Clus- 
&-Based Multipmsor Build h m  ATM Switches and Bus-Based 
Multipmsor Servers. In Proceedings of the Second IEEE Symposium 
on High-Per$ormarzce Computer Architecture, February 1996. 
[13] Jefhey Kuskin et al. The Stanford FLASH Multipmcessor. In Proceed- 
ings of the 2IstAnnual Intemational Symposium on Computer Architec- 
ture, pages 302-313, April 1994. 
[I41 James R. Larus andEric Schnan. EE%: MachineIndependent Executable 
Editing. In Proceedings of the SIGPIAN '95 Conference on Program- 
ming Language Design and Implementation (PLDI), pages 291-300, 
June 1995. 
[15] Beng-Hong Lim Phillip Heidelberger, Pratap Pattank, and Marc Snir. 
Message Proxies for Efficient, Protected Communication on SMF' Clus- 
ters. In Proceedings of the ThirdlEEE Symposium on High-Peflonnance 
Computer Architecture, February 1997. 
[16] Tom Lovett and Russel Clapp. S T d G  A CC-NUW. Compute System 
for the Commercial Marketplace. In Proceedings of the 23rd Annual In- 
ternational Symposium on Computer Architecture, May 1996. 
[ly Meiko World Inc. Computing Surface 2 Qvewiew Documentation Set, 
1993. 
[18] David Mosberger,Larry L. Peterson, and Sean OMalley. ProtocolLaten- 
cy: Mips and Reality. Technical Report TR 95-02, Department of Com- 
puter Science, University of Arizona, 1995. 
[19] Shubhendu S. Mukhe~jee, Babak Falsafi, MarkD. Hill, and David A. 
Wood. Coherent Network Interfaces for Fm-Grain Communication. In 
Proceedings of the 23rd Annual International Symposium on Computer 
Architecture, pages 247-258, May 1996. 
[20] Steven R M t ,  Robert M e ,  and David A. Wood. Decoupled Hard- 
ware Suppoxt for Distributed Shared Memory. In Proceedings of the 23rd 
Annual Inremarional Symposium on Computer Architecture, May 19%. 
[21] Steven K. Reinhardt. Tempest Interface Specification (Revision 1.2.1). 
Technical Report 1267, Computer Sciences Department, University of 
Wisconsin-Madison, February 1995. 
[22] Steven K. Reinhardt, BabakFalsafi, and David A. Wood. Kernel Support 
for the Wisconsin Wind Tunnel. In Proceedings of the Usenix Symposim 
on Microkemek and Other Kernel Architectures, September 1993. 
[23] Steven K. Reinhatdt, James R. Larus, and David A. Wood. Tempest and 
Typhoon: User-Level Shared Memory. In Proceedings ofthe 2lstAnnual 
International Symposium on Computer Architecture, pages 325-337, 
April 1994. 
I241 James D. Salehi, James F. Kurose, and Don Towsley. Scheduling for 
cache affinity in paralleliml communication protocols. Technical Report 
UM-CS-1994075, University of Masschusetts, October 1994. 
[25] Ioannis Schoinas, Babak Falsafi, Alvin R. hkk, Steven K. Reinhardt, 
James R. Larus, and David A. Wood. Fie& Access Control for Dis- 
tributed Shared Memory. In Proceedings of the Sixth International Con- 
ference on Architectural Syoport for Programming Languages and 
Operaring System (ASPLOS VI), pages 297-307, October 1994. 
[26] Chandramohan A. T h e w  and Henry M. Levy. Hardware and Soft- 
ware Support for Efficient Excepton Handling. In Proceedings of rhe 
Sixth Internatioml Conference on Architectural Support for Program- 
mingrclnguages and Operating Systems, pages 110-1 19, San Jose, Cali- 
fornia, 1994. 
1271 Thorsten von Ecken, DavidE. Culler, SethCopen Goldstein, and 
IUausErik Schauser. Active Messages: a Mechanism for Integrating 
Communication and Computation. In Proceedings of the 19th Annual In- 
temational Symposim on Computer Architecture, pages 256-266, May 
1992. 
[28] David Womble, David Gmnberg, Stephen Wheat, and Rolf Riesen. LU 
Factorization and the LINPACK benchmark on the Intel Paragon. Tech- 
nical Repoit ftp://ftp.cs.sandiagov/pu~pa~~dewombv 
paragon-linpack-benchmarkqs, Sandia National Laboratories, March 
1994. 
[29] StevenCameron Woo, Moriyoshi Ohan, Evan Tome, JaswinderPal 
Sigh, and Anmp Gupta. The SPLASH-2 Programs: chamcterizaton 
and Methodological Considerations. In Proceedings ofthe 22nd Annual 
Intem'onal Symposium on Computer Architecture, pages 24-36, July 
1995. 
[30] David A. Wood and Mark D. Hill. Cost-Effective P d e l  Computing. 
IEEE Computer, 28(2):69-72, February 1995. 
138 
Authorized licensed use limited to: EPFL LAUSANNE. Downloaded on April 7, 2009 at 12:33 from IEEE Xplore.  Restrictions apply.
