trunk from each link of its route at the time that the call arrives, and simultaneously releases these trunks at the time that the call finishes. Typical parameters for a large network, such as the AT&T circuit-switched network, are N _ 100, with almost all of the _ 5000 links having non-zero capacity, and a total of _ 1 million trunks. (We will use the AT&T network as a guide for constructing realistic simulation scenarios.
However, the routing policy we consider is different, being far simpler, than the policy used in the AT&T network.)
Call routing involves alternate-routing and trunk-reservation mechanisms [7] . Alternate-routing allows for the sharing of excess capacity:
• A call between a node-pair {i, j} is accepted on link {i, j} if that link is not full to capacity.
* Otherwise, a third node v, termed the via, is selected, and the call is accepted on the two link path {i, v}, {v,j}, if both links are not reserved.
• Otherwise, the call is blocked.
Under randomized-routing, the choice of via v is made by independent sampling from a probability distribution over the N -2 possibilities, which may dopend on the parameters of node-pair {i,j}, but not on A key difficulty in the design of a massively parallel simulation of the network is coping with asymmetries.
In realistic networks, the call arrival rates may vary by three orders of magnitude over the node-pairs. 
C+E,
where C (0 _< C < 1) is a fixed constant and E is exponentially distributed with mean 1 -C. Thus, the average holding time is one, and the units of the arrivai rate are "erlangs" [7] . Typically, in switched network simulation studies one assumes purely exponential holding times (C = 0). However, it is more realistic to allow for an initial constant delay C. Furthermore, we will see later that including this delay actually improves the efficiency of the parallel simulation of the system. As shown in Section 6, system performance measures such as blocking, turn out to be rather insensitive to C. In this section, we give a high level description Of the sweep algorithm, leaving some of the details to Section 5.
Arrival events at a given node-pair {i, j} are of One of two types: In brief, the idea is to consider an absorbing Markov process describing the number of calls present that have completed the deterministic C delay in their holding times: when k are present, one of the k finishes (thereby stopping the process) at rate k(1 -C), whereas another such call arrives at rate A. Analysis of this process provides the expected number of events within [s + C, t) as
where B(., .) is the Beta-function. Summarizing, the expected number of events within the window is
For large networks, such as the AT&T network, the aggregate call arrival rate ,_ is on the order of 1 million erlangs, and the number of events in the window will be large, even for small C, as will be seen in Section 6. 5 Implementation
An implementation of the sweep algorithm must address:
• the mapping of possibly unbalanced node-pair computations into the parallel computer, and
• the arrival event feasibility and state computations associated with the node-pairs.
We discuss these two issues in turn. Adaptations for a MIMD implementation are described in Section 5.3.
The Mapping
Impose an order on the processors in the parallel computer, and an order on the N(N -1)/2 node-pairs.
We dedicate a fixed number Pi,j _> 1 of consecutive processors to node-pair {i,j}; the index of the first of these is obtained by summing the values P_,_ for node-pairs (u, v} earlier than {i, j} in the node-pair order.
To simplify the discussion, consider one node-pair {i, j}. Suppose the current simulation window is [s, t).
To map the events of node-pair {i, j} into the P = Pi,j processors, assign the U h in the processor order to store and manage all events that fall into the U h subwindow
..., P. In this way, each event, identified by a node-pair {i,j} and a time u within the window s < u < t, maps into a unique processor.
In ourimplementation, wefurtherrestrictthe mapping bysettingthe extent of eachwindowto C, the extent of the constant portion of the call holding time, which we assume is non-zero. Thus, the jth window is the interval [(j -1)C, jC Using the bounds, we obtain the following rule for the feasibility of the k th event: Step 1 y f b,b} 0 o {b,c} 0 0 {a,c} 0 0
Step 2 (a,b} 0 0 {b,c} 1 1 {a,c} 1 1
Step 3 lower bound on the number of free trunks available after the last event of the previous processor.
Solving (5) reduces to parallel prefix computation [9] . Given inputs Zl, ..., zn, and an associative operator o, the parallel prefix problem is to compute the n partial products:
To put (5) in this form, we recast it as a matrix recurrence in the semiring where max is the addition operator with identity -ee and + is the multiplication operator with identity 0. Under this interpretation, the distributive law is a + max{b, c} = max{a + b, a + c}, and (5) is expressible as
where and the usual rules of vector and matrix multiplication apply but with scalar addition and multiplication taken to be max and +, respectively. Telescoping (6) we obtain v_ = MtMt-1...
M_vl.
Hence to compute the -ft' it suffices to:
1. solve the parallel prefix problem of computing the partial matrix products
The first step dominates the computational cost. Kruskal et al. [8] show that on a shared memory model, it is possible to solve the parallel prefix problem in O(log n) time using O(n/log n) processors. Their algorithm is easily adapted to the situation at hand, where the n inputs are distributed across P processors.
Taking into account that the distribution of events is random, it can be shown that the computational cost is O(ai,j/P + log P) with high probability where aij (defined in Section 5.1) is the rate at which node-pair {i, j} receives events. If P = Pi,j is taken proportional to ai,j the time becomes O(log P).
Adaptations for a MIMD Implementation
The implementation just described is well-suited for SIMD architectures, and we refer to it as the SIMD implementation.
The sweep algorithm has also been implemented on a MIMD architecture, the Intel iPSC/860
(whichis identical to theiPSC/2[1], except it isbased onthei860processor). Thereareonlytwosignificant differences between theMIMD andSIMDimplementations: the mapping of node-pairs to processors, and thehandling of interprocessor communication. Eachof these is described in turn below.
TheMIMD version mapsmultiplenode-pairs to eachprocessor. Thus,a givennode-pair's eventsare always all on thesameprocessor. Weaccept a node-pair's callarrivalrateasa reasonable estimate of the node-pair's workload, andthenviewthemapping problem asidentical toamultiprocessor scheduling problem where weseek tominimize themakespan ofasetofindependent, non-preemptable tasks. Ourimplementation uses aminorvariationofthewell-known longest processing time first list scheduling algorithm, first analyzed in [6] . We order the node-pairs in decreasing order of arrival rate, then step through the list assigning the next node-pair to the most lightly loaded processor. The partitioning of processors among the node-pairs is done using the arrival rate and routing parameters as described inSection5.1. This choice is validated bythe histogram in figure(6.1), which depicts the observed number of events per window per processor. it appears that this is aboutten timesfasterthananoptimized serial sinmlation runningon a powerful workstation with memory largeenough to retainthesimulation in core.
MIMD Performance
Manyof theperformance metrics examined in theSIMDcase reflect event densities andsweep counts. Such metrics are(forthemostpart)not architecture dependent, andsoareveryclose in bothSIMDandMIMD versions. Theprinciple difference between theversions is theevent processing rate,whichweexamine below.
In theSIMDcaseweobserve that rawperformance increases onsymmetric networks asthe sizeof the problem andthenumber of processors changes. 
