which another processor affects it. Conservative protocols are known to require lookahead in order to avoid deadlock, and to achieve good performance. Lookahead is the ability of a processor to predict its future behavior, as regards when next (in simulation time) it may affect another processor's submodel. Optimistic approaches ( [12] ) permit a processor to simulate ahead under the anticipation that another processor will not affect its submodel in the "past", but then correct these temporal errors as they occur. Optimistic approaches require state-saving and rollback to function
properly. The notions of conservatism and optimism are not mutually exclusive; as observed in [29] , the space of synchronization protocols is better partitioned using finer distinctions. This leads to protocols that combine elements of optimism and conservatism.
The synchronization approach we develop in this paper is conservative in all respects. A prin- First, onecanrarelyanalyzethe quality of agraph-partitioningsohition,exceptto assertsomelocal optimumcondition. Secondly, the objectivefunctiondoesnot reflectparallelcommunication. Our approachhasanalyticassurances, andseeksto minimizean objectivefunction that better models executiontime.
The contributionsofthis paperaretwo-fold.Oneofour contributions is anewheuristicfor static mapping,aspects of whichcanbe analyticallyquantified.In somecases wecanboundthe deviation of the resultsfrom optimal, in other caseswe can proveoptimality itself. We alsoextendour earlierwork in synchronization andin dynamicremappingdecision-making to the TPN simulation problem,andsynthesize theseadaptationswith thenewmappingalgorithm.Wedemonstrate a tool that acceptsa graphicallydesigned TPN, thenautomaticallymaps,synchronizes, anddynamically remapsthe simulationexecutingon a large scaleparallelarchitecture.Wereport the resultsof a numberoflargeTPN models, includingoneof the ThinkingMachines CM-1globalroutingnetwork, anda slotted-ringparallel architecture.Wereport good performance, obtainedautomatically,on largescaleparallelarchitectures. In onecaseweobserve a speedup in excess of 43on 64processors of the Intel Touchstone Delta.
The remainderof this paperis organizedas follows.In Section §2we discussTPN semantics. Section §3develops synchronization andsimulationalgorithms, andSection §4discusses automated mappingalgorithms. Finally, Section §5 presentsour performanceresults. Section §6givesour conclusions. • Fromtime s to time s + (i the transition is considered to be firing;
• At time s + (i a token is added to each of t's output places.
We say that the transition firing is enabled at time s, and completes at time s + (i. Note that tokens are committed to the transition firing at the time of the transition being enabled, not at the point when the transition actually fires. The interpretation that commits tokens only upon firing is also common; we will later discuss how to handle this as well.
We may construct a discrete-event simulation of a TPN whose events are TokenArrival, In the section to follow we show how to implement this algorithm on a parallel computer.
Synchronization
We believe that parallel simulation will be practical primarily when large simulation models are distributed over a moderate number of processors. The remainder of this section is divided into three parts. The first provides general information about synchronization. The second discusses the basic synchronization itself, while the third part extends the method to TPNs where transition firings may be preempted.
Preliminaries
We assume that the TPN model is partitioned by the modeler into logically cohesive subnets we Our solution requires that two rules be followed when aggregating (and pntool enforces these rules).
, All input places for a transition are assigned to the same LP as the transition.
• Every transition t with an output place that is assigned to a different LP than t must have a non-zero firing time. We turn next to a discussion of the protocol and its integration into the simulation algorithm.
Parallel Algorithm
The following is a brief overview of the protocol. Suppose that all simulation events in all processors up to (but not including) time Ts_,n have been simulated. Our protocol will compute a simulation time w ( We have proven elsewhere [26] that every future off-processor message the simulation will send has a time-stamp of at least w(Tsi,n). Mgorithm to find the optimal solution.
Linearization
We seek an ordering that concentrates highly communicating 
For all (A, B), (C, D) matched above, W(AB)(CD)
7. If n is odd then n = [_J + 1, else n = [_J. If n > 1 goto step 2. 
Sj(x_ pt) g (2 j -1)S,(_rh).
The result is obtained dividing through by Sj(_rh).
Note that since Sj(_r h) > S_(_r h
) for all j, we get a loose "pre-computation" bound of 2 j -1.
This bound will be sharpened given measured values of $1(7r h) and Sj(_rh).
Another feature of a match/merge linearization is that the sum of edge weights between adjacent LPs is not less than half of optimal.
Lemma 2 Let w°vt be the maximum possible sum o.f edge weights between adjacent LPs in any linear ordering, and let w h be the sum of such weights under a linearization produced by the match/merge algorithm.
Then w°vt/2 < w h. Such an algorithm is obtained from a modification to the stable marriage problem [31] 
O(nBlog(2i-lB));
the cost sum over all logn steps is This in fact was an unexpected consequence of our CM-1 router example.
LP enumeration has a definite affect on the matches made, and a little care is required for our optimality results to hold. For the specific cases we consider we suppose the LPs to be enumerated in a "natural" way. We presume a ring is enumerated so that adjacent 
Proof:
For any integer j, it is obvious that the partition into 2 k-j pieces maximizing the number of edges between vertices in a common partition element is obtained by grouping the first 2j vertices together, then the next 2j, and so on. This is precisely the grouping defined by the pair/merge algorithm. |
Lemma 4 Let (G, E) be a hypercube of dimension k, suppose that G is enumerated naturally, and that all edges have unit weight.
Then for all j = 1, 2,..., k, any linear ordering _r produced by the pair/merge algorithm maximizes Sj (_r).
Proof:
We first induct on z to prove that the number of edges between members of any subset of x vertices is no greater than (xlogx)/2. The base case of x = 1 is trivially satisfied. Suppose then that the claim is true for any subset of size x -1 or smaller, and choose any subset A with x vertices. Split A evenly into two subsets A1 and A2. The number of edges between vertices in A is the sum of the edges on A1 plus the edges on A2 plus the edges between them. There are at most [z/2J edges between them, and by the induction hypothesis the sum of edges on A1 and on A2 is no more than (x log(x/2))/2. Therefore the number of edges on A is no more than
which completes the induction. Now observe that when x = 2 i the bound is met, and is met by sets that themselves form hypercubes (which have i2 i-1 edges contained within them). Now at every step i the pair/merge algorithm merges hypercubes of dimension i -1 into hypercubes of dimension i (a consequence of G being ordered naturally). Thus Sj(r)is maximized for each j.
• Rings and hypercubes are special cases of k-ary n-cube networks. We believe our results might be generalized to such networks where k is a power of two.
Chain Mappings
Suppose that some linearization of the LPs is given.
The most general formulation of the remaining mapping problem allows any two LPs to have non-zero communication costs. A dynamic programming formulation solves the problem in O(Pn _) time, P being the number of processors.
To see this, let C(j,p) be the optimal bottleneck cost achievable mapping LPs 1 through j onto p processors. Then the principle of optimality asserts that • The second conclusion shows that in balanced hypercubes, for moderate values of e_ it is optimal to assign equal sized hypercubes of smaller dimension to each processor--as does pair/merge/map.
Lemma 6 Let (G, E) be a hypercube, enumerated naturally, with unit edge weights, and IGI = 2 k.
Suppose every vertex has common weight ew > k + 2/(ln 2). Then for any power-of-two number of processors 2j <_ 2 k, the pair/merge/map algorithm minimizes the bottleneck over all possible partitions of the hypercube into up to 2 j pieces .
Proof."
Consider a processor assigned any x LPs. The proof of lemma 4 shows that the sum of edges between LPs on that processor is no greater than (x/2)logx; hence there are at least kx -2x log x edges to LPs on other processors. The function
is thus a lower bound on the cost of assigning x LPs to a processor. Note that the bound is achieved if the set forms a hypercube. Considering x as continuous, we have
Note that logx is maximized when equal to k, hence f is increasing over f(x2) ,..., f(x2,)} isa lower bound on the bottleneckcost of the assignment. Since f is increasing, g is minimized when the maximum zi is as small as possible---that is,when the xi'sare identically 2k-j. This situation is achievedwhen the LPs are partitioned into 2J hypercubes,furthermore the value of g then is also exactlythe bottleneck.IfG is enumerated naturally, the pair/merge/map algorithm will produce this assignment.
• Other situations where the pair/merge/map approach finds optimal solutions occur as a result of the definition of the bottleneck cost. Any solution that minimizes the bottleneck can be embedded in a linearization. For example, given the optimal mapping we can renumber the LPs assigned to processor 1 starting at 1, then carry over the enumeration to LPs assigned to processor 2, and so on. However, a large number of linearizations are equivalent in the sense that the chain mapping algorithm will find the optimal bottleneck on them. For example, any permutation of the processor ordering does not affect the bottleneck cost, and does not confuse the chain mapping algorithm.
Likewise, within the LPs assigned to a processor there is an insensitivity to their ordering within the processor. The net effect is that given an optimal solution and an associated linearization r°pt, there are a number of permutations of 7r°pt that will not affect the sets of LPs that are co-resident. Given any one of these linearizations the chain mapping algorithm with discover the optimal bottleneck. 
Dynamic Remapping
Our approach to dynamic remapping has essentially been laid out before, in [28] , with an emphasis on physicaJ computations that exhibit distinct phases. The issue there is to determine with sufficient confidence that a phase change has occurred and that performance will benefit from remapping.
The general approach is to periodically consult an "oracle" that judges whether it is worthwhile to remap now. The oracle's decision is not immediately acted upon though, it is used to update (via Bayes' Theorem) a gain probability that performance will improve by remapping now. The optimal decision policy was shown to be a threshold policy--if the gain probability is larger than some step-specific threshold, then one ought to remap. As computation of the optimal decision thresholds proved to be impractical, a heuristic was proposed to use a constant, high, threshold.
We apply this work to the present context, as follows. The overhead of gathering workloads and projecting remapped performance can be seen as distribute it to the others. The load time is large, and so many processors (which are shared with many users and which are charged for) are idle during the loading. A more sophisticated implementation could use the concurrent file system. the gap between the two performance curves.
On this problem the difference is less than 10%, a difference that is increasingly amortized as the problem size grows. With increasing problem size performance gets better, but it is clear that if the growth trend continues, by dimension 9 the event rate is close to its maximal level. The fact that this occurs at a speedup less than 12 is due to the cost of communication on the iPSC/860, which is quite high relative to the speed of the CPU.
These same speedups on the more balanced Intel iPSC/2 are nearly 20% better (but the iPSC/2's CPU is a factor of 7 slower on this problem!).
We have also investigated our synchronization algorithm on various TPN models of mesh- We also see that the dynamic remapping mechanism comes close to achieving the optimal performance possible, given the unbalanced workload. Performance of the balanced workload is nearly perfect for 16 and 32 processors; it falls away at 64 processors owing to the low number of events performed on each processor between synchronizations (50).
Conclusions
This paper studies the problem of automatically paraUelizing the discrete-event simulation of large timed Petri-nets executing on parallel architectures. The methods we described have been implemented in a tool where one designs a Petri-net using a graphical tool, and then all remaining steps for parallelization are performed automatically.
We 
