Abstract. This paper presents an analytical model to evaluate the performance of parallel simulation on distributed computing platforms. The proposed model is formalized by two important time components in parallel and distributed processing: computation time and communication time. A conservative parallel simulation of multistage interconnection networks is used as an example in our analytical model. Performance metrics such as elapsed time, speedup and simulation bandwidth associated with different schemes for partitioning/mapping parallel simulation onto distributed processors are evaluated. Our mathematical analysis identifies the major constituents of simulation overheads in these mapping strategies necessary for improving parallel simulation efficiency. We also show that a perfectly balanced workload distribution may not necessarily translate into better performance. On the contrary, we have shown that a balanced mapping of workload may increase communication overheads resulting in a longer simulation elapsed time. Our performance model has been validated against implementation results from a parallel simulation model. The analytical framework is also practical to evaluate the runtime efficiency of other simulation applications which are based on the conservative paradigm.
Introduction
By the use of multiprocessors, the parallel discrete-event simulation (PDES) technique offers potential both for speeding up the sequential discrete-event simulation, and most importantly for increasing the size and complexity of models simulatable within a reasonable amount of time. Two related paradigms, conservative approach and optimistic approach, have been widely discussed [4, 5, 7, 8] . A similarity of these two approaches lies in the process-oriented methodology in partitioning a system to be simulated into loosely coupled components and simulating each component by a process. Interaction among the processes is performed by passing timestamped messages, which usually represent events scheduled, or to be scheduled if feasible, by one process at another. To ensure causality correctness, the conservative approach executes safe events only, meaning that an event, say of occurrence time τ , is selected for execution only when all the events before τ are already executed. On the other hand, the optimistic approach, also called timewarp, executes the events greedily. When causality error occurs, a rollback mechanism annuls those events simulated ahead of time.
Many factors contribute to the performance of a PDES program. Active research areas within this arena include new protocols, mathematical performance analysis, time parallelism, hardware support, load balancing and dynamic memory management [3, 9] . Among them performance † E-mail: teoym@iscs.nus.sg ‡ E-mail: taysengc@iscs.nus.sg analysis provides an insight into the simulation modelling and identifies the cause of unsatisfactory implementations. The work done in this area is useful to simulationists as it provides the guidance necessary to improve, or even redesign, their programs. This paper focuses on the performance modelling of the conservative parallel simulation. Instead of using self-contrived queuing network models, our mathematical analysis is based on realistic multistage interconnection networks (MIN). The proposed simulation and performance models can be practically applied to study other simulation applications which are based on the conservative paradigm. The remainder of this paper is organized as follows. Section 2 gives an overview of the conservative parallel simulation mechanism for modelling and simulating interconnection networks. We discuss the hand-shaking protocols used to ensure the correctness of finite-buffered simulation, and highlight some resolutions to tackle deadlock/livelock problems and a flushing mechanism to resolve combinatoric memory explosion. Section 3 abstracts the performance model. We derive the individual process elapsed time, and analyse the speedup and simulation bandwidth under light traffic and heavy traffic conditions. Section 4 analyses the performance of different partitioning schemes for mapping a simulation application onto a finite number of processors. Section 5 compares the derived metrics from the proposed analytical performance model with results collected from the simulation model. We comment in particular on the effect of load balancing on simulation performance. runtime improvements. Lastly, section 7 contains our concluding remarks.
Conservative parallel simulation model overview
A detailed description of the MIN simulation is available in [10, 11] . In brief, the process-oriented parallel simulation model consists of three types of logical processes: packet generator (G), switching element (SW ) and sink (S) (see figure 1) . Based on the statistical distributions for inter-arrival time and packet destination, each G process generates access request packets and sends them to the SW process on its link. Each SW process serves as a relaying agent to forward the packets to their destinations. Two buffers of finite size are embedded in each SW process to model a blocking switch. Two types of simulation events are used: arrival and departure. A constant switch delay is used in each SW process. Each S process serves as a destination for access packets and its service time is assumed to be instantaneous. Each process has its own local clock to indicate its simulation progress. A blocked message due to the unavailability of free space in its receiving buffer will have to wait until the buffer space become available. The simulation of a finite-buffered interconnection network is more sophisticated as compared to that of an infinite buffered network. For example, the execution of each departure event in G and SW processes should not send the departing packet to its successor unless the receiving buffer contains at least one unoccupied slot. Such a constraint causes some complications in the simulation as the buffer status of the receiver is not known to the sender. To ensure that packets are not lost at a SW process when the buffer is completely occupied, the simulation processes must adhere to some hand-shaking protocols during transmission [10] . That is, before a packet transmission is initiated, the sender must find out if the receiving buffer contains at least one unfilled slot to accommodate the packet. Six signals/messages are used in packet transmission. The underlined terms shown below indicate that the signal/message needs to be time-stamped: The use of hand-shaking protocols during packet transmissions can result in deadlock when the processes happen to be waiting in a loop. A detailed description of relaying protocols to resolve the problem is presented in [12] . In general, a process on receiving a null message or requesting a buffer status should also send null messages, with lookahead timestamps if available, to all the other three links. In this way the null messages are circulated among the processes for MIN simulation. If a received null message indicator is greater than all other outstanding event times, a process is safe to proceed on with the simulation, thereby resolving a deadlock.
Although the circulation of null messages is able to resolve the deadlock problem, they can cause some livelock situations where null messages are generated progressively on a circuit. The way we break the livelock is by giving a higher priority to the execution of departure event whenever a departure time is equal to a null-message indicator. By using this priority scheme in the event scheduling, a process Travelling time for the last transmission to reach its destination S p (scheme)
Speedup metric when p processors are used for a particular mapping scheme BW p (scheme) Bandwidth (number of events simulated per unit time) for a particular mapping scheme will proceed with the simulation instead of ceaselessly relaying null messages to other processes.
A flushing mechanism is also proposed in [12] to handle the combinatoric explosion in memory utilization. That is, when a null message is received, the process will flush in all the null messages that have already arrived on that link and select the largest time indicator. The forwarding of null messages is only on the largest time indicator and the rest are discarded. Such a flushing mechanism is also able to maintain the liveness of simulation as at least one null message is circulating in the network throughout the simulation duration. Table 1 contains a list of performance parameters used in our analysis. Based on problem size, MIN's characteristics, data transmission protocols and simulation mechanism, the model formalizes the workload of each simulation component in terms of their elapsed time.
Analytical performance model
We assume that the destination of each packet transmission is uniformly distributed.
Each data transmission among the simulation processes is modelled by two components: buffer access time and transit time. The reception of data is modelled by buffer access time only and transit time is excluded to prevent double accounting. We divide the analysis of simulation performance into two parts: light traffic (λ ≤ µ) and heavy traffic (λ > µ) conditions. In the speedup analysis and simulation bandwidth analysis, we assume that p = 2 × n + n×log 2 n 2 , i.e., the number of processors is sufficient to achieve the maximum degree of parallelism for the MIN simulation. While modelling T p , we assume that the elapsed time of each SW process outweighs those of G and S processes.
Light traffic
In this case we let λ eff = λ. We assume that the handshaking protocols used in each packet transmission include only REQ and AVA signals. In other words, when a process sends a REQ signal to its receiver to find out the buffer status, the reply is always 'available' (see figure 2(a) ). It follows that the time incurred in each set of hand-shaking protocols is 2 × T buff er + T transit , and the time incurred in each packet transmission is T buff er + T transit .
Packet generator
The elapsed time of each G process is modelled by three components:
(1)
By equations (1), (2) and (3), the total elapsed time of each G process becomes As there are two input links to each SW process, the number of received packets is 2 × g. Each packet, corresponding to one arrival event, will generate one departure event and therefore the total number of events to be executed in a SW process is 2 × 2g. It follows that
Switching element
For each SW process, 2g packets are received and 2g packets are transmitted. Therefore
For each transmission of AVA signal and REQ signal and ACC message on an in-coming link and out-going link respectively, null messages will be sent on the other three links of the SW process to prevent deadlock. We assume that the null messages are then discarded by their receivers. Since there are 2g packets to be received and 2g packets to be transmitted in each SW process, we have
By equations (5), (6), (7) and (8), the total elapsed time of each SW process becomes
Sink The elapsed time of each S process is modelled by two components: packet accounting (T S

) and hand-shaking protocols (T S
2 ). Since a S process is connected to each output link of the MIN, the number of received packets is g. It follows that
By equations (10) and (11), the total elapsed time of each S process becomes
3.1.4. Speedup Speedup for p processors is defined as the execution time for the best serial algorithm in a single processor (T 1 ) divided by the execution time for the parallel algorithm using p processors (T p ) [2] . In our analysis, T 1 is modelled by the elapsed time of MIN simulation incurred by a sequential program. As causality correctness is easily ensured in sequential simulation, the use of transmission protocols is not necessary in the simulation program and is discarded in our derivation of T 1 . This presents a fairer speedup measure than sometimes reported in the literature whereby the value of T 1 is measured by running the parallel algorithm on one processor. This later instance certainly gives better speedup. Given that the number of simulated packets transmitted across the MIN is n × g, the elapsed time of the simulation program becomes
= ng × (T generate + 2 log 2 n × T event + T account ). (13) Assuming that the elapsed time of each SW process outweighs those of G and S processes, we have
By equations (13) and (14), under light traffic condition the speedup when p processors (where p = 2n + 1 2 n log 2 n) are used becomes
3.1.5. Simulation bandwidth Simulation bandwidth, a measure of simulation implementation efficiency, is defined as the number of events executed per second. The total number of events to be executed is
By equations (14) and (16), the simulation bandwidth under light traffic condition becomes
Heavy traffic
When the arrival rate is greater than the service rate of the SW processes connected to the G processes, the buffer spaces of SW processes will be filled up progressively until all slots are occupied. Subsequently, the packet arrival rate is moderated by the SW processes to prevent packet loss. When the steady stage is reached λ eff = µ. Our worst case analysis assumes that the time interval before the steady stage is negligible. The hand-shaking protocols used by generators for packet transmission include REQ, NAV and AVA signals and TIM message to delay the packet arrival times (see figure 2(b) ). The derivations of the performance metrics are available in the appendix, and we summarize the results in tables 2(a) and 2(b). 
Analysis for different mapping schemes
Our conservative simulation model requires n G processes, 1 2 n log 2 n SW processes and n S processes to simulate a n × n MIN. Mapping of such processes onto a particular processors configuration for efficient implementation is an NP-complete problem [1] . In this paper, we introduce and analyse the performance of three mapping schemes, namely, the horizontal partitioning scheme (HPS), the vertical partitioning scheme (VPS) and the modular partitioning scheme (MPS). Details of the formulation and mathematical transformation used can be found in [10, 12] . For partitioning a 16 × 16 MIN simulation shown in figures 3(a-c), each rectangular block containing four switching elements is mapped onto one processor. As the G and S processes do not have to simulate the switch operations, their workloads are comparatively smaller as compared to that of a SW process. Therefore, in all mapping schemes each G process is placed together with the SW process on its out-going link in the same partition, and each S process is placed together with the SW process on its in-coming link. It is noteworthy that in MPS each module contains four SW processes where each process corresponding to a 2 × 2 switch, is executed/mapped on the same processor. However, in VPS each partition containing log 2 n SW processes is executed on a processor. Therefore, in the following speedup and simulation bandwidth comparison analysis we let n = 2 2 i for some integer i. For a fairer comparison, we let p = n/2 so that the number of SW processes mapped onto a processor is the same in all three schemes. Intra-processor communication time incurred by processes is assumed to be negligible. The following elapsed time analysis, speedup analysis and simulation bandwidth analysis is based on light traffic condition. The performance metrics for the heavy traffic condition can be derived by the same steps.
Horizontal partitioning scheme
In HPS, the G, SW and S simulated processes on one horizontal rectangular box (see figure 3(a) ) are mapped onto one processor. Parallel simulation is completed only when the sink processes in each processor have received the last packet transmission. Therefore, T p is modelled by the maximum workload among processors. Before proceeding further, we re-compute the elapsed times of the affected processes due to this scheme.
Packet generator
As the intra-processor transit time between the generator process and the SW process on the (log 2 n − 1)th stage is negligible, the total elapsed time for each G process (see equation (4)) becomes
4.1.2. SW process on the (log 2 n − 1)th stage While modelling the maximum workload among processors, we do not consider the top and the bottom partitions due to the diminished inter-processor links which reduce the total elapsed time. The elapsed times due to event execution (equation (5)) and packet transmission and reception (equation (7)) remain unchanged. The elapsed time due to hand-shaking protocols (equation (6)) for the SW processes on the (log 2 n − 1)th stage is reduced to
For the same reason, the elapsed time due to null-message transmission (equation (8)) is reduced to
By equations (5), (7), (19) and (20), the total elapsed time of each SW process on the (log 2 n − 1)th stage becomes
4.1.3. SW process on the 0th stage The elapsed time due to event execution (equation (5)) remains unchanged. The elapsed time due to hand-shaking protocols (equation (6)) is reduced to
The elapsed time due to packet transmission and reception (equation (7)) becomes 
(a)
Scheme Elapsed time HPS 2g × (T generate + 2 log 2 n × T event + T account +(5 + 15 log 2 n) × T buffer + 2(6 log 2 n − 5) × T transit ) VPS 2g log 2 n × T generate + 2(log 2 n × (2g + 1) − 1)T event + T account +(3 log 2 n × (12g + 5) − 13)T buffer + 2(2 log 2 n × (4g + 3) − 9)T transit MPS g log 2 n × T generate + 2((2g + 1) log 2 n − 2) × T event + T account +(3(11g + 5) log 2 n − 28) × T buffer + ((8g + 7) log 2 n − 22) × T transit The elapsed time due to null-message transmission for the SW processes (equation (8)) is reduced to
By equations (5), (22), (23) and (24), the total elapsed time of each SW process on the 0th stage becomes
Sink process
The elapsed time of each S process (see equation (12) ) is reduced to
By multiplying the respective numbers of processes to equations (9), (18), (21), (25) and (26), we have
4.1.5. Speedup By equations (13) and (27), the speedup when p processors (where p = n/2) are used becomes S p (H P S) ≈ ng × 2 log 2 n × T event [2g × (2 log 2 n × T event + 15 log 2 n × T buff er + 12 log 2 n × T transit )]
Simulation bandwidth By equations (16) and (27), the simulation bandwidth of HPS becomes
BW p (H P S) ≈ 2gn log 2 n[2g × (2 log 2 n × T event + 15 log 2 n × T buff er
Vertical and modular partitioning schemes
In VPS, each processor executes log 2 n SW processes belonging to the same MIN stage (see figure 3(b) ). During simulation, ACC packets are created by the generators, relayed by SW processes in each stage, and absorbed by the sinks. As the mapping is performed on a stage basis, parallel simulation is completed only when the last (rightmost) processor terminates. T p is therefore modelled by two components: elapsed time of the first (leftmost) processor, and the travelling time for the last transmission to reach the sink process. The MPS is based on partitions of four SW processes spread over two MIN stages as shown in figure 3(c) . A proof of the transformation from the Omega MIN to its equivalent modular topology can be found in [11] . Each processor executes log 2 n SW processes arranged in 1 4 log 2 n blocks. For the same reason that sink processes are contained in the rightmost module only, we also model T p by the same two components as in VPS.
A summary of the analytical results for VPS and MPS is given in tables 3(a) and (b).
Model validation and performance analysis
We implemented the conservative parallel MIN simulation model in C language on a network of workstations which mimics a distributed-memory parallel computer. PVM software [6] was used for spawning the simulation processes and for handling message passing. The table 4 ). Additional statistical measures used in the analysis are defined. Let s i be a simulation process and |s i | denote its workload. In the following analysis, the workload for each simulation process is represented by the size of its object codes, and normalized to that of the S process. For simplicity, we let |S| = 1. Thus, the workloads for |G| and |SW | are 1.93 and 6.24 respectively. The normalized coefficient of T transit is denoted by T . The percentage deviation between the measured and the predicted values δ is computed as |measured−predicted| predicted × 100. The average of δ over six different sets of simulation parameters is denoted by δ.
Let M be a set containing all s i required in a MIN simulation. Let P be the processor that incurs the largest workload among all processors. We define the workload distribution coefficient (ω) as ω = si∈M |s i |/ sj∈ P |s j |. The load balancing factor (η) is then defined as η = p ω . Thus, the larger the value of η, the more unbalanced is the workload distribution. The smallest (best) value of η equals 1, corresponding to a perfectly balanced distribution of workload.
We first compare the elapsed times among the different schemes, and between the simulation model (measured) and the analytical model (predicted).
For VPS and MPS, the latter outperforms in terms of elapsed time as shown in table 5. This phenomenon is due to the poor workload distribution and larger interprocessor communication overheads of VPS (T = 2) compared to those of MPS (T = 1). The comparison of VPS with HPS is more interesting. Although the interprocessor communication overhead of VPS is only half of that of HPS, it seems that the imbalance workload factor is more dominant, making the elapsed time of VPS (η = 1.31) larger than that of HPS (η = 1). In this aspect, our explanation is that for VPS the worst number of processes in the processors is greater than that of HPS. Therefore, the elapsed time of VPS consists of a larger amount of context switching time incurred by the task scheduler of the UNIX operating system. The comparison of HPS (η = 1) and MPS (η = 1.06) is rather counterintuitive. We observe that the HPS, a perfect load balancing scheme, incurs a significantly larger elapsed time as compared to that of a less balanced scheme. Such an observation contradicts the common belief that the workload distribution among processors must be balanced in order to improve elapsed time. We note here that other factors such as inter-processor communication overheads incurred by a mapping scheme should also be taken into account while balancing the workload distribution. As the inter-processor communication overhead incurred by HPS is larger than that of MPS, its elapsed time is therefore aggravated despite the balanced workload distribution.
The averaged percentage deviations δ show that the accuracy of the analytical model for predicting elapsed time is good for all the three mapping schemes. However, the predicted timings show slightly better performance than the measured simulation model timings because the context switching time incurred by the scheduler of the UNIX OS is not accounted for in our analytical model. Tables 6 and  7 show that our performance model closely predicts the parallel simulation speedups and simulation bandwidths.
The predicted values for all three mapping schemes are slightly better than the measured values for the same reason. The speedup efficiency is unacceptably low, i.e. around 10%. Our analysis reveals that this is due to the higher than expected cost of PVM communication. Better speedup can be obtained on a parallel machine where communication is much more efficient and less costly. In the following section, we present a breakdown of various communication costs, and analyse the effect of reducing these overheads on simulation performance.
In summary, the measured and predicted values for elapsed time, speedup and simulation bandwidth show the following important findings:
• A perfectly balanced workload distribution, such as HPS, may not necessarily translate into better performance. • Inter-processor communication overheads may cause dominant aggravation to the program elapsed time.
• A good mapping scheme should take into account the load balancing as well as communication overheads. Ignoring either factor may result in poor implementation performance. Table 8 depicts the asymptotic elapsed times for MIN simulation. The comparisons show that by transforming the MIN into a more modular equivalent interconnection, the message transit time can be significantly reduced. As also shown in our performance model, the communication overheads comprise two major components: buffer access time and message transmission time. In order to improve the implementation performance, we therefore have to reduce these two constituents.
Performance improvement
Using the MPS as an example, we analyse how the overheads affect parallel simulation performance. We compared the implications of setting overheads to zero (no overheads), buffer access time to zero (zero T buff er ), message transmission time to zero (zero T transit ), reducing T buff er to half (T buff er /2) and to a quarter (T buff er /4), and reducing T transit by half (T transit /2). T buffer + 4T transit simulation bandwidth against the number of workstations respectively.
In particular, figure 5 shows that good parallel simulation speedup can be achieved even with the imposed transmission protocols, provided the communication overheads are negligible in comparison with the event grain size. We observed that the worst bottleneck is attributed to T buff er followed by T transit . Thus, more effort should be devoted to reducing PVMbased memory allocation time, and improving access protocols to reduce overall message buffering time than to improving the network transmission speed. A significant reduction in elapsed time is observed when T buff er is halved as compared to setting T transit to zero (figure 4).
Conclusions
We have developed an analytical model that characterizes the performance of a conservative parallel simulation model using five timing parameters.
Performance measures are derived for simulation elapsed time, speedup and bandwidth for both light and heavy traffic conditions. Validation experiments comparing the performance metrics from the analytical model and the simulation model show an acceptable 5% difference. We have analysed the performance for three process-to-processor mapping schemes to identify the main causes of poor performance. Using the analytical framework, we also evaluated the significance of reducing buffer access time and message transit time in order to improve the runtime performance. The performance model can be easily extended to analyse other applications using the same conservative parallel mechanism. Our work reveals that in the PVM-based implementation reducing buffer access time is more critical than network transmission time. In addition, we dispel the common belief that to reduce the elapsed time of a parallel program the workload distribution among the processors must be balanced. We observed that other factors such as inter-processor communication overheads may also cause dominant aggravation to the simulation performance despite the balanced workload distribution. Therefore, a balanced workload distribution may not necessarily translate into better performance. Both the analytical and implementation results confirm such an exceptional phenomenon.
Our work also shows that although PDES is intuitively promising, it requires a lot of careful implementation considerations in practice. The simulation performed on a multiprocessor platform, if not implemented properly, may produce a runtime slow down due to high simulation synchronization overheads, fine granularity of events, strong coupling of exploitable parallelism and the system to be simulated, inherent limited lookahead, etc. Instead of building new PDES synchronization and disregarding distributed system synchronization mechanisms already available, an important thread of investigation is to harmonize and integrate the synchronization requirements in organizing PDES and in distributed processing. In this aspect, much remains to be done to realize its full potential.
Appendix A. Performance metrics for heavy traffic condition
The elapsed times incurred in each set of hand-shaking protocols used by G processes for packet transmission and in the SW process on the (n−1)th stage for packet reception are 5T buff er + 2T transit and 5T buff er + 3T transit respectively (see figure 2(b) ). The hand-shaking protocols used in the SW processes on all other stages remain as the light traffic condition because the arrival rate to these processes has been moderated by their service rate.
A.1. Packet generator
Packet generation time (T G 1 ) and packet transmission time (T G 3 ) remain unchanged. The elapsed time due to handshaking protocols is as follows:
By equations (1), (3) and (A1), the total elapsed time of each G process becomes
A.2. Switching element on the (log 2 n − 1)th stage Event execution time (T SW 1 ) and packet transmission and reception time (T SW 3 ) remain unchanged. The elapsed time due to hand-shaking protocols is changed as follows:
For each transmission of AVA and NAV signals and REQ signal and ACC message on an in-coming link and outgoing link respectively, null messages will also be sent on the other three links to prevent deadlock. Again, we assume that the null messages are then discarded by their receivers. Therefore
By equations (5), (7), (A3) and (A4), the total elapsed time of each SW process on the (log 2 n − 1)th stage becomes
A.3. Speedup
With heavy traffic, T 1 can be modelled by equation (13) with g = λ eff × t = µ × t . By equation (A5), we have
By equations (13) and (A6), the speedup for p processors (where p = 2n + 1 2 n log 2 n) under heavy traffic condition becomes
A.4. Simulation bandwidth
By equations (16) and (A6), the simulation bandwidth for heavy traffic condition becomes
Appendix B. Vertical partitioning scheme
By multiplying the number of processes to equations (18) and (21), we have T f irstp = 2 log 2 n × g × (T generate + 3T buff er ) + log 2 n × 2g × (2T event + 15T buff er + 8T transit ) = 2g × log 2 n × (T generate + 2T event + 18T buff er + 8T transit ).
By multiplying the required numbers of processes to equations (9), (25) and (26), and assuming one transit packet, we have T travel = (log 2 n − 2) × 1 × (2T event + 15T buff er + 12T transit ) + 1 × (2T event + 15T buff er + 6T transit ) + 1 × (T account + 2T buff er ) = 2(log 2 n − 1)T event + T account + (15 log 2 n − 13)T buff er + 6(2 log 2 n − 3)T transit .
By equations (B1) and (B2), we have T p = T f irstp + T travel = 2g log 2 n × T generate + 2(log 2 n × (2g + 1) − 1)
× T event + T account + (3 log 2 n × (12g + 5) − 13)
× T buff er + 2(2 log 2 n × (4g + 3) − 9)T transit . (B3) 
B.1. Speedup
From equations (13) 
Appendix C. Modular partitioning scheme
In the following derivation, we assume that n ≥ 16. First, we compute the elapsed time of the leftmost block. By multiplying the number of processes to equations (4), (9) and (21), and assuming T transit = 0 in the first two equations, we have 4 × g × (T generate + 3T buff er + 0) +2 × 2g × (2T event + 15T buff er + 0) +2 × 2g × (2T event + 15T buff er + 8T transit ) = 4g × (T generate + 4T event + 33T buff er + 8T transit ).
By multiplying the number of modules in the leftmost processor to equation (C1), we obtain T f irstp = 1 4 log 2 n × 4g × (T generate + 4T event + 33T buff er + 8T transit ) = g × log 2 n × (T generate + 4T event + 33T buff er + 8T transit ).
Next, we compute the elapsed time of each transmission across an intermediate block. By multiplying the respective numbers of processes in equations (21) and (25), and assuming one transit packet, we have (2T event + 15T buff er + 6T transit ) +(2T event + 15T buff er + 8T transit ) = 4T event + 30T buff er + 14T transit .
The elapsed time across the rightmost block is computed by equation (25), setting T transit = 0 in equations (9) and (12), and assuming one transit packet, giving 2T event + 15T buff er + 6T transit +2T event + 15T buff er + 0 +T account + 2T buff er + 0 = 4T event + T account + 32T buff er + 6T transit .
By multiplying the respective numbers of blocks to equations (C3) and (C4), we have T travel = ( 1 2 log 2 n − 2) × (4T event + 30T buff er + 14T transit ) + 1 × (4T event + T account + 32T buff er + 6T transit ) = (2 log 2 n − 4)T event + T account + (15 log 2 n − 28)T buff er + (7 log 2 n − 22)T transit .
By equations (C2) and (C5), we have T p = T f irstp + T travel = g log 2 n × T generate + 2((2g + 1) log 2 n − 2)
× T event + T account + (3(11g + 5) log 2 n − 28) × T buff er + ((8g + 7) log 2 n − 22) × T transit .
(C6)
C.1. Speedup
By equations (13) 
