Accurate and fast performance estimation methods for modern and future multi-core systems are the focal point of much research due to the complexity associated with such architectures. The communication architecture of such systems has a huge impact on the performance and power of the whole system. Architects need to explore many design possibilities by using performance estimation techniques at early stages of design to make design decisions earlier in the design cycle. While software developers need to develop and test applications for the target architecture and gather performance measurements as early in the design cycle as possible. Full system simulation techniques provide accurate performance values but are extremely time consuming. Static analysis techniques are fast but cannot capture the dynamic behavior associated with shared resource contention and arbitration. Moreover, synthetic traffic patterns have been used to analyze the communication architecture however, such patterns are not realistic enough. We propose a statistical based model to predict the dynamic cost of bus arbitration on the performance of a bus architecture. The proposed model uses workload trace of the actual applications and benchmarks to capture the real application traffic behavior. Statistics on the traffic patterns are collected and input to the analytical model which calculates performance values for the communication architecture under consideration. By knowing the performance measures, designers can avoid over and under-design of the communication architecture. This paper builds up on a previously developed performance estimation model. The previous work modeled single and burst bus-transfers, however only one interfering bus master at a time for each blocked bus request was considered. The proposed, improved accuracy model considers multiple interfering masters for each blocked request hence improving the estimation accuracy especially for traffic intensive applications and many PE architectures. Experiments are performed for two different architectures i.e., 4 processing elements connected via a shared bus and 8 processing elements connected via a shared bus. Results show no significant difference in accuracy compared to previously developed model, for low traffic applications SPARSE and ROBOT however notable accuracy improvement for traffic intensive applications. Maximum estimation error is reduced from 1.75% to 0.6% for FPPPP and from maximum 13.91% to 8.8% for FFT on the 4PE architecture. On the 8PE architecture, maximum estimation error is reduced from 11.8% to 2.7% for the FPPP benchmark. Moreover simulation speed-up for the proposed technique over simulation method is reported.
Introduction
The rapid adoption of multiprocessor architectures by System-On-Chips means architects face a larger design space, bigger extent of design decisions, complexity and bigger tradeoffs. Along with processing performance, the performance of communication architecture also plays a very important role in the overall performance of the design. Communication architectures require a very detailed and careful consideration during the design phase in order to meet performance, power and latency constraints. To achieve this, the designer not only needs to understand the design but also the target applications and the corresponding application traffic behaviors of each application for the SoC to be designed. 1 Department of Communications and Computer Engineering, Tokyo Institute of Technology, Meguro, Tokyo 152-8550, Japan a) farhanshafiq@vlsi.ce.titech.ac.jp b) isshiki@vlsi.ce.titech.ac.jp c) dongju@vlsi.ce.titech.ac.jp Currently, a lot of multi processor System on Chips (mpsoc) use bus based communication architectures due to their simplicity and popularity such as Core-Connect from IBM [1] , AMBA from ARM [2] , SiliconBackplane from Sonics [3] , STBus from STMicroelectronics [4] etc. Whereas MPSoCs with many more Processing Blocks are increasingly employing Networks-on-chip (NoCs) as the communication infrastructure [5] , [6] . In this paper, we focus on bus-based communication architecture.
Most commonly, Simulation is used to study the performance of a subject architecture running a number of target applications. There are several simulation methodologies with different levels of accuracy and speed tradeoffs. At one end of the spectrum static analysis techniques are used to predict performance of an architecture resulting in very quick simulation estimation however, lack the level of accuracy needed for a complex mpsoc design. "Full system simulators", on the other hand, model each hardware component and run full OS and applications providing the c 2017 Information Processing Society of Japan highest accuracy however very long simulation times that are not suitable for rigorous and iterative estimation.
In this paper we try to address this issue by presenting a novel performance estimation technique for on-chip bus based architecture. Our approach uses a statistical based model to accurately predict the dynamic stalls caused due to bus contention for a given application. Statistical models usually assume that bus requests are distributed evenly throughout the whole execution duration of an application however; this assumption is a source of inaccuracy since bus access behavior of actual applications is time varying [7] and cannot be modeled this way. To overcome this issue we assume that workload statistics for computation as well as bus traffic on each processing element are known. This assumption complies well with a workload simulation technique for example as presented in Ref. [8] . The estimation technique works such that, for an arbitrary window of "T" cycles, histograms on all busworkloads and computational-workloads on each PE for a given application are provided to the prediction model which calculates the bus contention stall for each PE. The resulting stall cycle counts are added to the overall cycle counts on each PE. Unlike a simulation approach, which requires arbitration simulation on every bus request, our proposed technique runs once every "T" cycles. Since the value of "T" can be chosen to be big or small in accordance with the total number of cycles, the time required for performance estimation does not increase drastically with increasing number of bus workloads. Moreover, unlike other researches in this area, that solely focus on bus-architecture design space exploration, we aim to use this model to enable application developers to perform "bus-performance aware" application optimization and partitioning and provide a better understanding of the bus performance of an application on a target architecture. Once application optimization is satisfactory and a suitable set of bus architectures and mapping are finalized, simulation techniques can be used hence resulting in a considerably shorter simulation time.
This research builds on a previous performance estimation model which took into account single and burst transfers on only one interfering bus master at a time for each blocked bus request [9] . The proposed, improved accuracy model considers multiple interfering masters for each blocked request hence improving the estimation accuracy especially for traffic intensive applications and many PE architectures. The previously developed models are named Single Blocking Model (SBM) and Burst Blocking Model (BBM) while the proposed model in this paper is named Multi Blocking model (MBM) . The difference between the three is detailed in Section 4.3.4.
The rest of the paper is organized as follows: Section 2 gives an account of related works and our contribution, Section 3 gives an overview of the proposed simulation flow, Section 4 explain the previously developed Single Blocking Model (SBM) and Burst Blocking Model (BBM), and Section 5 describes the proposed Multi-Blocking Model (MBM). Section 6 gives a summary of experiments and results. Section 7 outlines the future works and Section 8 concludes the paper.
Related Works
Related works cover the literature associated with performance estimation techniques for MPSoC bus architectures.
There have been a few approaches to address the need for performance estimation of an mpsoc shared bus. Some researchers have focused on static techniques for communication architecture performance estimation. Knudsen et al. presented a high level estimation model for communication throughput for a given protocol assuming pipelined transfers [10] . Yen and Wolf propose to estimate the communication delay using the worst-case response analysis of the real-time scheduling [11] . Daveau et al. considered static information like maximum bandwidth of channel and bandwidth of processing elements to estimate performance of interconnect between processing elements [12] . Nandi and Marculescu use continuous-time Markov process technique for performance measurement [13] . Drinic et al. used the profiled statistics of inter-core traffic for core-to-bus assignment [14] . Thepayasuwan and Doboli propose a simulated annealing approach [15] . Cho et al. proposed analytical performance model for AMBA 2.0 AHB single and hierarchical shared bus architectures, assuming bus slaves do not introduce any waits [16] . However these techniques are not able to model bus contention due to its dynamic nature and hence these techniques are not suitable for today's MPSoC.
Simulation based approaches have been more popular for performance estimation. Simulation is performed at various abstraction levels. Loghi [19] . Caldri et al. used transaction based bus cycle accurate approach to model AMBA2 using function calls for read/writes using SystemC 2.0. Capturing communication systems using TLMs has been tried due to its standardization [20] , [21] . Ogawa et al. created another T-BCA model variant for the AMBA AHB bus architecture using C as the modeling language [22] . Ariyamparambath et al. annotated ATLM models with bus-protocol-specific timing details [23] . Viaud et al. proposed TLM/T abstraction level [24] . Schriner et al. report a quantitative analysis of speed-accuracy tradeoff of TLM, using the advanced high-performance bus (AHB) as a test case, at different abstraction levels [25] . Beltrame et al. proposed using multiple levels of abstraction for communication architecture exploration, with the ability to dynamically shift between BCA, untimed TLM and timed TLM abstractions to improve simulation speed [26] . A simulation-based method gives accurate estimation results but pays too heavy a computational cost with increasing number of bus requests and simulation iterations. FPGA based simulation has been proposed [27] , [28] , [29] to speed-up simulation, however, implementing the architecture on an FPGA in early design phase is usually not possible.
To overcome this difficulty, a hybrid approach (between a static estimation and a simulation approach) has been developed by c 2017 Information Processing Society of Japan Lahiri et al. [30] . They used some static analysis to group the traces and apply a trace-driven simulation with the trace groups however their approach converges to trace driven simulation as the memory traces become larger. Kim et al. proposed another hybrid performance estimation approach based on queuing analysis [31] . However, static queuing techniques are inherently insufficient to handle complex bus protocol features. Moreover, due to the use of FSM, the number of events/state transitions can explode with increasing PEs. Kawahara et al. propose a simulation method that takes memory access contention into account for evaluation of the execution time of an application program [32] . However, the analysis is not based on "actual trace" of a program rather UML or state-chart of the program is simulated which results in longer simulation time. Moreover the experiments are performed for only two bus-masters.
Multi Blocking Model Contribution
Multi Blocking Model (MBM) provides improved accuracy over the previously developed SBM and BBM. The SBM assumes that, when blocked, a bus request will be granted as soon as the current bus master releases the bus. The BBM builds on the SBM by including burst transfers on the current bus master when the current bus master has a higher priority than the blocked bus request. This model can capture the bus performance with reasonable accuracy when application traffic is not intensive and number of bus masters is limited. However, when number of PEs or application traffic increases, a single low priority PE can be blocked by multiple high priority bus masters at a time. A bus request on a low priority PE is blocked by a higher priority PE until the higher priority PE releases the bus. However, during the time while the low priority PE was blocked, another high priority PE may issue a bus request. Once the current bus master releases the bus, the higher priority bus wins bus arbitration and the bus request on the lower priority PE gets blocked again. This kind of blocking can happen indefinitely. We call this kind of blocking by multiple interfering bus masters as "Multi Blocking". The BBM does not model Multi Blocking. The proposed Multi Blocking Model accounts for multi blocking behavior and hence improves on estimation accuracy as opposed to BBM.
Overview of Proposed Simulation Flow
Let us explain the simulation flow of our proposed technique for performance estimation. Although the model can be used with any schedule aware simulator where the computational and bus workloads on each PE are known, we use our bus model as an addition to the trace driven workload simulator presented in Ref. [8] . Figure 1 shows an overview of the performance estimation flow in a trace driven workload simulation. A partitioned application program's execution trace on a given set of input data is computed, through source-level instrumentation and native code execution and encoded as branch bit-stream. Moreover, workload model for the application is generated. The branch bit-stream is then used to steer workload models inside the trace driven workload simulator for a specific target MPSoC architecture model. Readers are directed to Ref. [33] for detailed reading.
However whenever there is a bus access, the bus stall due to contention cannot be calculated from the workload models and on every bus request, arbitration must be simulated to resolve any bus contention. We propose to eliminate this simulation part and replace it with a statistical prediction model. A comparison of the simulation and prediction method will be shown below. Either way, at the end of the workload simulation, an estimation on the cycle counts is produced. Depending on the performance numbers, application optimization or partitioning improvements can be performed and the above steps can be repeated to estimate bus performance. This results in a very fast performance estimation framework. We aim that the estimation on application cycle-counts using our technique can work well with the Tightly Coupled Thread model [34] which enables programmer to specify system level partitioning directly on reference C code.
Arbitration Simulation Technique in a Trace Simulator
In the simulation technique, arbitration must be simulated to resolve any bus contention on every bus access. There needs to be a global scheduler inside the trace simulator kernel that dispatches a processor on processor queue to the workload simulator where the processor queue is sorted by processor's simulation clock, or in the case of a tie, by processor's priority on the bus access. Multiple bus requests must be arbitrated by cycle-accurate bus simulation that leads to a huge computational load. Working principles of the simulation based bus model is illustrated in Fig. 2 . In the trace simulator presented in Ref. [8] a Program Trace Graph (PTG) is used for efficient trace retrieval. The PTG consists of nodes and edges such that each PTG-node represents function-start, function-end, branch or call. A PTG-edge connects two PTG-nodes and carries attributes about the program execution information, including cycle count between the two PTG-nodes it connects. On the trace simulator, the workload corresponding to each PTG-edge e n at processor PE i (i = 0, 1, . . .) contains at most a single leading "bus workload" B i [e n ], representing memory access instruction that generates bus traffic, which is followed by normal "computation workload" C i [e n ], representing normal instruction executions. Computation workload C i [e n ] denotes the number of execution cycles on an instruction stream contained in an edge e n . Bus workload B i [e n ] denotes the number of bus access cycles including accurate bus setup cycles and data transfer cycles, but does not include bus stall cycles which can only be obtained by actual bus simulation. When an edge e n contains a leading bus workload B i [e n ], the trace simulator kernel triggers the bus arbitration simulation, and if the bus request is not granted due to occupied bus, bus stall cycle D i [e n ] obtained from the bus arbitration simulation is added to the processor's simulation clock. Figure 3 illustrates the bus statistics used in stall cycle prediction that are collected during normal trace simulation. At processor PE i , computation-workload C i [e n ] and bus workload B i [e n ] on PTG-edge e n are simply accumulated on processor's simulation clock, where bus arbitration simulation at each bus workload is not performed. Statistics of C i [e n ] and B i [e n ] are collected at PE i within a predefined bus prediction interval T (cycles). All computation workloads within two consecutive bus workloads are merged as a single interval workload (L i ) where histogram h Li for L i and h Bi on all bus workloads B i are collected. Figure 4 illustrates the bus-stall prediction flow. For every predictionwindow of T cycles, Statistics (N i , h Li , h Bi ) are collected at each processor PE i and used to compute the expected bus stall cycles per request E[D i ]. Total bus stall cycle count during the bus prediction interval is predicted as E[D i ] · N i , which is added to the processor's simulation clock, where N i is the total number of bus workloads within the prediction window T .
Statistical Based Technique in a Trace Simulator

Single Blocking Model and Burst Blocking Model
Overview
The single blocking model (SBM) assumes that when a bus request on PE i is blocked by b j it will for sure be granted after maximum B j cycles. SBM does not model the case where, (1) there is burst traffic on a higher priority PE or (2) for PE i there are two or more higher priority PEs. Added to the SBM, The Burst Blocking Model (BBM) models (1) i.e. burst traffic on a higher priority PE. This paper presents Multi-Blocking Model, which in addition to SBM and BBM, also models (2) such that bus workloads on higher priority PEs can block a request on PE i indefinitely. The following section introduces some basic concepts from SBM, BBM and the MBM.
Key Concepts 4.2.1 Computation Workload
Computation workload L i denotes the number of execution cycles between two successive bus accesses on a Processing Element PE i where i indicates the priority of a PE. Let N i be the total number of computation workloads on PE i .
Bus Workload
Bus workload B i denotes the number of bus cycles between two successive computation workloads on PE i . Total number of bus workloads on PE i are same as N i .
Request Probability and Average Interval Workload
Request Probability is the probability λ i that a bus request r i occurs at each cycle on PE i . A burst transfer request is generated on PE i with probability μ i and probability that interval L i equals n (cycles) is given as:
and burst request probability μ i is obtained from the collected statistics during the bus prediction interval (N i , h Li , h Bi ):
Request Inactivation Probability
On each occurrence of bus workload B j , probability that request r i does not occur within the duration of B j − 1 cycles on PE i is called request inactivation probability y i j .
Merged Bus Workload
When a bus request on PE i is blocked due to a bus workload on PE j (such that priority of PE j is higher than priority of PE i ), the occurrence of a zero interval workload on PE j can block PE i continuously for multiple bus workloads. Effectively, from the perspective of PE i , the bus workloads are merged into one continuous workload. This bus workloads is termed merged bus workload B * j from here on.
Effective Request Inactivation Probability
Similar to merged bus workload, the probability that request r i does not occur within the duration of the merged workload B i j on PE i is called merged request inactivation probability y * i j .
Mathematical Model SBM and BBM
From here on we will use following terminologies throughout this document, r i : a request event by processor PE i , b j : a bus event (with arbitrary length B j ) at processor PE j , blk i j : a blocking event by bus event b j on request r i , b j (k): a bus event with length B j = k at processor PE j , blk i j (k): a blocking event by bus event b j (k) on request r i , t i j (k): time difference between the first cycle of b j (k) and request r i , bb i j : bus event b j at processor PE j follows immediately after (with no interval) a bus event b i at processor PE i , E[D i j ]: Expectation of bus stall per request at processor PE i .
SBM
For SBM The overall "bus stall" expectation, E[D i j ] on all bus events b j is given as,
Such that,
with value of μ i assumed to be 0 as SBM does not model zero interval bus requests.
Ni is the bus event count ratio. Although N i and N j are the actual bus event counts observed during the bus predication interval, we need to take into account the fact that these bus event counts resulted while we ignored the bus stall delays, and therefore this will lead to inaccuracy if used directly. In order to include the bus stall delay effects in the bus event count ratio, we define the average bus access interval G i as
] is the bus workload expectation, and E[D i ] is the bus stall delay. Then the bus event count ratio Q i j is calculated as
Here, E[D i ] and E[D j ] are the predicted bus stall delays that will be calculated by iterative method.
BBM
On the processor with higher priority, the occurrence of a zero interval workload in effect merges successive bus workloads. The expectation of the merged bus workloads is given as:
While the merged request inactivation probability y * i j is given as, Moreover, when α i j = 0, since PE i will only see (1 − μ j )N j effective bus workloads, the bus event count ratio Q i j in this case is rewritten as:
The bus stall expectation is calculated as,
Note that when μ j = 0 for all PEs, the BBM is reduced to SBM.
Calculation Flow
Detail of the bus stall delay calculation flow is described below. Here, let PEset be the set of processor indices. 
Multi-blocking Behavior (Limitation of SBM and BBM)
We define multi-blocking behavior as the event where a single bus request on PE i is blocked consecutively by bus-workloads on multiple higher priority PEs. Assuming that a request r i on PE i is blocked by PE j and that there is at least one more bus master PE l such that α i j = 0 and α il = 0, then there is a possibility that immediately after PE j releases the shared bus, a bus-workload on PE l wins the bus and r i is blocked for another bus-workload on PE l . Note that this consecutive blocking can happen indefinitely. Figure 11 shows the multi-blocking behavior on a shared bus. This behavior has not been modeled in SBM or BBM. This limitation introduces inaccuracies as the number of PEs increases and or the c 2017 Information Processing Society of Japan target application traffic becomes intensive. This research focuses on modeling the Multi Blocking Behavior.
Multi Blocking Model
The Multi Blocking Model (MBM) captures multi blocking behavior as defined in the previous section. A comparison of the modeled bus stall by SBM, BBM and MBM is shown in Fig. 5 . As shown, SBM only captures single blocking behavior, BBM also captures the potential extended stall by burst blocking behavior and MBM captures multi blocking behavior as well.
First we define some basic terms.
Key Terms 5.1.1 Effective Bus Workload
When a bus request on PE i is blocked due to a bus workload on PE j (such that priority of PE j is higher than priority of PE i ), the occurrence of a zero interval workload on PE j or a bus workload on PE k (such that priority of PE k is also higher than priority of PE i ) can block PE i continuously for multiple bus workloads. This kind of blocking can happen in different combinations on all high priority PEs. Effectively, from the perspective of PE i , the bus workloads are merged into one continuous workload. This merged bus workloads is termed effective bus workload from here on. Note that the effective bus workload on each PE j will be different for each PE i as opposed to the merged bus workload defined for BBM which is the same for any lower priority PE. Hence it is denoted as B i j i.e. the effective bus workload on PE j from the perspective of PE i . Calculation of effective bus workload is one of the key steps in the MBM model.
Effective Request Inactivation Probability
Similar to effective bus workload, the probability that request r i does not occur within the duration of the effective bus workload B i j on PE i is called effective request inactivation probability Y i j .
Mathematical Model
Derivation of the model follows in this section. For simple reading, first the derivations are done with the assumption that there are two higher priority PEs, both of which can generate burst traffic as well as show multi blocking behavior. At the end, the general form of the mathematical equations for n number of higher priority PEs is reported.
Probability Mass Functions
As noted above, the effective bus workload on each PE j will be different for each observer PE i . Lets first derive the equations for PE i such that there are two higher priority PEs PE 0 and PE 1 . Below defines some terms that will be used in the derivation. i j ]: Expectation of all merged bus workloads (b j or b k ) starting with b j and terminating with b k , such that PE i is lower priority than PE j and PE k . Y i j : Request inactivation probability of request r i with probability λ i on the effective bus workload B i j Y i jl : Partial Request inactivation probability of request r i with probability λ i on the merged bus workload starting with b j and terminating with b l . for f jl (m, k) Now, assuming two higher priority PEs PE 0 and PE 1 , first ex-
Expression
pressions for f jl (m, k) are derived ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f 00 (0, k) f 10 (0, k) f 01 (0, k) f 11 (0, k) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f B0 (k) 0 0 f B1 (k) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f 00 (1, k) f 10 (1, k) f 01 (1, k) f 11 (1, k) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f B0 (n) 0 0 f B1 (n) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ [C] 2 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f 00 (0, k−n) f 10 (0, k−n) f 01 (0, k−n) f 11 (0, k−n) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f 00 (m, k) f 10 (m, k) f 01 (m, k) f 11 (m, k) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = n ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f B0 (n) 0 0 f B1 (n) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ [C] 2 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ f 00 (m−1, k−n) f 10 (m−1, k−n) f 01 (m−1, k−n) f 11 (m−1, k−n) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ where [C] 2 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ C 00 C 10 C 01 C 11 ⎤ ⎥ ⎥ ⎥ ⎥ ⎦
Probability Mass Function of All Merged Bus Workloads
Next the probability mass function of all merged bus workloads is derived as,
Scaling Factor F jl (m)
The scaling factor F jl (m) is derived as,
where, c 2017 Information Processing Society of Japan 
Effective Bus Workload Expectation
Recall that B i j is effective bus workload on PE j as observed by a bus request on PE i . Let, E[B i j ]: expectation of effective bus workload B i j . Figure 6 A, B, C and D show the multiple merging fashions on two higher priority PEs with event probabilities for merged bus workloads, m = 0, 1, 2 and 3 respectively.
Expectation of m + 1 Merged Bus Events
Expectation of m + 1 merged bus events is calculated as,
Expectation of All Merged Bus Workloads
Expectation of all merged bus workloads is given as, 
Note that in the above calculation infinite bus event count 'm' is assumed although the actual number of bus events is not infinite. The actual number of bus events on each PE j observed during the bus predication interval is given as N j . However we need to take into account the fact that these bus event counts resulted while the bus stall delays were ignored, and therefore this will lead to inaccuracy if used directly. As shown in Ni which is rewritten as Q i j = Gi G j . The average bus access interval G i incorporates the effect of bus stalls and is calculated as
Moreover as shown in Section 5.2.4 the value of ([C] 2 ) m decays with increasing value of m, therefore the assumption above does not introduce significant inaccuracy. However, in some cases when the probability that neither b j nor b l immediately follows b j calculated as 1−(C j j +C jl ) is significantly small then the assumption of infinite bus event count becomes a significant source of inaccuracy. This kind of traffic pattern results in starvation in low priority PEs and will be covered in detail in Section 6.2.
Calculation of [1 1][H] 2 (I − C 2 ) −1
Let a = 1 − C 00 , b = −C 10 , c = −C 01 , d = 1 − C 11 then
Effective Request Inactivation Probability
Partial Request inactivation probability is calculated as,
The overall Request inactivation probability is simply calculated by summation of the partial request inactivation probabilities,
lim m→∞ ([C] 2 ) m = 0, lim m→∞ ([V] 2 [C] 2 ) m = 0
Recall that, C jl : Probability of event bb l j , i.e. b l immediately follows b j (observed at PE j ) 1 − (C j j + C jl ): probability that neither b j or b l immediately follows b j .
First of all, we know that practically speaking all probabilities C jl < 1, moreover, (C j j + C jl ) < 1, assuming that the probability that neither b j or b l immediately follows b j is greater than zero.
Let
Then, 
Note that for all 0 < C jl < 1, 0 < C j j < 1 and 1 − (C j j + C jl ) > 0; Perron-Frobenius Theorem [35] can also be used to deduce that 
Consecutive Bus Event
The consecutive bus event probability i.e. Pr(bb i j ) is calculated on two observers, PE i and PE j . C i j : probability of event bb i j observed at PE i and S i j : probability of event bb i j observed at PE j . C i j is calculated in different ways for i = j and i j.
For i = j, event bb i j happens when (1) burst transfer request arrives on PE i with probability μ i and a higher priority request does not arrive, i.e. event bb il does not occur for (l < i), therefore,
Event bb i j happens when (1) request r j is blocked by b i i.e. event blk ji happens or r j immediately follows b i and (2) Event bb il does not happen for all (l < j). (1) can be easily modeled as a blk ji where bus event b i (k) has 1 extended cycle.
Therefore, for α ji = 1 (1) can be calculated as,
And for α ji = 0 (1) becomes,
Therefore, overall C i j can be calculated by summing over all k,
where U * ji = (1 − μ j )S ji − (1 − λ j )(1 − S ji ) S i j is observed at PE j as opposed to C i j which is observed at PE i . To reflect this difference the bus event count ratio Q ji = Ni Nj is introduced.
n-PE Equations
Finally, we can write the equations for effective bus workload, and effective request inactivation probability for n number of higher priority PEs. For effective bus workload,
Information Processing Society of Japan
And effective request inactivation probability,
Bus Stall Expectation
The bus stall expectation is given as,
when α i j = 0, PE i will only see the effective bus workloads, i.e. the bus workloads that do not follow immediately after another higher priority bus workload. The bus event count ratio Q i j in this case is rewritten as:
Calculation Flow
Detail of the bus stall calculation flow is described below. Here, let PEset be the set of processor indices. 
Experiments and Results
Performance estimation of MPSoC shared bus was performed using the MBM (multi blocking model). We used recorded traffic patterns for benchmark applications as presented by Liu et al. [36] The experiments were performed for two traffic intensive applications namely (1) "SPEC95 Fpppp" which is a chemical program performing multi-electron integral derivatives. It consists of 334 tasks and 1145 communication links (2) "Fast Fourier Transform" with 1024 inputs of complex numbers. It consists of 16384 tasks and 25600 communication links. For comparison, prediction results of two low traffic applications, i.e. SPARSE Matrix Solver which is "Random sparse matrix solver for electronic circuit simulations" and ROBOT which is Newton-Euler dynamic control calculation for 6-degrees-of-freedom Stanford manipulator, are also shown. The benchmark applications are run on two different architectures (1) Consisting of four processing elements connected through a shared bus (2) Consisting of eight processing elements connected through a shared bus. Recorded traffic patterns of both applications for both architectures are fed into the trace workload simulator. Histograms are populated over the course of a predefined window of application clock. On the conclusion of every time window, the statistics are input to the mathematical model and expected bus stall is predicted. This predicted stall is then added to the clock of each PE as detailed in Section 3.2. The results are then compared with the simulation method. Figure 7, Fig. 8, Fig. 9, Fig. 10 and Fig. 11 show the comparison of "average bus access interval" calculated using the bus simulation method, calculated using prediction by BBM and calculated using prediction by MBM methods. While Table 1 shows a closer comparison of estimation error between the BBM model and the MBM model.
Estimation Error
Four PE Architecture
On a 4 PE architecture, BBM shows minimum 0.1% and maximum 1.75% estimation errors for FPPPP and minimum 2.2% and maximum 13.91% estimation errors for FFT. MBM, on the other hand, shows minimum 0.02% and maximum 0.6% estimation errors for FPPPP and minimum 0.6% and maximum 8.8% estimation errors for FFT.
Eight PE Architecture
On the 8 PE architecture, BBM shows minimum 0.005% and maximum 0.08% estimation errors for ROBOT benchmark. For the SPARSE benchmark, minimum 0.075% and maximum 1.5% estimation errors are shown. However, in case of FPPPP a minimum of 0.28% and maximum 11.8% error. MBM, on the other hand, shows minimum 0.005% and maximum 0.8%, minimum 0.003% and maximum 0.945%, minimum 0.098% and maximum 2.7% estimation errors for ROBOT, SPARSE and FPPPP benchmarks respectively. From our observation we conclude that the BBM is quite suitable and accurate for applications that are not traffic intensive, however, for traffic intensive applications BBM shows an increasing estimation error. Table 1 shows a comparison of estimation error on each individual PE for the FPPPP benchmark. It is evident that the error increases for lower priority PEs when using BBM while MBM retains its accuracy.
The Curious Case of FFT8
Performance estimation values for the FFT benchmark for 8-PE architecture are very curious and show a huge estimation error. Both the BBM model as well as the MBM model over-estimate the bus stall for lower priority PEs. In the case of BBM the estimated cycle count is 50-60% greater than the simulated cycle count. While for MBM the over estimation is above 1000% as shown in Fig. 12 . Intuitively speaking the BBM should always under-estimate the cycle count for lower priority PEs since it does not account for multi-blocking behavior. This observation led us to close inspection of the traffic pattern for this benchmark. Observations are performed using bus simulation method to check for starvation. Table 2 reports a record of workload completion on each PE. During the simulation, whenever any PE completes its last bus workload the number of completed bus workloads on each PE is logged at that point in the simulation. This point is termed "Check Point" (CP). "F" indicates that a PE has already finished all of its bus-workloads. Note that the three lowest pri- ority PEs did not complete a single bus transfer until the three highest priority PEs had finished all their bus transfers. It's evident from the observations that only a very negligible number of bus requests on the lowest priority PEs incurred any bus stall due to bus workloads on the highest priority PEs. Moreover, a significant portion of the bus workloads on all PEs used the bus when the effective number of bus masters competing for bus access was four, three, two or even only one. On the other hand, as shown in Fig. 3 and Fig. 4 the prediction model assumes a full crossbar bus during a prediction window, such that all PEs are assumed to perform bus transfers without incurring any stall. For every prediction-window of T cycles, the expected bus stall cycles per request, E[D i ] is calculated and total bus stall during the bus prediction interval (E[D i ] · N i ) is added to the processor's simulation clock, where N i is the total number of bus workloads within the prediction window T . In the case of starvation a lower priority PE does not get the bus until all the bus workloads on higher priority PE have been completed. Therefore, at least (N i − 1) bus workloads on lowest priority PEs are wrongly predicted to incur stall due to highest priority PEs. Let's look at a hypothetical example to understand this case. Figure 13 shows two PEs, PE H and PE L such that PE H is higher priority. During the prediction window, assuming full cross bar, both PEs are assumed to have completed 10 bus transfers each and workload statistics are accumulated. The prediction model uses these statistics to calculate expected stall per bus request. However, when using a shared bus, the first bus request on PE L gets blocked until PE H finishes all its bus workloads as shown in Fig. 14. After which PE L completes all its bus workloads without incurring any stall at all. The prediction model assumes that 10 requests are issued on each PE in the T cycles however, simulation shows that only 1 bus request is issued on PE L . The prediction model calculates the incurred bus stall assuming all 10 bus requests must be serviced before the next window starts. This adds a huge stall to all 10 bus requests however, in reality the remaining 9 requests will not be issued until the first bus request is serviced. In contrast, the workload completion record of FPPPP8 benchmark, as shown in Table 3 , indicates absence of starvation, hence the prediction model is able to estimate with higher accuracy. A close look at the CP0 row and the PE7 column clearly shows the difference between the two traffic patterns.
Bus Starvation
In concurrent computing starvation is measured by the bound value of bypass such that if n processes are competing for access to a shared resource, then a process is deemed starved unless it gains access after being bypassed at most f(n) times by other processes for some function f [37] , however the value of n is subjective and can be different for different applications. The term "bus starvation" here is used to define a situation where one or more PEs are starved of bus access such that on the starved PE, the number of serviced bus requests is less than 0.5% of the number of serviced requests within a time window. However this observation is based on our experiments on the FFT8 application and is purely subjective. Secondly, while calculating expectation of all merged bus workloads, the proposed model assumes infinite bus workloads. This assumption becomes a significant source of inaccuracy for traffic patterns that result in starvation. This is because of a very small value of the rate with which ([C] n ) m shrinks. For PE 7 this probability is 1 − 6 x=0 C jx . The values of 6 x=0 C jx probabilities have been reported in Table 4 . For the FFT8 example, we noted that when 6 x=0 C jx becomes greater than 0.9 the traffic pattern starts to cause starvation i.e. as seen by lowest priority PE, b j is immediately followed by a bus event on a higher priority PE at least 90% of the times.
Identifying Bus Starvation
Usually it is not straight forward for application developers to determine if an application exhibits bus starvation. Simulation is performed for specific bus architectures and specific task mappings to evaluate the bus performance and to identify potential c 2017 Information Processing Society of Japan bus starvation. However, in our experiments we relied on the results of proposed prediction model to identify starvation. The proposed prediction model, although hugely over estimating the actual stall, raised a red flag resulting in close inspection and identifying starvation. We feel that the proposed model could also serve as a tool to identify such cases of starvation so designers can examine the application and the architecture closely and tweak the application or the architecture accordingly to avoid bus starvation on any PE.
FFT8manipulated
To demonstrate this feature, the FFT8 traffic was manipulated to find a starvation-less configuration. Assuming a 2x increase in the bus bandwidth and computation speed reduced by 1/3x, the resulting system showed a much improved bus performance, and faster application execution despite a reduced computation speed. The prediction results were accurate with about 5% prediction error. Figure 15 shows a comparison of estimated and simulated average bus access interval while Table 5 shows a record of workload completion on each PE. This experiment was performed to demonstrate the usability of the prediction model in identifying starvation cases however further detailed work on more such benchmarks has not been performed and remains one of the potential research areas for future work.
Simulation Speedup
Next we compare the simulation times using the proposed estimation technique as opposed to the simulation technique. The simulation time for the "simulation based prediction method" consists of two components. T sim bus and T sim que , where T sim bus is the time it takes to simulate the bus access itself, while T sim que is the time it takes to maintain the arbitration queue on the arrival or granting of every new bus request. T sim bus can be calculated as,
where, t ba is the time it takes to simulate one bus access and "N". is the total number of bus workloads on a PE.
T sim que has to be calculated on each individual arrival or grant of a bus request as the time required to maintain the queue would be different depending on the number of PEs in the queue. Overall
Here "I" is the number of times a benchmark/application is executed. This will be discussed in detail at the end of this section. On the other hand, the simulation time for the proposed "prediction" technique can be calculated as,
And Speed-up is simply calculated as,
Speedup = T sim T prd
The expression for T sim clearly shows that T sim will increase as the length of simulation increases. On the other hand, the expression for T prd shows that it increases with the increase in the number of prediction time-windows W. Since the length of the prediction window can be adjusted depending on the size of simulation such that the total number of windows "W" does not increase drastically. This results in only a slight increase in T prd as the simulation data length increases. As a result the speedup ratio increases as "I" increases. Figure 16 and Fig. 18 report T sim for each individual PE with an increasing value of "I" for the FPPP8 and FFT8 benchmarks respectively. Experiments were performed for different values of I i.e. "I = 1, 10, 100, 500 and 1000". Moreover, Fig. 17 and Fig. 19 report a comparison between total T sim , total T prd and the resulting speedup ratio for both benchmarks. As evident from the shown graphs, for shorter simulations, T sim is low however, the longer the simulation continues the value of T sim increases drastically while T prd increases very slightly.
For the experiments an increasing value of "I" was chosen in order to increase simulation data length to show that proposed c 2017 Information Processing Society of Japan Fig. 16 "Tsim" for each PE for FPPPP8, with increasing "I".
Fig. 17
Comparison between "Tsim" and "Tprd", and Speedup. Fig. 18 "Tsim" for each PE for FFT8, with increasing "I".
Fig. 19
Comparison between "Tsim" and "Tprd", and Speedup. method is robust enough for any length of input data. The data length generated by 1000 iteration is long enough to demonstrate that the proposed method will not be affected adversely as simulation length increases as reflected in Fig. 17 and Fig. 19 . This highlights the benefit of using proposed model for thorough and iterative performance estimation that involves running an application multiple times and involves multiple cycles of performance estimation, application tuning and design space exploration.
Calculation Optimization and Model Scalability
In order to reap the speed-up benefits of proposed technique, we try to optimize calculation of different values used by the mathematical model. First, the values that are independent of the iteratively calculated value Q i j , are calculated outside the iterative loop and only the values dependent on Q i j are updated every iteration. Secondly, the matrices [B] n , [V] n , [Y] n , [H] n are all diagonal matrices hence special optimized multiplication functions are implemented to reduce any useless calculations. Another point of discussion for proposed model is scalability. Given the matrix multiplication and inversion calculations involved in calculating "effective bus workload expectation" E[B ij ] and "effective request inactivation probability" Y ij , we expect that the calculation will become more complicated as number of PEs and as a result the matrix size increases. Especially the matrix inverse operation could be a speed bottleneck. In future work, an approximate model that can limit the size of matrices while maintaining accuracy will be developed in order to make the current estimation model scalable to any number of PEs.
Future Work
Further development of our proposed technique has a couple of directions.
As discussed before, in future work, an approximate model that can limit the size of matrices will be developed in order to make the current estimation model scalable to any number of PEs. Secondly, multiple arbitration schemes such as, TDM/Round Robin, Lottery based and Least Recently Granted will be modeled. Furthermore, we aim to augment the proposed model with a cache model that can predict the effects of cache performance on the overall performance of a bus-architecture for any specific application program.
Conclusion
This paper presented an analytical model to predict arbitration stalls for a shared bus in a multi processor system on chip architecture. A previous Burst Blocking Model (BBM) was extended to account for multiple interfering bus masters. We call this model, Multi Blocking Model (MBM). The developed model was tested mainly on an 8-PE architecture with a shared bus and corresponding results and comparisons were presented accordingly. We conclude that the previously submitted, Burst Blocking Model is accurate enough for low traffic applications even on 8PE systems however for traffic intensive applications the estimation results become inaccurate and hence use of the proposed model i.e. Multi Blocking Model becomes necessary. c 2017 Information Processing Society of Japan
