Abstract-The massive integration of cores in multicore system has enabled chip designer to design systems while meeting the power-performance demands of the applications. However, fullsystem simulations traditionally used to evaluate the speedup of these systems are computationally expensive and time consuming. On the other hand, analytical speedup models such as Amdahl's law are powerful and fast ways to calculate the achievable speedup of these systems. However, Amdahl's Law disregards the communication among the cores that play a vital role in defining the achievable speedup with the multicore systems. To bridge this gap, in this work, we present PaSE a parallel speedup estimation framework for multicore systems that considers the latency of the Network-on-Chip (NoC). To accurately capture the latency of the NoC we also propose a queuing theory based analytical model. We conduct a case study for a matrix multiplication application and evaluate and analyze the speedup from our framework.
should consider the NoC latency while determining the execution time or speedup of the parallel applications. In this work, we present a Parallel Speedup Estimation (PaSE) framework to analytically model the achievable speedup for parallel systems like multicore processors while accounting for the NoC latency. This will enable designers to make quick coarse-grained design choices while using time-consuming fullsystem characterizations for more specific fine-tuning and performance analysis of the designs.
The PaSE framework contains two models: the latency model and the speedup model. A NoC latency model is proposed using queuing theory to evaluate the packet latency of a NoC architecture. Then, the NoC latency is plugged in to the speedup model to analytically compute the speedup of parallel applications on a NoC based multicore system. Hence, the contribution of the work is two-fold. Firstly, we propose an analytical framework to accurately evaluate the speedup for NoC based multicore system. Secondly, to calculate the latency of the packets in a NoC precisely, we propose a queuing theory based analytical latency model. Using the PaSE framework, we present a case-study for parallel speedup for a data parallel application such as Matrix Multiplication. From our study, we demonstrate how the speedup is dependent on the NoC topology, System Size and Computation-to-Communication (C-to-C) ratio of the application running on the multicore system.
II. RELATED WORKS
Several previous researches have extended Amdahl's law for various purposes. In [7] , Hill and Marty proposed a corollary of speed-up for the symmetric, asymmetric and dynamic multicore systems based on Amdahl's law. In [8] , authors investigated what Hill and Marty introduced in [7] to come up with an accurate quantitative model for the multicore performance. In [9] authors study the scalability of multicore processors and proposed a theoretical speed up model revealing that the multicore is suitable for large-scale manufacturing. In [10] , authors investigated Amdahl's law to obtain the optimum frequency, voltage supply, and energy. However, none of these proposed model consider the communication latency between the cores in a multicore system. In [11] authors proposed a speedup model considering the communication latency in a multicore system. However, authors considered a simple latency model which neglects the contention delay in the NoC routers. Hence, to capture the effect of the NoC in multicore speed-up it is necessary to accurately estimate the NoC latency. To estimate the NoC latency many latency models has been proposed in recent years. In [12] authors have proposed a machine learning technique based NoC latency model called SVR-NoC. Although, such work is unique and shows promising results, the large training set required to precisely calculate the latency for different NoC architectures and traffic patterns is timeconsuming to generate. Consequently, many of the NoC latency models are based on queuing theory. These works can be broadly classified in two categories: i) infinite buffer capacity queuing systems and ii) finite buffer capacity queuing systems. For example, in [13] , [14] and [15] , authors proposed NoC latency models considering M/G/1, M/M/1, and G/G/1 queues with infinite buffer capacity. However, the number of virtual channels in a NoC router is limited (i.e. finite buffer capacity) due to power and area constraints. On the other hand, authors in [16] and [17] proposed M/G/1/K queue and G/G/1/K queue based latency model with finite buffer capacity. However, in the NoC routers, the arrival and departure of packets follows a Poisson distribution [18] . Consequently, the service time in the NoC routers should follow an exponential distribution as the time interval between Poisson events are characterized to be exponential. Hence, assuming general distribution for the service time will result in limiting the accuracy of the latency model. Therefore, to accurately capture the NoC latency, in this paper, we propose an analytical model based on M/M/1/K queuing systems. Then, using this latency model we propose a framework for evaluating the speedup of multicore systems.
III. THE PASE FRAMEWORK
In this section, we discuss the basic assumption along with the latency model and the speedup model utilized by the PaSE framework in details.
A. Basic Assumptions and Notations
In this work, we consider a symmetric multicore system (i.e. all the cores are identical). Each core is considered to be connected with a NoC router through a network interface. The NoC routers implement shortest path based deterministic routing algorithm [19] along with virtual channel (VC) based wormhole flow control mechanism [20] .
The communication among the cores occurs in form of packets where the packets are divided into multiple flits. The header flit contains the routing information and the body flits simply follows the path set by the header flit. The flit size (in bits) is considered same as the width of the links. We assume backpressure flow control where if the VC buffers are full flits are not dropped but stalled from further movement upstream. For the sake of simplicity, the links are assumed to be single cycle pipelined stage regardless of its physical length. Moreover, we consider a constant packet size represented as M in Table I along with other notations used in this paper.
B. Latency Model
Based on the NoC architecture and the traffic pattern, the latency model determines the average time elapsed for the packets to traverse from the source NoC router to the destination NoC router (i.e. packet latency). As the packets are divided into header and body flits, the packet latency contains both the delay for the header flit and body flits. The packet latency, can be calculated based on.
( 1 )
Where, is the total time for the header flit to traverse from the source NoC router to the destination NoC router, which includes the time spent at the all intermediate routers in the path. This is alternatively known as path-discovery latency. Therefore, to determine the average time for the header flit, both the waiting time at the routers as well as the effective number of hops in network is required. The effective number of hops can be determined from the NoC topology as shown in next subsection. Therefore, the latency of the header flits can be calculated by the following equation.
(2)
Where, is the expected waiting time in the input VCs (including service time by the router) and is the router pipeline stages. Due to wormhole switching mechanism, the latency for the body flits, with packet size, M is given by
In the next subsections, we discuss how and are determined to analytically calculate the packet latency.
1) Determination of Expected Waiting Time:
The expected waiting time ( for the header flit is the time that it must wait in the input VCs before it is routed to the output VC of a NoC router. To estimate this, we model each router as a queue with finite buffer capacity. The total buffer capacity is equals to the summation of all the input VCs in all ports. As shown in [18] , both the arrival and departure of packets follows a Poisson distribution. On the other hand, the service time between two Poisson events are characterized to be exponential. Hence, we adopt an M/M/1/K queuing model that assumes Poisson distribution of the arrival rate of flits, exponential distribution of router service rate, and finite buffer size to accurately model the NoC router behavior.
Using the M/M/1/K queuing model, with Poisson arrival rate of the header flits and exponential service rate, the expected waiting time in the queue with size K can be calculated using the following equation 
Where, is the expected steady state number of the header flits in the queue and is the steady state probability of having K header flits in the queue. The steady state number of the header flits or packets in the queue, can be calculated using the following equation ( 5 ) Where, K is the number of buffers in the queue, is the traffic intensity and is the probability that the queue being empty. The probability of having n flits in the queue can be calculated using following equation, ( 6 ) Where, the value of n is any whole number between 0 to K. On the other hand, the traffic intensity can be calculated using the following equation, (7) It can be seen from the equation (7) that the traffic intensity depends on the arrival rate and the service rate. The service rate, is a function of the NoC router and can be obtained from the NoC router specification [13] . Alternatively, the arrival rate depends on the application traffic injection pattern. Given traffic injection pattern of the application, T(i,j) as (8) Where, is the injection rate in packets/cycle from core i addressed to core j. Hence, is the total rate at which packets addressed to core j are injected in the network from all cores. Therefore, this is the rate j at which packets can arrive at core j. For any traffic pattern, we calculate the arrival rate, for all routers. Then, the overall arrival rate is calculated as the average of all arrival rates and is given by, (9) Where, N is the number of NoC routers in the network. Hence, in order to determine the waiting time we need the router service time which can be found in router specification and the arrival rate which is a function of the traffic pattern. In the next, subsection we discuss about the determination of the effective number of hops that is used to calculate the latency of the header flit following (2).
2) Determination of Effective Number of Hops:
Effective number of hops in a NoC is the expected number of intermediate router that a packet has to travel through during its way from the source to destination. Therefore, the effective number of hops can be calculated using following equation for any NoC architecture (10) Where, N is the number of cores in the system and is the number of hops in the shortest path between core i and core j. It can be seen from the equation that the effective number of hops is different for different network topology with same system size due to different . Also for the same network topology, as the number of cores increases the effective number of hops also increases yielding a higher packet latency. The outcome of this latency model will be used in the speedup model as discussed next.
C. Speedup Model
The speedup model is used to determine the speedup of a NoC based multicore system. The speedup is defined as the ratio of the serial execution time on a single core and the parallel execution time in a multicore environment. The speedup model requires application profile and the architecture of the cores. The application profile contains the fraction of the application that requires serial execution (s) and the fraction of the application that requires parallel execution (p). It also contains the type and number of operations (i.e. OI) required by the application. On the other hand, the architecture of the cores defines the number of cycles required to complete different instructions (i.e. CI). Using these parameters and the latency of the NoC from the latency model, the speedup model computes the parallel execution time, Tp with a N core system using the following equation, (11) Where, is the communication overhead of the application and calculated using (1) , is the time required to finish the total task on a single core and is a coefficient denoting the dependency of the application on communication overhead. can be a positive whole number that signifies the number of messages required to arrive at the core before execution of a task can proceed. It can also have real fractional values if portion of the packet latency can be masked by computation at the core due to its architecture design. This equation models the parallel execution time. The time required to finish a subtask depends on the type and the number of instructions in the subtask and the time required by the core to finish each of the instructions. Using the parallel execution time from equation (11), the speedup for the multicore system with N cores is given by the following equation, (12) The complete flow of the PaSE framework with operation of both the latency and speedup model is shown in Fig. 1 .
IV. RESULTS AND ANALYSIS
In this section, we present the speedup analysis with the proposed PaSE framework. First, using the proposed M/M/1/K queuing based latency model, we evaluate latency for regular NoC topologies like Mesh, 3D Mesh, and Folder Torus as well as for irregular topologies like small world networks. The small world topology design methodology is adopted from [19] which is characterized by many short and a few long-distance links. We also consider different system sizes and injection rates in our latency evaluation. We adopt the three-stage pipelined NoC router architecture from [21] with wormhole switching [20] . Each port of the NoC router is considered to have 2 VCs (one input and one output) with buffer depth of 8 flits which is same as the size of a packet. Although the routing logic in NoCs is typically dependent upon the topology here we assume the shortest path routing as most deterministic routing strategies converge to the shortest path routing regardless of topology.
A. Validation of the Proposed Latency Model
In this subsection, we validate our proposed latency model with simulation based latencies. We compare the latency of an 8x8 core system with matrix multiplication traffic pattern. For the matrix multiplication application, we consider the multiplication of two [32 x 32] matrix, A and B. The resultant matrix is also a [32 x 32] matrix. Initially, the data is considered to be residing in a distributed manner with each element of both A and B existing in each core of the array. In such traffic pattern, the packets containing elements coming from each row of cores are mapped to the corresponding column of cores to compute the resultant element of the result matrix. Such traffic pattern represents the communication pattern for data parallel application like matric multiplication. We also analyze the speedup for this application in the next subsection. We have considered different injection rate for this traffic pattern. The latency of different NoC architectures with matrix multiplication traffic pattern is shown in Fig. 2 . It is observable from the figure that for all NoC topologies the latency model closely resembles the simulation latencies with negligible error of less than 7% before entering the congestion region. One interesting observation from this plot is that, after saturation, when the latency increases drastically, the latency obtained from the model is lower than that of the simulation. However, the throughput of all the NoCs given by our model is same as that of the simulation results. The maximum injection rate that can be sustained by the NoC or its maximum throughput, can be derived from the latency plots by observing the injection rate at which the latency has a sharp increase. So, the error in throughput of our model with respect to simulations is negligible.
B. Case Study: Speedup Analysis with Matrix Multiplication
In this section, we present a case study of multicore speedup with the same matrix multiplication task as described in the previous section. The interaction between the cores depends on the mapping of the application between cores in the multicore system. For this evaluation, we consider the following mapping, computation, and communication among the cores.
• The data of the two matrices A and B are equally distributed among the cores and each core computes the same number of elements of the resultant C matrix (i.e. uniform workload distribution). We consider the cores to run only the matrix multiplication with no other overhead (e.g. no scheduling overhead), thereby making s=0 and p=1.
• To calculate one element of the resultant C matrix, elements in a row of matrix A is multiplied with corresponding elements in a column of matrix B. These partial products are then added to determine one element of the resultant matrix. Therefore, calculating one element of the resultant matrix, C requires 32 multiplications and 31 addition instructions. Hence, for all the elements of the matrix the total computation time, TC will be 32*32*(32 M+31 A) cycles. Here, M and A refers to the time required by the core to finish a multiplication and addition instruction respectively. Here, we assume a core microarchitecture such that each multiplication or addition operation does not begin until the previous one is finished.
• To perform this matrix multiplication each core shares its values of matrix A and B with other cores in the same row and column. Furthermore, we assume a core can start its computation as it receives the first packet containing the elements of A and B. The rest of the packet latency is masked by the computation. Hence, we assume to be 1 in our evaluation.
We consider three different cases: ideal-core, integer (i.e. matrix elements are integer numbers), and denormal (i.e. matrix elements are denormal numbers, NaNs or infinity) for this evaluation. In the ideal-core case, we assume a core architecture such that it can complete any instruction (whether it is addition or multiplication) within one clock cycle making M=A=1. For Model Simulation the integer and denormal type matrix elements the number of cycles corresponding to these operations in the Intel knights landing processor is used in our evaluations [22] . Each of these cases will have different values of Tc with ideal-core case being the lowest and denormal case being the highest. We also evaluate the speedup for different NoC architectures and system sizes. With varying system size the problem size assigned to each core will also vary. For example, for a system size of 8x8 (64) cores, the problem size is 1024/64=16 whereas for a system size of 32x32 (1024) cores, the problem size is 1024/1024=1 at each core. Furthermore, to capture the variation in packet injection due to the different core architecture (e.g. prefetching mechanism, cache replacement policy) we consider both presaturation and post-saturation cases for the speedup evaluation using the framework. Fig. 3 and Fig. 4 shows the speedup for the different NoC architectures for the matrix multiplication traffic pattern under pre-saturation and post-saturation cases respectively. We analyze these results based on three aspects. This analysis is presented below:
1) Effect of System Size:
In pre-saturation, the speedup increases with increasing system size. This is shown in Fig. 3 (ac) . From the figures, we can observe that for all NoC architectures the speedup is lowest for the ideal-core case ( Fig.  3 (a) ) and highest for denormal numbers (Fig. 3 (c) ). This is because, in ideal-core case, the computation time is minimum and hence, the communication latency plays a vital role in the speedup. This is also evident from Fig. 3 (a) , as NoC architecture with higher latencies shows lower speedup for a system size of 32x32(1024) cores. On the other hand, for denormal numbers, the computation time is higher than the packet latency even with increasing system size. As the computation time dominates communication, the effect of the latency becomes insignificant for speedup evaluation. Fig. 4 (a-c) shows the speedup achieved in the various cases with increase in system-size while the NoC is in saturation. For all the 3 cases considered here, the speedup of the system in saturation is lower than when the system is operating in presaturation range. This is because the NoC latency in postsaturation cases are higher resulting in a higher parallel execution time. However, in the ideal-core case it is interesting to note that the speedup does not monotonically continue to increase with size. This is because with size the latency numbers increase significantly and overshadows the effect of increased parallelization among the cores. In case of integer numbers, the number of cycles required for computation increases in proportion to that of the communication latency and hence the computation part starts to dominate the speedup. However, for very large systems with number of cores higher than 256 cores, the communication latency starts dominating the speedup. Therefore, different NoC architectures have different speedups for large systems. Lastly, for the denormal case the number of cycles required in computation is very large which eliminates the artefacts of the NoC topologies and the speedup is dominated solely by the effect of increase parallelization.
2) Effect of NoC Architecture:
NoC architectures with lower effective number of hops have lower latencies as they provide shorter distance to the destination. Hence, more efficient NoCs will result in higher speedups. However, in our case studies the effect of the NoCs on the speedup is visible only when the execution time is dominated by the NoC latency. This happens for the ideal-core and integer cases for both presaturation and post saturation operation. As the NoC latencies decrease from mesh, folded torus, 3D mesh to small world topology, the speedups increase in the reverse sequence.
3) Effect of the C-to-C Ratio:
The C-to-C ratio signifies the volume of computation of a parallel application with respect to its volume of communication. When an application is computation intensive, its speedup is dominated by the computation time. This can be observed from Fig. 3 (c) and Fig.  4 (c) that shows the speedup for the denormal case in both pre and post saturation scenarios. As the computation time is high, the effect of packet latency is masked and all the NoC architectures have similar speedup behavior. On the other hand, when an application is communication intensive, the latency governs the speedup of the application. This can be observed for Fig 3 (a),(b) and Fig 4 (a),(b) where are post saturation load speedup of different NoC architectures with ideal-core and integer cases are shown. It can be seen from these figures that when the system size increases, due to increase in packet latency and decrease in computation time the C-to-C ratio decreases. In such case, NoC architecture with higher latency (e.g. Mesh), yielding a lower C-to-C ratio has the lowest speedup among other NoC architectures for which the C-to-C ratio is comparatively higher. Thus in cases with low C-to-C ratios the effect of the NoC becomes dominant at large system sizes.
V. CONCLUSIONS
Multicore processors are the energy and power efficient solutions to the increasing power consumption of high performance processing modules as they do not require frequency scaling but exploit parallelism to achieve speedup. In this paper, we propose a framework to model the speedup of a NoC based multicore processor while performing parallel tasks. The proposed PaSE framework uses a queuing theory based model to compute the latency of packet transfer in various NoC architectures. For a multicore system with a NoC architecture, this latency model is used to calculate the communication overhead of parallel tasks. Using this overhead we calculate the achievable speedup in such a system. We find that the speedup depends upon a number of factors such as system-size, the nature of the task and the computation to communication ratio. Interestingly, we find that under certain circumstances, when the system is dominated by communication latency, increasing the number of cores in a system may not necessarily result in higher speedup. Therefore, using our model it is possible to determine an optimal system-size for certain applications and then use that as a design guideline for more precise simulation based performance estimates. This model can therefore reduce design time and effort of such NoC based multicore processors.
