Abstract-Using analytical and simulation results, this paper performance comparison between popular and generic sharedpresents comparative analyses between network on chip and bus AMBA and NoC MPSoC architectures. shared-bus AMBA using real application traffic with MPEG-2
I. INTRODUCTION
II. SYSTEM DESIGN A. MPEG-2 Video Decoder Cores Network-on-chip (NoC) is emerging as a viable on-chip communication infrastructure for multi-processor system on In this work, we have developed a MPEG-2 video decoder chip (MPSoC) [1] . Video streaming is seen as a key feature of that employs five cores. The application partitioning is done in future MPSoC and MPEG video decoding is a major compo-line with [12] and no attempt has been made to optimise the nent of these systems [2] . To date there has been good progress partitioning. A block diagram of the multi-processor MPEG-2 in developing flexible NoC architectures (such as [3] ) and effi-video decoder used in this work is shown in Fig. 1 . The input cient routing algorithms in terms of contention avoidance and energy consumption (such as [4] ). For the NoC methodology to gain further maturity, comparative studies between shared and segmented-bus topologies and NoC need to be performed with the aim to identify the benefits and shortcomings of each cost analyses involving area, power, frequency, throughput, latency and energy of NoC and bus-based architectures are buffer controller (IBC) core stores and forwards the original presented in [6] , [7] . Review of guiding principles towards the video bitstream to the variable length decoder (VLD) core, evolution of NoC as an emerging SoC architecture is presented which organises the bitstream into two sequences: header and in [8] . A comparative evaluation between P2P and NoC with video sequence. The quantisation matrices and macroblocks MPEG-2 video encoder is carried out in [2] considering area, (MBs) are sent to inverse scanner and quantiser (ISQ) core, power, data parallelism, MPEG frame rate and scalability. while header and motion specific information are sent to the Analytical comparisons involving shared-bus, P2P and NoC motion compensator (MC) core. The core ISQ sends the DCT architectures were reported in [9], [10] considering power and coefficients to the inverse discrete cosine transformer (IDCT) energy consumption, and overall design effort. core, which transforms them into actual time domain format Most of the comparisons reported between NoCs and shared through in a lossy manner. The picture-ready blocks from bus SoCs, as in [7] , [6] , use analytical and synthetic traffic IDCT are sent to MC core, which forms predictions, organises patterns. The comparisons presented in [11] are based on real and stores decoded video. application but it does not consider application performance.
Each IP core has a dedicated local memory of Using analytical and simulation results, this paper presents 32768 (1024 x 32) bits, which is large enough to contain comparative analyses between NoC and shared-bus AMBA for processed DTUs of the previous core until it is processed. The multi-processor application using real application traffc with memory is directly interfaced with the input port by memory MPEG-2 video decoder in cycle-accurate realistic simulation access controllers. Output port is connected to the processor. environment. The aim of this comparison is not to demonstrate Optional control and credit signals are also connected for which bus-based or NoC architecture or mapping performs compatibility. A simplified block diagram of an IP core is best, rather this paper investigates the application-specific shown in Fig. 2 [14] . Our aim, in this work, is to compare and AMBA, two SystemC simulators were employed. The the performance without restricting the architecture to the basic simulation setup for NoC and AMBA used to perform application itself. Hence, a general purpose architecture for the comparisons are briefly described below: NoCs is preferred. Due to their simplicity, performance and 1) The MPEG cores ( Fig. 1 ) are configured for NoC using scalability [15] , mesh-based topology with deterministic XY NJ and for AMBA using port configurations (Fig. 3) formance, we define core concurrency and core efficiency and higher average degree of concurrency at DNOCT= 3.05, comto understand interconnect performance, we define channel la-pared to DAMBA =0.88. Due to overlap of core execution, tency and bandwidth. Later, the performance of the application on average TA for NoC is reduced to approximately 29% of in NoC and AMBA are compared.
TA for AMBA. Hence, it is evident that NoC suits MPSoC 1) Concurrency: Concurrency defines the number of cores architectures, where concurrent processing is desirable. that are able to execute computation at the same time and 2) Core Efficiency: Core efficiency defines how efficiently is dependent on the way IP cores communicate with each the cores can utilise the execution cycles within the application 
for waiting in 1198074 clock cycles of execution time for
bitstream test1.m2v. Both cores IBC and VLD, have maxn=l imum 100% core efficiency in NoC. The cores ISQ, IDCT where TC -in(n) is the time elapsed for data to travel from and MC in NoC have non-processing times due to waiting for source output port to source interconnect port, gS-D (n) is the DTUs to arrive for processing as ISQ receives DTUs from time elapsed for data to travel from source interconnect port to VLD, IDCT receives DTUs from ISQ and MC receives DTUs destination interconnect port and Dfn(n) is the time elapsed from VLD and IDCT. The average core efficiencies of ISQ, for data to travel from destination port to the destination IDCT and MC are found by using execution times and non-memory, all for nt-th DTU out of total N DTUs.
processing times obtained from Table III and using Equation 4 For AMBA, TCS_* (nt) =1 clock cycle after bus access is as 73.14%, 70.55% and 80.43%, respectively. On the other granted and locked. During T$j1 (itn() =1 clock cycle the hand, due to shared interconnect access, re-arbitration times arbiter does the necessary routing of the data and notifies the make up a major component of the non-processing times of all slave port. Due to direct memory interface, TDin_ (nt) =0 the cores in AMBA and hence, the application times increase clock cycle. Minimum channel latency (without waiting states) Bitstream  IBC  VLD  ISQ  IDCT  MC  arch.  TE  TNP  TE  TNP  TE  TNP  TE  TNP  TE  TNP involves communication over an array of switches for each links ( Timr (n) is 1 clock cycle. Also, the time required for routing decision on the m-th switch for n-th packet, T7 (n) is 1 clock In practice, the actual switching frequency will also decycle. The n-th packet travels from router to the output channel pend on capacitive loading. According to [7] , due to capacof the m-th switch immediately in our implementation and itive loading and global wire lengths in AMBA, considering hence T_oc(n) = 0 clock cycle. Finally, the time required for fNoC = 3 x fAMBA, the bandwidth definitions in Equations 8 the n-th packet to travel from output channel of m-th switch and 9 give NoC a 2.428 times higher bandwidth advantage. to input channel of the (m +1)-th switch, m T (m+) (n) is [6] , the maximum obtained from Table I , average per MB decoding time can be available bandwidth for any node in any architecture is given found and are shown in Table IV . by (Fig. 5) . Despite higher channel latency, NoCs have higher core efficiency, concurrency and bandwidth advantage over AMBA and can operate at lower frequency (approxi-(Sections III-B1 andIII-B2), AMBA has large non-processing mately 29%) than AMBA for same decoding bitstream (Sectime and hence higher application time and TAMBA (on tion III-C). Our comparisons focus on performance aspects average 2.46 times higher than TMoC).
between NoC and AMBA using a real application, while it 2) Operating Clock Frequency: Clock frequency is an im-supports the comparisons involving power, area and scalability portant parameter as processors with high clock rates dissipate in [9] . It is hoped that the findings in this paper would power proportional to operating frequency [7] . The operating contribute towards the current research efforts in identifying clock frequency required to give standard frame rate for the appropriate on-chip communication architecture for emerging video bitstreams shown in Table I, 
