In this paper, we present reliability analysis and comparison between on-chip communication architectures: dominant shared-bus AMBA and emerging network-on-chip (NoC); in the presence of single-event upsets (SEUs) using MPEG-2 video decoder as a case study. Employing SystemCbased fault simulations, reliability of the decoders is studied in terms of SEUs experienced in the computation cores and communication interconnects. We show that for a given soft error rate (SER), NoC-based decoder experiences lower SEUs than AMBA-based decoder. Using peak signal-to-noise ratio (PSNR) and frame error ratio (FER) metrics to evaluate the impact of SEUs at application-level, we show that NoC-based decoder gives up to 4dB higher PSNR, while AMBA experiences up to 3% lower FER. Furthermore, we investigate the impact of routing, application task mapping (distribution of tasks among computation cores) and architecture allocation (choice of number of computation cores) on the reliability of the decoders in the presence of SEUs.
consumption [1, 2, 3] . Shared-bus, such as advanced microprocessor bus architecture (AMBA), is a dominant, industry standard on-chip communication architecture [4] . To address the performance and scalability issues in the design of future MPSoCs, network-on-chip (NoC) has evolved as an emerging on-chip communication architecture [5] . Over the years researchers have proposed a number of flexible NoC architectures with efficient communication techniques. For example, AETHEREAL NoC architecture has been proposed by [6] with guaranteed communication services and NOSTRUM NoC architecture with layered communication approach has been presented in [7] .
Among other developments, recently a mesh-based Intel 80-core NoC architecture with clock frequency higher than 4GHz has been proposed in [8] .
An emerging challenge in MPSoC design is reliability in the presence of different faults. These faults can generally be classified in two types: permanent and transient. Permanent faults are related to irreversible physical defects in the circuit, which are produced during manufacturing process. Transient faults, also known as soft errors, take place when a single ionising radiation event produces a burst of hole-electron pairs in a transistor that is large enough to cause the circuit to change state. Single-event upset (SEU) is the most popular transient fault model used in the study of reliability [9] , which is exacerbated by scaling and low power design techniques [10, 11] .
To mitigate the impact of soft errors a number of studies have shown different fault tolerant on-chip communication architectures and techniques for MPSoCs. For example, in [10] an investigation into reliability of different NoC architectures has been reported. Based on the investigation, effective fault tolerance techniques have been proposed for different NoC configurations to operate in the presence of soft errors. Another reliability analysis of on-chip communication architectures from performance, reliability and energy perspective has been carried out in [12] . Using such analysis an array of different fault tolerance techniques have been introduced at architecturaland algorithmic-level to tackle the reliability issues of communication components. In [13] a fault tolerant design of interconnects in on-chip communication architectures has been considered explaining conflicting design trade-offs between reliability and performance. The impact of power minimization on reliability has been examined in [14] showing effective power-aware fault tolerance design techniques for on-chip communication architectures. Several other techniques, such as stochastic communication [15] and routing [16] , have also been proposed to incorporate fault tolerance in on-chip communication architectures. Although good progress has been made in the 2 development of fault tolerant architectures and techniques, currently there is a lack of analysis of how on-chip communication architecture affects the reliability of MPSoCs in the presence of soft errors. For the NoC methodology to gain further maturity, such insightful analysis of reliability need to be performed highlighting comparison between dominant shared-bus AMBA and NoC, which is the main aim of this paper. To the best of our knowledge, no such study has yet been reported. In this paper, using cycle-accurate SystemC-based simulations we investigate the number of SEUs experienced in computation cores and communication interconnects in shared-bus AMBA and NoC employing real application traffic of MPEG-2 video decoder. We evaluate the number of SEUs experienced for a given soft error rate (SER) and show the impact of SEUs experienced at application-level. Furthermore, we investigate the impact of routing, application task mapping (distribution of application tasks among processing cores) and architecture allocation (choice of number of processing cores) on the reliability of the AMBA-and NoC-based decoders. The rest of the paper is organized as follows. Section 2 describes application, architecture and fault injection model used in this work. Section 3 compares between AMBA-and NoC-based decoders in terms of SEUs experienced in computation cores and communication interconnects, and evaluates the impact of SEUs at application-level. Section 4 demonstrates the impact of application task mapping and architecture allocation on the reliability of decoders. Finally, Section 6 concludes the paper.
System Model
In this section, MPEG-2 video decoder-based application model and MPSoC architectures employing the decoder cores (with AMBA and NoC on-chip communication) are described. Also, the fault injection model used to evaluate reliability of the MPSoC decoders in the presence of soft errors is explained.
Application Model: MPEG-2 Video Decoder
MPEG-2 video decoder constitues a major component of MPSoC applications and is chosen as an application case study. Figure 1 core, while part of the header sequence is sent to motion compensator (MC) core. The scanned and quantized video blocks are transformed into time-domain picture-ready video blocks by the inverse discrete cosine transformer (IDCT) core. Using these picture-ready video blocks MC core forms inter-and intra-frame predictions and stores or displays the decoded frames (Figure 1(a) ). and handshake signals (busy in, busy out, request in and request out) for enabling communication to/from the processing core ( Figure 1(b) ). MPEG-2 video decoder is capable of decoding video bitstreams with different rates and sizes. Table 1 shows four video bitstreams 1 with different resolutions and sizes, which are used for comparisons in Section 3.
Shared-bus AMBA
Shared-bus AMBA employs a central multiplexor scheme, called a bus, which controls the access and direction of on-chip communication. Using such scheme all masters (e.g. processing elements)
in an MPSoC are required to be granted mutually exclusive access to the bus by an arbiter to [17] . Advanced high-performance bus (AHB) is used as shared-bus AMBA in this work due to its high performance [4] . A single-layer central multiplexor configuration with video decoder cores (Figure 1(b) ) are configured by using the 32-bit input port as slave port (for memory interface) and 32-bit output port as master port (for PE interface) as shown in Figure 2 .
As a result each core can process data from internal memory and initiate write operation through its master interface when access to bus is available and write data to slave interface that is connected 
Network-on-Chip
Network-on-Chip (NoC) incorporates packet-based on-chip communication with links laid out in different directions, while packet routing and communication is controlled by a switch. NoC gives large design space with different routing techniques, switch architectures and network topologies [10] . In this work, we use a mesh-based NoC topology with deterministic XY routing and single-flit-packet wormhole communication due to simplicity of switch design, performance and scalability [18] . The impact of using different routing algorithms in switch is investigated in Sec- Table 2 .
As can be seen the packet header consists of packet ID, source and destination ID, routing and virtual channel information and credit signals. The payload contains the actual computation data.
For such packet structure the size of each NoC packet is (32+46)=78 bits. 
Fault Injection Model
In this work, fault injection is carried out using SEU-based fault model employing the technique proposed in [20] . The injection of SEUs using this simulator is initiated through replacement of SEUs based on the specified soft error rates and probability distribution to identify fault locations within the fault locations database. Figure 5 shows the fault injection setup employing the fault injection simulator used for the MPEG-2 decoder with four processing cores ( Figure 1 
Comparative Reliability Analysis
Reliability of an application against SEUs is related to the total number of SEUs experienced over a given time [21] . Our aim in this work is to analyze how the reliability of MPEG-2 video decoder is affected by the choice of on-chip communication architectures: AMBA and NoC. To this end, the following investigations are carried out:
• evaluate the number of SEUs experienced during computation, F comp , to show how MPEG-2 decoder computation is affected,
• evaluate the number of SEUs experienced during communication, F comm , in the MPEG-2 decoder to show how on-chip communication is affected, and
• evaluate the impact of total SEUs experienced, F = F comp + F comm , at application-level to demonstrate how decoder reliability is affected. In the following (Sections 3.1 and 3.2), F comp and F comm of AMBA-and NoC-based decoders are evaluated and compared. Later (in Section 3.3), the impact of F is evaluated at application-level.
SEUs Experienced During Computation
The SEUs affect computation of a processing core through perturbation of the registers. to affect computation process. Hence, for a given soft error rate (SER), the effective number of SEUs experienced during computation (F comp ) can be given as the number of SEUs experienced by the computation cycles (in instances 1, 2 and 4) during execution of a processing core. The F comp of an MPSoC decoder with C processing cores can be given as
where λ is the SER (in SEUs per bit per cycle), T i is the execution time (in clock cycles), T
I−I i
is the number of idle-to-idle transitions within T i (in clock cycles) and R i is the register usage (in bits per cycle), all for i-th processing core. The R i gives a measure of per core register usage by the application, since SEUs in other registers have no impact [21] . The R i is given by [20] as
where R i,t is the instantaneous number of registers (in bits) used by MPEG computation process at t-th clock cycle in i-th processing core. Table 3 Higher T i results in higher number of SEUs experienced during computation (F comp ) in AMBAbased decoder compared to NoC-based decoder for decoding video different bitstreams ( , and average register usages, Ri, of processing cores in AMBA-and NoC-based decoders as shown in Figure 7 . The F comp values are found from simulations using an arbitrary SER of 10 −9
SEUs/bit/cycle in simulated fault injection environment (Section 2.4). The approximate F comp values can also be validated through (1) with T i , T
and R i values from Total SEUs experienced AMBA-based decoder experiences approximately 83% higher F comp on average compared to NoC for decoding different video bitstreams. As a result of higher F comp , MPEG-2 decoder computation is expected to be affected more in AMBA-based decoder than NoC-based decoder. In Section 3.3 the impact of SEUs experienced is examined at application-level.
SEUs Experienced During Communication
An important aspect in the reliability of on-chip communication architectures is the number of SEUs experienced during inter-core data communication as these SEUs perturb the registers in the interconnects and affect the data transfer [22] . 
where M is the number of inter-core communication links in the decoder (M = 4, Figure 1 The for a given link, L ch can be expressed as
where τ S c−in (n) is the time elapsed for DTU to travel from source output port to source interconnect port, τ S−D in−in (n) is the time elapsed for DTU to travel from source interconnect port to destination interconnect port and τ D in−c (n) is the time elapsed for DTU to travel from destination interconnect port to the destination core memory, all for n-th DTU out of total N DTUs. For AMBA, τ S c−in (n) = 1 clock cycle after bus access is granted and locked. During τ 
Equation (5) is a result of multi-hop NoC packet communication through K intermediate switches and involves the following delays. The time required for the n-th packet to travel from input channel to the router of the k-th switch, τ k ic−r (n), is 1 clock cycle for the NoC switch design (Figure 3(b) ). Also, the time required for routing decision on the k-th switch for n-th packet, τ k r (n), is 1 clock cycle. The n-th packet travels from router to the output channel of the k-th switch immediately in the NoC implementation and hence τ k r−oc (n) = 0 clock cycle. Finally, the time required for the n-th packet to travel from output channel of k-th switch to input channel of the (k + 1)-th switch, τ k−(k+1) oc−ic (n), is 1 clock cycle. Using (4) and (5), NoC has a minimum channel latency (L ch ) of 9 clock cycles (with K = 2 for shortest path mapping and XY routing, The average register usage of communication components during transfer of a DTU, R com j in (3), sets up another difference between AMBA-and NoC-based decoders. The R com j can be given by dividing the total register usage during inter-core transfer of DTUs by the number of DTUs, i.e.
where R n,l is the instantaneous register usage on j-th link during inter-core communication of n-th DTU at l-th clock cycle (l=1:L ch j ). For NoC-based decoder, R n,l in (6) includes registers used in packet overheads and buffers in NI interfaces, channels, VCs, and routers as packet is communicated between cores. For AMBA-based decoder, R n,l includes the registers used in address (HADDR), control signals (RD and WR), decoder and arbiter as DTU is communicated between cores. Using (6), R com j in NoC-based decoder (Figure 3(a) ) obtained from simulation logs is approximately 212 bits per data transfer cycle (for using XY packet routing) and that in AMBAbased decoder is approximately 87 bits per transfer cycle. The higher R com j of NoC is expected as NoC incorporates packet based multi-hop routing and buffering with complex switch structure.
Note that R com j of NoC is dependent on the packet routing algorithm as underlying routing algorithm determines the switch design complexity and the associated the register usage [24] . For example, using source-based routing algorithm gives R com j value of 187 bits per cycle, while using odd-even routing algorithm results in R com j value of 273 bits per cycle as opposed to 212 bits per cycle for XY routing. Table 4 shows the number of DTUs, N i (N 1 for VLD-MC link, N 2 for VLD-ISQ link, N 3 for ISQ-IDCT link, and N 4 for IDCT-MC link, Figure 1(a) ), recorded from simulation logs. Note that N values do not change between AMBA-and NoC-based decoders for a given video bitstream due to similar architecture for processing cores (Figure 1(a) ). For decoding a given video bitstream, N is the least from core VLD to core ISQ. As the video decoding progresses with other cores, N between cores increases due to decompression of the original video bitstream. For example, only N =66 × 10 3 DTUs are transferred from core VLD to core ISQ, while N =202 × 10 3 DTUs are transferred from core IDCT to core MC for decoding test1.m2v (row 2, Table 4 ). For increased video sizes, N also increases for a given link. For example, 108 × 10 3 DTUs are transferred from core ISQ to core IDCT for decoding test1.m2v compared to 364 × 10 3 DTUs on the same link for decoding test2.m2v (column 4, Table 4 ). Figure 8 shows comparative F comm of AMBA-and NoC-based decoders obtained from simulation logs for an arbitrary SER of 10 −9 , while decoding different video bitstreams (Table 1) . Approximate To demonstrate the impact of choice of NoC packet routing algorithms on the F comm , Figure 9 shows the F comm values for different packet routing algorithms: source-based, XY and odd-even routing algorithm implemented on NIRGAM [19] . The F comm values are found with SER of 10 −9 , while decoding the video bitstream test4.m2v (Table 1 ). The approximate values of F comm can be found through (3) using the L ch and R com j values of AMBA-and NoC-based decoders. As can be seen, using source-based packet routing in NoC switches gives the least SEUs experienced during communication (F comm ), while odd-even routing algorithm gives the highest F comm . This is because, due to source initiated routing information inserted in the packets, source-based routing gives the least register usage of 187 bits per cycle and simpler switch design. On the other hand, odd-even routing implements adaptive strategy of packet routing with a control mechanism to per cycle) than odd-even due to its deterministic nature of choice of routing directions [25] . As expected, as more number of switches are travelled by NoC packets using these routing algorithms, the F comm values also increase linearly.
Comparing between F comm ( Figure 7 ) and F comp values (Figure 8 ) of AMBA-and NoC-based decoders while decoding a given video bitstream, it can be seen that F comm ≪F comp . Nevertheless, F comm affects the reliability on-chip communication as it leads to faults resulting in misrouting or loss of DTUs [22] . The loss of DTUs or misrouting causes the decoding process to be terminated or skip a number of video blocks or frames while decoding [26] . Next, the impact of overall SEUs experienced (F) is evaluated at application-level.
Impact of SEUs at Application-Level
In Sections 3.1 and 3.2, the reliability of AMBA-and NoC-based decoders were investigated in terms of the SEUs experienced during computation (F comp ) and communication (F comm ). With the F comp and F comm values from (1) and (3), the total number of SEUs experienced, F, is given as
16
In this section, the impact of injected SEUs, F, given by (7), is evaluated at application-level. Such evaluation has also been used in [21] showing that the faults at architectural-level do not always lead to faults at application-level enabling low-cost fault tolerance mechanisms. We evaluate the impact of F on decoder reliability using peak signal-to-noise ratio (PSNR) metric (as also used by [21] ). PSNR is defined as
where P is the number of frames, each with Q pixels, x p,q and y p,q are the q-th pixels in p-th reference and decoded frames. Note that in the presence of SEUs, PSNR (given by (8)) is degraded due to alterations in computation registers containing y p,q values. As a result, the SEUs experienced during computation (F comp ) has a direct impact on the PSNR. However, due to normalization with decoded frames and pixels PSNR does not reflect temporal fidelity in the event of loss of frames [26] .
To evaluate fidelity in the event of frame losses, we use frame error ratio (FER) metric, defined as
where x is the number of lost frames out of P frames. As expected, NoC-based decoder outperforms AMBA-based decoder with up to 4dB higher PSNR (Figure 10(a) ). This is because NoC-based decoder experiences lower F comp than AMBA-based decoder (Section 3.1). However, since PSNR does not reflect the fidelity of video blocks due to perturbation of registers by F comm (and also since the number of intermediate switches does
not affect F comp , given by (1)), NoC-based decoder shows similar PSNRs for all configurations.
Comparing the FER values in Figure 10 , it can be seen that AMBA-based decoder gives 3% lower The FER values of NoC-based decoder in Figure 10 (b) are obtained using XY packet routing algorithm. Figure 11 demonstrates the impact of choice of routing algorithm on the FER of the NoC-based decoder (Figure 3(a) ), while decoding the video bitstream test4.m2v (Table 1) . Three different packet routing algorithms are used: source-based, XY and odd-even. FER values are 18 obtained through (9) from decoded video frames in SystemC fault injection environment with an SER of 10 −9 . As expected, using the source-based packet routing algorithm gives the lowest FER among the routing algorithms due to the lowest F comm in NoC-based decoder (Section 3.2).
Employing XY or odd-even routing algorithm gives higher FER in the decoder due to the higher F comm (Section 3.2). It can be seen that with increasing number of intermediate switches between
communicating cores, the FER of the NoC-based decoder increases almost linearly due to increased F comm , given by (3).
Impact of Application Task Mapping and Architecture Allocation
The impact of application task mapping and architecture allocation on system performance in the context of HW/SW co-design has been studied extensively [27] . In this section, the impact of application task mapping and architecture allocation on the reliability of on-chip communication architectures is investigated. Core 1 t1, t2, t3, t4
Core 2 t5, t6
Core 3 t7, t8
Core 4 t9, t10, t11
Core 2 t4, t5
Core 3 t6, t7, t8, t9, t10
Core 4 t11
M3 (optimized for parallelism) Core 1 t1, t2, t3, t4, t9
Core 2 t5, t6, t7
Core 3 t8
Core 4 t10, t11
M4 (optimized for reduced register usage & parallelism)
Core 1 t1, t2, t3, t4, t5, t6
Core 2 t7, t8
Core 3 t9
Core 4 t9, t10, t11 Table 5 : Four application task mappings of MPSoC decoder using four processing cores (Figure 1) Numerous mapping combinations are possible for decoder design using the task graph (Figure 12 ). Table 5 shows four different task mappings of the decoder with the mapped tasks on each processing core. Mapping M1 (row 2) is the mapping employed in Figure 1 (a), mapping M2 (row 3) is optimized for reduced register usage, mapping M3 (row 4) is optimized for high parallelism and finally, mapping M4 (row 5) is jointly optimized for reduced register usage and high parallelism. The task mappings M2, M3 and M4 in Table 5 are found through simulated annealing using group-migration based task movement proposed in [28] . As can be seen, mapping M2 localizes most of the the tasks (for example, tasks t 1 -t 8 are mapped in core 1) to achieve low overall register usage (R = i R i ), while mapping M3 distributes the tasks among processing cores to optimize for high parallelism. Mapping M4 achieves reduced register usage and high parallelism by carefully distributing the tasks among cores (for example, related tasks t 7 and t 8 , which share IDCT parameters and video blocks between them, are mapped in core 2). Figure 13(a) and (b) show the register usages (R) and multiprocessor execution times (T M ) obtained from SystemC cycle-accurate simulations for the AMBA-and NoC-based decoder designs with the tasks mappings (Table 5 . As expected, mapping M2 gives the lowest R for AMBA-and NoC-based decoders due to optimization with reduced register usage ( Figure 13(a) ). However, low R in mapping M2
is obtained at the expense of the highest T M caused by localization of application tasks (Figure 13(b) ). Mapping M3 gives the lowest T M due to high parallelism among the processing cores for both decoders. Since such low T M is achieved through distribution of tasks among cores to give higher parallelism, shared register resources among these tasks are duplicated in processing cores.
As a result, mapping M3 gives the highest R. Mapping M4 offers a good trade-off between R and T M . It can be seen that AMBA-based decoder has lower R compared to NoC-based decoder due to contention of registers over idle period during bus arbitration (Section 3.1). As expected, T M is high for AMBA-based decoder due to shared-access of bus and hence lower concurrency among processing cores [3] (Figure 13(b) ).
To demonstrate the impact of application task mapping on reliability, Similar trend is also observed while decoding other video bitstreams (rows 3-5, Table 6 ). The higher F comp in mapping M2 is due to reduced register usage (R) through localization of the tasks on a processing core. Such localization causes high multiprocessor execution time (T M ) and leads to high F comp , given by (1). Mapping M3 also experiences higher F comp than mapping M4 due to increased register usage (R) through duplication of shared registers. Due to joint optimization with reduced register usage (R) and high parallelism, mapping M4 provides the lowest F comp . Note that F comm does not vary for different task mappings as the total number of DTUs communicated among processing cores, N = i N i , does not vary significantly while decoding a given video bitstream. Figure 14 shows the impact of F (given by (7)) at application-level in terms of PSNRs and FERs of AMBA-and NoC-based decoders, while decoding test4.m2v. The PSNR and FER values were obtained using (8) and (9) from the decoded videos using SER of 10 −9 in SystemC fault injection environment (Section 2.4). As expected, mapping M2 gives the lowest PSNR (79dB and 85dB for AMBA-and NoC-based decoders) due to the highest F comp (Figure 14(a) ). Due to lower F comp , mapping M3 gives up to 7dB higher PSNR compared to mapping M2. Mapping M4 gives the best PSNR (91dB for AMBA-based decoder and 95dB for NoC-based decoder) when compared to the other three mappings due to the lowest F comp . From Figure 14 that, despite similar F comp values for different mappings (Table 6 ), the FER is higher (7.5%) for mapping M2 due to incorrect computation of video parameters with high F comm . With low F comm , mapping M4 gives the least FER (6.2%) among all task mappings.
Architecture Allocation
Architecture allocation is a system-level design step for MPSoCs that deals with allocation of processing elements and their interconnects into the architecture [29] . In this section, we refer to architecture allocation as allocation of number of computation cores in the MPSoC decoder ( Figure 1) . To investigate the impact of architecture allocation on reliability, different number of allocated cores were simulated using mapping M4 (Section 4.1). Table 7 (Table 7) . As expected, NoC-based decoder experiences less number of SEUs during computation (F comp ) than AMBA-based decoder, while AMBA-based decoder experiences less SEUs during communication (F comm ) for all architecture allocations (Sections 3.1 and 3.2). It can be seen that both AMBA-and NoC-based decoders experience higher F comp as the number of allocated cores increases in the architectures. This is because with higher number of allocated cores, the overall register usage (R = i R i ) increases due to duplication of shared resources, resulting in higher F comp given by (1) . Also, with increased architecture allocation To observe the impact of total number of SEUs experienced at application-level, Figure 15 shows ( Figure 15(a) ) due to lower number of SEUs experienced during computation, F comp (Section 3.1).
Due to increased F comp for increasing number of allocated cores, architecture with higher number of cores give poorer PSNRs for AMBA-and NoC-based decoders. For example, PSNR decreases from 99dB for architecture with 2 cores to 84dB for architecture with 5 cores for AMBA-based decoder (Figure 15(a) ). As expected, decoder architecture with higher number of cores gives higher FER due to increased F comm (Table 7) . For example, FER increases from 2% for architecture with 2 cores to 4% in the case of NoC size 2×2 (4 cores) and 4.5% for architecture with 5 cores for AMBAbased decoder (Figure 15 
Summary of Comparisons
From the comparative analysis (Sections 3 and 4) the following observations are made:
1. For a given architecture allocation and soft error rate (SER) AMBA-based decoder experiences higher number of SEUs during computation than NoC-based decoder. This is because AMBA-based decoder has higher execution time than NoC-based decoder due to shared bus access in AMBA (Section 3.1). 
NoC-based decoder experiences higher

Conclusions
Using MPEG-2 video decoder as a case study in simulated fault injection environment, we have presented a comparative reliability analysis between shared-bus AMBA and NoC. We have shown that AMBA-based decoder experiences higher SEUs during computation than NoC-based decoder due to higher execution time than NoC-based decoder (Section chap4:results:computation). We have also shown that NoC-based decoder experiences higher SEUs during inter-core communication than AMBA-based decoder due to higher channel latency and register usage in communication interconnects (Section 3.2). Considering the impact of SEUs at application-level, we have shown that NoC-based decoder is more error resilient (in terms of peak signal-to-noise ratio) compared to AMBA-based decoder but it suffers from higher frame error ratio due to higher SEUs experienced during communication (Section 3.3). Furthermore, we have investigated the impact of routing, application task mapping and architecture allocation on the reliability of the decoders in the presence of SEUs (Section 4). It is hoped that the findings in this work would contribute towards the current research efforts in identifying appropriate on-chip communication architecture for emerging multimedia applications.
