Abstract-Mixed-critical real-time systems must meet strict integrity, resilience, and timing constraints, as specified by safety standards. Due to the increasing threat of random hardware faults, efficiently achieving high reliability and dependability calls for cross-layer fault-tolerance solutions. This paper introduces the Advanced Integrity Q-service (AIQ), a mechanism to ensure the integrity and predictability of on-chip communication under random hardware faults. Devised for cross-layer and hierarchical fault-tolerance solutions, AIQ realizes low-overhead error detection in hardware and delegates error handling to arbitrary strategies in software. Experimental evaluation featuring benchmark applications and an industrial avionics use case shows that AIQ operates with high reliability and availability and low hardware and performance overheads. In a many-core mixed-critical platform under expected real-time scenarios, AIQ performs with execution time overhead between 1.4% and 7.1%.
The first resilient NoC for real-time systems has been recently proposed in [4] and [9] with an approach that ensures the continuous operation of the network after error occurrences. The work is based on results of a failure mode and effects analysis (FMEA)-based analysis [10] that is capable of meeting certification requirements of safety standards and uncovers all possible impacts of soft errors in the NoC. Upon error occurrence, the approach limits the error impact on time and in scope to provide predictability and integrity. Data delivery under real-time constraints is realized by Automatic Repeat reQuest (ARQ)-based protocols [9] , [11] . Although the approach successfully increases the reliability of the NoC, providing error detection and recovery capabilities in hardware incurs overhead even in the absence of errors, which is the case most of the time. Errors in NoCs are seldom and the overhead should be minimized following a "good enough" strategy.
Fast, hardware-based recovery is not always necessary in cross-layer and hierarchical fault-tolerance approaches. Ensuring the system's integrity is paramount and seen as basic functionality, e.g., in [12] . In fact, lossless recovery in hardware requires additional circuitry that can incur substantial power consumption and delays-e.g., retransmission buffers in ARQ [11] . Recovery can be performed more efficiently in higher levels of abstraction, as seen in cross-layer approaches with replicated execution [13] [14] [15] . Such techniques exploit the abundant hardware available in multicore and many-core platforms to increase reliability and provide error recovery capabilities in software. Error detection is performed with hardware support since software-only error detection is ineffective and inefficient. The decision to recover and the error recovery itself are delegated to software.
Nevertheless, the hardware behavior must be predictable since it is a real-time system, and it must detect errors fast enough to allow the system to isolate them and prevent their propagation. This ensures that the recovery, if and as desired, can be carried out in the proper granularity and ensures the integrity of the rest of the system. Also, it reveals two requirements to the hardware operating under soft errors: integrity and real-time (predictability).
This paper introduces the Advanced Integrity Q-service (AIQ), an end-to-end mechanism to provide integrity and real-time guarantees of NoC transactions under errors. The mechanism is inspired by the idea of keeping track of transactions in distributed systems and hardware transactional memory (HTM). Upon error detection, error handling and recovery are delegated to software, which may react according to an arbitrary strategy in a cross-layer approach. AIQ is proposed and evaluated in a manycore research platform Integrated Dependable Architecture for Many Cores (IDAMC) [1] considering aspects such as performance and implementation costs. Although the idea of keeping track of transactions in distributed systems is not novel, to the best of our knowledge, its application in hardware in the context of predictable real-time systems has not been explored and evaluated.
The contribution of this paper is fivefold: 1) the AIQ approach to ensure integrity and predictability of real-time NoCs under random hardware faults; 2) formal communication time analysis of AIQ in NoCs with formal analysis for both error-free and error cases; 3) the evaluation of AIQs performance in a mixed-critical real-time many-core platform, comprising benchmarks and an avionics use case; 4) the evaluation of AIQs hardware implementation cost; and 5) the evaluation of the achieved reliability and availability.
The remaining of this paper is organized as follows. A review of relevant related work is given in Section II. The AIQ approach is introduced in Section III, followed by the formal analysis in Section IV. The experimental evaluation is presented in Section V. Section VI concludes this paper.
II. RELATED WORK
Fault tolerance in NoCs has been widely explored throughout the years. Research has explored fault tolerance both in the link layer [16] , [17] and in the network layer [18] [19] [20] [21] [22] [23] [24] . Moreover, approaches focus on different types of random hardware faults: transient and intermittent faults [16] [17] [18] [19] , permanent faults [24] , or both [20] [21] [22] [23] . Comprehensive overviews are found in [25] and [26] . The key technique varies with the approach: from retransmission protocols and adaptive routing to stochastic broadcasts. Feng et al. [21] tackle transient faults in the link layer with hybrid ARQ/forward error correction (FEC) and permanent faults at the network layer with a reinforcement-learning-based fault-tolerant deflection routing algorithm. A similar approach was adopted in [22] . In contrast, Bogdan et al. [23] employ a probabilistic broadcast scheme to reliably transmit packets under transient and permanent faults. In summary, the vast majority of research has focused on general purpose and high-performance computing systems and their requirements. Due to different goals and constraints, they are usually not directly applicable to the mixed-critical realtime domain [4] .
Mixed-critical real-time systems have strict requirements [6] [7] [8] that call for dedicated techniques that assure safety without jeopardizing the efficiency of resource usage, the partitioning-sharing tradeoff [5] . A comprehensive overview of mixed-criticality is found in [5] . In a mixedcritical real-time NoC, traffic belonging to functions of different criticalities coexist [10] and share resources (routers, links, and network interfaces) [5] . Aside from the aforementioned real-time, integrity, and resilience requirements, three points are neglected by nonmixed-critical NoCs [4] : predictability and deterministic behavior to enable nonpessimistic, minimum performance guarantees; sufficient independence between different traffic streams to enable the use of the NoC as a shared resource by different traffic of criticalities with independently given performance guarantees; and an error-model that accurately captures all possible impacts of errors, as obtained by FMEAs and usually required by safety standards [6] . The interested reader can refer to [4] and [10] for further discussion. Regarding fault-tolerance, beside the aforementioned resilient NoC [4] , [9] , another work has recently addressed mixed-critical NoCs, albeit considering only permanent faults. Mixed-critical partitioning is employed in [24] to circumvent faulty routers, which are detected by a built-in self-test.
The concept of transactions has been widely applied in different fields of computer science and engineering, such as databases, memories, and distributed systems, with the main objective of ensuring the correct execution of transactions while increasing performance and parallelism. In database management systems [27] , transactions are employed to provide the properties of atomicity, consistency, isolation, and durability. In that context, the concept is used to increase the level of concurrence, and thus, performance, in processing numerous simultaneous transactions accessing a single, large database. Thus, transactions might be executed with speculative data accesses and its changes to the database/system are only committed after a validation phase [28] . If an illegal interleaving of reads and writes to the database occurred due to a race condition, the respective transaction is rolled-back and restarted.
Inspired by database management systems and existing work on HTM [29] , the transactional memory coherency and consistency (TCC) was proposed as an alternative to traditional memory consistency and coherency models [30] . TCC aimed at simplifying parallel software programming and increasing the concurrent performance of shared memory multiprocessors. Unlike traditional consistency models, accesses to critical sections of the code-i.e., lock-based access to shared memory-must not be explicitly specified but are carried out by transactions, which are atomic from the point of view of consistency. Unlike traditional coherency approaches, data status synchronization can be performed only at the end of a transaction instead of every memory access (the actual operation depends on the coherency scheme). A good overview of transactional memories is given in [31] . In the context of mixed-critical real-time systems, HTM has been explored as hierarchical HTM on distributed embedded systems spanning over on-chip and off-chip networks [32] .
In the context of fault tolerance, HTM was explored as a hardware-assisted mechanism for error detection and recovery in replicated software execution. The hardware-assisted fault tolerance approach [33] employs instruction-level redundancy for error detection and HTM for recovery with compilerbased code instrumentation. Before committing a transaction to memory, error detection is performed by comparing the instruction-level redundant execution. Recovery is carried out by rolling back the transaction with HTM. Similarly, FaulTM [34] also employs instruction-level redundancy and HTM but with hardware extensions instead of a software-only approach. Those approaches are types of replicated execution [13] , [14] . However, error detection requires all tasks/threads in the system to be protected and to execute redundantly. A single unprotected task/thread may cause system failure and violate integrity.
In this paper, the concepts of transactions and ARQ are employed in NoCs to monitor and to detect random hardware faults during runtime and to ensure system integrity while not jeopardizing system performance. The difference with related work lies in that all communication in the NoC is covered instead of only replicated tasks, does not require replicated execution, does not depend on resilient routers, and does not depend on the components of a tile-i.e., whether the tile has a processor with caches, only a hardware accelerator or a memory controller. The hardware ensures that any error in a transaction is detected and, in a cross-layer solution, delegates to software the task of handling the error and choosing the best strategy to do so. Strategies may involve rolling-back, restarting or killing a task/thread, or failing when no other strategy can be safely applied (see [13] and [14] ). Upon a task failure, the error effects are isolated to ensure the integrity of the rest of the system. Upon a system failure, the failure should be signaled before any erroneous output is performed.
III. AIQ APPROACH

A. Overview
The AIQ approach is an NoC service inspired by memory transactions and ARQ-based protocols. AIQ works as a service that keeps tracks of transactions across the NoC with respect to integrity and timing. Fig. 1 gives an overview of how the approach is integrated into the NoC. The AIQ service operates in the transport layer in the NoC and is located between the interfaces of the network interface (NI) to the tile internals and the lower layers of NoC. AIQ detects soft and hard hardware errors and reports it to software. The error report signaling path is depicted with the dashed arrows. The AIQ approach delegates to software the task of deciding and recovering from an error in favor of less hardware overhead. Since hardware-based error recovery is not required, powerhungry retransmission buffers and error recovery circuitry are avoided.
B. Transactions in the NoC
The mechanism keeps track of transactions in the NoC in principle by acknowledging successful transactions. However, given the stringent power and area constraints of small-scale on-chip networks, a closer look at how transactions take place over the NoC is required.
A regular, unprotected transaction is usually initiated by a master who sends a request to a slave, which will receive the request and respond after it has been completed. For example, tile t1 synchronizes with t2 by polling the value of a shared variable in the local memory of t2. Fig. 2(a) illustrates such a read transaction, where t1 sends a read request packet and waits for the response, and t2's NI receives the request packet and forward the request to the tile. After some time, depending on the resource contention in the tile, the response is packed and sent, and the response packet is then received by t1, which resumes its execution. Note that the lower layers of the NoC protocol stack are abstracted away for the sake of clarity. Note also that the same pattern occurs in case of a write operation or in case of message passing, including or not a response, depending on the memory or communication models.
The master might present overlapping memory transactions, e.g., on memory consistency models where support for nonblocking reads and writes exist. The master might also support direct memory access (DMA) transfers, which will overlap with regular memory transactions when appropriately used.
A slave will potentially receive multiple concurrent requests independently of the memory model since more than one master can send a request concurrently. The slave can then support single or overlapping requests. In the former case, one request is received and processed at a time, and subsequent requests are not accepted by the slave until the completion of the current one. In the latter case, more than one request can be processed at a time. Potentially, in both cases, arriving requests will queue up in the NoC causing backpressure and head-of-line blocking, impacting the arrival of subsequent requests and responses. Resource sharing management is a technique that allows the predictable management of shared resources, as might be required in the slave. However, that will not be further discussed here, and the interested reader is referred to [35] for more details. For the sake of simplicity and without loss of generality, the discussion in this paper assumes single transaction support in slaves, unless stated otherwise.
C. Error Model
The transactions are composed of packets, which can suffer a series of different impacts due to random hardware faults in the NoC. We derived a functional error model capturing all impacts of soft errors on an unprotected real-time NoC and their durations. The error model is based on the comprehensive description given in [10] , which is the result of an FMEA-based analysis methodology that uncovers all impacts of soft errors in the NoC, as required by safety standards [6] [7] [8] .
On an end-to-end communication stream, as seen by the NIs, the impact of random hardware faults can be summarized as follows (packet and data are used interchangeably). [10] , and unless they are ruled out with a resilient NoC, their effects last until the circuit is reset [4] .
Hard errors have the same impacts on an end-to-end basis as the ones listed above but with a different occurrence pattern. While soft errors are transient or intermittent in nature and affect traffic randomly according to a probabilistic distribution, hard errors are permanent and, upon occurrence, affect the traffic continuously. Thus, the difference between soft and hard errors is captured in the error model by the frequency in which a certain error effect occurs. The difference between hard errors and soft errors with static effects is that the latter disappear when the NoC is reset.
D. Protocol
AIQ relies on two well-known error detection mechanisms in computer networks: packet integrity check and packet delivery confirmation. The former is based on error-detecting codes (EDCs) and error-correcting codes (ECCs) and can be realized in hardware, or in some cases delegated to software. The latter is realized in hardware with watchdogs and acknowledgment messages, as seen in ARQ-based protocols.
AIQ aims at decoupling the data transport over the NoC from the processing in the tiles, as depicted in the example of Fig. 2(b) . The reason for it is that being predictable and providing timing bounds that include the processing in the tiles creates a complex circular dependence. The traffic in the network depends on the processing in the tiles and vice versa. The AIQ approach should be applicable without previous knowledge of the tile internals. Thus, AIQ tracks requests and responses independently in the NoC instead of keeping track of the entire transaction, which is vital for achieving low error detection latencies and effectively limiting the error impact. Moreover, it also enables the formal performance analysis and guarantees of the approach.
AIQ's objectives are: 1) detect all relevant soft and hard random hardware faults in the NoC; 2) operate independently of the tiles' contents and operation and of the NoC topology; 3) report detected errors with very low latencies; and 4) minimize the NoCs performance overhead.
AIQ keeps locally a tracking table common for both requests and responses with n entries, as illustrated in Table I.   TABLE I   AIQ REQUEST/RESPONSE TRACKING TABLE   TABLE II AIQ ERROR NOTIFICATION SCENARIOS Every request (respectively, response) transmitted by the master (respectively, slave) requires an entry in the table. The entry is kept until the sender confirms the successful transmission of the request (respectively, response) or until an error affecting the request (respectively, response) is detected. Each entry has a sequence number that identifies the request/response; a flag indicating the entry's state; the request/response typee.g., read response; a timer; the source and destination of the request 1 ; and the address of the request/response. Some fields are used for tracking and others enable the diagnosis after error detection.
The AIQ protocol is divided into two parts: a master instance for sending requests and a slave instance for sending responses. A summary of the different scenarios is given in Table II . They are described individually in the sequel.
1) Requests: AIQ must account for two cases when tracking requests: loss and corruption.
Loss is monitored with a handshaking mechanism, as seen in ARQ protocols. When the request is sent, the timer associated with the request is triggered. When the request is correctly received, the slave's AIQ sends an acknowledgment (ACK) packet back to the sender. If the ACK is correctly received, the respective timer is stopped and the request is marked as not pending, releasing the respective entry. The scenario is illustrated in Fig. 2(b) . If the ACK is not received, a timeout will trigger the error detection activities. This occurs if the request itself was lost or if the ACK was lost. In the former case, the slave is unaware of the failed request and remains unaffected by the error. That is illustrated in Fig. 3(a) . In the latter case, the slave received the request and will process it correctly, whereas the master cannot be certain that the request was received. Thus, in both cases, only the master is notified with a hardware interrupt in order to take appropriate action. The master must be able to handle the "orphaned" responses.
Upon loss detection, four actions are carried out at the master.
1) The information necessary for error diagnosis is stored in dedicated registers in the AIQ. 2) The respective request table entry is released.
3) The interface that issued the affected request is notified to abort the transaction in order to allow the further operation of the system. 4) A hardware interrupt is triggered so that appropriate action can be taken in software. For example, fault containment can be performed by the real-time operating system (RTOS) and error recovery can be performed by a replica manager [13] , [14] . Unaffected tasks may continue executing normally.
Corruption is verified with an integrity check using EDC and possibly ECC, both at the slave and at the master. The integrity check is mandatory for control fields and optional for data. 2 Requests with corrupt control fields are immediately dropped to keep the integrity of the node-e.g., prevents unintended access and modification (corruption) of memory contents. As illustrated in Fig. 3(b) , upon the integrity of control fields ②, the request is forwarded to the slave tile. The integrity check of the data (when enabled) is performed on-the-fly as the data are forwarded to the tile. This improves performance and reduces hardware overhead by avoiding the use of large buffers. The result of the integrity check is available when the last data word of the request traverses the AIQ. Thus, possibly part of the data might have already reached the tile (e.g., memory) by the time the corruption is detected and signaled. Nonetheless, the signaling ③ will occur before the last word of the request leaves the slave's NI. The error is also signaled back to the master ④ with a negative acknowledgment (NACK) packet.
Upon corruption detection, the following actions are carried out at the slave.
1) The information necessary for error diagnosis is stored in dedicated registers in the AIQ. 2) The interface receiving the affected request is notified to abort the transaction in order to allow the further operation of the system. 3) A hardware interrupt is triggered locally so that appropriate action can be taken. 4) A NACK is sent to the master in order to trigger the error detection actions. Upon the receipt of a NACK, the following actions are carried out at the master.
3) The interface receiving the affected request is notified to abort the transaction in order to allow the further operation of the system. 4) A hardware interrupt is triggered locally so that appropriate action can be taken in software. In case, the NACK is not successfully received, e.g., due to the failure of the NoC, the case will be handled as a request loss.
2) Responses: Similar to requests, AIQ needs to account for two cases when tracking responses: loss and corruption. In both cases, error detection occurs with the same approach and mechanisms as for requests. The difference lies in the error reporting.
The loss of a response must be reported back to the master. It can be optionally 3 reported locally to the slave. Upon the loss detection, the following actions are carried out at the slave.
1) A NACK is sent to the master in order to trigger the error detection actions. 2) The NACK will be transmitted following Stop-and-Wait ARQ. Optionally, the following actions can be carried out at the slave to trigger a reaction locally.
1) The information necessary for error diagnosis is stored in dedicated registers in the AIQ. 2) A hardware interrupt is triggered locally so that appropriate action can be taken. Upon the receipt of a NACK, the following actions are carried out at the master.
1) The information necessary for error diagnosis is stored in dedicated registers in the AIQ. 2) The interface receiving the affected request is notified to abort the transaction in order to allow the further operation of the system. 3) A hardware interrupt is triggered locally so that appropriate action can be taken in software. 4) The NACK is acknowledged (Stop-and-Wait ARQ). The corruption of a response is detected and reported locally to the master. Upon corruption detection, the following actions are carried out at the master.
1) The information necessary for error diagnosis is stored in dedicated registers in the AIQ. 2) The interface receiving the affected request is notified to abort the transaction in order to allow the further operation of the system. 3) A hardware interrupt is triggered locally so that appropriate action can be taken.
E. Discussion and Limitations
Due to the size and resource restrictions in an NoC in comparison with an off-chip network, two points that can impact the performance and correct operation of the approach were identified during design. They are described individually in the sequel along with the applicability limitations of AIQ.
The first identified architectural limitation is related to how received requests are handled by the interfaces of the NI. When handling received requests, the NI handles each request sequentially instead of handling them in parallel or buffering the requests, which would be too costly and unnecessary in most cases. Thus, subsequent requests and packets can potentially be blocked due to the backpressure. Fig. 4 shows a block diagram of the NI, including the AIQ mechanism (corruption detection with cyclic redundancy check (CRC) is shown separately). The backpressure is illustrated with the red arrow ①. A request is received by the Mst_if interface and blocks the following packets as indicated by the arrow. That interface will only accept a next request after the current one has been served. That can make the timing of the approach dependent on the workload and on the internal details of the tile. The key impact on AIQ is that the blocking introduces additional delay to requests/responses, which are expected to be acknowledged as soon as they arrive at the NI. Thus, the latency becomes dependent not only on the NoC topology and interfering traffic but also on the internal performance of the tile. The first point is addressed in this paper by bounding the maximum time that an interface in the NI can take to process a request. The bound must be realized by the NI and tile designs-e.g., the NI might abort the request if it is not completed by the tile within the specified bound; alternatively, the tile can be designed to accept multiple concurrent requests.
The second identified architectural limitation is related with the NI's input buffer (buffer phit2flit), which reassembles flow control units (flits) from physical units (phits) and buffers them until they can be forwarded to the upper layers of the protocol stack. The flits of different virtual channels (VCs) are stored in different queues. It can happen that a control message of AIQ (ACK or NACK) experiences head-of-line blocking depending on the arbitration policy and depending on the type of packets queued in front of it. This is illustrated by arrow ② in Fig. 4 . The situation can escalate to a deadlock, which will be detected as an NoC error, when the Mst_if is ready to respond a request but no entries are available in AIQs tracking table. The ACK that releases a table entry is then blocked in the buffer phit2flit due to backpressure from Mst_if. The deadlock can be ruled out by using separate virtual channels (VCs) for control packets and for requests/responses. Alternatively, it can be ruled out by ensuring that the number of concurrently received requests in a tile is limited and do not block the control packets, e.g., by using a resource manager [35] . The former solution is adopted.
As a mechanism integrated into the NoC, AIQ fails when it is not able to detect and report errors anymore or when the underlying NoC fails. AIQ itself is dimensioned to a single-error scenario, i.e., either the error affects a request/response or it affects the AIQ. For instance, either AIQs table will suffer a bit flip, which can either be masked or cause the report of an error, or the request/response, handled as expected. The occurrence of a second error in the same request/response is considered as a failure.
Finally, note that AIQ does not support broadcasts and multicasts out-of-the-box since that functionality must be supported by the architecture of the underlying real-time NoC. For example, IDAMC does not support those in favor of high predictability and sufficient independence [1] . Moreover, AIQ is intended for detecting errors in the NoC. Thus, error detection in tiles and error handling are beyond AIQs responsibilities. Error handling can be performed in the next levels of a hierarchical fault-tolerance architecture or crosslayer solution. The limitations of such a cross-layer solution depend on factors beyond the scope of the NoC and AIQ, such as the actual software, RTOS, scheduling policies, and system-level requirements. The dimensioning and validation of the cross-layer approach can be made with tools such as [13] and [36] .
Next, the proposed AIQ mechanism is formally analyzed with respect to communication time in the NoC, including the aforementioned aspects.
IV. FORMAL ANALYSIS OF AIQ
Transport protocols, such as Go-Back-N ARQ and the proposed AIQ, introduce additional flow control to the communication. This comes from packets that are retained due to handshaking or due to retransmissions with timeouts (in ARQ). It creates a circular dependence where the performance of the transport layer depends on the network latency, which, in turn, depends on the traffic injected by the transport layer. Similar to [11] , we model such networks using compositional performance analysis (CPA) [37] , which facilitates the integration of network and transport layer analyses.
The formal timing analysis of AIQ is presented in three parts. First, the modeling in CPA is introduced. Then, the protocol behavior in the error-free scenario analyzed. Finally, the error case is addressed with an analysis of the worst case latency in error reporting.
A. Modeling in CPA
CPA [37] relies on independent local analyses of the system resources, such as router ports and CPUs, and a global analysis loop that aggregates the local results to provide worst case response times (WCRTs) and jitter of tasks. The system model is based on resources providing services, tasks consuming these services, and event models specifying task activation patterns. Task activations are triggered by an external source or by events propagated from other tasks (predecessor tasks). The activations in an event model are given by event arrival curves η − ( t) and η + ( t), which return the minimum and maximum number of events that can arrive in a given time interval t, respectively. Their pseudoinverse counterparts δ + (q) and δ − (q) return the maximum and minimum time interval between the first and last events in any sequence of q event arrivals, respectively. A conversion method is presented in [38] . The analysis is then carried out in a local step and global loop. In the local step, the local analysis derives each task's response time and output event model based on the busy window approach [39] . In a global loop, the analysis propagates tasks' output event models to their dependent tasks, which, in turn, become their input event models. The analysis stops when a fixed point is reached and all event models are stable or when predefined constraints are violated, such as a maximum WCRT.
The modeling in CPA is based on the transport layer analysis of [11] and illustrated in Fig. 5 . An interface in the sender's NI produces packets in a traffic stream according to a given packet-based event model δ t x . The packets are handled at the transport layer by AIQ, which is modeled as a resource. The analysis assumes that packets transmitted from different interfaces within an NI are arbitrated according to strict priority nonpreemptive (SPNP) and also assumes that ACKs and NACKs have the highest priority. Each interface producing packets is modeled as a task τ mapped to that resource-the interface under analysis is depicted as τ aiq , whereas a lower priority one and a higher priority one are depicted as τ lp and τ hp , respectively. Packets generated by AIQ are captured by a dedicated task-e.g., ACKs transmitted by the sender and NI are captured by τ ack . AIQ then injects traffic into the lower layers of the NoC according to the output event model δ aiq . Note that only one interface (or AIQ itself) can transmit a packet at a time, and therefore, only one output traffic stream is depicted for the sender even though several traffic streams can coexist in the NoC and possibly originate in the same sender.
A protocol with handshaking, such as AIQ, is a bidirectional communication stream [11] . As illustrated in Fig. 5 , the communication is mapped in the NoC as two unidirectional streams: one for data and one for acknowledgments in the feedback path. As in [11] , the underlying NoC analysis is arbitrary. This analysis assumes, without loss of generality, [40] as the underlying NoC analysis, which models the NoC in CPA as follows: each output port of a router is mapped as a resource, and traffic streams are chains of tasks mapped to resources. Resource arbitration depends on the router arbitration. The output of the underlying NoC analysis used by the transport analysis is the worst case latency L + i of a packet transmitted in a traffic stream i . The interested reader is referred to [11] and [40] for further details.
The analysis supports both packet-switched and wormholeswitched NoCs. Packet-switched NoCs are supported by default. For wormhole-switched ones, however, a conversion between event models is necessary, as seen in [11] . This is depicted by the light blue elements in Fig. 5 . The conversion between packet-based and flit-based event models can be performed with the following equations [11] :
where size i is the size of any data packet in stream i (in flits), and d min is the minimum distance between two consecutive flits [40] .
B. Formal Analysis: The Error-Free Case
At first sight, the timing behavior of AIQ seems similar to Go-Back-N ARQ [11] . However, AIQ differs from ARQ in that one instance in an NI is shared among traffic streams, whereas, in the latter, each traffic stream has its own protocol instance. That makes a big difference. The latter simplifies the analysis by exploiting the fact that all worst case processing delays and round-trip times (RTTs) are the same for the same traffic stream. In the former, worst case processing delays and RTTs of interfering packets are potentially different, resulting in the more complex problem of multiserver queues [41] , [42] . The analysis of multiserver queues is a hard problem, with a worst case that is difficult to tightly bound, and it is thus usually handled with Queueing Theory [41] [42] [43] . Thus, to make the analysis problem feasible, the worst case analysis of AIQ assumes that there are enough table entries for packets to be transmitted without contention (unlimited number of entries). The adopted strategy allows us to find out the number of entries required to achieve the bounded performance. Similar analysis approaches are seen in the literature [40] , [44] . A violation of the assumption can be monitored during runtime and be safely reported.
To obtain the worst case end-to-end latency of a packet protected by AIQ, it is necessary to derive the interference of other traffic in the NI and the contribution of the AIQ protocol to the latency. That is captured by the WCRT of AIQ R + aiq,i , which is the largest period of time in which a packet is retained by the protocol. Similar to [11] , the analysis relies on the busy window approach [39] . The first step is to derive the worst case multiple packet queuing delay.
The worst case multiple packet queuing delay Q (6) and where O + aiq, j is the maximum time that AIQ requires to forward a packet of stream j ; O + aiq,ack is the maximum time that AIQ requires to create and forward an ACK; lp(i ) and hp(i ) are the set of all lower and higher priority streams mapped to the same AIQ as stream i , respectively; and η + tx,i ( t) is the maximum event arrival curve (cf. Section IV-A). Equation (3) results in a fixed-point problem. It can be solved iteratively, starting with a very small, positive .
Lemma 1: Equation (3) gives an upper bound on the worst case multiple packet queuing delay Q + aiq,i (q). Proof: The proof is by induction. When q = 1, stream i 's packet can be blocked by one nonpreemtable, lower priority packet that just started transmitting, assumed to be the largest one causing the longest delay; while queued, the packet can also be blocked by arriving higher priority packets; additionally, the packet can also be blocked by ACKs, which are sent with the highest priority, generated due to receiving packets. The best-case multiple packet queueing delay Q − aiq,i (q) is the shortest time interval from the arrival of the first packet until the qth packet receives service. It is given by
where O − aiq,i is the minimum time that AIQ requires to forward a packet of stream i .
Lemma 2: Equation (3) gives an upper bound on the bestcase multiple packet queueing delay Q − aiq,i (q). Proof: The proof is by induction. When q = 1, stream i 's packet can be forwarded as soon as it arrives. In a subsequent q + 1-th activation, the packet must wait at most for the previous q packets to be forwarded. That results in (7) .
The worst case multiple packet forwarding time B Proof: The proof is by direct deduction. Under SPNP, the time to forward q packets corresponds to the time until the qth packet is about to receive service and the time it takes to forward the qth packet, which in nonpreemtable. This is captured by the first and second terms of (8) 
Lemma 4: The best-case multiple packet forwarding time B − aiq,i (q) given by (9) is a lower bound.
Proof: The proof is omitted. It is similar to Lemma 3 but using lower bounds instead.
The busy period w aiq,i is the longest time interval in which packets of stream i arrive at AIQ before the previous packet has been transmitted. That is, it is a half-open interval starting with the first activation and ending when activation q completes before the arrival of the q + 1 activation. The busy period w aiq,i is given by
Lemma 5: The busy window is upper bounded by (10) . Proof: The proof is by contradiction. Suppose there is a busy windoww aiq,i longer than w aiq,i . In that case,w aiq,i must contain at least one activation more than w aiq,i , i.e.,q ≥ q +1. From (10), Q + aiq,i (q) < δ − tx,i (q), i.e.,q is not delayed by the previous activation. Since that violates the definition of a busy window, the hypothesis must be rejected.
The WCRT R + aiq,i is the longest time interval that any packet of a stream i is delayed by AIQ before being forwarded to the network. It is bounded by
Theorem 1: R + aiq,i (11) provides an upper bound on the response time of an arbitrary packet in the traffic stream i transmitted under AIQ.
Proof: The WCRT of an arbitrary packet in the traffic stream i is obtained with the busy window approach [39] . The response time of the qth packet is the time between its arrival (δ − tx,i (q), a lower bound) and its injection in the network (B + aiq,i (q), an upper bound). The WCRT is then found as the maximum among the response times of activations occurring inside the busy window w aiq,i [39] . It remains to prove that the busy window is correctly captured by (10) and that the blocking captured in (8) is an upper bound. Those are proved in Lemmas 5 and 3, respectively.
The event model capturing the traffic injection of stream i in the network by AIQ can now be derived. The output event model δ − aiq,tx,i propagated by an NI with AIQ is obtained as follows:
Theorem 2: The minimum distance function δ − aiq,tx,i (q) given by (12) is a lower bound.
Proof: Packets can leave AIQ as soon as they arrive but not faster than AIQ is able to process them. This is captured by the max function. The proof is by cases, with two cases that must be lower bounds. The first case is when the packets leave the AIQ as fast as they arrive. Since packets can be affected by delay in the AIQ resulting in a jitter (R + aiq − R − aiq ) that is propagated with the output event model. That is guaranteed to be a lower bound. The proof is given in [45] . The second case is that any q packets cannot be closer to each other in time than the rate with which AIQ is able to process. This is captured by B − aiq (q − 1), which is proven to be a lower bound in Lemma 4. Since both cases are lower bounds, (12) is also a lower bound.
The time it takes to transfer q packets can now be bounded, where q might range from a single small packet to a long DMA transfer. 4 The overall latency L + aiq,i (q) of transmitting q data packets in a stream i is given by (13) where L + i is the worst case NoC latency of any packet in stream i , provided by the network analysis (cf. Section IV-A).
Theorem 3: Equation (13) gives an upper bound on the overall latency to transmit q data packets under AIQ.
Proof: The proof is by direct deduction. The latency consists of the time it takes for the sender to create q packets (δ − tx,i (q)), plus the latency it takes for the last (qth) packet to be delivered by the network, plus the worst case delay for that packet introduced by AIQ (R + aiq,i ) due to contention and handshaking. Due to causality-i.e., packets cannot bypass each other-all previous packets must have been received by the time the last packet is received. Thus, (13) is a valid upper bound.
Finally, the time it takes to perform an NoC transaction consisting of request and response under AIQ can be bounded. The transaction latency L + trans (q req , q resp ) of a transfer comprising q req request packets and q resp is given by
where the request and the response consist of q req and q resp packets, respectively, and O + proc is an upper bound of the time it takes for the transaction to be processed and responded (see Section III-E).
Theorem 4: Equation (14) gives an upper bound on the overall latency to complete a transaction.
Proof: The proof is by direct deduction. The latency of a transaction consists of the latency to transmit the request, the time it takes for the receiver to process the request and generate a response, and the latency to transmit the response. This is captured by the first, second, and third terms of (14) . From Theorem 3, the first and third terms are upper bounds. The second term (O + proc ) is an upper bound by definition. Thus, (14) is a valid upper bound.
C. Formal Analysis: The Error Case
In this section, AIQ is analyzed with respect to its error detection latency guarantees. In contrast to ARQ-based protocols, which guarantee packet delivery, AIQ does not provide error recovery and thus does not introduce itself additional latency due to errors. Error recovery might be performed in software and will certainly incur additional processing time, whose worst case behavior under errors has been analyzed, e.g., by [13] . As summarized in Table II , two cases must be detected by AIQ-loss and corruption. Upon detection, the error must be notified to the local tile or to the remote tile depending on whether the affected packet was a request or a response.
The worst case impact of an error on the detection latency is when the error causes a request/response loss, where the detection of a packet loss occurs upon the timeout event of a timer. In contrast with the detection of corruption, which occurs at the arrival of a request/response, the detection of packet loss will always take longer due to the timeout. Such worst case error impact is similarly seen in ARQ-based protocols [11] . In the sequel, the worst case detection latency is analyzed with respect to transient faults. The impacts of permanent faults and permanent effects are discussed afterward.
In case of request loss, only the local tile must be notified (cf. Table II ). The worst case error detection latency for local reporting L +err aiq,i (q) is the longest time interval between the transmission of a request on stream i until the notification that its transmission on the NoC failed. It is given by (15) where t out,i is the timeout value for the request packet of stream i and O + aiq,int is the maximum delay from timeout detection until a hardware interrupt is raised. Similar to ARQ protocols, the timeout must be chosen larger than the worst case RTT, usually including a safety margini.e., t out,i > RTT
In case of a response, a remote notification with a NACK is required, which extends the notification latency. This is captured by the worst case error detection latency for remote reporting L +err rem aiq,i (q) and is given by (16) where t out,i is the timeout value for the response packet of stream i , and L + nack is the worst case NoC latency of the NACK packet (cf. Section IV-A).
It is possible that the NACK is delivered to the master only after retransmission attempts, which can occur in multiple error scenarios in very high error rates. In that case, k·t out,NACK can be appended to (16) to account for the k additional retransmissions with timeout t out,NACK .
In case of permanent faults or transient faults with static effects causing the failure of the NoC, it is possible that the NACK is not delivered at all. AIQ is also able to detect NoC failures by monitoring the frequency of error occurrences and by detecting the failure of a remote notification. In case of network failure, a dedicated error single-wire signal, shared among all nodes, can be employed to notify the failure to the otherwise unreachable system controller. The controller can then reset the NoC, which will cause the unavailability of the NoC for some time, called mean down time (MDT), whose length depends on the hardware/software implementation. Moreover, the reset of the NoC must be carried out in such a way that the remaining transactions are allowed to finish so that only the tasks whose transactions failed due to an error will trigger a recovery in software. Otherwise, the reset could induce the failure of all pending transactions and lead therewith to an undesirable scenario.
V. EXPERIMENTAL EVALUATION
AIQ has been evaluated with respect to performance, implementation overhead, and achieved reliability and availability. The objective of the experiments is to evaluate the impact of AIQ on the regular performance of the MPSoC. AIQs impact on performance under errors is upper bounded by its impact on regular operation (cf. Section IV-C). Note that a clear distinction must be made between the impact of AIQ and the impact of software execution in the occurrence of errorse.g., error handling routines. Only the former is evaluated here.
The performance was evaluated with the many-core platform IDAMC [1] . Benchmark applications, as well as an avionics use case, were executed on two versions of the platform: a baseline version and a version with AIQ. Moreover, two different mapping configurations are used to stimulate an extreme scenario and one expected scenario. In the first scenario, the applications are executed remotely inducing the direct impact of the NoC latencies on the application performance-i.e., the application code and data are mapped to memory in remote tiles. In the second scenario, the application nodes execute locally, emulating a logical execution time execution model with intercore communication for synchronization and DMA for data transfers-i.e., code and data mapped to local memory, and shared memory communication and code download from memory in remote tiles. The two scenarios provide a valuable contrast between AIQs impact on the NoC traffic and the impact of the NoC performance on the application's execution. Finally, the hardware implementation overhead of AIQ is evaluated, followed by a reliability assessment and discussion. The results presented in this section regard a VHDL design of the IDAMC platform and AIQ simulated in register-transfer level (RTL) with QuestaSim [46] and synthesized as an application-specific integrated circuit (ASIC) with Design Compiler [47] .
A. Performance Evaluation: Benchmark Applications
Let us start by evaluating the performance impact of AIQ with CHSTONE benchmarks [48] . The benchmark applications were mapped to the IDAMC platform as depicted in Fig. 6 . The applications' code and data were mapped to a remote memory, according to the first of the aforementioned mapping configurations. In the setup, application tiles generate traffic due to cache misses and evictions and due to uncached data access. The applications were divided into two groups: one group (light gray) accesses the memory in tile DRAM1 and the other group (dark gray) accesses the memory in tile DRAM2. The NoC is configured with XY routing for requests, YX routing for responses, and a separate VC for each application. The interested reader can refer to [1] , [40] , and [4] for more details on the IDAMC and on the NoC, respectively.
The results of the RTL simulations can be seen in Fig. 7 , which plots the latencies of NoC transactions of the different applications as boxplots. Furthermore, the plot compares latencies on an unprotected NoC (base) with latencies on a NoC protected with AIQ. In a boxplot, the whiskers represent the maximum and minimum, the box represents the second and third quartiles, the horizontal line indicates the median value, and the marker indicates the mean value. First, the minimum latency increased nine cycles in all applications. This is due to the increase in the pipeline length in the NI as seen in Fig. 4 . By introducing AIQ and the CRC checker, the pipeline was extended by four stages for a request and for a response. On average, the latencies increased 16.1% across all applications. The standard deviation increased from 0.42 cycles (AES) up to 2.90 cycles (Blowfish). That is caused by the increased pipeline length as well as the additional feedback traffic consisting of ACKs. This experimental setup intentionally induces a high amount of traffic whose performance strongly impacts the execution time of the applications. Thus, execution time increase varied from 10.8% (ADPCM) up to 15.5% (Motion), depending on the application's memory footprint. On average, the execution times increased 12.6% across all applications.
This can be considered as an upper bound for the overhead caused by AIQ on cache-enabled executions. As seen next, the impact on the performance of executions with local memory is much lower.
B. Performance Evaluation: Avionics Use Case
Let us now evaluate AIQ with a parallelized avionics application. Due to the high secrecy involving the development of such systems, the experiments employ an artificial demonstration application (ADA) that mocks the dataflow and workload of a Helicopter Terrain Awareness and Warning System (HTAWS). The original application consists of multiple threads with design assurance level (DAL)-C [7] executing on an SMP with an RTOS. The main application dataflow, depicted in Fig. 8(a) , comprises four major pipeline stages. Two of the stages (Decomp. and Draw) can be parallelized to increase performance by exploiting the available data parallelism. A major frame must be processed from input to output in at most 60 ms. In the original single-core application, the four stages are executed sequentially, with a period of 60 ms. In the parallel version, the stages are executed in a pipeline to increase the overall throughput.
The application is mapped to a 2 × 4 instance of IDAMC, as depicted in Fig. 8(b) . Stages 1, 3, and 4 are mapped to a single tile (L/I/D). Three instances of stage 2 are mapped to different tiles (Decomp. #1, #2, and #3). In addition, interference is introduced in the platform by node Stream. src, which generates DMA traffic to node Stream. dst with 4.8-KB DMA transfers. The system controller (System Ctrl.+RM) initializes the platform and manages the access to shared network resources by implementing a resource manager [35] . Fig. 9 reports the execution time of ADA under different setups in RTL simulations of the IDAMC platform. When increasing the workload, specified as a number of batches, ADA takes between 1 and 3.2 ms to execute on IDAMC with an unprotected NoC (base). With an NoC protected with AIQ, the execution takes from 1.4% to 7.1% longer [cf. Fig. 9(b) ]. In contrast to the benchmark applications evaluated in Section V-A, these executions with AIQ present much lower overhead. This is due to the more realistic setup expected in realtime multicore and many-core platform environments, where a clearer separation between computation and communication is required to limit interference, to achieve predictability and to avoid prohibitive analytical over-approximations. This trend can be seen in predictable execution models such as superblocks [49] , [50] . Fig. 10 reports the latencies of NoC transactions observed in the scenario with four batches. The plot shows, as boxplots, the observed latencies of NoC transactions initiated by different ADA application nodes with and without AIQ. The average transaction latencies increase between 14.8% and 19.7% with AIQ. The minimum latencies increase 15.5% due to the additional cycles required by the extended pipeline in the NIs. The observed maximum, however, slightly decreased from 3.9% to 7.8% due to slightly different network traffic patterns resulting, e.g., from the interaction with the feedback traffic. The standard deviation increased at most 1.91 cycles (L/I/D) and decreased at most 1.33 cycles (Decomp. 2).
Although nonnegligible, the observed performance overheads are low in comparison with other software and hardware-based fault-tolerance approaches for safety-critical real-time systems [12] , [51] . Evaluating only timing can be a pitfall since the cost of the approach might be hidden. For instance, in case of dual modular redundancy (DMR) or triple modular redundancy (TMR) in processors with lockstep execution [12] , the performance overhead is low or negligible, the remaining overhead is hidden in the hardware implementation. That is evaluated in the sequel.
C. Implementation Overhead
Let us now evaluate the implementation overhead of AIQ. ASIC synthesis results (65-nm UMC) for a NI with AIQ (2, 4, and 8 entries) and without it (U).
the lower layers of the NoC are captured by Others. The AIQ is subdivided into the core AIQ and the CRC integrity check (CRC gen and CRC check). The results also show Addr.Transl., which contains the NoC routes and VCs (source routing) as well as local and remote address mapping.
In a 65-nm UMC ASIC, an NI implementing AIQ with two table entries requires 9.6% additional silicon area. The implementation overhead of AIQ is introduced by the main AIQ component and the CRC integrity check. The main contributors to the area increase are the integrity check (CRC gen. and CRC check), which correspond for approximately 82.9% of the additional logic. The main component (AIQ) contributes only approximately 17.1% of the additional logic since it replaces the original multiplexer and demultiplexer (DE/MUX). When increasing the number of entries from 2 to 4, 2.2% additional logic is required (in total, 12.0% additional logic from U to 4). Further doubling the number of entries from 4 to 8 requires 4.2% additional logic (in total, 16.7% additional logic from U to 8). Nonetheless, the area requirements of AIQ are expected to decrease with more efficient designs and implementations of the NI and of AIQ itself.
The energy consumption was evaluated with Synopsys PrimeTime [52] using the 65-nm netlist. Under full load (single-word memory accesses with random payload), the NI equipped with AIQ (two entries) consumes 8.03% more energy than baseline. For a larger AIQ with 4 and 8 entries, the energy overhead is 9.24% and 11.52%, respectively. When idle (no traffic), the overheads are 6.45%, 7.70%, and 9.91% for AIQ with 2, 4, and 8 table entries, respectively.
Let us now compare the hardware cost of the AIQ approach with the resilient NoC [4] and the DMR and TMR approaches [53] . DMR is able to detect errors of the NoC but it is neither able to pin-point nor to recover from them. As AIQ, DMR can be used to detect errors and achieve integrity. TMR is able to detect errors and also is able to detect which NoC instance is faulty. It can tolerate one error and continue operating. Fig. 12 shows the cost in terms of silicon area of a 5×5 NoC where every router is connected to a NI. Two different technology nodes are used: UMCs 65 nm and TSMCs 28 nm. Before discussing the results, some considerations are required. First, the figure is intended to compare the overhead of different approaches in the same technology node. Second, the costs of DMR and TMR do not include the voter and recovery logic, and the Resilient NoCs NIs use a large AIQ as a lower bound for a full-featured ARQ implementation. Third, the results do not account for the link wires, which are highly dependent on the place and routing of the entire MPSoC.
The total cost of implementing AIQ and ensuring the predictability and integrity of a 5 × 5 NoC is 3.74% in 65 nm (4.78% in 28 nm) when equipping the NIs with AIQ instances with two table entries (AIQ 1). When considering that two NIs are potential bottlenecks and require larger AIQ instances with eight table entries (AIQ 2), the cost raises slightly to 4.08% in 65 nm (5.19% in 28 nm). Even when all NIs feature large AIQs, the cost corresponds to 6.52% in 65 nm (not plotted). In contrast with DMR and its >100% overhead, AIQ requires just a fraction of the resources to provide the same guarantees. In order to further achieve resilience and high reliability, the resilient NoC approach requires 12.06% additional silicon in 65 nm (14.76% in 28 nm). In comparison to AIQ, that is roughly three times the overhead. In comparison to TMR, not only is the resilient NoC much more efficient with respect to resource usage (area) and power but also more effective. In addition, DMR and TMR imply a significant increase in the number of interconnecting wires, which can lead to circuit routing complications, potential congestion, and lower frequencies during implementation.
In 28 nm, slightly larger overheads can be observed. This is due to the fact that different cells in a cell librarye.g., sequential, combinational, and their different variationshave different sizes. When scaling down the technology node from 65 to 28 nm, the corresponding cells in the library do not scale equally.
Regarding maximum achievable frequency, no impact of AIQ has been noticed in 65 or 28 nm-i.e., the critical-path lies in one of the interfaces in all NI configurations.
D. Reliability
To put into perspective what can be achieved with such overheads, let us evaluate the reliability. The reliability is evaluated by means of the reliability metric R(t), which is the probability that the NoC does not fail during a time interval [0, t] [53] . We reuse the definition of [4] , which defines failure in a high dependability mixed-critical real-time system as the violation of integrity, resilience or real-time latency guarantees due to errors, including static effects leading to blocking. Packet loss is not considered as a failure for and AIQ and the Resilient NoC because it is handled in the transport layer. The evaluation considers bit error rates (BERs) from 10 −12 up to 10 −9 /h, which accounts for the BERs expected 5 in practice and the 5 BERs derived [55] for sequential and combinational logic with data from [56] for 65-nm CMOS SRAM. Masking effects [57] are not taken into account. BERs in smaller geometries, including FinFETs, have been shown to be decreasing [58] and are thus conservatively covered by the derived BERs. design safety margin [59] . In addition, a permanent fault rate 6 of 10 −11 /h per router is considered. The occurrence of a permanent fault leads directly to failure. Fig. 13 plots the analytical R(t) for the unprotected, baseline NoC and the NoC protected with AIQ considering a 5 × 5 2-D-mesh topology, excluding the NIs. The plot also includes the Resilient NoC [4] and variants of the baseline NoC with DMR (Base + DMR) and TMR (Base + TMR). The DMR and TMR have ideal voters and the latter is nonreparable [53] . Although DMR can ensure the integrity of the system, the extra redundant hardware implies higher susceptibility to errors and results in a less reliable NoC than baseline. The same is seen for TMR, whose capability of withstanding one error does not pay off the extra redundant hardware. On the other hand, AIQ captures all violations of integrity and real-time requirements and is able to increase the reliability in about one order of magnitude with respect to the baseline NoC. However, there still exists the possibility that a soft error affects the state of the NoC and causes static effects, leading the blocking of the NoC [4] , [10] , which still limits the reliability in time. This scenario can either be prevented with a Resilient NoC [4] approach or it can be handled by resetting the NoC back to a valid state with a version of AIQ with NoC resetting capabilities (AIQ+R). AIQ+R and Resilient achieve equally high reliability since all soft errors in the NoC can be detected and handled by both techniques, at which point hard errors then become the limiting factor.
Even in small topologies, nonresilient NoCs present very low mean time to failures (MTTFs) and, consequently, very high failure rates. On average, an 8 × 8 NoC is expected to be struck by a soft error from every 360 days in a regular environment (BER 10 −12 ) up to every 8.65 h in an aggressive environment (BER 10 −9 ). Most of those errors present only transient effects and will be handled in software. However, some of them present static effects and their recovery requires resetting the NoCs state. Those are seldom and are expected to occur, on average, every 80.4 h under BER 10 −9 . The 6 The fault rate per router is derived from processor failure rates in [60] . recovery requires cycles of downtime and thus impacts the NoC availability, which is evaluated next. Fig. 14 reports the unavailability of the NoC, as the complement of its availability (U = 1 − A) for different sizes and BERs when varying the MDT. Even for very high BERs, with the MDT in the range of microseconds, the NoC still presents very high availability. As the MDT approaches a tenth of a second, the system experiences longer interruptions due to the NoC recovery, which is likely to compromise its timeliness and violate, already at design time, the real-time guarantees. Thus, fast software routines should be used with hardware support to ensure low MDTs and the applicability of the approach. Alternatively, in case, the availability is too low due to a combination of long MDTs, high BERs, and large NoC sizes, the Resilient NoC approach [4] , which ensures the availability of the NoC in hardware, can be employed.
VI. CONCLUSION
In this paper, we presented AIQ, an end-to-end mechanism to provide integrity and real-time guarantees for NoC transactions under random hardware faults. When integrated, the mechanism results in a low-overhead fault-tolerant NoC capable of detecting errors and ensuring that their effects are contained in time in order to maintain the system's predictability and integrity. AIQ explores the idea of keeping track of transactions of distributed systems in the context of NoCs for predictable real-time systems. Upon error detection, error handling and recovery are delegated to software, which may react according to an arbitrary strategy, as seen in cross-layer fault-tolerance approaches. The mechanism was evaluated in the many-core research platform IDAMC considering aspects such as performance, implementation costs, reliability, and availability. AIQ operates with low hardware overhead and low impact on performance.
