On-chip wires are becoming unreliable as the effect of various noise sources increases with technology scaling. This leads to unpredictable timing delay variations on the interconnect wires. There is a significant need to mitigate the effect ofparasitics on the interconnects, while keeping performance and area overheads at a minimum. In this work, we present a timing error tolerant design methodology, T-error, that provides dynamic recovery from timing delay variations on the interconnects. We validate thefunctionality of the T-error methodology using cycle-accurate RTL models of a Network-on-Chip (NoC) design, that are integrated onto a multiprocessor virtualplatform. Our comparisons with the state-of-the-art error recovery mechanisms show that the T-error system provides error recovery with higherperformance than the existing schemes. We also present the synthesis results for the T-error scheme, which show that the scheme has negligible overhead.
Introduction
With technology scaling, the communication complexity of Systems-on-Chip (SoCs) is rapidly increasing. To tackle the resulting complexity, Networks-on-Chip (NoCs) have emerged as the paradigm for designing scalable communication architecture for SoCs [1] - [5] .
Another effect of Deep Sub-Micron (DSM) technologies is the appearance of significant delay variations on the wires. Wires are becoming thicker and taller, but their widths are not increasing proportionally, thereby increasing the effect of coupling capacitance on the delay of wires. As an example, the delay of a wire can vary between T and (1 + 4A)T (where T is the delay of the wire without any capacitive coupling and A is the ratio of the coupling capacitance to the bulk capacitance) [9] . The wire delay for data transfers on a communication bus depend on the data patterns that have to transferred. As presented in [7] , the data-dependent variations in wire delay can be as large as 50% for different switching patterns. With technology scaling, the device characteristics fluctuate to a large extent due to process variations and can cause significant variations in wire delay [8] . Wire delay is also affected by other forms of interference, such as supply bounce, transmission line effects, etc. [8] .
The major effect of these noise sources is that the delay incurred by the data traversing the interconnect becomes unpredictable, causing timing delay violations and timing errors on the interconnects. There is a significant need to mitigate the effect of parasitics on link performance. In most state-of-the art NoCs, when such errors are detected, the packets that incurred errors are retransmitted [18] - [21] . However, retransmission of data incurs significant performance penalties. Moreover, timing delay variations due to the noise sources can potentially affect multiple data bits in a packet, requiring complex multi-bit error detecting/correcting codes that are of impractical use [21] .
In this work, we present a Timing-error tolerant design methodology, T-error, that makes the NoC design tolerant to timing errors caused by the unpredictability in the environment and wire characteristics. In the T-error scheme, a double data sampling technique is used to recover from timing errors in the NoC. The double data sampling technique has been widely used by several researchers for general purpose processor designs [10] - [15] . In this work, we integrate the basic double sampling technique with the network buffers and the link level flow control protocol used in the NoC. The resulting system can detect and correct timing errors dynamically, with minimum performance and area penalty.
In the proposed design methodology, the normal FIFOs used in the network components (links, switches and network interfaces) of the NoC are replaced by the T-error FlFOs, which can be designed and used as library elements. Thus, adding support for timing error recovery in existing NoC systems using the proposed approach requires only a marginal amount of design effort. To the best of our knowledge, the T-error methodology is the first work that addresses the issue of dynamic timing error resiliency in NoCs using the double data sampling technique. The Terror scheme is explained in detail in Section 2.
The basic approach of the T-error design methodology 1-4244-0622-6/06/$20.00 ©2006 IEEE. as applied to network links has been presented by us in [22] and a comparison with other flow control schemes has been presented in [24] . In this work, we validate the performance and area overhead of the methodology by developing cycleaccurate SystemC models of the T-error network components, and incorporate them within the x pipes NoC architecture [16] . To obtain results on typical SoC benchmarks, the T-error based NoC is integrated into MPARM [23] , a general multiprocessor virtual platform. We perform functional cycle-accurate SystemC simulations on an image processing benchmark application to validate the performance of the scheme. We also run experiments on state-of-the-art NoC error recovery schemes, based on retransmission of data in case of errors, which show that the T-error scheme is faster. We also present synthesis results for the T-error NoC, which show that the area overhead incurred by the scheme is negligible.
We would like to state here that the T-error methodology only targets the recovery from timing errors, which however constitute a major portion of the faults that may be encountered in NoCs. To tolerate other kinds of errors (such as soft errors), mechanisms presented in several other research works (such as [17] - [21] ) should be used in conjunction with the T-error methodology.
T-error Scheme
The basic operation of the T-error scheme is explained in this section. In order to obtain high throughput, we realistically assume that the long links in the baseline error-prone NoC are pipelined and have 2-entry FIFOs at each pipeline stage (see Figure 1 ) [6] . In the T-error scheme, the 2-entry FIFOs are modified to support timing error tolerant operation; the FIFOs in the switches and network interfaces of the NoC are also modified in a similar manner.
The modified 2-entry FIFO structure is shown in Whenever a timing error is detected (i.e. the err signal is raised), a stall signal is sent to the previous stage, so that the previous stage suspends data transmission for one cycle while the error is handled. Also, a valid signal is sent to the following stage, informing that the data sent in the previous cycle was faulty and that a new, correct copy is now incoming.
A it. In the network with congestion, the data from the preced- The amount of timing delay that is tolerated by the Terror design depends on the phase shift between the clocks of the main and the delayedflip-flops. This shift should be as large as possible, so that the delayedflip-tflop is guaranteed to sample the right data and to provide correct system operation. However, the maximum shift is constrained by internal repeater delays (the error detection logic must operate between a ckd edge and the following ck edge). Detailed timing analysis and SPICE simulations (for a link size of 32 bits) showed that clock ckd can be delayed by 53.3% of the clock period with respect to ck. In this work, we assume that a maximum delay of 50% of the clock is tolerable with a T-error enabled system. Thus, the delayed clock ckd is just the inverted value of the main clock, and delay chains are not needed to generate it. At the same time, the maximum delay which is tolerated on a wire is 150% of the clock period, providing ample margin for timing error correction. We refer the interested reader to [22] for transistor-level implementation details, timing analysis and SPICE simulation results of the T-error scheme.
Experimental Results

Comparisons with Retransmission Scheme
We now compare the performance of systems that use traditional retransmission mechanisms (we assume switchto-switch retransmission) for handling timing errors against the T-error scheme. To evaluate the designs, we define a new metric: Potential Error-Rate (PER). The PER represents the percentage chance that a data word reaching a FIFO incurs one or more timing errors, if the data is sampled directly on the ck rising edge. Note that in T-error, in most scenarios, the data is sampled first by the delayedflipflop and only later by the mainflip-flop; for example, this occurs under congestion, or whenever a string of incoming data blocks follows a faulty transmission. This automatically avoids any potential errors. Therefore, even with a PER of 100%, the actual errors happening at the T-error We first present an experiment where a serial interconnect (of 1-bit data width) is used for the links. In this experiment, the data bits are sent between two switches and links with two pipeline stages each. This models the nearest neighbor traffic pattern, which is typical for SoCs [21] . In Figure 6 , the latency for data transmission for two different PER values (1% and 5%) is analytically plotted for various data sizes for the T-error based design and a traditional design where errors are corrected by retransmission. These PER values are reasonable estimates for the error rates in future interconnects, which are obtained from [21] .
The T-error scheme incurs negligible latency penalties and the plots for both 1% and 5% PER values for the scheme overlap. This is due to the above mentioned fact that most of the faults after the first error are transparently corrected by the design. For a chosen PER value, as the size of data to be transferred increases, there are significant latency savings in the T-error system when compared to the traditional scheme of retransmission. Moreover, as the error rate starts to increase, there are much larger savings in latency for the T-error based system. For the data size of 1000 bits and a potential error rate of 5%, there is a 35% reduction in latency in the T-error based system when compared to the retransmission scheme.
In the second experiment, we leverage upon a full NoC platform. The simulation platform consists of cycleaccurate SystemC models of the T-error designs for the switches, links and network interfaces, incorporated within the xpipes architecture. We use the MPARM simulation environment [23] , which allows several interconnect structures (such as AMBA, STBus, xpipes) to be utilized to connect processor/memory cores, and has support for a variety of benchmark applications. A functional SystemC simulation is carried out on an image processing benchmark (referred to as MAT2). For illustrative purposes, we assume that the retransmission mechanism is capable of detecting all timing errors; in reality, all the data bits can have timing errors and such a scheme may not even be applicable. The percentage increase in application runtime for different PER values for the retransmission scheme when compared to the T-error scheme is presented in Figure 7 . Please note that the application execution time includes the computational time as well. For a PER value of 5%, the retransmisson scheme when compared to the T-error scheme incurs an 8% increase in total application runtime. 1 shows the area overhead for the T-error scheme, for a 32-bit 5 x 5 mesh NoC. The base NoC area is the sum of the area of switches, links and network interfaces without the T-error design changes. As seen from the table, the T-error scheme incurs a negligible increase in area of the NoC. 4 
Conclusions
Robust design methods are needed to cope with the increasing timing uncertainty on the interconnects. It is important to achieve a robust design with minimum performance and area overhead. In this work, we present a timing error tolerant design methodology, T-error, that provides dynamic recovery from timing errors. We implement the T-error scheme using cycle-accurate RTL models and present the simulation and synthesis results for the design. Experimental comparisons with state-of-the-art error recovery mechanisms show that T-error provides large performance improvements with minimum area overhead. As future work, we plan to apply the timing error tolerant scheme to the computation architecture.
