Abstract-Achieving reliable operation under the influence of deep-submicrometer noise sources including crosstalk noise at low voltage operation is a major challenge for network on chip links. In this paper, we propose a coding scheme that simultaneously addresses crosstalk effects on signal delay and detects up to seven random errors through wire duplication and simple parity checks calculated over the rows and columns of the two-dimensional data. This high error detection capability enables the reduction of operating voltage on the wire leading to energy saving. The results show that the proposed scheme reduces the energy consumption up to 53% as compared to other schemes at iso-reliability performance despite the increase in the overhead number of wires. In addition, it has small penalty on the network performance, represented by the average latency and comparable codec area overhead to other schemes.
bilities may result into intermittent faults manifesting as burst of errors that repeatedly occur in same locations [3] . Transient faults can be caused by several noise sources such as crosstalk noise, power supply noise, alpha particles, electromagnetic interference (EMI) and transistor variability [4] , [5] . Faults affecting the links result in incorrect interpretation of the data and/or control signals and are usually addressed using error detection/correction coding such as simple parity or Hamming codes [6] , [7] .
Crosstalk noise has posed as one of the most challenging problems in timing closure and power consumption of modern VLSI circuits [5] . Short wire spacing and high aspect ratio of the interconnects in deep-submicrometer processes increase the coupling capacitance and in turn affects the integrity and timing of signals and contributes to the increase of interconnect power consumption [3] , [5] , [8] . For parallel wires or bus, the worstcase value of the crosstalk induced bus delay (CIBD) without crosstalk reduction is , where is the ratio of wire coupling capacitance to bulk capacitance and is the crosstalkfree wire delay [8] , [9] . Several techniques were proposed to address crosstalk effects such as wire spacing, shielding, duplication and crosstalk avoidance codes (CAC) [8] . These techniques are either reducing the coupling capacitance as in the case of wire spacing or preventing opposite direction switching in adjacent wires to reduce the effective coupling capacitance as in shielding, duplication and CAC. Shielding or duplication techniques can reduce CIBD to and some CACs can reduce CIBD to or [8] , [9] . Joint codes to address both fault tolerance and power consumption of the bus have been proposed in [10] [11] [12] [13] [14] [15] [16] . Works in [10] , [11] proposed to combine low power code (LPC) with error detection/error correction (ED/EC) code. In reducing the power, LPC reduces the bus transition activity while ED/EC code allows the bus to operate at lower voltage. Such joint codes however ignore the crosstalk effects on signal delay. To address the crosstalk and other transient fault noise sources simultaneously, researchers have proposed joint error correction/detection codes with crosstalk avoidance codes [11] [12] [13] [14] [15] [16] [17] . It is worth to note that CAC reduces the bus power consumption through the reduction of adjacent wire switching activity.
The joint code proposed in [17] pointed out that there is diminishing return in power reduction when error correction capability exceeds four errors. This observation could be valid for systems with low noise deviations. For systems with higher noise deviations, it is possible to achieve larger power savings by adopting higher error detection/correction capabilities schemes.
In this paper, we propose a new joint coding scheme that can detect up to seven errors and simultaneously reduce the crosstalk effect through duplication, as opposed to previous works that achieve up to only 4 errors detection. The encoding scheme is based on parity codes generated by arranging the data into two-dimensional arrays and calculating the parity for each independent row and column and finally duplicating all the bits. In order to achieve high detection capability, the two copies of each bit are compared in addition to checking the parities at the decoder.
The following sections are organized as follows: Section II presents related works, which combine crosstalk avoidance code and error detection/correction code. Section III describes the proposed coding scheme. Sections IV and V provide derivation of the reliability assessment and energy consumption of the different schemes respectively. Section VI discusses the experimental results addressing the reliability, performance, energy and power consumption of the scheme as compared to similar works. Finally, Section VII concludes the paper.
II. RELATED WORKS
Several error detection/correction schemes for NoC environments were proposed in [6] , [7] , [18] [19] [20] . Cyclic Redundancy Check (CRC), simple parity and Hamming codes were analyzed for the NoC environment in [6] , [7] . The use of orthogonal latin square codes was proposed in [18] to provide up to four error correction capability. In [20] , the authors proposed the usage of multiple groups of Hamming codes with configurable interleaving to provide higher error detection and correction. The drawback of this scheme is that it is not able to detect more than two random errors in each group. Product codes based on hamming codes was proposed in [19] to achieve multi-bit error correction. The authors proposed the use of type-II Hybrid Automatic Repeat Request (HARQ) scheme where only the rows codes are sent in the first transmission and the rest of the check bits are sent if uncorrectable errors are detected. Despite its capability of detecting/correcting multiple errors, it could not detect/correct more than two errors for any row. Moreover, the scheme requires three stages pipelined decoding process, which is less attractive to latency sensitive NoC applications. It is worth mentioning here that these schemes addressed the transient fault noise sources excluding the crosstalk noise on signal delay.
Works in [8] , [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] addressed the timing effects of crosstalk, however they did not provide fault tolerance against other sources of transient fault noise. Some techniques include shielding [26] , [27] , repeater insertion [30] , [31] and skewed transitions [25] , [28] , [29] . Coding based techniques were proposed as in [8] , where three representatives of CACs were proposed, namely forbidden overlap condition, forbidden transition condition and forbidden pattern condition codes. The authors in [21] proposed the overlapping codes for forbidden pattern and forbidden transition codes. A combination of shield patterns to reduce worst case signal delay and bus invert code for further power reduction was proposed in [24] . A two dimensional CAC was proposed in [32] that considers both spatial and temporal domains of the code. Some works base their code on numeral systems as in [22] and [23] .
To simultaneously achieve crosstalk-based timing reduction and single error correction capability, Duplicate-Add-Parity (DAP) code was proposed by duplicating the data bits and adding one parity bit [11] . On the other hand, DAPX, a modification of DAP, duplicated the parity bit to reduce the crosstalk delay effect on this particular bit [11] . Other schemes which have similar correction capability and addressed crosstalk through duplication approach are the Dual Rail (DR) [14] , Boundary Shift Code (BSC) [15] and Modified Dual Rail code (MDR) [13] . These schemes have simple codec but are limited by the correction of only single errors. Triplication coding with green bus encoding alongside with voltage scaling was proposed in [33] . However due to its single error correction in a group of three bits, slight decrease in the voltage swing can be achieved as compared to DAP. Note that, in the ultra deep-submicrometer (UDSM) technology, multiple errors are expected to occur and thus single error correction capability will not be sufficient [12] , [20] .
Crosstalk aware multi-bit error detection/correction codes were proposed in [12] , [17] , [34] . The Hamming product code in [19] was extended to include the crosstalk delay effect reduction through skewed adjacent wires transitions [34] . In [12] , double error correction was achieved through joint crosstalk avoidance and double error correction code (CADEC) scheme. The scheme proposed to encode the data using Hamming single error correction code and then encode the resulting check bits and the data bits using DAP approach. This idea was extended to triple error correction in joint crosstalk avoidance and triple error correction (JTEC) code and extended to quadruple error detection in JTEC with simultaneous quadruple error detection (JTEC-SQED) code [17] . The increased error detection/correction allows the links to operate at lower voltage swing to reduce the power consumption while achieving the required reliability.
In this paper, a new coding scheme is proposed that combines two dimensional parities with duplication to jointly provide high error detection and crosstalk avoidance capability. This joint scheme achieves up to seven random errors detection in which the high detection capability is not limited to address burst errors as in [33] , [34] . This allows for further reduction in the voltage swing with respect to previous works, leading to higher energy savings as the experimental results section shows. A low complexity high-speed un-pipelined decoder has been implemented with the ARQ error control policy, attractive for low latency NoC applications. The scheme has small performance penalty represented by the average latency and comparable codec area overhead to other schemes.
III. PROPOSED CODING SCHEME In general, it is possible to achieve higher error detection than correction with the same amount of redundancy since a block code with Hamming distance, can detect up to errors while it can correct only errors [35] . However, the drawbacks of the error detection schemes are the communication latency and energy consumption imposed by the retransmissions. Despite these disadvantages, we demonstrate that with precise selection of the voltage swings of the links, the performance and energy consumption of the NoC will not be highly impacted by the retransmission.
A. Duplicated Two-Dimensional Parities (DTDP) Scheme
The proposed coding scheme is designed based on two principles: wire duplication to reduce crosstalk effect on signal delay and two-dimensional parities to provide error detection. By arranging the data in a two-dimensional matrix and calculating the parity for each row and column, the equals to 4. This configuration allows the possibility to correct one error and detect up to two errors, or to detect 3 errors without any error correction capability. Through wire duplication, the scheme achieves twofold objective. First, the crosstalk effect on signal delay manifested in CIBD can be reduced through the reduction of effective coupling capacitance. This reduction is achieved as the switching of neighboring aggressors in opposite direction to a victim in the middle is inhibited. Second, the duplication doubles the to 8, as shown in Appendix A, which leads to a maximum of seven random errors detection (7ED) capability.
Despite the increase in the number of wires, this paper shows that the energy saving through the wire supply voltage reduction exceeds the energy consumed by the additional wires under isoreliability performance. Fig. 1(a) shows the encoding process where the data bits are arranged in rows and columns and then parities are calculated for each row and column respectively. Row parity can be defined as for to while column parity can be defined as for to . In addition, 'check on checks" parity, is defined as . The check on checks parity extends the detection to include the errors affecting the rows and columns check bits. The resultant codeword is duplicated before being sent over the link. The duplication produces and codeword bits per row and column respectively as Fig. 1(a) shows. Fig. 1(b) shows the encoder implementation. It can be noticed that the critical path delay is the generation as it takes the rows parities as inputs to its XOR chain. Since the proposed coding scheme is based on ARQ error control policy, it requires a retransmission buffer and retransmission request signal coming from the decoder. For simplicity reason, they are not shown here in this figure. Note that upon receiving retransmission request, the encoder encodes the data in the retransmission buffer instead of the new data.
B. DTDP-7ED Encoder
The selection of and of the data arrangement matrix affects the codeword size before duplication,
, where . Table I shows for different pairs of and values for data bits width ranging from 16 to 128 bits. It can be seen that the is minimum when . To put in perspective on the final codeword size for the proposed DTDP-7ED scheme over the other similar coding schemes in [11] , [17] , Fig. 2 shows the normalized final codeword size with respect to the data bits or flit size. It should be noted that DTDP-7ED and JTEC-SQED duplicate all the bits, DAP duplicates all the bits except the parity while JTEC excludes one parity bit from the duplication. For bits, the final codeword size for DTDP-7ED is 90 bits while JTEC-SQED requires 78 bits resulting to normalized final codeword size of 2.81 and 2.44 respectively. DTDP-7ED scheme has the highest codeword size while DAP scheme has the lowest. It should be noted that final codeword size for all the schemes is more than twice its data bits due to wire duplication. It can also be noticed that as the flit size increases, the differences of the normalized final codeword size between the schemes decreases due to the smaller increment in the number of check bits compared to the flit size increment. This demonstrates that the overhead energy consumed by the additional wires in DTDP-7ED scheme will be comparable to the other schemes at larger flit size.
C. DTDP-7ED Decoder
As Fig. 3 (a) shows, the codeword bits received are arranged in a two-dimensional matrix similar to the two-dimensional matrix after duplication in the encoding process with rows and columns. The decoding is applied to each row and column by calculating the parities to check for errors and generating the retransmission request signals, and for each individual row and column except the last row. The final retransmission request is set when at least one or one signal is set. The signal is also used to indicate to the next stage in the router pipeline the validity of the decoder output data.
Decoder implementation in Fig. 3(b) shows that the signal and output data bits are generated by the row decoder block while the column decoder block generates the signal. Note that both row and column decoder blocks have similar implementations, therefore, to illustrate the decoding mechanism, row decoder block 0 is chosen. As Fig. 3(c) shows, codeword bits are arranged in vertical groups namely for to and two horizontal groups namely and . Each contains two bits; the even index bit, and the odd index bit . contains all even index bits while contains all odd index bits. Check bit for , can be computed as while check bit for is computed as . The row decoder considers the codeword bits examined as error free if and only if and all , thus will set signal to 0. Note that any odd number of errors in horizontal or vertical group can be detected while even number of errors will lead to check bit equal to 0. The row decoding is considered in failure when at least one error exists but all check bits are 0. The DTDP-7ED decoder considers the codeword as error free, i.e.,
, if and only if for to and for to . The DTDP-7ED decoder fails when at least one error exists and . In Appendix B, we show that each row and column decoder block is capable to detect all single, double, and triple errors but fail to detect some quadruple error patterns. As a result, the DTDP-7ED scheme fails when the combination of the row and column decoder blocks fail to detect errors for some patterns of eight errors. To illustrate this, Fig. 3 (d) shows two samples of 8 errors of which the erroneous codeword will be assumed error free and the decoder fails. We also provide proof in Appendix C showing that DTDP-7ED decoder is able to detect up to 7 random errors. It is also important to highlight that in the basic row or column decoder block, considering one horizontal group is adequate for error detection computation. Considering both and will not increase the error detection ability and adds hardware overhead to the design.
Spatial correlation between errors due to crosstalk effects may result in multiple adjacent wires to be affected. This leads to higher error rate which affects performance and energy consumption due to the retransmissions. Multi-bit adjacent errors are inherently resolved in the proposed scheme as the bits are arranged into rows and columns and then interleaved. DTDP-7ED has an interleaving distance of as shown in Fig. 1 (b). Since each column decoder fails when four errors occur, then the scheme fails in one burst of or two bursts of 4 adjacent errors. In comparison, DAP, JTEC and JTEC-SQED schemes lack of this property and as a result, DAP cannot correct multi-bit adjacent errors, JTEC can correct up to 3 and JTEC-SQED can detect up to 4 adjacent errors.
D. Incorporating CODEC in NoC Router
The incorporation of any encoder and decoder into the NoC pipelined routers affects the NoC performance due to the codec latency. One possible implementation is to place the encoder and decoder in the last and first stage of the router, namely the switch traversal (ST) and route computation (RC) respectively. This will preserve the number of pipeline stages at the cost of reduced clock frequency. The alternative implementation shown in Fig. 4 , is to place the encoder and decoder as separate stages after ST and before RC respectively. This increases the number of pipeline stages but preserves the clock frequency. The second implementation will be adopted for all the schemes considered in this work. Based on this pipeline architecture, a retransmission request is received by the sender after the round trip time, which is the time required for the flit to be transferred from sender to receiver and its acknowledgement to be received back by the sender. This is represented by the following processes: 
IV. RELIABILITY

A. Bit Error Rate
The bit error rate (BER), , can be represented using the Gaussian noise model, with zero mean, is the variance of the noise source and is the voltage swing on the wires [5] , [36] , [37] . Then BER can be given by:
The reduction of reduces the power consumed in the links. However, at any noise level, increases with the reduction of and the model in (1) accounts for the decrease in noise margin due to the reduced swing voltage. This increases the probability of flit errors resulting in lower reliability. One possibility to compensate the low reliability is through the use of error detection or correction codes. These codes will require encoder and decoder in addition to extra wires for the check bits which form the overhead power consumption that should be minimized. It was shown in [6] , [12] that by reducing , quadratic power saving on the links can be achieved surpassing the overhead power consumption, resulting in overall power saving. This is particularly true in DSM where the ratio of power consumption for gates to wires is decreasing [1] , [38] . 
B. Undetected Error Probability
The probability to receive bits error free is . The probability to have at least one erroneous bit in this bit word, defined as word error probability, can be given by:
The probability to have errors in bits word is given by [5] , [35] : (4) where (5) The undetected error probability, , is the probability that a flit has errors that cannot be detected by the error detection or correction scheme. Note that each scheme has different detection capability, thus has different undetected error probability model. For the case of uncoded flits, the is the same as the word error probability in (3) . Table II shows the undetected error probability analytical model of the different schemes considered alongside with the proposed scheme. These schemes are selected based on the merits that address both crosstalk effects on signal delay and transient fault noise sources. The undetected error probability models approximated at small for DAP, JTEC and JTEC-SQED schemes are reproduced here from [11] , [17] . Note that is the data bits size while is the codeword size after Hamming encoding. For the proposed scheme, model was derived using similar approach as in [17] ; i.e., based on the maximum number of errors that the scheme can detect. Given that DTDP-7ED is able to detect up to seven errors, the upper bound on is:
Using (4) and approximated at small , the first term in (6) dominates and Table II shows the final model. It is clear that larger flit size results in higher under same , as more bits are susceptible to errors.
C. Retransmission Probability
Coding schemes with error detection and/or correction capabilities can be further categorized into one of the three error control policies: automatic repeat request (ARQ), forward error correction (FEC), or hybrid ARQ (HARQ). The number of retransmissions for each coding scheme differs according to the adopted error control policy, causing different effect on the communication latency and energy. A retransmission is requested when the decoder detects an error that cannot be corrected, therefore the flit retransmission probability of coding scheme can be expressed by [35] : (7) where is the probability that a flit has error(s) that cannot be corrected by the scheme's decoder and is the probability that a flit has undetected errors. FEC-based coding schemes, like DAP and JTEC, have no retransmission capability, therefore . For JTEC-SQED, the scheme can correct up to 3 errors, retransmit when 4 errors are detected and unable to detect 5 or more errors. Therefore, is given by:
Under small assumption, (8) can be approximated to:
For DTDP-7ED, it can detect up to seven errors and all these errors are uncorrectable since it is an ARQ scheme, therefore the is given by:
Under small assumption, (10) can be approximated to:
The analytical models for retransmission probability of all the schemes are summarized in Table III . It can be seen that the proposed DTDP-7ED which is based on ARQ has higher as it is proportional to while JTEC-SQED which is based on HARQ is proportional to . To validate the derived retransmission probability estimations in (9) and (11), we simulated of 32 bits flits encoded using DTDP-7ED and JTEC-SQED schemes respectively. Random errors were then injected into the encoded flits at and these flits were then decoded using the corresponding decoder. The number of retransmissions was captured and the was calculated for each case. Table III shows the retransmission probability simulated value and the estimated value from the derived analytical model. The difference between the simulated and estimated values is small; both cases giving error of 4.40% and 4.38% respectively. It can be seen that at this , the retransmission probability of the proposed scheme is relatively high as compared to JTEC-SQED, slightly impacting the network throughput.
Most NoC designs use the Go-Back-retransmission policy [39] , [40] , where represents the window size at which the sender continues sending without receiving the acknowledgement from the receiver. In NoC environment, is the same as the round trip time [19] . In this policy, when an error is detected in a flit, a retransmission is requested for that flit. The sender will continue sending until it receives the retransmission request, during this time it would have sent flits including the erroneous one. When a retransmission request is received, the sender retransmits the requested flit and the subsequent flits which have been transmitted before. The average number of transmissions, including the retransmissions required to successfully transmit the flits is given by [35] : (12) The first term in the equation, denotes the probability of a flit to be successfully accepted in the first transmission attempt without requiring retransmission.
is the probability that a flit is accepted in the first retransmission. In this case, flits will be transmitted including the erroneous flit in the first attempt, subsequent flits, and the second attempt of the erroneous flit which is successful.
is the probability of a flit to be accepted after two retransmissions, so the total number of transmissions required is . Using geometric series reduction, (12) can be simplified to: (13) The average throughput, Th, expressed in flits/cycle is the reciprocal of in (13) [20] . The average throughput decreases with increased retransmission probability and with increased round trip time. Fig. 5 shows the average throughput as a function of noise deviation, for DTDP-7ED and JTEC-SQED schemes representing ARQ and HARQ error control policies respectively. It can be seen that Th for both schemes is not affected for noise deviations less than 0.11 V. Th starts to degrade for DTDP-7ED scheme from that point onwards, with 64-bits showing higher degradation than 32-bits case. This is true since for same noise level and voltage swing, the wider bus will be more susceptible to errors. Note that Th for JTEC-SQED, in both 32 and 64 bits flit, is not affected within this noise deviation range due to its ability to correct three errors.
D. Residual Flit Error Probability
The residual flit error probability can be used as a metric to measure NoC reliability or the mean time to failure (MTTF) [7] . The residual flit error probability can be defined as the probability of accepting a flit that has error(s) which could not be detected and this can happen in the first transmission or after being retransmitted one or more times, and it can be given by [20] :
Using geometric series reduction, (14) can be simplified to:
For small values of , the can be approximated as . For instance, given differs by 1.01% as compared to . Note that this retransmission probability value is considerably high and is applicable in ARQ-based schemes under high noise levels.
V. POWER AND ENERGY CONSUMPTION
With the assumption of same router architecture implemented for all the coding schemes, the difference in power consumption comes from the encoder, decoder and links. Thus the average power, can be given by: (16) Average link power is given by [41] : (17) where is the number of wires in the link, and are the self and coupling capacitance of wire and between wires respectively, and are the wire self-transition and coupling transition activity factor respectively, is the supply voltage and the operating frequency. The first term in the equation represents the link self switching power consumption, while the second term represents the link power consumption due to the coupling capacitance,
. Taking the average switching activity, the two terms lead to [42] :
For a link with duplication technique applied, the coupling effect between the duplicated wires is eliminated, thus the link effective coupling capacitance is reduced from to . This link effective coupling capacitance can be generalized to to include DAP and JTEC scheme where the parity bit is not duplicated. Note that this link effective capacitance term is general for any coding schemes based on duplication technique with up to one unduplicated bit. The link power consumption due to coupling capacitance can then be given by the following: (20) Note that (18) and (20) are general for any coding schemes based on duplication technique with up to 1 unduplicated bit under the assumption that is the number of wires after duplication.
The average energy consumed by the encoder, decoder and link to transfer a flit, is given by multiplying the average power by the clock period, :
However, this flit may have some errors that cannot be corrected and a retransmission is required. In such case, a better measure of the energy is to find the energy required to successfully transfer a flit to the receiver. Therefore, the average energy to successfully transfer a flit, can be given by the multiplication of with the average number of transmissions, in (13):
Since some flits may arrive with undetected errors, it is more useful to calculate the energy for a successful error free flit, (or useful flit) which is used in the evaluation section. can be computed by incorporating the residual flit error probability in [20] .
For small values of and , (23) can be approximated by (21) .
VI. EVALUATION
For the evaluation purposes, we consider a flit size of 32 bits (unless otherwise indicated). The data is arranged into 4 8 matrix which results in minimum and number of wires and lead to lowest interconnect power consumption. In addition, comparing to 6 6 matrix, encoder for 4 8 matrix has less 2-input XOR gates to compute the resulting in smaller critical path delay.
The link delay is assumed one cycle and the round trip time, , is 4 cycles (encoding, link traversal, decoding and retransmission signal link traversal) as can be seen from Fig. 6 . In all the schemes, it is assumed that a single additional supply voltage exists to provide the . Despite the complexities introduced, multidesigns are getting high interest and becoming more common in current chips [43] , [44] . 
A. Reliability and Voltage Swing
The residual flit error probability was evaluated using (15) for the different schemes taking their respective undetected error probabilities and retransmission probabilities from Tables II and III respectively. To show how reliability can be highly enhanced with DTDP-7ED, we consider the case that the four coding schemes work at same voltage swing under same noise deviation which gives same . Under these conditions, DTDP-7ED achieves the lowest as Fig. 7 shows. For as indicated by the vertical line in Fig. 7 the residual flit error probabilities are and for DAP, JTEC, JTEC-SQED and DTDP-7ED schemes respectively. Assuming the link works at 1 GHz frequency and one flit per cycle transmission, the MTTF will be 63 sec, 7.4 seconds, 1.3 hours and years respectively. This shows that higher error detection capability through DTDP-7ED scheme can enhance the reliability drastically over other schemes under the same working condition.
From another perspective, the intersection points with the horizontal line show the bit error rates at which each scheme achieves the target of as Fig. 7 shows. The results indicate that for the same target reliability, DTDP-7ED can work at higher than the other schemes. As DTDP-7ED can sustain higher , the voltage swing of the link can be lowered down to work in same noise environment (i.e., noise deviation) as with other schemes. This relationship is shown in Fig. 8 , where the link voltage swing is shown as a function of the target at of 0.12 V. It can be noticed that of DTDP-7ED for a high target reliability is lower than for all the other schemes even at lower reliability of . The same applies to JTEC when compared to DAP. Another important observation is that the difference between at and at is small ( % of ) for all the schemes. This observation indicates that high increase in reliability can be achieved with a slight increase in . On the other hand, this also brings a new design issue as it requires precise voltage control; otherwise the required reliability will not be achieved and would drop considerably with a slight drop in voltage.
To select a for a target reliability level, the noise deviation of the system must be estimated during design time. This design time estimation requires statistical results from previously implemented designs and/or circuit simulation, or the use of complex models to account for the different noise sources. Fig. 9 shows the effect of different noise deviations, on the required voltage swing with the target set to . It can be seen that the change in noise deviation has a high effect on the required voltage swing for all the schemes. When the noise deviation changes from 0.06 V to 0.12 V, the is double for all the schemes. It is worth to note that for DTDP-7ED is 48% lower than that of DAP as compared to 35% and 28% reduction achieved by JTEC-SQED and JTEC respectively. Results of DAP show that can only be operated at sub-1V for noise deviations below 0.072 V. Based on this, it can be inferred that low swing signaling with single error correction or detection schemes is only possible in the presence of low noise levels. This confirms the need for multi-bit error correction or detection schemes as higher noise levels are expected in DSM technology as indicated in [12] , [20] .
For the rest of the experiments, we employ V and target . For this parameter setup, we identified the corresponding voltage swings as shown in Table IV . It can be seen that DTDP-7ED has the lowest while all other schemes need to work at voltage above 1 V. For target , a small increase in for 64-bit with respect to 32-bit flit width is observed since wider bus is more susceptible to errors. From Section IV, DTDP-7ED has been identified as having the highest retransmission probability as it has no correction capabilities. However, as reported in Table IV , the average throughput decreases by about 5% for both 32 and 64 bit flit widths, comparing to the other schemes. In comparison to the results in Fig. 5 , slight increase of in 64 bit case reduces the , thus results in similar average throughput with the 32-bit case. Note that further reduction on may lower the throughput below unacceptable level. For higher throughput, higher is required which also enhances the reliability. Using (13), (11) and (1), it has been found that 0.966 V and 1.0 V is required to achieve throughput of 0.99 for 32 and 64 bit cases respectively.
B. Power and Energy
Based on the voltage swing of each scheme in Table IV the power consumption was compared at 3-mm link length. The interconnect parameters shown in Table V are based on ITRS [45] . The self and coupling capacitances were obtained for 45-nm technology based on the predictive technology model (PTM) [46] for the topmost metal layer. The link power was evaluated using (18) and (20) . The power overhead for level translation circuits is small as reported in [17] , [19] , thus contributes in small proportional increase in link power for all the schemes. The encoder and decoder for each coding scheme were synthesized using Synopsys Design Compiler with 45-nm Nangate library [47] targeting 800 MHz frequency. Fig. 10 shows the comparison of the link power for different schemes normalized to the DAP for flit size of 32 and 64 bits. Link power for DTDP-7ED is the lowest at both 32 and 64 TABLE V  INTERCONNECT PARAMETERS AND THEIR VALUES USED IN THE SIMULATION   TABLE VI  NUMBER OF WIRES, BUS WIDTH, CRITICAL bits flit sizes due to its smallest . The 64-bit case has slightly lower link power than the 32-bits since normalized codeword size reduces with larger flit size as depicted in Fig. 2 . The power consumption of the DTDP-7ED at high throughput (0.99) is higher than the normal case but still lower than the other schemes and achieving higher . As shown in Table VI , DTDP-7ED achieves the lowest link power in both equal-width-spacing and equal-bus-width configurations, due to its lowest despite its larger number of wires. The corresponding number of wires for each scheme can be found in the table. Note that the equal-bus-width assumes constant link area for all the schemes allowing further optimization on the interconnect structure resulting in lower coupling capacitance [48] . Typically, wire and dielectric thickness cannot be specified by circuit/layout designers but they are free to tune the wire width and spacing. By constraining the bus width of all the schemes to DTDP-7ED bus width, DAP has the lowest coupling capacitance, followed by JTEC and JTEC-SQED, while DTDP-7ED has the highest. These coupling capacitance values are then utilized in respective equations, reflecting performance of each coding scheme. As indicated in Table VI , the encoder hardware for JTEC-SQED and DTDP-7ED consume higher power comparing to the other two schemes. This is due to the requirement of retransmission buffers in the former schemes which are based on HARQ and ARQ error control policies respectively. Comparing the total power, DTDP-7ED achieves 54% power savings with respect to DAP, while JTEC and JTEC-SQED achieve 34% and 41% respectively for equal-width-spacing configuration. As for the equal-bus-width configuration, 42%, 26% and 34% power saving with respect to DAP is achieved for DTDP-7ED, JTEC and JTEC-SQED scheme respectively. For the rest of results equal wire width and spacing is assumed. Fig. 11 shows the total power for each scheme normalized to the total power of DAP scheme as a function of link length ranging from 1 mm to 5 mm. The power savings for each scheme increases with the increase of link length. This is true as the link power consumption changes rapidly with respect to the encoder and decoder power consumption and thus low voltage swing signaling is highly attractive in this respect. At 1 mm, the DTDP-7ED encoder and decoder power consumption represents 45% of total power in comparison to only 14% at 5-mm length. With the continuous faster scaling on the transistor size compared to wires scaling [5] , it is expected that the power savings continue to improve despite the down-scaling of the interconnect structure and shorter links.
The reduction of power consumption in DTDP-7ED scheme is reflected in the average energy for successful flit transfer as shown in Fig. 12 . At low is negligible but as noise deviation exceeds the design selected value (i.e., V), with V, increases to a noticeable value affecting the average energy. The other schemes show no effect as they have error correction capabilities. At V and below, the proposed scheme energy is 53%, 30%, and 21% less than DAP, JTEC, and JTEC-SQED respectively. At V which represents the design point, the energy savings decrease to 51%, 26%, and 17%. Thus, it is clear that the proposed scheme brings energy savings when working within the design noise levels.
C. Codec Delay and Area
As shown in Table VI , the DTDP-7ED encoder has the highest critical path delay while its decoder has the smallest delay. The maximum delays between encoder and decoder governs the maximum clock frequency when pipelined operation is considered. Thus, DTDP-7ED has the highest possible clock frequency of 1.61 GHz, while DAP, JTEC and JTEC-SQED have 1.56, 1.0, and 0.93 GHz respectively.
From Table VI , it can be seen that JTEC-SQED and DTDP-7ED have higher encoder area than the other two schemes due to the retransmission buffers required in HARQ and ARQ schemes. On the other hand, the decoder area of the proposed scheme is slightly larger than DAP with respect to the area overhead for JTEC and JTEC-SQED decoders over DAP. The proposed decoder requires additional XOR gates to generate the rows and columns check bits, while JTEC and JTEC-SQED have higher overhead due to the error correction circuitry. The total area of the proposed scheme is higher than both DAP and JTEC but it is lower than JTEC-SQED.
D. Average Latency
To evaluate the NoC performance in the presence of the different coding schemes, an 8 8 mesh based system is considered. In this setup, four stages pipelined router with two additional stages for encoding and decoding and one cycle for link traversal is assumed. Each port in the router has four virtual channels of 8 flit entries each. The packets are routed using the XY routing algorithm. Each packet is composed of four flits and a uniform traffic pattern with Bernoulli injection is used in the simulations. In the evaluation, three schemes are considered: JTEC, JTEC-SQED, and DTDP-7ED. Both DAP and JTEC are FEC schemes, which means they have no retransmission. As a result, they have the same network performance, so considering JTEC is enough to show the FEC schemes behavior. Two target residual flit error probabilities are considered, and . The former represents a low reliability selection (i.e., MTTF less than one day) for 8 8 mesh NoC working in 800 MHz with fully utilized links, while the latter gives a high reliability (i.e., MTTF more than 17 years). Observing the average latency at shown in Fig. 13(a) , it can be noticed that the three schemes have same latency at all injection rates except near the saturation point. DTDP-7ED saturates slightly before the other two schemes. This indicates that the retransmission probability at , has negligible effect on performance. It can be seen from Fig. 13(b) , for target , the average latency for DTDP-7ED at injection rates less than 0.16 is almost the same as JTEC and JTEC-SQED. But when the injection rate exceeds 0.16 the average latency of DTDP-7ED starts to increase faster than the others due to the effect of retransmissions. The retransmissions contribute to the network contention which finally saturates the network at injection rate of 0.26. On the other hand, JTEC and JTEC-SQED saturate at 0.3 injection rate since there is no effect of retransmissions.
VII. CONCLUSION
In this paper, a crosstalk aware seven error detection coding scheme was proposed. The undetected error probability and retransmission probability of the proposed scheme were derived. The residual flit error probability, a representative of the schemes' reliability, was compared as a function of BER. The results show that the proposed scheme, DTDP-7ED, can achieve the same reliability under higher BER, allowing the link to work in lower voltage swing. The reduced voltage swing enabled DTDP-7ED to reduce the average power and energy consumptions as compared to the other schemes despite the increase in energy with the increase of noise due to retransmissions. It was shown that providing higher error detection brings energy savings when working in noisy environments. Furthermore, using ARQ error control policy with higher error detection achieves energy savings with relatively small impact on performance, highly suitable for medium to high reliability systems.
APPENDIX A HAMMING DISTANCE OF DTDP SCHEME Based on [35] , for any product code of three codes , and having Hamming distance , and respectively, the Hamming distance is given as . DTDP code can be considered a product of three codes. First code is the simple parity code applied to each row, the second code is the simple parity code applied to each column, and the third code is the duplication code of each bit. Simple parity has minimum Hamming distance whereas duplication is a repetition code of length 2 with minimum Hamming distance . Therefore, the minimum Hamming distance of DTDP code is given as . Prof. Ismail has several awards such as the USA National Science Foundation Career Award, the IEEE CAS Outstanding Author Award, Best Teacher Award at Northwestern University, and many other best teaching awards and best paper awards. Prof. Ismail is the distinguished lecturer of IEEE CASS.
APPENDIX B ERROR DETECTION ABILITY OF ROW DECODER BLOCK
Prof. Ismail has published more than 170 papers in top refereed journals and conferences and many patents. He co-authored three books: On-Chip Inductance in High Speed Integrated Circuits, Handbook on Algorithms for VLSI Physical Design, and Temperature-Aware Computer Architecture. He has many patents in the area of high performance circuits and interconnect design and modeling. His work is some of the most highly cited in the VLSI area and is extensively used by industry.
