Abstract-High reliability against noise, high performance, and low energy consumption are key objectives in the design of on-chip networks. Recently some researchers have considered the impact of various error-control schemes on these objectives and on the tradeoff between them. In all these works performance and reliability are measured separately. However, we will argue in this paper that the use of error-control schemes in on-chip networks results in degradable systems, hence, performance and reliability must be measured jointly using a unified measure, i.e., performability. Based on the traditional concept of performability, we provide a definition for the "Interconnect Performability". Analytical models are developed for interconnect performability and expected energy consumption. A detailed comparative analysis of the error-control schemes using the performability analytical models and SPICE simulations is provided taking into consideration voltage swing variations (used to reduce interconnect energy consumption) and variations in wire length. Furthermore, the impact of noise power and time constraint on the effectiveness of error-control schemes are analyzed.
I. INTRODUCTION
O N-CHIP networks have been proposed to cope with the ever-increasing complexity and communication requirements of SoCs [1] , [2] . The implementation of an on-chip network affects the system reliability, performance (system speed), and energy consumption to a large extent [3] . Energy consumption is one of the most prominent issues in on-chip networks [2] , particularly in the context of battery operated devices. It has been shown that on-chip interconnects account for a significant fraction (up to 50%) of the total on-chip energy consumption [4] . On the other hand, the required reliability of on-chip interconnects is becoming harder to achieve due to shrinking feature-sizes and supply voltage scaling which makes on-chip interconnects more sensitive to noise [6] .
To address the energy consumption issue, a physical layer [1] technique called reduced voltage swing [7] - [9] (reducing the voltage at which information is transmitted over a channel) is often used. However, reduced voltage swing leads to decreased noise margin; making interconnects less immune to noise. Variations in voltage swing also necessitate changes in interconnect operational frequency which lead to variations in performance [7] . To address the reliability issue, error-control schemes such as Automatic Repeat Request (ARQ), and Forward Error Control (FEC) which are widely applied in packetized communication in large scale networks, can be used at the data-link layer [1] to increase the reliability of on-chip networks [2] , [5] , [6] . However, these mechanisms increase the energy consumption and can degrade the performance of the on-chip networks. For instance, in the ARQ scheme, the receiver requests the sender to retransmit the data unit that was faulty [6] . Clearly, retransmissions take time (i.e., degraded performance) and consume energy (i.e., increased energy consumption). Based on the above, high performance, high reliability and low energy consumption are conflicting objectives that require to be considered jointly when designing an on-chip network.
In the context of on-chip communication, the energy efficiency of the FEC and ARQ error-control schemes has been studied in [6] . This research has reported that, for the same constraint on system reliability, the ARQ scheme consumes less energy than FEC. However, this research has not considered the performance. Indeed, it has been assumed that timing penalties can be tolerated [15] . Furthermore, this research has not considered the hybrid ARQ/FEC (HARQ) scheme. A dynamic voltage swing approach has been proposed in [7] to optimize the energy consumption of the ARQ scheme without degrading the performance and the reliability. However, this research has not considered the FEC and HARQ schemes. Reference [16] has compared the ARQ and HARQ schemes. This work provides useful information to select an appropriate error-control scheme for a given application. However, it addresses energy/reliability and performance/reliability tradeoffs separately and does not consider the impact of voltage swing on the simultaneous tradeoff between reliability, performance, and energy consumption. References [25] - [27] have addressed the reliability, performance and energy consumption of NoCs, however, these works are mainly focused on router architecture and they do not investigate the issues related to channel wires such as voltage swing variations, variations in wire length (wire capacitance), etc. These works also do not provide any comparison between the ARQ, FEC and HARQ error-control schemes. In [35] , a technique has been presented to tolerate permanent faults by gracefully degrading the NoC performance focusing on the router architecture. However, the work has not addressed on-chip interconnect wires (issues such as voltage swing, wire capacitance, noise), and has not considered the ARQ, FEC and HARQ error-control schemes. In [28] , fault tolerance techniques for NoC interconnects have been presented, considering the power consumption of the circuits used for the fault tolerance techniques. However, it has not addressed the power consumption of on-chip network interconnects or any low power design technique (for example, reduced voltage swing). While most of the reported on-chip interconnect techniques have focused on voltage-mode links, there have been some reported studies of current-mode on-chip interconnects [33] , [34] .
Although some of the above previous works have addressed the performance and reliability of NoCs, none of them has addressed the performability metric [10] , [21] which is a composite measure of performance and reliability. It has been shown that for degradable fault tolerant systems-fault tolerant systems that tolerate faults by reducing their performance-reliability and performance cannot be measured separately and should be measured jointly using the performability metric [10] , [21] . We will argue in this paper (Section II-B) that the use of error-control schemes in on-chip networks results in degradable fault tolerant systems, hence performability should be used to measure performance and reliability jointly. Based on the traditional concept of performability metric [10] , [11] , [21] , in this paper, we provide a definition of "interconnect performability" to measure the reliability and performance of an on-chip network interconnect in a composite way. Two other important issues which have not been addressed in all previous works [6] , [7] , [16] are the impacts of (i) time constraints and (ii) noise power on the effectiveness of the error-control schemes.
In this paper, we aim: (i) to analyze the impact of voltage swing and different error-control schemes on the tradeoff between performability and energy, and (ii) to answer the following question: "If a message transmission has to be finished in a given time interval (time constraint) and in the presence of noise with a given power, which error-control scheme and what voltage swing must be used to perform the transmission with the minimum energy and highest performability?". It should be noted that the aim of this paper is to identify an appropriate error-control scheme (among the existing ones) and to select a proper voltage swing which will meet the required performability and energy objectives under given time constraint and noise power, rather than proposing any new error-control scheme. Also note that we concentrate on the physical (e.g., voltage swing) and data-link (e.g., error control) layers.
To analyze the performability/energy tradeoff, analytical models of performability and expected energy consumption are developed for three error-control schemes (ARQ, FEC, and HARQ) and the simple non-fault-tolerant communicationmessage flow-control units or flits. In the presence of channel width constraints, multiple physical channel cycles may be used to transfer a single flit. However, in many implementations, each flit is transferred in a single cycle [6] . Error-control schemes can be implemented at different levels of granularity [6] , [16] . In packet-level error-control schemes, check bits (e.g., checksum bits) are associated to an entire packet and can be transmitted as the last flit of the packet. In this scenario, the cost for error control would be paid in the time domain (i.e., the transmission of an additional flit). The alternative solution is flit-level error control, where each flit contains its own check bits. In this case, the cost for error control would be paid in the space domain (i.e., additional wiring resources for check bits) [6] . Although packet-level error control has lower check-bit overhead than flit-level error control [6] , flit-level error control is used in on-chip networks (such as XPIPE [17] ) since it has relatively lower packet latency and requires less buffer memory at error-control circuitry which makes it suitable to be used in distributed error-control schemes [6] , [16] . In fact, most of the related works [5] , [7] , [13] , [15] use flit-level error control where wiring resources are used for check bits, although some of these works do not directly refer to the term "flit-level error control". Similarly, in this paper we consider flit-level error-control schemes where redundant wires are used for check bits. Fig. 1 shows a possible architecture for an on-chip interconnect which uses flit-level error control. The encoder (denoted by "ENC") adds check bits to each flit and the decoder (denoted by "DEC") uses the check bits to detect and/or correct faulty flits. The 1-bit connection line denoted by "Retransmission Request" is, unlike all the other connections in Fig. 1 , backward from the decoder to the encoder. The "Retransmission Request" line is only required for the error-control schemes with retransmission capability and is not required for the other schemes (Section II-A). The level shifter units are used to change the voltage swing.
In the rest of this section, we first introduce three error-control schemes (ARQ, FEC, and HARQ), and then we develop the analytical models of performability and energy for the schemes.
A. Error-Control Schemes
The three error-control schemes for on-chip networks, considered in this work, are the following.
1) ARQ:
In this scheme [6] , the sender includes an encoder which encodes flits using an error detection code (e.g., CRC-8 code [7] ). The receiver includes a decoder which can detect errors (faulty flits). When the receiver detects no fault in a flit, it sends back an ACK (e.g., a "0" on the 1-bit "Retransmission Request" line) to the sender to acknowledge the correctness of the flit. However, when the receiver detects that a flit is faulty, it sends back a NACK (e.g., a "1" on the 1-bit "Retransmission Request" line) to request the sender to resend the flit. This process is repeated until the receiver detects no fault in the flit. When the receiver detects no fault in a flit, the flit is supposed to be correct; however there are rare occasions when a flit is faulty and the receiver cannot detect the fault. In this case, since the fault is undetected, the receiver does not request the sender to resend the flit. Therefore, the flit remains faulty and the transmission fails (Section II-B1).
Most of the related works (e.g., [5] , [7] , [17] ) consider the ARQ schemes which are based on a policy called Go-Back- [30] . In this policy, flits are transmitted continuously and the sender does not wait for an ACK after sending a flit. Such an ACK is received after a round-trip delay. The sender requires buffering resources to store a copy of those flits that are transmitted during the round trip delay and their ACKs are still not received. Using these buffers, when a NACK is received, the sender backs up to the flit that is negatively acknowledged and resends it in addition to the ( is sometimes called window size [30] ) succeeding flits that were transmitted during the round-trip delay. A flit is removed from the sender buffer only when an ACK is received for it. At the receiver, the received flits following a detected faulty flit are discarded regardless of whether they were correct or not. It should be noted that in the Go-Back-policy, the channel and the "Retransmission Request" line operate in parallel. That is, while the sender is transmitting the th flit over the channel, the receiver transmits an ACK/NACK for the th flit over the "Retransmission Request" line. The Go-Backpolicy is preferred to its alternative, a selective repeat policy [30] , that requires more buffers at the receiver [17] . Therefore, in this paper, like in [5] , [7] , [17] , we consider the ARQ schemes which are based on the Go-Back-policy (for more information on the Go-Back-policy refer to [30] ).
As it can be seen from Fig. 1 , the "Retransmission Request" line is not driven with a reduced voltage swing. This is because this line usually carries ACKs and it rarely carries a NACK, only when a fault is detected. Hence, the switching activity of this line is essentially very low and it consumes negligible power, so that a reduced voltage swing is not required. b) FEC: In this scheme [6] , the sender includes an encoder that encodes flits using an error correction code which can be used for single-bit error correction (e.g., overlapping parity bits [12] ). The receiver includes a decoder which can correct single-bit errors. When the receiver detects a single-bit error in a flit, it corrects the error without any retransmission request. However, on the occasions that there is a multiple-bit error in a flit, it cannot be corrected and the transmission fails. In this scheme, the "Retransmission Request" line shown in Fig. 1 is not needed and does not exist.
c) Hybrid FEC/ARQ (HARQ):
In this scheme, the sender includes an encoder that encodes flits using an error correction code (e.g., overlapping parity bits [6] ). The receiver includes a decoder which can correct single-bit errors and detect multiple-bit errors. When the receiver detects a single-bit error in a flit, it corrects the error without any retransmission request. However, on the occasions that the receiver detects a multiple-bit error in a single flit, it cannot correct the error and hence requests the sender, through the "Retransmission Request" line ( Fig. 1) , to resend the flit. This process is repeated until the receiver detects no fault in the flit or detects only a single-bit error that is correctable without requiring any retransmission. Like in the ARQ scheme, when the receiver detects no fault in a flit, the flit is supposed to be correct; however there are rare occasions when a flit is faulty and the receiver cannot detect the fault. In this case, since the fault is undetected, the receiver neither corrects the flit nor requests the sender to resend the flit. Therefore, the flit remains faulty and the transmission fails (Section II-B1). Also, in this paper, the retransmission policy of the HARQ scheme, similar to that of the ARQ scheme, is considered to be the Go-Back-policy.
B. Performability of an On-Chip Network Interconnect
An important class of fault tolerant systems are degradable systems which in the presence of faults descend into a lower level of performance but still operate correctly. In fact, degradable systems have the capability of compromising performance for reliability. These are unlike non-degradable fault tolerant systems which in the presence of a fault either tolerate the fault and continue to operate correctly at the normal performance level (without any degradation in performance) or do not tolerate the fault and fail. As discussed in the literature (e.g., [10] , [11] , [21] ), traditional views of computer "performance" and computer "reliability" are no longer applicable to degradable systems (they are only applicable to non-degradable systems) and performance and reliability must be measured jointly using a metric called performability. We believe that the use of errorcontrol schemes for on-chip network interconnects may result in degradable systems, thereby requiring performability analysis. We clarify this by means of the following example:
Suppose a 32-bit on-chip interconnect operates at the frequency of 500 MHz (i.e., each flit takes 2 ns to be transferred and the bit rate is 32 bits/2 ns 16 Gbit/s) and we want to transfer 10 flits on this interconnect. Also suppose that the ARQ scheme is used for this interconnect. If no fault occurs during the transfer of the 10 flits, the transfer of the 10 flits will take 20 ns and hence the useful bit rate will be (32*10 bits/20 ns) 16 Gbit/s. However, if for example during the transfer of the 10 flits, 4 of them become faulty and require retransmissions, 14 flits should be totally transferred (4 of them are transferred twice) that will take 28 ns and hence the useful bit rate will be (32*10 bits/28 ns) 11.4 Gbit/s. It can be seen that when faults have occurred during the transmission of the 10 flits, the faults have been tolerated using the ARQ scheme, but the interconnect performance has dropped from 16 Gbit/s (i.e., the performance in the fault-free case) to 11.4 Gbit/s. This example shows that the use of the ARQ scheme for the interconnect results in a degradable system. Therefore, a performability analysis should be used for such an interconnect rather than analyzing the performance and reliability separately. In fact when we use error-control schemes for on-chip network interconnects, the traditional views of communication performance and communication reliability have the following drawbacks.
1) Metrics such as bit rate, baud rate, latency, bandwidth, and operational frequency are some of the most commonly used measures of communication performance [7] , [16] . However, when error-control schemes are used in on-chip networks, these metrics cannot provide a realistic view of performance. In fact, from a performance point of view, it is the useful bit rate which is important, not the apparent rate at which all the bits (including faulty and fault-free flits) are transferred. On the other hand, the use of error-control schemes causes the useful bit rate to become dependent on how faults occur and how they are tolerated. Therefore, it may be impossible to measure the real performance without considering the reliability issues. For instance, in the above example, when there is no faulty flit, the useful bit rate is 16 Gbit/s, but when 4 flits become faulty, the useful bit rate is reduced to 11.4 Gbit/s (although the faults are tolerated). Note that while the useful bit rate varies with the number of faults, the apparent bit rate is constant and equal to 16 Gbit/s.
2) Another important drawback of the above mentioned metrics of communication performance (i.e., bit rate, baud rate, etc.) is that they cannot model the probabilistic nature of the performance of those on-chip interconnects which use error-control schemes. From the above example, it is clear that the real performance of the example interconnect (i.e., the useful bit rate) depends on the number of faulty flits. However, since faults occur randomly the real performance is also a random variable and is not deterministic. In such cases, metrics such as bit rate, baud rate, etc. can only be used to describe the average (or the maximum) value of the interconnect performance but cannot model its probabilistic nature. It should be emphasized that the probabilistic nature of communication performance is caused by the influence of reliability-related issues.
3) Metrics such as Bit Error Rate (BER), Flit Error Rate and Residual Error Probability are some of the most commonly used measures of communication reliability [6] , [7] , [16] . However, when error-control schemes are used in on-chip networks, these metrics cannot provide a realistic view of how reliable an on-chip interconnect is. For example, suppose that in the above example the residual error probability is 0. From a reliability point of view this is the highest imaginable reliability which means that all the possible faults are definitely detected and tolerated by retransmission. However, if the number of faulty flits increases, although all of them will be detected and tolerated, the interconnect performance may be drastically reduced because of the time that retransmissions will take. In this case, the reliability of the interconnect is apparently infinite since all the faults are tolerated, but the resulting performance reduction may make the interconnect completely useless if the performance becomes less than what is required by the application. This discussion shows that when error-control schemes are used for on-chip interconnects, residual error probability (which is a commonly used measure of communication reliability) cannot provide a realistic view of reliability. When the residual error probability of an on-chip interconnect is 0, it seems that the best reliability is achieved, while the interconnect may be completely disabled because of an excessively low performance. Hence, for those on-chip network interconnects that use error-control schemes, performance have to be taken into account in measuring reliability. It should be noted that here, when addressing the performance degradation problem, we do not refer to performance in the sense of average value; rather the concern is related to specific performance values that arise on the particular occasions when fault(s) occur. In fact, the impact of fault(s) on the average performance of on-chip network interconnects may be negligible, because faults are usually rare events and hence retransmissions are rarely required. However, on the particular occasions when fault(s) occur, the performance degradation caused by retransmission(s) is not negligible. From a reliability point of view these particular performance values are very important. This is because, on the particular occasions when fault(s) occur, while faults are tolerated by retransmissions, the degraded performance (due to retransmissions) that arises in these particular cases may be intolerable and leads to problems. We clarify this by means of an example.
Example: Assume that a 32-bit on-chip interconnect operates at frequency of 500 MHz (i.e., each flit takes 2 ns to be transferred) and 3 flits need to be transferred on this interconnect. Also assume that BER (i.e., the probability that a transmitted bit will be received in error) is BER (this BER is within the ranges considered in [6] , [16] ) and the residual error probability is 0 (i.e., all faulty flits will be detected and retransmitted). Let be the probability of a flit being faulty. When BER , the probability of a flit being correct (all the 32 bits of the flit should be correct) is and hence the probability of a flit being faulty is . If a flit becomes faulty (with the probability ), the first retransmission is required with the probability of . The second retransmission is only required when both the original and the first retransmitted flits are faulty, hence the second retransmission is required with the probability of . Similarly, the th retransmission will be required with the probability of . Thus, for each flit, the expected number of transmissions (including the original one and retransmissions) is (1) In the case where no fault occurs, the time that the 3 flits take to be transferred is ns. However, considering the possible presence of faults, the average time that the 3 flits take to be transferred (both original and retransmitted flits) is Although this average transfer time can be rounded to 6 ns, we have intentionally kept the numbers to show the difference between this average transfer time and the transfer time in the fault-free case. It can be seen that the transfer time has increased from 6 ns in the fault free case to 6.000000019 ns when faults might occur. Therefore, from an average performance point of view, the performance penalty which is imposed by faulty flits is very low and negligible. Considering this negligible performance penalty, suppose we require transferring the 3 flits during a time interval of 7 ns (which is greater than 6.000000019 ns). It is interesting that with this required time constraint, the interconnect is not more reliable than a simple interconnect which has no retransmission mechanism. This is because even the occurrence of a single fault requires an additional delay of 2 ns to retransmit the faulty flit (8 ns in total) which leads to the violation of the required time constraint. Although the average performance penalty imposed by retransmissions is very low and negligible, reliability requirements may necessitate that we consider a significantly large time interval for transferring the flits-a time interval which is considerably greater than what is required in the fault free case. For instance, in this example, with the time constraint of 7 ns no fault is tolerable and with the time constraint of 9 ns only one fault is tolerable.
In DSM technologies, on-chip interconnects are becoming increasingly susceptible to noise [6] , [16] , hence we expect relatively higher BER values. Furthermore, BER is very sensitive to voltage swing variations, so that as the voltage swing decreases, BER may increase by several orders of magnitude (Section II-B2a). For large BER values, the average performance penalty caused by retransmissions may become considerable. For example, if due to smaller transistor geometries and reduced voltage swing, BER reaches to ( [31] has addressed such large BER values for logic gates), the probability of a flit being faulty will be -and the average time for the 3 flits to be transferred (both original and retransmitted flits) will be It can be seen that as BER increases to , the average transfer time increases from 6 ns (fault-free case) to 8.33 ns and, therefore, the average performance is considerably degraded (the average transfer time is increased by about 39%).
The above discussion indicates that like all other degradable systems, when error-control schemes are used for on-chip interconnects, performance and reliability may be impossible to be measured separately and preferably they should be measured jointly using the performability metric. Formal definitions for performability have been provided in [11] , [12] , [21] , [22] . However, the performability of a degradable system can be simply defined as [11] : "the probability of completing a given amount of useful work within a specified time interval". Since in an on-chip network interconnect the useful work is to transmit useful bits (by useful bits we mean original data bits excluding check bits and redundantly transmitted data bits), in this paper we define the performability of an on-chip network interconnect as the probability to transmit useful bits during the time interval in the presence of noise. To see how this definition can be used to combine the reliability and performance analysis, again consider the ARQ scheme. The presence of faulty flits (i.e., low reliability problem) in the ARQ scheme necessitates a more frequent retransmission of flits which requires more time and reduces the probability to finish the transmission of a fixed number of useful bits during a fixed time interval (i.e., performability). Also, reducing the bit rate (i.e., low performance problem) increases the time required for sending the flits. This time increase reduces the probability to finish the transmission of a fixed number of useful bits during a fixed time interval (i.e., performability). While the performability of an on-chip interconnect provides a better insight into the performance and reliability of the interconnect, it is not intended to replace the basic metrics of performance and reliability (e.g., BER and operational frequency) with the performability metric. In fact, as it will be seen in Section II-B1, the performability metric itself should be calculated and obtained from the basic metrics of performance and reliability.
Note that for different applications different levels of performability might be required. For example, in safety-critical applications [12] a system is required to operate correctly with a probability greater than - [12] . Hence, the performability of an interconnect which is used for a safety-critical application must be greater than -. (Note that for all -
.)
The analytical performability models for the communication schemes are presented next.
1) Analytical Performability Models: It has been observed that reduced voltage swing is an effective method to reduce the energy consumption of on-chip interconnects [7] - [9] , [14] . Variations in the voltage swing of a channel also lead to variations in the channel delay [7] . When a channel is used at the voltage swing , the channel delay is [7] (2) where is the driver transistor transconductance, which depends on the driver transistor dimensions and some process parameters, is the wire capacitance (the capacitance of each wire in a multi-wire channel), and is the threshold voltage of the transistors. Let be the additional delay imposed by the error-control circuit (e.g., the encoder and decoder). Then, the interconnect operational frequency is (3) where is the total delay of the interconnect caused by both the channel and error-control circuit.
As mentioned in Section II-B, the performability of an on-chip network interconnect is the probability to transmit useful bits during the time interval in the presence of noise (faults). Suppose bits are put into flits of length bits. Since each flit is transmitted in one cycle, the time required for transmitting a flit is ; hence, the maximum number of flits which can be transmitted during the time interval is (4) When a flit is transmitted over an on-chip network interconnect, the following three cases are possible to happen.
Case 1: Correct flit.
i) The flit is correct (or correctable), and ii) no retransmission is required. In this case, the flit is either fault-free or with a fault that can be corrected (e.g., using error correcting codes) in the receiver without requiring any retransmission.
Case 2: Retransmission requiring flit.
i) The flit is faulty, and ii) a retransmission is initiated. In this case, a fault occurs in a transmitted flit which can be detected. The error-control scheme detects the fault and initiates a retransmission of the flit.
Case 3: Residual faulty flit.
i) The flit is faulty, but ii) no retransmission can be initiated, and iii) the receiver also cannot correct the flit by itself (e.g., using error correcting codes). In this case, a fault occurs in a transmitted flit which cannot be tolerated by the error-control scheme. The probability of this happening sometimes is referred to as Residual Error Probability [6] , [7] . This happens when either 1) the error-control scheme detects a fault but cannot tolerate it, because for example the scheme does not support retransmissions (e.g., FEC), or 2) a fault occurs but the error-control scheme cannot detect it, hence no action is taken to tolerate the fault. Let , and be the probabilities of Case 1, Case 2, and Case 3, respectively. Since all the possibilities have been considered above, we can write:
. Also, in the schemes which do not have the retransmission capability (FEC and SNFT), since no retransmission is possible, we have . As shown in the following, the probabilities , and are used to develop performability models for error-control schemes.
Consider the schemes with retransmission capability (i.e., ARQ and HARQ). Suppose that the transmission of useful bits (put into flits) within the time interval is finished successfully and exactly faulty flit(s) occur during this transmission. None of these faulty flits can be a "Residual faulty flit" (Case 3) and they all should be "retransmission requiring flits" (Case 2), because it is supposed that the transmission is finished successfully. Since the retransmission policy is considered to be the Go-Back-policy, the occurrence of these faulty flits results in more flit transmissions. Therefore, in this case flit transmissions are required. As mentioned in Section II-A, when a faulty flit occurs, the receiver discards the received flits following the detected faulty flit regardless of whether they were correct or not. In fact, it is not important at all whether these flits are correct (Case 1), retransmission requiring (Case 2), or residual faulty (Case 3), since they will be discarded anyway and the receiver will never use them. Therefore, in this paper these flits are called discarded flits. Because of the occurrence of exactly faulty flits, totally flits are discarded. From the remaining non-discarded flits: a) None of them can be a "Residual faulty flit" (Case 3), because if even one "Residual faulty flit" occurs, the transmission will fail.
b) The last non-discarded flit which is the th nondiscarded flit should be a correct flit (Case 1). Otherwise, the th non-discarded flit is a retransmission requiring flit (Case 2), which means that more flit transmissions are required and hence the th non-discarded flit is not the last nondiscarded flit. Note that the probability of the th nondiscarded flit being correct is . c) From the remaining non-discarded flits, flits should be correct flits (Case 1) because in total (including the th transmitted flit which is discussed above) we require that flits be transmitted successfully. Also the remaining flits should be retransmission requiring flits (Case 2), because it is supposed that exactly faulty flit(s) occur during the transmission. Assuming that all transmitted flits are independent and equally probable to be a correct flit, a retransmission requiring flit, or a residual faulty flit, the probability that flits out of flits are correct flits and the remaining flits are retransmission requiring flits is (5) Therefore, the probability that the transmission (of useful bits which are put into flits) is finished successfully while exactly faulty flit(s) occur during the transmission is (6) As mentioned earlier in this section, the maximum number of flits which can be transmitted during the time interval is , hence . Therefore, the maximum number of faulty flits that may occur during this transmission is (7) Based on the definition of interconnect performability provided in Section II-B, the performability of the error-control schemes which have the retransmission capability (HARQ and ARQ) can be expressed as the probability that the transmission of useful bits (put into flits) within the time interval is finished successfully despite the occurrence of faulty flit(s), where can change from 0 to . Based on (6) and (7), this performability can be written as (8) In the schemes which do not have the retransmission capability (FEC and SNFT), when , this means that there is not enough time to transmit flits during the time interval , and therefore performability is 0. On the other hand, when , there is enough time to transmit flits, however each flit can only be transmitted once and there is no retransmission. (Note that in the schemes which do not have the retransmission capability, we have .) Therefore, the transmission of the flits will be successful if and only if the only transmission of each flit is correct (Case 1), whose probability is . Therefore, the performability of the FEC and SNFT schemes is (9) 2) Probability Models Required for Performability Evaluation: As observed in Section II-B1, to evaluate the performability of an interconnect we need to know the c, r, and f probabilities. The analytical models for these probabilities are presented in this section. a) Bit Error Rate: For on-chip communications the BER (i.e., the probability that a transmitted bit will be received in error) is affected by the voltage at which the data is transmitted over the channel. This is due to the fact that noise margins decrease as the voltage swing decreases [6] , [7] . In the context of on-chip network interconnects, the relevant literature mostly uses Gaussian noise model [4] - [7] , [13] , [14] . In this model, it is assumed that all the noise sources collectively induce a noise voltage on the channel which follows a Gaussian distribution with zero mean and variance . Therefore, the BER is given by (10) where is the voltage swing and is the Gaussian tail function (11) It is worth mentioning that BER is very sensitive to voltage swing variations, so that as the voltage swing decreases, BER may increase by several orders of magnitude [31] .
b) Probabilities of Retransmission Requiring and Residual Faulty Flits:
For each scheme (SNFT, ARQ, FEC, and HARQ) we have analyzed the probabilities , and as follows: SNFT Scheme: In the SNFT scheme, a flit will be a correct flit if and only if all of its bits are correct and intact, therefore the probability of a flit being a correct flit is (12) where is the flit size (in bits). Since the SNFT scheme does not have the retransmission capability, we have (13) and hence (14) ARQ Scheme: As mentioned in Section II-A, in the ARQ scheme, an error detecting code is used to detect faulty flits (retransmission requiring flits) so that they can be retransmitted. Cyclic redundancy check (CRC) codes are error detecting codes that are widely used in communications links [23] and in particular are used for implementing the ARQ scheme for on-chip network interconnects [6] , [7] , [16] . Similarly, in this paper we consider the ARQ schemes which are based on CRC codes. It should be emphasized that CRC codes can only be used for fault detection and they cannot be used for fault correction by themselves (i.e., without retransmission). In the ARQ scheme, like in the SNFT scheme, a flit will be a correct flit if and only if all of its bits are correct and intact, therefore the probability of a flit being a correct flit is (15) where is the flit size (in bits) in the ARQ scheme. It has been shown that the residual error probability of a CRC code can be expressed as [23] (16) where denotes the minimum Hamming distance of the CRC code, and is the number of Hamming code words with weight . There are various CRC codes that differ in their generator polynomial, and for a specific CRC code, the and parameters depend on the generator polynomial [23] as well as the flit size. In this paper, in all experiments and case studies, it is assumed that each flit contains 32 data bits, excluding the check bits. Also, in all experiments and case studies (Section III), we consider a standard CRC code with the generator polynomial (called DARC-8 [24] ). Therefore, we developed a software code to evaluate the and parameters for this CRC code, and we obtained: . Based on (15) and (16), we have (17) FEC Scheme: For the FEC scheme, a flit is considered faulty when it has more than one erroneous bit. Those flits which have only one erroneous bit are not considered as faulty flits, since they are recoverable by the receiver. Therefore, the probability of a flit being a correct flit is (18) where is the flit size (in bits) in the FEC scheme. Since the FEC scheme does not have the retransmission capability, we have (19) and hence (20) HARQ Scheme: For the HARQ scheme, like the FEC scheme, a flit is considered faulty when it has more than one erroneous bit. Those flits which have only one erroneous bit are not considered as faulty flits, since they are recoverable by the receiver without requiring any retransmission. Hence, the probability of a flit being a correct flit is (21) where is the flit size (in bits) in the HARQ scheme. Assuming that the error correction code can also be used for double-bit error detection (e.g., overlapping parity bits [6] ), the residual error probability can be expressed as [15] 
C. Energy Consumption Model
The dynamic energy consumption of an on-chip wire per bit is [4] , [8] , [14] ( 24) where is the switching activity (the probability that the logic value of a channel wire changes), is the wire capacitance (the capacitance of each wire in a multi-wire channel), and is the supply voltage.
While the driver inverter (denoted by Inv1 in Fig. 2 ) dissipates the dynamic energy of (24) to charge and discharge the wire capacitance , it dissipates only a small amount of static energy because when there is no input transition, one of its transistors is always cutoff. This is, however, not true for the receiver inverter (denoted by Inv2 in Fig. 2) , whose transistors may never be cutoff because of a low input voltage swing [8] . When the wire shown in Fig. 2 is used at voltage swing , the voltage values and on the wire represent logic-0 and logic-1, respectively. Assuming that the receiver inverter is symmetric ( and ), for both of the wire voltages the current which flows through the receiver inverter is the same. Hence, we consider only the case where the wire voltage is , i.e., a logic-0 is on the wire. When is less than the threshold voltage , the N-transistor of the receiver inverter is cutoff and hence only the subthreshold leakage current flows through the inverter. However, when is greater than , the N-transistor and P-transistor of the receiver inverter are in the saturated and linear regions, respectively; hence a considerable current flows through the inverter. This current can be calculated as (25) where is the transistor beta parameter. The energy consumption per bit, dissipated by this current is (26) Another important source of energy consumption in on-chip interconnects is the error-control circuit. The energy consumption of the error-control circuit has two components: static and dynamic. Let be the static power of the error-control circuit. Since each flit is transmitted in one cycle, the static energy consumption per flit is , where is the interconnect operational frequency given by (3) . Hence, the static energy per bit is (27) where is the flit size (in bits). Let be the dynamic energy consumption per bit. The total energy per bit which is consumed by the error-control circuit can be written as (28) Note that the dynamic energy consumption per bit is frequency independent, because to process a bit of data a certain number of signal transitions are required regardless of the rate at which the circuit processes data.
Considering all the sources of energy consumption [ (24), (26), and (28)], the total energy consumption per bit which is consumed by both the channel wires and error-control circuit is (29) Suppose that the transmission of useful bits (put into flits) within the time interval is finished successfully. When the Go-Back-policy is used for the schemes with retransmission capability (ARQ and HARQ), if faulty flit(s) occur during the transmission, flit transmissions will be required (Section II-B1). Since the probability that faulty flit(s) occur during the transmission is , the expected number of total flit transmissions (including the original flit transmissions as well as the retransmissions) is (30) where is given by (7) . Therefore, for the retransmission-based schemes (ARQ and HARQ), the expected energy consumption required for the successful transmission of flits during the time interval is (31) where is the flit size (in bits), and is equal to either or , depending on which retransmission-based error-control scheme (ARQ or HARQ) is considered.
In the retransmission-free schemes (FEC and SNFT), each flit is transmitted only once. Therefore, in these schemes, the energy consumption required for the successful transmission of flits during the time interval is (32) where is the flit size (in bits), and is equal to either or , depending on which retransmission-free communication scheme (FEC or SNFT) is considered.
III. EVALUATION OF THE ERROR-CONTROL SCHEMES
In this section we will evaluate the different error-control communication schemes as well as the non-fault-tolerant one for energy consumption and performability. To ensure a fair comparison between the different schemes, we first estimate the 
A. Energy Overhead of Error-Control Circuitry
To analyze the energy overhead of the error-control circuits, we synthesized the error-control circuits into 45 nm SPICE models. The simulations were carried out using 45 nm PTM technology [18] , [19] ( V). Note 45 nm technology has been used as a way of an example and the models, developed in this work, are generic and can be used for other technologies. A cyclic redundancy code (DARC-8) with the generator polynomial [24] was used for the ARQ scheme, while overlapping parity methods [12] were used for the FEC and HARQ schemes. A CRC circuitry can be easily implemented using a Linear Feedback Shift Register (LFSR). However, the LFSR-based implementation is unsuitable for parallel communication interconnects. Therefore, a Parallel Bit Code Generator [13] (PBCG) method was employed for carrying out CRC checking. The aim of the SPICE experiments was to obtain the energy and power values from the simulation to insert them in the analytical models obtained in Section II-C, i.e., (27) and (28) . For (27) , we needed to evaluate the static power and for (28), we needed to evaluate the dynamic energy per bit . For the evaluation of dynamic energy per bit, some random data bits were encoded and decoded. Each flit contained 32 useful bits as well as redundant check bits. It was assumed that all data combinations are equally probable to be transmitted. (This is a simplified assumption, but the same methodology can be applied to any data pattern.) In order to determine the interconnect operational frequency we also needed to evaluate the delay of the error-control circuits (See Section III-B). The values of energy consumption and circuit delays were obtained using TRANSIENT SPICE analysis. The simulation results are shown in Table I .
Apparently an error correction circuit should be more complex than an error detection circuit, because an error correction circuit not only detects the faults but also corrects them. However, an error detection circuit with high error detection capability may be even more complex than an error correction circuit with relatively lower error detection capability. For example, consider the error detection and error correction circuits that are considered in this paper, i.e., the DARC-8 and overlapping parity circuits, respectively. The DARC-8 circuit is only able to detect errors and cannot correct them; however thanks to its complex hardware, it provides a higher error detection capability than the overlapping parity circuit. In fact, DARC-8 is more effective in detecting multiple-bit errors as compared to the overlapping parity method, so that the residual error probability of the overlapping parity method is worse than that of DARC-8 (for example, assuming that V, and each flit contains 32 useful bits, the residual error probabilities of the overlapping parity method and DARC-8 are and , respectively). This is why, in Table I , the energy consumption of the DARC-8 circuit is comparable to that of the overlapping parity circuit. It should be noted that there are various CRC circuitries with different generator polynomials that differ in complexity and detection capability. As compared to CRC circuitries with fairly simple generator polynomials (e.g., considered in [6] ), DARC-8 (with the generator polynomial ) has more complex hardware and consumes relatively more power but provides a better error detection capability.
Another noticeable issue which can be seen from Table I is that although both the HARQ and FEC schemes use the overlapping parity method, the energy consumption of the HARQ error-control circuit is more than that of the FEC error-control circuit. This is because the HARQ scheme requires more hardware resources to provide the retransmission capability. For example, as mentioned in Section II-A, the HARQ scheme requires buffering resources to store a copy of those flits that are transmitted and their ACKs are still not received. Note that in this paper it is not intended to provide a study of the hardware complexity (area overhead) of the error control schemes; rather the main aim of this work is to identify and select an appropriate error-control scheme capable of meeting the required perfromability and energy objectives under given time constraint and noise power (Section I). Some information on the hardware complexity (area overhead) of the error-control schemes can be found in [6] and [16] . It is worth mentioning that it is possible to obtain some good estimates of the hardware complexity (area overhead) intuitively. For example, the hardware and area overhead of the HARQ error-control circuit is more than that of the ARQ scheme. This intuitive conclusion is in line with the observations reported in [16] .
B. Analysis of Performability/Energy Tradeoff
In this analysis, we make the following assumptions: the wire capacitance is pF (a few millimeters long wire in 45 nm technology [20] ). Threshold and supply voltage of the circuit are V and V, respectively; Gaussian noise variance is V. Furthermore, we consider a switching activity of (all transmitted bits are independent and equally probable to be 0 or 1). The amount of data that has to be transmitted consists of useful bits, which have been split into flits, each containing 32 useful bits. It is also assumed that these data bits need to be transferred during the time interval nS. In Sections III-B1, III-B2, and III-B3, we will respectively examine the impact of the noise level , the wire capacitance (wire length), and the parameter (time constraint) on the performability/energy tradeoff in the communication schemes.
Since DARC-8 has been used for the ARQ scheme, the flit size in the ARQ scheme is bits. Also since overlapping parity methods have been used for the HARQ and FEC schemes, the flit size in the HARQ and FEC schemes is bits. Assuming that, in the ARQ and HARQ schemes, the channel and the "Retransmission Request" line shown in Fig. 1 operate in parallel and none of them is pipelined (i.e., at any time instant, just one flit is transmitted over the channel and just one ACK/NACK is transmitted over the "Retransmission Request" line), the window size for the Go-Back-policy is (for more information on window size, refer to [30] ).
Using the analytical models developed in Section II (i.e., (8) , and (31) for the ARQ and HARQ schemes and (9), and (32) for the FEC and SNFT schemes), Table II shows how the energy consumption (consumed by both the channel wires and the error-control circuit) and the performability of the communication schemes change as the voltage swing changes. These energy and performability values are also shown in Fig. 3 as performability/energy tradeoff curves. Three main observations are made from Fig. 3 .
The maximum achievable performability (at the maximum voltage swing V) from the SNFT scheme is less than -, while error-control schemes can provide much better performabilities, i.e., significantly greater than -. Therefore, the usage of error-control schemes is essential in noisy environments to achieve a highly reliable communication. This observation is in line with previous works [6] , [7] , [16] .
For a given performability constraint, the HARQ scheme consumes the least energy when compared with the other errorcontrol schemes. For example, if we require a performability more than -, we can use the ARQ scheme with the voltage Fig. 3 . Performability/energy tradeoff. swing V. However, if we use the HARQ scheme with the voltage swing V, we will achieve the required performability but with 10.6% energy saving. Note that none of the previous works [6] , [7] , [16] has reached the same conclusion.
While the maximum achievable performability (at the maximum voltage swing V) from the FEC and ARQ schemes are aboutand -, respectively, the maximum achievable performability from the HARQ scheme is much higher-about -. Again note that none of the previous works [6] , [7] , [16] has reached the same conclusion.
1) Influence of Noise Power on Performability/Energy Tradeoff:
There are some techniques and CAD tools to evaluate the noise immunity of a given digital circuit [36] , [37] , however these techniques and tools do not provide information on noise power and BER values. Indeed, it has been observed that noise power varies for different applications and environments [5] , [7] , [31] , so that the related literature often considers different ranges of possible noise power values rather than a specific noise power. For example, in [31] two different noise power values, V and V, are considered for logic gates with V. Using (10), when V, as the noise power increases from 0.3 V to 0.5 V, BER increases from to . As another example, in [7] it is considered that for an on-chip interconnect in a 90-nm technology (with V), the noise power varies from 0.04 V to 0.1 V. Again, using (10), when V (the maximum voltage swing), as the noise power increases from 0.04 V to 0.1 V, BER increases from to . In DSM technologies, on-chip interconnects are becoming increasingly susceptible to noise [6] , [16] , hence we expect relatively higher BER values. However, in this paper, the intention is not to consider any specific noise power value; rather we aim to analyze how the effectiveness of the error-control schemes change as the noise power changes. Therefore, in this section we consider a wide range of noise power values between two extreme cases. Fig. 4 shows the performability/energy tradeoff of the communication schemes when the noise power varies between the following excessively low and excessively high noise power values.
V [ Fig. 4(a) ]: In this case the noise is so weak that no error control is required. This is because as it can be seen from Fig. 4(a) , the SNFT scheme can provide a performability of -, which is very close to 1. Considering the definition of performability (Section II-B), a performability of -means that the transmission of the given amount of data within the given time interval will be finished successfully with the probability of -. Since this probability is very close to 1, it is not necessary to improve the performability and hence the use of error-control schemes (ARQ, FEC, and HARQ) is unnecessary. Note that even in safety-critical applications, a system is required to operate correctly with a probability greater than - [12] . V [ Fig. 4(f) ]: In this case the noise is so strong that the interconnect fails despite the use of error-control schemes. For example, it can be seen from Fig. 4(f) that when V, the maximum achievable performability is about -(HARQ, V). A performability of 0.00574 means that the transmission of the given amount of data within the given time interval will be finished successfully with the probability of 0.00574. This probability is very low and indicates that the interconnect most likely (with a probability of 0.99426) fails to transmit the given amount of data within the given time interval.
It should be noted that Fig. 4 covers a wide range of BER values from (when V and V) to (when V and V). This range includes the BER values that have been considered in the literature with respect to on-chip network interconnects [6] , [7] .
The following two interesting observations can be made from Fig. 4 .
It can be seen from Fig. 4 that when the noise power is low [ Fig. 4(a) and (b) ], the ARQ scheme is more effective than the FEC scheme (the ARQ curve is below the FEC curve). However, as the channel becomes more noisy [ Fig. 4(c), (d) , (e), and (f)], the ARQ scheme becomes less advantageous than the FEC scheme. We clarify this by means of the following example. -When V [ Fig. 4(b) ], if we use the FEC scheme with the voltage swing V, we will achieve a performability of -. However, if we use the ARQ scheme with the voltage swing V, we will achieve the same performability but with 4.3% energy saving. -When V [ Fig. 4(c) ], if we use the FEC scheme with the voltage swing V, we will achieve a performability of about -. If we use the ARQ scheme with the voltage swing V, we will achieve the same performability but with 1.6% more energy consumption.
-When V [ Fig. 4(d) ], if we use the FEC scheme with the voltage swing V, we will achieve a performability of about -. If we use the ARQ scheme with the voltage swing V, we will achieve the same performability but with 9.4% more energy consumption.
In short, as increases, the energy saving of the FEC scheme over the ARQ scheme improves. This is because a strong noise can repeatedly affect the retransmitted flits. Therefore, a simple retransmission scheme (i.e., ARQ) is not suitable for a very noisy channel. While the maximum achievable performabilities (at V) decrease with the increase in nose power, the maximum achievable performability from the HARQ scheme is always significantly higher than what is achievable from the other schemes. For example, when V [ Fig. 4(c) ], the maximum achievable performabilities from the SNFT, FEC and ARQ schemes are about --and -, respectively, but the maximum achievable performability from the HARQ scheme is about -. This shows the importance of the HARQ scheme.
2) Influence of Wire Length on Performability/Energy Tradeoff: The capacitance of an on-chip interconnect is directly proportional to its length [6] , [20] . Since the length of interconnects varies for different on-chip networks [32] , a wide range of interconnect capacitances is considered in the related literature. For example, in [6] two different interconnect capacitance values are considered for a 180-nm technology: pF (a few millimeter long wires in a 180-nm technology) and pF (a wire of about 1 cm in a 180-nm technology). In [7] , a capacitance of 2.73 pF is considered for an on-chip interconnect in a 90-nm technology (a wire of about 1 cm in a 90-nm technology). In this paper, we do not consider any specific capacitance value; rather we analyze how the effectiveness of the error-control schemes change as the interconnect capacitance (length) changes. For this purpose, in this section we assume that the interconnect capacitance varies from 0.01 pF to 1 pF. Based on the information provided in [20] , in a 45-nm technology, a capacitance of 0.01 pF corresponds to an interconnect length of about 0.05 mm and a capacitance of 1 pF corresponds to an interconnect length of about 5 mm. This assumption about the interconnect length (i.e., from 0.05 mm to 5 mm) is realistic as the interconnect length in on-chip networks usually lies within this range [32] . Fig. 5 shows the performability/energy tradeoff of the communication schemes when the interconnect capacitance (wire length) varies from 0.01 pF to 1 pF. Two main observations are made from Fig. 5 .
It can be seen from Fig. 5 that when pF [ Fig. 5(a) ], the HARQ scheme consumes less energy than the ARQ and FEC schemes (the HARQ curve is below the ARQ and FEC curves). However, as the wire capacitance (wire length) decreases [ Fig. 5(b) and (c) ], the energy saving of the HARQ scheme over the ARQ and FEC schemes decreases. We clarify this by means of the following example: Suppose we require a performability of -. To achieve this level of performability:
-When pF [ Fig. 5(a) ], we can use the ARQ scheme with the voltage swing V and the HARQ scheme with the voltage swing V. However, at these voltage settings, the HARQ scheme offers 10.6% energy saving as compared to the ARQ scheme.
-When pF [ Fig. 5(b) ], we can use the ARQ scheme with the voltage swing V and the HARQ scheme with the voltage swing V. However, at these voltage settings, the HARQ scheme offers 2.4% energy saving as compared to the ARQ scheme. In fact, it can be seen from Fig. 5(b) that when pF, the FEC, ARQ and HARQ curves become very close to each other (for example, the FEC and HARQ curves cross each other) which means that there is no considerable difference between the energy consumption of the three schemes.
-When pF [ Fig. 5(c) ], we can use the ARQ scheme with the voltage swing V and the HARQ scheme with the voltage swing V. In this case, the HARQ scheme consumes 11.4% more energy than the ARQ scheme.
In short, with the performability constraint of -, as decreases from 1 pF to 0.01 pF, the energy saving of the HARQ scheme over the ARQ scheme decreases from 10.6% to 11.4%. This is mainly because, as it can be seen from Table I, the energy consumption of the HARQ error-control circuit is more than that of the ARQ error-control circuit. In the interconnects made up of long wires, the main portion of the energy is consumed by the wires and not by the error-control circuit; hence, the difference between the energy consumption of the ARQ and HARQ error-control circuits is negligible. However, as the wire length decreases, the energy consumption of the error-control circuits becomes a significant portion of the total energy; hence the energy saving of the HARQ scheme over the ARQ scheme decreases because of the higher energy consumption of the HARQ error-control circuit. It should be noted that when short wires are used, although the energy consumption of the HARQ scheme is more than that of the ARQ and FEC schemes, the HARQ scheme may be still preferable to the ARQ and HARQ schemes because the maximum achievable performability from the ARQ and FEC schemes is much lower than is achievable with the HARQ scheme. For example, when pF [ Fig. 5(c) ], the maximum achievable per- formabilities from the FEC and ARQ schemes are aboutand -, respectively, while the maximum achievable performability from the HARQ scheme is about -. It can be seen from Fig. 5 that as the wire capacitance (wire length) decreases, the slope of the curves decreases so that in Fig. 5(c) , the curves are close to being horizontal. This means that as decreases, the effectiveness of reducing the voltage swing decreases. For example, in Fig. 5(c) , when the voltage swing of the HARQ scheme decreases from 0.5 V to 0.36 V, the energy consumption only decreases from 11.73 pJ to 11.29 pJ, while the performability decreases considerably fromto -. This is because, when an interconnect is made up of short wires, the energy consumed by the wires is only a small portion of the total interconnect energy and the main portion of the energy is consumed by the error control circuit. In this case, reducing the voltage swing can only achieve a negligible energy saving, while it still has a considerable negative impact on the interconnect performability. Therefore, reducing the voltage swing is not suitable when short wires are used.
3) Influence of Time Constraints on Performability/Energy Tradeoff: So far, we have analyzed the performability nS . Assuming that is constant, for the applications which do not have tight time constraints, we can analyze the performability for relatively large values. However, for the applications with tight time constraints, smaller values have to be considered. In order to study the impact of the time constraints on the efficiency of the error-control schemes, Fig. 6 shows the performability/energy tradeoff of the communication schemes when nS, i.e., in Fig. 6 , we consider the performability nS . Two key observations are made from Fig. 6 .
When we compare Fig. 3 ( nS) with Fig. 6 ( nS), it can be seen that when nS (relaxed time constraint), the ARQ scheme is more effective than the FEC scheme. However, when nS (tight time constraint), the ARQ scheme becomes less advantageous than the FEC scheme. For example, when nS (tight time constraint), the maximum achievable performability from the ARQ scheme is about -. However, if we use the FEC scheme with the voltage swing V, we will achieve not only a performability more thanbut also 7% energy saving. This is because the ARQ scheme only relies on retransmissions to tolerate faults. Therefore, when tight time constraints are imposed, the ARQ scheme has relatively less time to retransmit faulty flits and hence its performability decreases. However, imposing tight time constraints does not have a similar negative impact on the FEC scheme, as it does not use retransmissions. Reference [6] has studied energy/reliability tradeoff and reported that for the same constraint on system reliability, the ARQ scheme consumes less energy than FEC. This is true and our observation is in agreement with it ( Fig. 3) but only when we do not require high performance (relaxed time constraints). It can be seen from Fig. 6 that when we require high performance (tight time constraints), the ARQ scheme is less effective than the FEC scheme.
When we compare Fig. 3 ( nS) with Fig. 6  (  nS) , it can be seen that when nS (relaxed time constraint), the HARQ scheme is more effective than FEC. However, when nS (tight time constraint), the HARQ scheme becomes less effective than the FEC scheme (the FEC curve is below the HARQ curve in Fig. 6 ). In fact, when nS (tight time constraint), the HARQ scheme does not have enough time to retransmit faulty flits and hence, just like the FEC scheme, it can only correct single-bit errors at the receiver without any retransmissions. Therefore, as it can be seen from Fig. 6 , when the voltage swings of the FEC and HARQ schemes are the same, they provide almost the same performabilities. Since the energy consumption of the HARQ error-control circuit is more than that of the FEC error-control circuit (Table I) , when the voltage swings of both the schemes are the same, although they provide almost the same performabilities, the HARQ scheme consumes more energy than the FEC scheme.
IV. CONCLUDING REMARKS AND FUTURE WORKS
In this paper, we have argued that the use of error-control schemes in on-chip networks results in degradable systems, hence performance and reliability must be measured jointly using the "Performability" metric. We have analyzed the impact of three error-control schemes on the tradeoff between performability and energy in on-chip networks, when voltage swing, noise power, wire length (wire capacitance) and time constraint vary. This is unlike the previous works [6] , [7] , [16] which none of them has addressed the degradable nature of on-chip interconnects and the performability metric.
Since noise power and time constraint vary for different applications and environments, and wire length varies for different on-chip interconnects, the impacts of these three factors (noise power, time constraint, and wire length) on the effectiveness of the error-control schemes have been analyzed in this paper. This analysis shows the following.
The maximum achievable performability (at the maximum voltage swing) from the HARQ scheme is always higher than (or almost equal to) what is achievable from the other schemes (Figs. 3-6) .
For a given performability constraint, the HARQ scheme consumes the least energy when compared with the other error-control schemes, except for when short wires are used [ Fig. 5(c) ], or when tight time constraints are imposed (Fig. 6) .
When short wires are used [ Fig. 5(c) ], the HARQ scheme provides the best performability and consumes the most energy. Also, the FEC scheme provides the least performability and consumes the least energy among the error-control schemes. It is worth mentioning that when short wires are used, reducing the voltage swing is not suitable.
When tight time constraints are imposed (Fig. 6) , the HARQ and FEC schemes provide almost the same performabilities and can provide better performabilities than the ARQ scheme. However, since the FEC scheme consumes less energy than the HARQ scheme, the FEC scheme is preferable to the HARQ scheme.
Although we have analyzed a number of factors that have significant impacts on the performability/energy tradeoff in the communication schemes (i.e., voltage swing, noise power, wire length, and time constraint), it is clear that there may be other factors that can affect this tradeoff. Future work mainly involves analyzing the other factors that may have noteworthy impacts on the performability/energy tradeoff in the communication schemes. For instance, it is becoming common in deep submicron designs to use repeaters for on-chip interconnects [29] . These repeaters have an influence on the delay and energy consumption of on-chip interconnects [29] . Therefore, an interesting topic for future work is to investigate the impact of the use of repeaters on the performability/energy tradeoff. Another interesting topic for future work is to consider the use of error-control schemes (ARQ, FEC and HARQ) for current-mode interconnects [33] , [34] and to analyze their performability/energy tradeoffs.
