Abstract -Power consumption and delay are two of the most important constraints in current-day on-chip bus design. The two major sources of dynamic power dissipation on
INTRODUCTION
capacitive crosstalk [2] - [7] . Since, both power consumed and delay incurred by a system bus, increase with increase in the coupling and the self capacitances, minimizing these capacitances is a major challenge in modern DSM designs.
The problem of reducing capacitive crosstalk effects on buses depends on the transitions (toggle in the value) on the bus lines. The bus lines are classified as aggressor lines and victim lines depending on the transition activity of the signal they carry. The effect of an aggressor on a victim depends on a number of factors, and not every aggressor will inject an appreciable amount of noise into a victim. In addition, the crosstalk effect may cause a switching (toggling) wire to inject a noise bump on an adjacent silent (non-toggling) wire leading to functional defects. The Miller effect suggests that the crosstalk capacitance varies with the switching behavior of a victim wire and its neighbors [8] .
For example, consider three wires that run in parallel for a significant distance with minimum space. If the middle wire switches from low to high, while its neighbors are simultaneously switching from high to low, the effective capacitance of the middle wire becomes doubled compared to the case when the neighbors are quiet. On the other hand, if all the three wires are simultaneously switching in the same direction, the coupling capacitance of the middle wire becomes zero.
It has been shown that the delay and power for a long bus is strongly a function of the coupling capacitance between the wires [5] . Figure 2 shows the effect of coupling for a bus of width three and varying lengths, implemented in 65nm technology. The bus model was simulated using the Eldo tool of Mentor Graphics. The 65nm technology parameters for the above simulation were derived from [9] and are listed in the Table X. In this paper, a new temporal encoding scheme is proposed, which uses self-shielding memory-less codes to completely eliminate Class 5 and Class 6 crosstalk effects [5] and hence significantly minimizes power consumption and delay of the bus.
ANALYTICAL MODELS FOR DELAY AND ENERGY CONSUMPTION
Analytical models for estimating crosstalk related delay and energy consumption in DSM buses have been proposed in [5] , [10] , [11] . These models are extensions of the standard Elmore delay model that account for arbitrary number of lines driven by independent sources and a distributed coupling component. The effect of transition patterns is also taken into account enabling the estimation of the delay on a sample by sample basis instead of considering a single worst case scenario. In this paper, the above mentioned analytical models are used to perform a qualitative analysis of the proposed approach and compare the same with the normal bus transmission. Figure 3 shows a model for the drivers and buses in DSM technologies. Consider a bus with n lines. Let d t = (d t 1 , d t 2 ,...d t n ) denote the t th n-bit data transmitted on the bus. The delay for transmitting the (t+1) th data on the bus is given by the following formula [5] . Defining T k (d t ,d t+1 ), 1 ≤ k ≤ n, as follows:
where R T is the total resistance, C L is the ground capacitance, C I is the inter-wire capacitance,
and λ= 
Similarly, the total energy, E(d t ,d t+1 ), consumed during the transmission of d t+1 is given by [11] , [12] .
(
) where respectively. From the above formula, it is seen that, the delay and energy consumed by a wire k during a transmission, depends not only on the state of the wire k but also on the states of the lines adjacent to wire k. Table I classifies the crosstalk delay effect on a wire k into six classes, depending upon the state of wire k (the middle wire) and its adjacent wires [5] , [12] . From this table, it is clear that the worst-case crosstalk delay for transmitting data items is C L .R T (1+4λ) (Class 6 in Table I ), followed by Class 5. Exactly similar classification can be made for energy consumption on a wire k. Here too, the Class 6 crosstalk on a wire results in maximum energy consumption, followed by Class 5. Table II shows the number of coupling transitions
for different transition patterns. Here again, the Class 5 and Class 6 transition patterns yield maximum number of coupling transitions. This suggests that elimination of Class 5 and Class 6 crosstalk patterns shall reduce both the delay and energy consumption of the bus. This is precisely what is attempted by the technique proposed in this paper.
PREVIOUS WORK
Many encoding methods have been presented to reduce the power dissipation on buses. The Bus invert method proposed in [13] achieves a 50% reduction in the maximum number (peak value) of self transitions and coupling transitions when compared to a normal transmission over an unencoded bus. However, this method fails to control the average coupling transition activity leading to less average power reduction. The failure is due to the ignoring of the coupling capacitance between interconnects that result in a significant power penalty on DSM global buses. A bus encoding technique to simultaneously minimize power consumption and eliminate crosstalk delay is proposed in [14] . This technique requires large number of extra interconnects. For example, it encodes a 32-bit data to a 55-bit data (extra of 23 bits). This leads to a heavy overhead in terms of routing congestion, Encoder-Decoder (CODEC) complexity and wire area penalty. In [15] , a bus encoding technique is proposed to minimize coupling transitions by considering minimization of power consumption as the main objective. This technique does not eliminate classes 4, 5 and 6 crosstalk patterns. Hence, it is not much advantageous from the delay perspective.
Another bus encoding technique to minimize both energy and delay is proposed in [16] , which can eliminate only the crosstalk classes 4 and 6. Here, the worst case delay is still due to the class 5 transitions, which is high. Moreover, like [14] , this technique requires large number of extra interconnects. For example, like [14] it encodes a 32-bit data to a 55-bit one. The Odd/Even bus invert technique proposed in [17] is designed to target minimization of coupling energy and hence is not very effective from the delay perspective.
The simplest and most effective technique for preventing crosstalk delay is shielding. This involves providing a power supply wire, either ground or VDD, between every two adjacent wires on the bus, as shown in Figure 4 . These constant-voltage wires act as electro-magnetic shields, and prevent activity on one signal wire from significantly coupling over to another signal wire. Though this naive scheme is effective to remove the crosstalk delay, the routing congestion and wire area penalty are significant due to the almost doubled number of wires.
Victor et al. proposed the self-shielding bus encoding technique [6] to prevent crosstalk delay.
Two versions of the coding technique were proposed, namely, the memory-based selfshielding code and the memory-less self-shielding code. In the memory-based self-shielding technique, the encoder and the decoder require the previous transmitted data on the bus for encoding/decoding the current data. Thus, the previous transmitted data have to be remembered by storing the same in a memory. The encoder and decoder of a memory-less self-shielding code depend only on the current data for encoding/decoding the same. The technique in [6] was proved theoretically far better than just placing shielding wire between every adjacent wire. A thorough theoretical study of self-shielding codes is presented in [6] .
Theoretical bounds on performance in terms of required channel width versus data bits are derived. For example, it is shown in [6] that, for a 32-bit data D, any memory-based selfshielding shall require at least 40-bits (extra of 8) to encode D, such that, the resulting transmission avoids class 5 and class 6 crosstalk patterns. It is also shown that, a memory-less shielding shall require at least 46 (extra of 14) bits to achieve the same.
Tiehan et al. [18] proposed a dictionary based encoding scheme for data buses. This approach uses an adaptive dictionary encoding scheme for minimization of power. The power reduction using this technique crucially depends upon the patterns in the transmitted data. In [19] , an address bus encoding scheme is proposed. This scheme crucially depends upon the high degree of temporal correlation in the data transmitted over the address bus to reduce the power dissipation. In the absence of such correlation between consecutively transmitted data, this scheme partitions the data to be transmitted into two parts and transmit the same in two consecutive cycles, hence resulting in temporal redundancy.
In [20] , a bus encoding technique using a variant of binary Fibonacci representation of integers is proposed and a recursive procedure to generate crosstalk delay free binary Fibonacci codewords is given. The generated Fibonacci codewords are similar to that of the memory-less self-shielding codes given in [6] . It has been shown in [6] , [20] that m-bit crosstalk delay free binary Fibonacci codewords can be used to encode log 2 (F m+2 ) bits, where F m+2 is the (m+2) th Fibonacci number. So, a 32-bit bus can be encoded using 46-bit crosstalk delay free binary Fibonacci codewords.
In [21] , a bus encoding is proposed to obtain 10% energy reduction alone with delay reduction of nearly 50%. Bus encoding techniques to reduce the worst-case crosstalk delay by nearly 50% are proposed in [2] , [14] . However, these techniques require large number of extra wires.
For example, the techniques proposed in [2] , [14] and [21] encode a 32-bit data into 52, 55
and 48 bits, respectively.
All the bus encoding schemes mentioned above employ spatial redundancy (that is, they use extra bus wires for encoding) and can be modeled as shown in Figure 5 . In other words, m > n in Figure 5 for spatial redundancy based techniques.
Recently, temporal redundancy (redundancy in time) based bus encoding techniques to minimize propagation delay and/or energy consumption are reported in the literature. In a temporal redundancy based bus encoding technique, a data item is encoded in such a way that the encoded data is transmitted in two or more successive cycles.
In [22] , a bus encoding technique based on both spatial and temporal redundancy is proposed for on-chip delay and energy minimization. In this technique, data to be transmitted on the bus is classified into different crosstalk classes [5] , [11] . The temporal encoding is applied to only those data that belong to certain crosstalk classes.
A bus encoding technique, based on temporal redundancy alone is proposed in [4] , wherein each 32-bit data to be transmitted is encoded as two 24-bit data and transmitted in two successive cycles. The coding technique does not directly eliminate all the class 5 and class 6 crosstalk patterns. The encoder circuit detects a class 5 or class 6 pattern in advance and transmits a dummy data on the bus to avoid the same. In contrast to the above-mentioned approaches, this paper proposes a temporal redundancy based coding technique which fully eliminates both class 5 and class 6 crosstalk transition patterns. The technique is a memoryless self-shielding one.
BUS ENCODING USING TEMPORAL REDUNDANCY
The main idea behind the proposed temporal redundancy based encoding scheme is to encode the original n-bit data packet which is to be transmitted, into two m-bit data packets, where, (m < n). The two encoded data packets are transmitted over two consecutive cycles on the bus.
Though it may seem that this technique will incur a large delay overhead, as the number of transmissions are doubled, it has been proved that this technique can reduce the bus delay significantly when compared to normal transmissions on the bus. The central idea is explained below.
For the correct operation of buses, the clock period T C should be sufficiently large so that all the transitions in the bus have enough time to be completed [5] . During a normal (uncoded) transmission on a bus, any of the six classes of transition patterns, as shown in Table I , can appear on the bus. Thus, the clock period should be larger than the maximum possible delays of all such classes. From Table I , we can see that the Class 6 patterns induce the maximum delay, which is equal to CL.RT( 1 + 4λ) . Thus, for a normal uncoded transmission over a bus
As mentioned earlier, for a temporal redundancy based encoding scheme, the size of the bus reduces from n to m, that is, m < n. By assuming that the new m-bit wide bus occupies the same interconnect area as that of the original n-bit wide bus, the spacing between the wires in the new bus can be increased. The following formula (Equation (2)) illustrates the increased spacing between the wires got by converting an n-bit bus to an m-bit bus.
where, W is the width of a wire, S is the original spacing in the n-bit bus, and S' is the new spacing between adjacent wires in the m-bit bus. Increasing the space between adjacent wires, decreases the C I and increases the C L [5] , which in turn decreases the value of λ.
The encoding technique proposed in this paper ensures that Class 5 and Class 6 crosstalk patterns are eliminated during the transmission over the bus. Therefore, as seen from Table I, the worst case delay pattern that can occur on the bus is due to the Class 4 transition pattern.
In addition, as two encoded data are transmitted (temporal redundancy) for each of the original data, the proposed scheme shall result in a better performance compared to the normal transmission if and only if
where, capacitance factor
are the values of the ground and interwire capacitance, respectively, for the new m-bit bus and δ is the delay associated with the encoder and the decoder. The self capacitance C L and coupling capacitance C I are calculated using the equations presented in [23] (reproduced here as Equations (4) and (5)). where W is the wire width, H is the dielectric thickness, S is the interwire spacing, T is the wire thickness and ε is the permittivity. In the subsequent sections of this paper it is shown that the above inequality (Equation (3)) is indeed true for DSM buses.
CROSSTALK PREVENTING CODE (CPC) BASED TEMPORAL ENCODING
The encoding scheme presented in this paper is denoted as an (n, p, q, r)-encoding and is described as follows: Consider an n-bit data D = (x 1 , x 2 ,...,x n ) which is to be transmitted on the bus. This n-bit data is partitioned in order into two data sets, namely, D 1 = (x 1 , x 2 ,..., need not be a factor of n. So, the remaining (n/2 -n/2p *p) bits constitute the n/2p th data block;
2) Encode the p-bits in each block d i , 1 ≤ i ≤ n/2p into a q-bit code;
3) Encode the g (=(n/2 -n/2p *p) ) bits in the block d n/2p into a r-bit code; 4) All these n/2p number of q-bit encoded data blocks and one r-bit encoded block are transmitted over the bus with a shielding wire placed between every two adjacent data blocks as shown in the Figure 6 . The shielding wire carries a logical zero or a logical one signal permanently, and therefore, never toggles. There are at most n/2p
shielding wires as shown in Figure 6 . Hence, the total number of wires in the new bus
end of Algorithm 1
Similar encoding is done for the n/2 bits in D 2 and these encoded bits are transmitted in the consecutive cycles following the transmission of the encoded D 1 bits. The decoding procedure is very straightforward as seen in the Figure 6 , wherein, each of the q-bit data blocks are decoded into p-bit data blocks. The last r-bit data block is decoded into a g-bit data block, where, g= (n/2-n/2p *p).
From Table I, The CPC encoding scheme presented below achieves the above-mentioned objective.
Consider the set F k , k ≥ 1 of k-bit binary sequences generated by Algorithm 2. The first four sets F 1 to F 4 are shown in Table III . Consider the set F 3 in Table III . It is interesting to note that for any two arbitrary 3-bit strings s, t∈ F 3 , transmitting s on a 3-bit bus followed by t on the same shall not lead to any Class 5 or Class 6 transition patterns (i.e. two opposite transitions on adjacent lines). Similar is the case with the sequences in the sets F 2 and F 4 . The following theorems prove some properties of the sets F k , k ≥ 1. 
end of Algorithm 2
Theorem 1:
Let F k  denote the number of elements in the set F k . Then,
Proof: Without loss of generality, let k be even. Then, Algorithm 2 implies that F k is constructed by 1) taking all elements of F k-1 and prefix the same with a 0; and,
2) taking all elements of F k-1 starting with a 1 and prefix the same with 1. Since k is even, k-1 is odd. Again from Algorithm 2, we see that all the strings that start with a 1 in F k-1 are those got by prefixing a 1 to all the strings in F k-2 .
The above two observations imply that F k =F k-1 +F k-2 , for all even values of k. A similar proof holds good for odd values of k.
Theorem 2:
For any two arbitrary m-bit strings s, t ∈ F m , transmitting s on an m-bit bus followed by t on the same shall not cause any opposite transitions on adjacent lines.
Proof:
The proof is by induction on m. The theorem is true for F 2 . Let the theorem be true for all i ≤ k. Consider F k+1 . Since F k+1 is constructed by prefixing 0's and 1's to the strings in F k , by induction hypothesis, it is clear that, if at all opposite transitions should happen on adjacent wires during two consecutive transmissions of arbitrary strings in F k+1 on a (k+1)-bit bus, then, it must happen on the leftmost (most significant bit) two lines. This implies that there should exist at least two strings in F k+1 with the two most significant bits as 01 and 10, respectively, so that if they are transmitted one after another, they cause opposite transitions on adjacent lines. But, this is never the case, as from Algorithm 2, it is clear that, F k+1 shall have strings starting with either 10 or 01, but not both. Hence, the theorem.
Design of the Encoder and the Decoder

Theorem 2 implies that, if all the k-bit sequences transmitted over a k-bit bus belong to F k , then, there would be no Class 5 or Class 6 transition patterns on the bus. Theorem 2 suggests
a solution to the p-to-q bit and g-to-r bit encoding as required by steps 2 and 3 of Algorithm 1 that also meets the objective 1 stated earlier.
The idea behind the design of the p-to-q encoding is to map each of the 2 p different p-bit inputs onto unique q-bit sequences in the set F q . Since the mapping has to be one-to-one (injective), it is straightforward to see that q is the smallest integer satisfying F q  ≥ 2 p .
Similar technique is employed for the g-to-r encoding. As seen above, r is the smallest integer satisfying F r  ≥ 2 g . This and Theorem 2 imply that class 5 and class 6 are avoided in all transitions over the m-bit bus of Figure 6 . The Table IV gives the cardinality of Fq for different values of q, calculated using Theorem 1.
Given the value of n, the choice of p is crucial. From the discussion above and Algorithm 1, it is obvious that the values of q, g and m depend on p. The value of r depends on g and hence,
in turn depends on p. The following two important factors have to be considered for choosing p.
1) The p should be chosen such that, there is a significant reduction in m, the number of wires in the new bus. The reason is, from Equation (3) it is seen that, as the bus width decreases, the spacing between the lines increases and value of λ decreases, which shall result in better performance of the bus.
2) The p should be chosen such that, the encoder and decoder are not too complex to implement.
The construction of the encoder and decoder circuits can be understood through the following example. Consider transmission of a 32-bit data (n = 32). The encoding scheme partitions the input into two 16-bit data, encodes them, and transmits the encoded version one after another in two consecutive cycles. Given n = 32 and hence n/2 = 16, the Table V gives for each choice of p, the values of q, r and m (refer Figure 6 ).
From Table V we see that the minimum bus-width is 25, which is got by selecting p as 5, 7 or 8. In all these cases, the p-to-q en/decoder becomes complex in terms of delay and power consumption. Tables VI and VII This justifies the use of a 3 to 4 encoder and the corresponding decoder for transmission.
Hence, p is chosen as 3, which implies that q=4, g=1, r = 1 and m = 26. The Table VIII shows the p-to-q (3-to-4) encoding using the sequences in F 4 . The g-to-r (1-to-1) encoding is straightforward, wherein, the single bit is transmitted as it is. The Boolean equations for the 3-to-4 encoder and decoder are given below. The Figure 7 shows the gate-level representation of the above encoder and decoder functions. 
EXPERIMENTAL VALIDATION AND RESULTS
This section presents the delay and peak-power analysis of a bus during both normal transmission and the proposed encoded transmission and compares the same. For experimental study, the normal bus is assumed to be 32-bit wide, and the corresponding encoded bus is 26-bit wide, as described in the previous section.
Synthesis of the Encoder-Decoder
A Verilog description of the encoder and decoder circuit shown in the Figure 7 was taken through the RTL2GDSII cycle of the Magma Blast Chip Version 4.1.57 design flow using the CL013LV (130nm technology) cell library of the TSMC. Table IX shows the delay, energy and area overheads respectively of the encoder and decoder, calculated using the above design process.
Peak-Power Analysis
The effect of the proposed encoding scheme on the peak-power consumption of on-chip buses was studied by simulating the SPEC2000 CINT [24] benchmark suite on the Simplescalar 3.0 [25] architectural simulator. The performance of different on-chip buses between the processor datapath and L1 I-cache/D-cache were studied. For each benchmark, the first 100 million instructions were fast-forwarded and simulation study was done on the next 100 million instructions. The following equation was used to determine the power consumption (P) during one data transmission.
where, N s, is the total number of switching transitions, N c is the total number of coupling transitions, f is the frequency, V is the value of the high voltage level, P enc and P dec are the power consumed by the encoder and decoder, respectively. The coupling transitions per every bus-line value were estimated as given in Table II Table X . The values of P enc and P dec are taken from Table IX. Thus, the power for each transmission on the bus is computed and the maximum of this value computed over all transmissions is the Peak-Power. Figure 8 shows the percentage peak-power reduction due to the proposed encoding technique in comparison to the normal transmission, for both address and data buses in 45nm, 65nm and 90nm technologies, respectively. The results show that on an average the proposed technique leads to a reduction in the peak-power consumption by 51% (28%), 51% (29%) and 52% (30%) in the data (address) bus for 90nm, 65nm and 45nm technologies, respectively. Simulation results show that the peak power reduction is higher for the data bus than the address bus because the data transmitted on the latter are temporally correlated in nature when compared to the data transmitted on the former. In other words, the Hamming distance between any two consecutive transmissions in an address bus is less when compared to any two consecutive transmissions in a data bus.
Delay Analysis
The delay analysis was based on the technology parameters taken from the ITRS 2001 [9] for 90nm, 65nm and 45nm technologies. The parameters are shown in Table X . In addition, the Predictive Technology Model [26] was used to calculate the resistance (R T ), self-capacitance (C L ), coupling-capacitance (C I ) and wire spacing (S) values for different technologies. The value of λ was calculated using the capacitance values. These computed values for a 32-bit uncoded bus and the corresponding 26-bit encoded bus are shown in Tables XI and XII, respectively. The wire spacing parameter (S') for the encoded bus shown in Table XII (8) Equation (8) was derived from the Equation (2) by substituting n = 32 and m = 26.
To measure the performance of the proposed encoding, a SPICE model of a 3-bit wide bus as shown in Figure 3 was developed. The model is a distributed R-C one. The lines were assumed to be capacitively coupled. The model was simulated using the Mentor Graphics and for both the normal and the encoded bus models.
The technology parameters were taken from the Tables X, XI and XII. The delays due to the encoder and the decoder were taken from the Table IX to compute the total transmission delay of the encoded bus. The total percentage of delay reduction due to the proposed encoded transmission in comparison to the normal transmission was computed as given in Equation (9) .
The results obtained for 90nm, 65nm and 45nm technologies are shown in Figure 9 . The graph indicates that for a bus length of 10 mm the proposed technique also achieves 17%, 31% and 37% reduction in the bus delay for 90nm, 65nm and 45nm technologies, respectively, when compared to what is incurred by the data transmission on a normal bus. Table XIII compares the proposed approach with the other existing techniques in terms of interconnect spacing, redundancy type, number of bus lines, shielding protection, power and delay requirements.
The comparison results show that the proposed approach is better than the existing ones reported in literature, in terms of all the parameters mentioned above. To sum up, the salient features of the proposed technique are as follows: 1) Basic en/decoder is a simple combinational circuit realized using 6 logic gates with a circuit depth of 3;
2) The technique widens the distance between bus lines without area penalty, resulting in less parasitic capacitance between bus lines;
3) The technique completely eliminates class 5 and class 6 without using extra wires for shielding; 4) Total number of bus lines required by the technique is less than the original bus width; and,
5)
The technique is independent of the executing application that causes the bus transmission.
CONCLUSIONS
In this paper, a new temporal encoding scheme that uses self-shielding memory-less codes to minimize crosstalk was proposed. The paper presented a detailed comparison of the proposed scheme with existing encoding schemes. The proposed encoding scheme was tested with the SPEC2000 CINT benchmarks to study the peak power consumption. On an average, the proposed technique reduced the peak power consumption by 51% (28%), 51% (29%) and 52% (30%) in the data (address) bus for 90nm, 65nm and 45nm Technologies, respectively.
The experimental results got by a SPICE simulation of the DSM bus model show that the proposed technique achieves 17%, 31% and 37% reduction in the bus (10mm) delay for 90nm, 65nm and 45nm technologies, respectively when compared to data transmission without any encoding. To the best of our knowledge, this is the only temporal encoding scheme reported in literature that eliminates Class 5 and Class 6 crosstalk patterns completely.
ACKNOWLEDGMENTS
This work was partially supported under the Fast-track scheme for young scientists supported by the Department of Science and Technology (DST), Government of India. We thank the anonymous referees for their valuable comments that have greatly enhanced the quality of this paper. 
