Technology trends and especially portable applications are adding a third dimension (power) to the previously two-dimensional (speed, area) VLSI design space 30]. A large portion of power dissipation in high performance CMOS VLSI is due to the inherent di culties in global communication at high rates and we propose several approaches to address the problem. These techniques can be generalized at di erent levels in the design process. Global communication typically involves driving large capacitive loads which inherently require signi cant power. However, by carefully choosing the data representation, or encoding, of these signals, the average and peak power dissipation can be minimized. Redundancy can be added in space (number of bus lines), time (number of cycles) and voltage (number of distinct amplitude levels). The proposed codes can be used on a class of terminated o -chip board-level buses with level signaling, or on tri-state on-chip buses with level or transition signaling.
I. Introduction
The low-power community has been generally concerned with average power consumption and ways to minimize it. Average power is directly related to battery life in case of portable applications, as well as with costly package and heatsink requirements for high-end devices 24]. More recently the interest has also shifted to minimizing the instantaneous, or peak, power dissipation. There are cases when the instantaneous power can be much higher than the average power, and this leads to an undesired increase in simultaneous switching noise 14] , metal electromigration problems 6], and local physical deformations due to nonuniform temperatures on the die. This paper focuses on low-power techniques for global communication in CMOS VLSI using data encoding methods. Such encodings can decrease the power consumed for transmitting information over heavily loaded communication paths (buses) by reducing the switching activity without a ecting the I/O information entropy 17]. Some of the techniques presented here are particularly e ective in reducing the peak power consumption.
Global communication is typically achieved with buses 8], 7]. Such buses can be on-chip (between di erent functional blocks), o -chip (between di erent ICs on a PCB) or at the system level (as a back-plane bus connecting several boards together). The main motivation behind this work was the fact that buses have a relatively small number of electrical nodes (the number of bus lines), but they can still consume a disproportionately large amount of power because of the large capacitive loads. Example The dynamic power dissipation is proportional with the capacitive load 40] . A bus line is likely to have a load 2-3 orders of magnitude larger than an internal node 1], hence one transition on a bus line will dissipate as much as 100-1000 internal transitions.
A. Random Data Bus Model
Buses can behave di erently depending on the type of information carried. The bus model considered here is a data bus on which the activity can be characterized as a random uniformly distributed sequence of values. Of course this is just an approximation, since in general the data is not random and the correlation can be exploited by lossless compression techniques 43]. Data compression can be an e cient method to decrease power dissipation and, by removing the extra redundancy, the data can be more accurately approximated as a random sequence. Example Figure 1 shows the e ects of compression and encoding on the number of transitions generated when transferring several typical Unix les over an 8-bit bus. The e ect of low-power encoding alone (Bus-Invert is the encoding used here, see section II-A) can be seen on the columns labeled inv, the e ect of compression alone on the columns labeled gz, and the combined e ect of compression and encoding on the columns labeled gzinv.
Depending on the data type, compression can have a very high impact on switching activity required at the I/O. Of course compression/decompression is desirable only for buses where the extra latency can be either tolerated or masked (e.g. main-memory or system buses) and if the overhead power is less than the savings on the bus which can be the case if all the operations can be done on-chip without external memory accesses. The assumption of independent uniformly distributed inputs is also conveniently made by many statistical power estimation methods 28] . With this assumption, for any given cycle, the data on a K-bit wide bus can be any of 2 K possible values with equal probability. The average number of transitions per cycle will be K 2 , hence there will be an average of 1 2 transitions per bus line.
B. Level Signaling and Transition Signaling
Many system buses use active-low level signaling for the physical layer. For an active-low bus with level signaling, a logical 1 is represented by a LO voltage level and a logical 0 by a HI level, where the actual HI and LO voltage levels are determined by the technology used (e.g. TTL or ECL) and by whether the bus uses a full or reduced swing signaling. In the case of an active-high bus, a logical 1 is represented by a HI voltage and a logical 0 by a LO voltage. Transitions (from HI to LO or from LO to HI) determine the dynamic power consumption 4] on a tri-state bus with level signaling. Level signaling is applicable to both transaction oriented and packet oriented buses.
As we will see in section II-A, low-power encodings with level signaling need a complex oneto-many context-dependent correspondence between codewords and information symbols. An alternative is to use transition signaling for packet-oriented buses. With transition signaling a logical 1 is represented by a transition (from HI to LO or LO to HI) while a 0 is represented by the lack of such a transition. Transition signaling will not reduce switching activity by itself but, as we show in section II-B, in this case low-power encoding can be done in a simpler one-to-one context-independent manner. If we use transition signaling and at the same time we reduce the number of 1 0 s in the codewords we can directly reduce the switching activity on the bus.
Transition signaling has a simple algorithmic description which corresponds to a straightforward implementation. Denoting by v(t) the symbol to be transmitted, the modulated symbol with transition signaling, b(t), is obtained by the simple expression ( is the bit-wise XOR of the symbol bits): Demodulation is similarly simple:
In a very di erent context, transition signaling was also found convenient by the asynchronous design community when they adopted Signal Transition Graphs (STG) over state diagrams for describing asynchronous behavior 13]. Transition signaling with capacitive coupling was also proposed for solving the \known-good die" problem for multi-chip modules (MCM) 27].
C. I/O vs. Internal Power
The amount of I/O and internal power depends on the application and type of data transmitted over the bus, as can be seen in gure 2. The power dissipated for the I/O can be as low as 10% 9] and as high as 80% 20] of the total power.
D. Previous Work on Low-Power Encodings
There are many sources of noise in a digital circuit, including the switching of large currents. Consequently, noise and dynamic power are both directly related to switching activity at the I/O, and the work by Park and Maeder 23] and Tabor 38] for minimizing transient noise due to I/O activity has many similarities with our work. Park and Maeder propose encoding tri-state I/O buses with transition signaling and limit the number of transitions in order to reduce switching noise. Tabor considers both tri-state and ECL terminated buses and proposes \starvation codes" and \ration codes" for minimizing switching transients. Even closer to the Bus-Invert method presented in 33] is the patent on a similar technique issued to Fletcher 10] It turns out that for purely sequential behavior even the Gray code is too complex. This can be seen by looking at the information entropy of an address bus, which is zero for purely sequential accesses. In other words there is no need to actually send a new sequential address over the bus, since this address can be locally generated. This is the principle behind burst transfers on a synchronous bus, where only the rst address is actually transferred, while the next addresses in the burst are locally generated.
Finite-state machines (FSM) have been the \workhorse" of the logic synthesis community for a long time. Mustang, Nova, and Diet are just a few CAD tools that attempt to optimize FSMs for area, performance and testability. It is natural to extend the research done for optimally encoding FSMs to yet another objective function: power. Hachtel et al. 12 ] have looked at re-encoding a previously encoded FSM for low-power without changing the number of bits in the encoding, while Olson and Kang 22] , and Tsui et al. 39 ] have looked at the more general problem of optimally encoding a state-machine for low power.
Due to the range and resolution of data represented as 2's complement numbers in the datapath of a DSP processor, the most signi cant bits on such a numerical bus are typically correlated, as opposed to the least signi cant bits which are essentially random, or uncorrelated. Mehra et al. 19] , 18] have studied algorithmic and architectural level methodologies for modeling such correlated behavior at a high level and for accurately estimating power dissipation for DSP applications. Chandrakasan et al. 3] have proposed the use of sign-magnitude representations which can take advantage of the correlated behavior of the most-signi cant bits in order to reduce switching activity.
The one-hot residue logic design proposed by Chren 5] for arithmetic circuits also tries to minimize power by the way data is represented. Of great interest is the work on information theoretic measures for low power by Marculescu et al. 16 ] which relates power consumption to information theory concepts like spatial and temporal correlations and redundancy.
II. One-Dimensional Codes for Low-Power In this section we consider encoding the data in space by adding redundancy in the form of extra bus lines. In order to be e ective, the method requires that the power dissipated in the encoder and decoder be small. De nition A code is a mapping C : U ! V where the elements of V are represented using N bits, and U has dimension 2 K , hence the elements of U are represented using K bits, with N K.
The encoding process, as shown in gure 3, is done in parallel in the spirit of the codes for computer systems discussed by Fujiwara and Pradhan 11] and Rao and Fujiwara 26] . With low-power encodings it is possible to add redundancy in a controlled manner such that the correlation in time between successive data values reduces the switching activity.
In order to give an intuitive explanation for the e ect of low-power encodings on switching activity, let us consider a 2-bit active-high bus with 4 possible symbols: U = fu0 = 00; u1 = 01; u2 = 11; u3 = 10g
The 4 codewords can be arranged as the vertices of a square-graph as in gure 4a. A transmitted sequence over the bus will be equivalent to traversing a path in the square-graph and the number of transitions between 2 consecutive bus transfers will be: 0 with probability 1 4 , 1 with probability 1 2 , and 2 with probability 1 
. Example
The sequence: u0; u2; u1; u0; u2; u1; u0 (transmitted as 00, 11, 01, 00, 11, 01, 00) will generate 8 transitions by traversing the diagonal twice (see gure 4a). The number of transitions can be reduced with a one-to many (1-to-2) mapping (N = 3 code) as in table I, by using a 3-bit wide bus. The eight codewords can now be arranged as the vertices of a cube-graph as in gure 4b. By properly choosing at each step one of the two representations The same sequence: u0; u2; u1; u0; u2; u1; u0, will generate only 6 transitions by using the following encoding: 000, 100, 110, 111, 011, 001, 000. Bus-Invert is a context dependent one-to-many encoding which uses level signaling. Two codewords exist for each information symbol (1 bit of redundancy), from which the codeword leading to the lowest activity factor is selected during encoding. This idea could in principle be extended to more than 1 bit of redundancy (and a broader context, i.e. more than two codewords for each information symbol) but the coding complexity would quickly become unmanageable. A solution is to use Limited-Weight codes (LWC) 32] which are one-to-one (context-independent) encodings, hence are simpler to implement. Because the source entropy 2] must remain unchanged (we want to transmit the same amount of information) it follows that in order to have a su cient number of codewords, the following inequality must be satis ed 32]:
Here the left-hand side represents the total number of possible codewords with weight M, and Perfect and semi-perfect Limited Weight codes are optimal in the sense that any other code with the same length N cannot have better statistical properties for low power.
The inequality (1) shows that there is a clear trade-o between the limiting weight M and the length of the code N: the smaller M gets (and thus power dissipation is decreased), the larger N (and thus the extra required redundancy) must be. In general the number of necessary extra code bits grows exponentially and the method becomes impractical when very small switching activities are needed, as can be seen in table II.
Two di erent points of view are possible in order to design good LW codes for a given K:
with a given level of redundancy N ? K, build a code that has a minimum M, with a desired M, build a code with the smallest extra redundancy. Until now we have only considered tri-state buses on which the power dissipated is mainly dynamic (charging and discharging of bus line capacitances) and the codes must minimize the switching activity (number of transitions).
A type of modern parallel terminated buses with only pull-up resistors dissipates power only when transmitting logical 1 0 s. In such a case Limited-Weight codes can be directly used for Example For a standard bus like the Rambus (see gure 6) the main constraint is to code the data at the logical level without a ecting the physical speci cation. The Rambus has 8 + 1 data bits, the 9 th bit's use being left to the system designer. We can take advantage of this 9 th bit and use it as the invert line. The derived code will have codewords of length 9. With 9 bits there are 2 9 = 512 possible patterns out of which only 2 8 are needed. It can be observed that:
( 9 0 ) + ( 9 1 ) + ( 9 2 ) + ( 9 3 ) + ( 9 4 ) = 2 8 = 256
It follows that a perfect 4-Limited Weight code that uses all 9-bit patterns with at most four 1 0 s is optimal. The data can be either decoded at the receiver and stored unencoded as 8-bit values, or can be stored in encoded form in a 9-bit wide Rambus DRAM (RDRAM). Storing the encoded data has the advantage of using only o -theshelf RDRAMs, with modi cations needed only on the Rambus ASIC (RASIC) side.
Because the resulting codewords have at most four 1 0 s, the worst case power dissipation on the data lines is decreased by 50% (from 168mW to 84mW). The decrease in average power dissipation depends on the statistics of the bus transfers and is generally smaller. If the data is random uniformly distributed the average I/O power dissipation is reduced by approx. 18% (from 84mW to 68mW).
D. The LWC-ECC Duality
There is an unexpected formal relationship between Limited-Weight codes and general ErrorCorrecting codes. It is unexpected because LWCs do not exhibit any of the nice algebraic properties of ECCs and because they were formulated in a completely di erent setting. The fundamental observation that links LWCs to ECCs is the following: the codewords of a Limited Weight code when viewed as patterns of 1 0 s and 0 0 s are the same as the error patterns corrected by a standard block ECC. Generally, an ECC is designed to correct up to m errors, which means that it will correct all patterns with at most m 1 0 s. This is \dual" to the de nition of a M-LWC which contains codewords with at most M 1 0 s. Example A (15,11) Hamming ECC has 11 information bits and 4 parity bits 2]. Any two of the 2 11 codewords are at a Hamming distance 3 and thus the (15,11) Hamming ECC can correct any single bit in error. An error pattern is represented by a 15-bit word which has a 1 in the position of the erroneous bit. If v(t) is the transmitted codeword, v 0 (t) the received codeword and e(t) the error pattern, the e ect of the error on the codeword can be written as:
It is clear that there are exactly 15 1-bit error patterns and together with the all-zero word they are the same patterns as the words of a 1-Limited Weight code (1-hot encoding with \no-hot" encoding for 0). Decoding an ECC requires nding the error pattern and, since the error pattern is similar to a LW codeword, this is the same as encoding a LWC. It must be noted though that the e cient decoding of EC codes is a hard problem except for particular cases, and this duality between ECC decoding and LWC encoding shows that LWC generation is also intrinsically hard in general.
The Hamming bound 2] for an ECC gives a relation between the length n, number of check bits k and the number of correctable bits m:
The bound simply expresses the fact that the total number of syndromes (2 n?k ) must be greater than the number of correctable error patterns because the syndromes must be unique for all errors (coset leaders). The inequality (1) for LWCs is very similar to (2), with one di erence: the sign of the innequality. For a LWC and the dual ECC, the (1) inequality has one more term than (2), which is true when both inequalities are strict. In the case of perfect codes, when (1) and (2) are both equations, the number of terms is the same. 
III. Two-Dimensional Encodings for Low Power
The codes proposed in section II use redundancy in space (number of bus lines) for reducing the bus transition activity. One problem with such encodings is that the required number of extra bus lines increases exponentially as the transition activity is reduced, and this can make the techniques impractical. An alternative is to keep the number of bus lines constant and inject redundancy in time by using extra transfer cycles. The same assumptions as in section II are made here about the randomness of the data bus and the use of transition signaling.
Time encoding requires that the data be transmitted in packets, which is typical for global buses where there is an advantage in transmitting bursts of data for improved throughput 25] . Figure 7a shows a data-packet with redundancy in space, while 7b shows the same packet with redundancy added in time. With the transmitted data arranged into K t -word packets, where each word is initially K s -bits wide, the same coding techniques that in section II used redundancy in space can now be applied in time. Limited-Weight codes with redundancy in time and transition signaling can use extra transfer cycles for encoding the K t bits that are successively transmitted over each bus line in order to minimize the number of 1 0 s and hence the number of transitions. For K t = 4 a low-power time encoding will rst count the number of 1 0 s that are to be transmitted over each bus-line. If the number of 1 0 s for a bus-line is greater than K t =2 = 2, then the K t = 4 bits on that line will be inverted and this inversion will be signaled by a 1 in an extra 5 th transfer cycle. Otherwise, the K t = 4 bits will be transmitted as they are and the extra 5 th bit will be 0. The computation of the redundant bit needs to be done for each of the K s bus-lines, in series or in parallel.
By simple probabilistic reasoning it can be shown that with the same amount of redundancy, time encodings will have the same average power savings as the equivalent space encodings. If power needs to be further reduced, redundancy in both space and time can be used. Encoding in two-dimensions is a two-step process, and there is a choice whether to apply redundancy rst column-wise (in space) and then row-wise (in time), or vice-versa, with the same average power reduction being obtained in both cases. Example For the previous example the number of 1 0 s can be reduced to 6 with two-dimensional coding as can be seen in table VII. Column-wise encoding (in space) is done rst in the table on the left, row-wise encoding (in time) is done rst in the table on the right. The waveforms corresponding to the case where encoding is done column-wise rst can be seen in gure 8d. Transition signaling is also used in this case. Although the average power reduction is the same whether the two-dimensional encoding is done rst in space or in time, the same is not true about the peak power and simultaneous switching noise. In the previous section we saw that encoding in time (as opposed to encoding in space) can still lead to transitions on all bus lines, hence, in order to reduce the peak power and simultaneous switching noise, the encoding should be done rst in time and then in space (in this way we can make sure that the maximum number of transitions will be determined by the encoding in space). Table VIII shows all the codewords of the smallest possible two-dimensional low-power code, with column-wise encoding followed by row-wise encoding. A code with the same parameters, but with row-wise encoding followed by column-wise encoding, is shown in gure IX. There are 16 such codewords, one for each of the 2 2 possible patterns of 1 0 s and 0 0 s. Two bits of redundancy are used in space and two in time. The average switching activity is reduced by 31%, which is better than the 25% for one-dimensional Bus-Invert. There is an extra 9 th bit which encodes in time the space codebits. A similar extension of the code in table IX is shown in table XI, this time the extra 9 th bit encodes in space the time codebits. The average power dissipation for these 9-bit two-dimensional codes is reduced by 34%, slightly better compared to the smallest two-dimensional codes.
A useful application of such two-dimensional encodings is the generation of new one-dimensional codes by projecting (unrolling) the two-dimensional code in one dimension (projecting a twodimensional array For example, by unrolling the two-dimensional codes in tables VIII or IX, we can obtain onedimensional semi-perfect 2-Limited-Weight codes of length 8. Similarly, by unrolling the codes in tables X or XI, the codes obtained are semi-perfect 2-LW codes of length 9. This is an important result for several reasons:
The algorithmic generation of codes for low-power is intrinsically hard in the general case. Such unrolled two-dimensional codes provide a compromise between the two extremes of BusInvert (minimum redundancy, one extra line) and one-hot encoding (minimum transition activity, one transition per cycle), and o er another practical design alternative.
B. Implementation of coding in space and time
The techniques used for computing the code bits in space and time are similar, which means the circuits can also be similar. A key element is the e cient implementation of a majority voter, and this can be done in a digital or analog fashion 33]. For time redundancy there are also issues related to accessing the entire data packet while encoding and decoding. For encoding, the entire packet must be stored and accessed, but decoding can be done on the y if the extra code bits are transmitted before the data bits. Example The block diagram of an encoder for the two-dimensional code in table XI (time followed by space encoding) is shown in gure 9. Since there are only two information bits the majority voter in this case is only an AND gate. For this example it was chosen to transmit the redundant bit before the information bits. Encoding in time (row-wise) is followed by encoding in space (column-wise) and then by transition signaling. Shift registers with parallel D and T inputs are used for time encoding and T registers are also used for transition signaling. There will be at most 2 transitions for each 4 bits of information (9 codebits transmitted). It can be seen that, although conceptually similar, time encoding is more expensive in this case than space encoding because it needs to access the entire data packet at once.
C. Modulation in time
Until now, when we addressed redundancy in time we implicitly assumed that the time domain has exactly the same \integer" restrictions as the space domain but this need not be the case. Data Code 10 0100 11 1000 000 000100 010 100100 011 001000 0010 00100100 0011 00001000 over the unencoded case, but for a given T min , the transfer time is reduced by 50%
(hence the energy-delay product, or action, is also reduced by 50%). Extra power savings can be obtained by realizing that for low power we may use a somewhat larger value than p opt . In equation (4), p opt was computed for optimal data rate (or minimum transfer time for a given packet) but for low power we are more interested in minimizing the number of transitions than in optimizing the transfer rate. Another argument is given by gure 12 which shows the code rate as a function of d and p. As can be seen the code rate has a very shallow peak at p opt , hence we can safely choose a larger value for p without much penalty in code rate. We can quantify this choice by using as a measure of low-power e ciency the energy-delay product of the number of transitions in a packet 1= log 2 p, and the transfer time for the packet 1=rate. The optimal p low for low power will minimize (compare with (3)):
In this way the optimal p low values are larger and there can be better savings in transition activity (up to 58% savings at d = 16 for p low = 26) as can be seen in table XV. The equation which determines p low for minimum energy-delay is: p low (ln 2 log 2 p low ? 2) = 2 d (6) The trade-o is that the implementation of real circuits for large values of p will be more di cult. As in the case of space and time, we can also de ne two-dimensional codes that use modulation in amplitude and time. By modulating the signal amplitude with 2 p volt levels distanced V apart, in each cycle we can transmit log 2 p volt bits in addition to the log 2 p time bits transmitted in time with each transition. Example Figure 13 shows such a two-dimensional encoding with p time = 5 and p volt = 5 that can transmit 5 bits in amplitude and 5 bits in time for a total of 10 bits per transition and savings of 80% in switching activity. There are many possible implementations for a phase modulation scheme and the one in 21] is very convenient. For encoding and decoding it uses a PLL with a (p + d)-stage ring oscillator which can generate the p necessary phases and guarantees the minimum d zeros between two transitions. Other schemes which further improve coding e ciency and reduce implementation complexity have been also proposed 42]. We believe that phase modulation is practical, improves coding e ciency and is suitable for low-power applications.
Unfortunately, amplitude modulation does not seem to be a power e cient option because very low-power A/D and D/A converters are not feasible yet, hence two-dimensional codes in amplitude and time will probably remain just a theoretical possibility for the near future.
IV. Conclusion
In this paper we have attempted to present a uni ed framework for low-power communication on global buses. Such buses can be at the module, subsystem, die, package, or multichip module (MCM) level, the only common characteristic being that the bus capacitances are much larger than the capacitances of the internal circuit.
Low-power techniques at the architectural level reduce the I/O transition activity by coding the data on the bus. Besides being applicable in practice, the topics presented here shed light on a number of interesting theoretical issues:
For low-power operation redundancy should be generally avoided by compressing the data, but controlled redundancy through low-power encodings can improve the signal temporal correlation for reduced transition activity, Time redundancy can be as e ective as redundancy in space for reducing the average switching activity, Low-power operation can be achieved by reducing the activity on highly-loaded nodes, even if both the total number of transitions and the total capacitance are increased, Phase modulation can be explained in terms of low-power encodings, Limited-Weight codes are duals of Error-Correcting codes with Hamming distance in the same sense as quantization is the dual of coding with Euclidean distance. There are several directions of research that can extend the results of this paper:
Finding new practical Limited-Weight codes based on the ECC duality, Extending low-power coding methods to other bus models (e.g. with cross-coupled capacitances), Extending low-power coding methods to di erent data models (e.g. gaussian distribution), Implementing low-power codes in real applications, Including low-power coding methods in the speci cations of standard buses. The most important aspect in the future will be to apply the low-power methods described here to more practical circuits, hopefully with the cooperation of interested chip designers and manufacturers.
