We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5mm at 0.13 and project a break-even point of 2.7mm for a larger design at 0.07 .
Introduction
Scaling trends have continually increased the importance of wires relative to logic. Among other things, the ever rising ambitions of computer architects have caused wire lengths to remain constant or increase-even as transistor sizes have shrunk. This observation suggests that energy conscious designers should focus some of their attention on the energy dissipated by on-chip wires. Clearly, this process can involve a variety of techniques at the level of technology, circuits, and architectures.
In this paper, we exploit abundant transistors to transform information into a form that is less expensive to communicate; our techniques are complementary to other options such as reducing voltage swing. Energy is consumed when wires change state. Thus, our goal will be to eliminate or reduce the total number of wire transitions while also reducing cross-coupling energy. Compression techniques have long been used to reduce the volume of off-chip communication. Given the large capacitances of cross-chip interconnects, compression circuits can easily save more power than they utilize. In contrast, we address a more difficult question: Is it possible to reduce the traffic over on-chip buses and save energy while doing so? Since on-chip wires have capacitances that are orders-of-magnitude smaller than cross-chip interconnects, the answer to this question requires careful accounting of the energy consumed by the encoding and decoding process. Figure 1 shows the basic idea, which we call "bus transcoding" (hereafter called "transcoding"). Circuits at either end of a long bus reduce the number of bus transitions. The encoder takes in Ï bits and recodes them into Ï bits.
The Bus Transcoder
These bits are then transmitted along the wire (with repeaters as necessary). At the destination, the decoder takes in Ï bits and restores the original Ï bits. In general, Ï Ï . For this paper, we will assume that the encoder and decoder are operating synchronously. In this case, these elements can have arbitrarily complicated internal state; it is assumed that the encoder FSM utilizes the input stream to make its state transitions, while the decoder FSM uses the output stream to make its transitions. The goal of this transcoding process is to reduce the total energy expended on the long bus. We can envision many different transcoders. However, in keeping with a philosophy that the encoder and decoder are drop-in cells, we prefer techniques that do not change the timing of the bus. This simple goal rules out naive uses of compression that generate long, multi-cycle code words.
One simple enhancement to the scheme described above is to insert transition coding modules at either end of the bus. As shown by the figure, this enhancement causes the encoding of data from encoders and decoders to represent wire changes rather than absolute values: a one (1) on an output from the encoder represents a wire that will change its value (expend energy), while a zero (0) represents no energy expenditure. This change in representation greatly simplifies the energy optimization problem-even when accounting for cross-coupling between adjacent wires. For instance, we can easily perform inversion coding in which we send a value or its inverse to reduce transitions. If a coding technique reduces transitions on a bus, then we can conclude that it will save energy for some bus length. This is not a particularly sophisticated argument -energy consumption increases with volume of communication and scales linearly with bus length. For a given bus, type of traffic, and technology, there is a break-even length of the bus at which a transcoder can save energy. As technology shrinks, this bus length will shrink as well, since the power consumed by the encoder and decoder will decrease.
Using Data Prediction to Reduce Power
One interesting viewpoint for transcoder design is to utilize value prediction [6, 16] -often viewed as providing limited performance enhancement -to save energy. In effect, we run the same predictor on either end of the bus and compare its result with the actual value to be transmitted. Since both predictors are running synchronously and on identical values, they will provide identical predictions. When these predictions match with a value to be transmitted, we send nothing-to say that the prediction was correct. When they do not match, we send the original information directly (and transition a special control bit). Should the predictors attain a 100% prediction rate, then no energy would be consumed sending information. Many value prediction techniques have reasonably high prediction rates [16] . If we apply these techniques to bus traffic, we might expect to achieve a respectable reduction in energy. Figure 2 shows how to use predictor confidence [7] to improve this technique. Assume that the set of possible values are sorted by confidence and assigned code words from the space of Ï -bit words. These values may come from a single predictor or a combination of multiple predictors. The highest confidence value will be matched with a code word that has the lowest energy cost. Given transition coding, this would be the all zero vector (no transitions). The next Ï values could be matched with the unique vectors of Hamming weight one (i.e. with a single bit set). After this, we use vectors with higher Hamming weight, sorted to minimize cross-coupling. When the predictor is presented with a new input word, it checks its list of encoded values. If any of them match, it sends the corresponding code. If none of them match, it sends either the data or its inverse.
Our Contribution
Previous studies of on-chip bus compression suffer from two deficiencies. First, they utilize random traffic, a poor approximation to real traffic. This has a tendency to underestimate the potential for compression technology. In contrast, we examine traces of internal traffic from several buses in a simulation of a running superscalar processor.
Second, previous studies have focused on the reduction in bus traffic while completely ignoring the complexity and energy consumed by the encoding and decoding circuits. Two very important questions can only be answered by attempting to design a complete transcoder circuit:
1. Will the transcoder power be so high that no reasonably-sized chip will meet the break-even point?
2. Is the area consumed by transcoder logic too large for practical use?
We design a transcoder in a 0.13 Ñ silicon technology, carefully producing a compact, energy efficient layout. We discuss some of its internal circuits, consider its size, and provide an accurate analysis of energy consumption.
Our results show that despite potentially large reductions in bus energy achieved by sophisticated prediction methods, a very simple, energy efficient transcoder is the only method that can save energy for on-chip buses of reasonable length. With the transcoder of Section 5, we show that we can break even at median wire lengths of less than 11.5mm at 0.13 Ñ and a projected length of 2.7mm for a more complex design in 0.07 Ñ.
The rest of this paper is as follows: Section 2 discusses related work. Section 3 presents a detailed wire model. Section 4 explores various prediction technologies and presents potential savings. Section 5 follows with the architecture and layout of a practical transcoder, while Section 6 evaluates the resulting energy savings. We conclude in Section 7.
Related Work
The techniques of this paper are complementary to circuit techniques, such as presented by Zhang et al. [21] , to reduce the voltage swing within the interconnect.
There have been a number of papers on the subject of coding for low power. Bus-Invert coding [19] and partial bus-invert coding [17] implement the same idea of inverting the bus value to be sent if more than half the wires are changing. [13] provides the circuits necessary to implement the scheme including a novel analog majority voter to count the number of ones on the input. Additional schemes include workzone encoding for address buses [12] , which was extended in [1] to partition the memory space into a number of sectors that represent different segments of the address space. [10] proposes a similar design of sending the code-word xor'ed with the current input that has the lowest Hamming weight.
In addition to these experimental approaches, several papers propose more complicated algorithmic solutions. [18] proposes a static code-book but the encoding used is the one that minimizes a more complex fitness function including inter-wire capacitance. [8] proposes a complicated scheme for address buses that re-maps transitioning and non-transitioning wires to shield cross-coupled wires.
Basu et al. [3] proposes placing a value cache at both ends of a communication channel. When "hit", the system sends the index to the cache entry, instead of whole word, to reduce transitions. Their scheme focuses on off-chip buses and for DSP and embedded applications. Parcerisa and González [15] applied value prediction to inter-cluster communication on clustered microarchitectures. Their goal was to reduce long wire delay instead of energy.
Our technique differs from those listed above by incorporating value prediction to reduce on-chip communication energy. In our paper, we also evaluate our energy reduction scheme against real bus traffic generated by SPEC benchmarks rather than randomly generated bus traffic.
Interconnect Models
In this section, we explore the characteristics of buses in modern integrated circuits. Energy consumption involves two distinct elements: capacitances between wires and the substrate, and repeaters to control latency.
Interwire Capacitance
Every wire transition expends energy. Furthermore, wires that are adjacent to one another expend energy through cross-coupling. A simple model that accounts for these two effects is shown in Figure 3 [18] . This figure illustrates two types of capacitive couplings 1 : wire-substrate capacitance (C Ë ) and inter-wire capacitance (C Á ). These values depend on technology characteristics such as the width and height of wires, oxide thickness, distance between wires, etc. Total capacitance grows linearly in the length of the bus.
The relationship between expended energy and bus transition activity is governed by the equation for energy stored in a capacitor: ½ ¾ Î ¾ , where is the size of the capacitor and Î is the voltage stored on the capacitor. During the process of charging this capacitor, the power supply expends ¾ total energy. We model the energy dissipation in two chunks:
during the initial charge, and during the discharge. Consequently, the energy expended is 1 We will ignore other parasitic effects (such as coupling between nonadjacent wires), since these are small in comparison [8] . proportional to the number of transitions (charge/discharge operations). Using Figure 3 , we can derive a model for the energy expended by wire Ò (Ï Ò ), denoted by Ò :
Here, Ä Ù× is the length of the bus, £ is the ratio between inter-wire and wire-substrate capacitances (Figure 3 ), « Ò is the total number of bus transitions on Ï Ò , and ¬ Ò is the total number of pair-wise inter-bus transitions between Ï Ò and Ï Ò·½ . Energy scales linearly with wire length because capacitance scales linearly with length.
To explore the effectiveness of power reduction techniques,
we compute values for « Ò and ¬ Ò over the course of some simulation. We can attack either or both terms as a way to reduce the energy of communication.
Signal Repeaters
For longer on-chip wires, repeaters are necessary to reduce delay. Therefore, we introduce a standard buffered wire model [2, 14] that will be included in the energy savings analysis of the various coding methods. Our buffered model is illustrated in Figure 4 . This standard model involves uniformly placed inverters of equal size throughout the length of the wire. The optimal size and number of repeaters are technology-dependent; consequently, given technology characteristics, we derive wire parameters as discussed in [9] . An exponentially-increasing cascade drives the sending end of the wire to offset repeater input capacitance. 
Realistic Technology Parameters
The energy and delay of our wire model for various technologies are shown in Figure 5 and Figure 6 . These graphs are generated with HSPICE simulations using real process parameters (for 0.13 Ñ) and BPTM parameters (for 0.10 Ñ and 0.07 Ñ) 2 . The wires are placed at minimum pitch apart for this study. The number of repeaters necessary varies with length of wire and is calculated based on [9] . Energy and delay for unbuffered wires are also included for comparison. Unbuffered wires exhibit quadratic delay with length, whereas wires with repeaters have linear delay. The buffered wire consumes more energy due to repeaters. Table 1 shows effective £ for buffered and unbuffered wires. Although inter-wire capacitance, C Á , is significant in long wires, its effect is less pronounced in buffered wires because repeater capacitance is included in C Ë .
Potential For Energy Savings
In this section, we explore transcoder schemes for possible implementation in Section 5. We start by extracting realistic bus traffic from an event-driven simulator. After highlighting characteristics of the resulting traffic, we subject it to different prediction schemes to evaluate their efficacy.
2 Wire parameters derived from the Berkeley Predictive Technology Model (BPTM) [4] , using wire geometries and dielectric values from the International Technology Roadmap for Semiconductors (ITRS).
Extracting Realistic Bus Traffic
In this paper, we explore two buses: the memory bus and the integer register bus. Memory buses tend to have high capacitance because they extend off-chip or, in the case of Systems on a Chip (SOC), travel a long distance. Integer register buses tend to have high fan-outs, and thus high capacitance. These two buses are not the only energy-intensive communication channels in a microprocessor. Rather, they represent channels that could benefit from our techniques.
To acquire realistic traces, we instrumented the most detailed SimpleScalar 3.0 [5] simulator, sim-outorder, to capture data values on internal buses. SimpleScalar is a well-known functional simulator for out-of-order execution. In functional simulation, results are computed immediately after instruction dispatch, with accounting to track timing and input/output dependencies through registers and memory. As a result, there are no "buses" with realistic timing. To address this problem, we enhanced SimpleScalar with bus timing generators that extract values from the ongoing simulation and re-time them to resemble actual bus timing.
Output Bus to Caches/Memory: To simulate the external data and address buses, we maintain a queue of timeordered entries for the value on the bus from the current cycle into the future. Memory events are inserted in the scheduler queue when load or store instructions are ready to execute. If the data must be fetched from main memory, the access latency calculated by sim-outorder generates an event corresponding to a future cycle in which the value will appear on the data bus. We refer to this bus as the "memory bus" throughout the rest of the paper.
Register file output to functional units: The synthesis of bus behavior is easy for the register bus, since every instruction goes through an explicit pipeline stage in SimpleScalar. Hence, we can easily determine what value would be on the register bus each cycle. We refer to this bus as the "register bus" throughout the rest of the paper. 
Trace characteristics
To begin our investigation, we explore some statistical properties of bus traces acquired as in the previous section. Figure 7 shows the cumulative distribution function of all unique values, sorted in order of most frequent to least frequent. This graph attempts to show how many unique values make up the majority of trace values for several benchmarks from a 10 million value trace. For the 4 benchmarks we chose to look at in this analysis, none of them have a unique value set size with significant coverage until we get into the 100-1000 value range. This suggests that a strictly frequency-based compression approach will not be very effective unless we can afford a very large dictionary size to hold 100s-1000s of unique bus values. Figure 8 shows the average fraction of values in a trace that are unique in a window, given a particular window size. This statistic suggests that a compressor based on tracking a moving window of unique values might be reasonably successful with even a small number of entries (10s of entries). We exploit this property later.
Coding schemes
Our next task is to introduce coding mechanisms that exploit the above statistical characteristics.
Spatial encoder:
Each input value could be coded as a single bit on a ¾ Ï -bit bus. We call this the "Spatial Encoder" since it converts each input value to a transition at a particular spatial location on the long bus. This transcoder provides extremely low communication energy at the expense of an impractical, exponential cost in area.
Inversion encoder:
A more realistic stateless encoder is shown in Figure 9 . This inversion coder produces a "transition vector" by xor'ing the current and new bus value, and then xors this transition vector with one of a number of available constant bit patterns to generate the next bus value. The bit pattern chosen is the one that results in the minimum total and coupled transitions after xor'ing. The encoded value is sent along with the identity of the chosen bit pattern. A simplification to this is to consider inverting all or none of the bits [19] .
LAST-value predictor:
A LAST-value predictor [11] captures strings of repeated values. Although we do not use this predictor by itself, we include it in combination with all of the remaining predictors. We assign code "0" to repeated values to avoid a penalty relative to the un-encoded case (which expends no energy when the bus is unchanged).
Stride predictor:
Our first viable predictor contains multiple stride predictors [16] and makes use of predictor confidence. A shift register containing previous bus values is used to calculate the stride of every data-word, every other data-word, every 3rd data-word, etc. The lower order strides are encoded with lower weight codes because they are assumed to be more frequently matched. We preferentially match lower order strides to minimize outgoing weight.
Window-based predictor:
Our second predictor captures the locality implied by Figure 8 . It keeps the last few unique values in a shift-register and encodes them as low-weight codes. A shift occurs when a value appears that is not already in the register, entering a new value and discarding the oldest value. Context-based predictor: The Context-based transcoder in Figure 10 assigns low Hamming-weight codes to frequent values [16] . It operates by maintaining a table of values sorted by frequency. When the number of entries is less than the bus width, the code-word associated with an entry can be a single bit, since each wire in the bus can uniquely identify an entry. Larger tables require more sophistication in code assignment to minimize cross-coupling. When an input value is not in the table, it is sent un-encoded. Each table entry has a frequency counter that is incremented on a match to the input. Naively, new bus values could be inserted into the table immediately, replacing the lowest frequency value. However, this causes thrashing on the lowest value. Instead, we utilize a Window-based predictor structure (shift-register) with frequency counts at the input. Entries in the shift register accumulate counts and are later entered into the frequency table if their frequency count is above threshold. To accommodate phases in computation, we periodically divide counter values by two; the period is called the "counter division period."
Coding effectiveness
We next evaluate the efficacy of the various transcoder schemes. We plot the fraction of energy removed by each scheme; since Equation 1 is linear, this metric is independent of wire length. Unless otherwise noted, we assume the transition to coupling energy ratio is 1 (£ ½). In addition, we do not account for energy used in encoding and decoding; this analysis is saved for Sections 5 and 6.
Inversion performance: Figure 11 shows how the simplified inversion coder performs with trace data from four SPEC benchmarks and uniformly distributed random data. Three different minimizing functions are used: 0, 1, and N. In 0, the coder chooses a bit pattern assuming the technology specific £ in equation 1 is 0-equivalent to the encoder in [19] . In 1, the function assumes £ ½ , and in N, the function knows the correct value of £. We simulated the inversion coder with varying £. Note that except for technologies with high values of actual £, transcoders that assume £ ½ ( 1) are accurate approximations. The figure also shows that using random data to determine he energy consumption of an inversion scheme will generally yield better results than would occur in reality. Realistic traces are thus very important for proper conclusions.
Strided performance: Figure 12 and 13 plot the energy reduced by the stride predictors, normalized to the unencoded case, as a function of the number of stride predic- tors used. On the memory data bus in Figure 12 , there is an observable jump in energy reduction between three and four stride predictors but it is not large. For the register bus, Figure 13 gives no conclusion as to how many stride predictors would be useful across the benchmarks. In both cases, adding more stride predictors reduces the number of bus transitions and thus the energy, which is what we would expect. However, the lack of breakpoints in these curves complicates the process of choosing a reasonable predictor size. Further, if we compare Figure 11 to Figures 12 and 13 , we see that for the same bus and wire model, some of the stateless inversion coders remove more energy than the biggest stride predictors. This indicates that the stride predictors are not the best stateful coding mechanism.
Window-based performance: Figures 14 and 15 plot energy removed by Window-based transcoder, as a function of shift-register size. The knees of the curves center around 8. At this configuration, the transcoder removes about 19-25% of the energy-a respectable performance.
Context-based performance:
The Context-based transcoder depends on several parameters, such as the size of the frequency table, the shift register length, and the counter division period. The table size determines the number of values that can be tracked; the shift register length dictates how long a new value can accumulate counts before being shifted out; the counter division period determines how quickly the transcoder responds to changes. Figure 16 and Figure 17 show that a frequency table size somewhere between 20 and 32 to be optimal for a shift register size of 8. We reach the point of diminishing returns for frequency table sizes greater than 16. Figure 18 shows that a shift register size of 8 entries is a good trade-off between normalized energy removed and design complexity. As shown in Figure 19 , the energy removed levels off at a division period of 4096 cycles for many of the benchmarks.
From these results, the context-based encoder removes about 25% to 35% of normalized energy on average for reasonable shift register sizes (4 to 8) and table sizes (24 to 32). If we compare these results to the inversion coders and the stride predictors, the stride predictors remove only 10% to 15% of the energy and the inversion coders remove 15% to 20% (except for the random input).
The comparison for each desired stride. The Context-based encoder requires only frequency comparisons and counters. The Window-based encoder is even simpler. Given their superior potential for removing energy, we continue to explore Window and Context-based encoders in following sections.
Toward Building a Real Transcoder
A transcoder trades the energy of encoding and decoding for a reduction in communication energy. In order for a proposal such as this to be believable, we must be very careful in our accounting to ensure that the transcoder does not use more energy than it saves. In this section, we design a complete hardware transcoder and analyze its cost in Section 6. As decided in Section 4, we will consider the Window and Context-based transcoders.
Energy budget
If we consider the energy that we eliminate using a particular coding technique, we obtain what we will call our energy budget. This metric is independent of the particular circuit implementation we choose to use for our transcoder and depends only on our particular wire model, wire length, benchmark, and transcoder. As long as a transcoder implementation does not exceed the energy budget, then it will break even (save energy). Figure 20 shows the energy budget at 0.13 Ñ as a function of transcoder size in both Window-based and Contextbased designs. Different lines correspond to particular wire lengths and transcoder configurations. Since transcoder energy is independent of wire length and each transition saved is worth more energy for longer lengths, we see that the energy budget increases with increasing wire length. Because the Context-based architecture saves only a small fraction more transitions than the Window-based architecture for shorter wire lengths, they have approximately the same energy budget. For longer wires ( ½ ÑÑ), the energy budget gap between them widens. 
Realistic Transcoder Design
In this section, we discuss an efficient implementation for two transcoder designs, including operations for sorting and coding. We also present some customized circuits.
Efficient Sorting Algorithm:
The frequency table must be sorted by counter values to make it easier to discard leastrecently used values. A benefit is that the position in the table can be used as the code word. We devised a lowoverhead sorting algorithm for the Context-based design. It involves closest neighbor swapping, equality comparison and an extra bit to keep the frequency table sorted [20] .
Required Operations:
To perform the coding and sorting functionalities, the Window and Context-based transcoder designs need a number of elementary operations that contribute to its dynamic power consumption. The operations are labeled in Figure 21 and are the following: 
match:
The bus input value is compared to all the entries in the shift register and lookup table to determine whether we can send a dictionary index instead of the full value. counter comparison: We must compare counters to sort frequency table entries. An arbitrary sort is costly; instead, we start with all entries sorted by frequency. We maintain this invariant by catching situations in which adjacent counter values match and the lower value is incremented; we then swap these entries. Values from the shift register enter the table when they are more frequent than the least-frequent table entry. swap: When two table entries accumulate counts that place them out of order, we swap the two entries. shift: A new bus value can be inserted into the shift register every cycle. The value on the end of the shift register shifts out and is either inserted into the frequency table or discarded based on its frequency count. last value tracking: We must catch repeated strings of values to achieve LAST-value prediction (coded as "0").
Customized Circuits: For a Window-based design, we must implement "shift," "match," and "last value" operations described above. We utilized careful circuit design to ensure lower power consumption for these operations:
Single clock shift cell: As shown in Figure 22 , we use a single PMOS pass transistor to enable/disable the feedback loop between the cross-coupled inverters. The advantage of this design over a transmission gate design is that no complementary shift signal is needed and there is one less clock to route. Pointer-based shift entries: We use a pointer-based shift register design, as shown in Figure 23 . Thus for every shift, only the value at the head entry changes. This reduces the number of bit transitions (therefore energy) when compared to a standard shift register. Pointer-based last value: Since the last input value is in the shift register, we maintain a vector of bits (one bit per entry) to point at this value. This approach reuses the matching circuits for LAST-value functionality. 
Discussion
The addition of the frequency table adds considerable complexity to the Context-based design. If we compare the two designs while keeping the total number of entries constant, we see that the Window-based design only requires shifting of bus value entries and value matching circuitry as shown in Figure 21 . When we add frequency tracking, all the value entries must be augmented with a Johnson counter, counter matching circuitry and the necessary control logic to determine swapping, counting and sorting. In a preliminary transistor level schematic of the full Context-based design, we noted that the counter and counter match circuitry adds approximately 50% more area to the Window-based design.
The energy budget ( Figure 20) showed that the two designs have comparable performance for short wires. From transistor level designs, we estimate that approximately twice as many operations of similar energy usage would happen per cycle in the Context-based architecture. Both swapping and counting turn out to be extremely expensive operations. We therefore opt to use the Window-based architecture for the rest of our investigation.
Physical Implementation
We implemented the Window-based encoder in layout to accurately evaluate energy consumption. The design utilizes the "single clock shift cell," "pointer-based shift entry," and "pointer-based last value" circuits. It does not need to implement the sorting algorithm because there are no counters to sort. The decoder, which was implemented at the schematic level, shares many identical components. From HSPICE simulation, we find that it consumes 20% less energy compared to the schematic equivalent of the encoder.
Experiment and Results
From the physical layout described in the previous section, we calculate the energies consumed by the wires we intend to encode, as well as the transcoding circuitry itself.
Methodology
The ultimate goal of transcoding is to reduce the energy consumed by buses. Therefore, it is critical to determine how much energy is used to perform the encoding and decoding process for an actual layout. The most accurate method for determining the power consumption of the coding circuitry would be to simulate the benchmarks on the layout for bus power usage. This would have been too slow.
Instead, as shown by Figure 25 , we gathered statistical averages of various operations in the high-level transcoder module which was integrated into SimpleScalar [5] . This module closely simulates the hardware transcoder architecture and keeps a running total of the energy consuming operations performed by this architecture. Operations are tallied separately for each SPEC benchmark.
These numbers are later combined with energy dissipation numbers derived from HSPICE simulations of an actual extracted layout netlist. Energy is calculated for each separate operation. Based on these energy numbers from HSPICE and operation counts from SimpleScalar, we derive total energy expenditure. We validate the derived total energy expenditure against energy obtained by running the layout netlist with a short 100 cycle trace. The derived energy comes within 5% of the actual energy expenditure. This method, although less accurate than simulation of the complete layout, achieved tremendous increase in simulation speed.
To obtain estimates of energy usage from our transcoder circuit layout with future technologies, we used a number of tricks to scale the 0.13 m STMicro layout extraction SPICE netlist. We used the BPTM model [4] for 0.10 Ñ, and 0.07 Ñ processes. We used the less accurate BPTM technology models because ST Micro models were not available for other process technologies.
We performed technology scaling using the following procedure: with the netlist generated by Cadence we (1) scale transistor gate lengths and widths, and source/drain peripheral lengths linearly by the quotient between desired and current minimum feature sizes. Source and drain areas are scaled quadratically. (2) Wire capacitances are scaled based on the BPTM estimates for interconnect dimensions and parasitic capacitance. (3) This modified netlist is simulated using HSPICE to find scaled energy consumption.
Energy and Latency Results
We present the area, energy, and delay of our transcoder layout in Table 2 . Even though leakage energy increases as technology shrinks, the table shows it is still orders of magnitude smaller than dynamic energy of the encoder. Additionally, the design consists of less than 5k transistors, giving it the small area necessary to fit it on both ends of a bus. The latency for the encoder is high due to the serial NAND match design. This delay could be reduced by making op- timizations for speed in the matching circuit, and adding a small amount of additional power. The matching circuit is currently made of two NAND trees of 16 bits each, but one could imagine breaking this tree into more groups of smaller numbers of bits or changing it into a flatter ORbased matching circuit. Additionally, due to limited manpower, we were unable to fully optimize transistor sizings in the swapping circuits. A better designed version would have less latency. We obtained the following energy graphs for SPEC95 benchmarks of the register and memory bus for various wire lengths with the SR-only design. Figure 26 and 27 show the ratio of transcoder plus wire energy versus pure wire energy with the 8 entry shift register design. As shown in Figure 26 , the 8 entry shift register performs fairly well. The transcoder saves energy, or meets its energy budget on almost all benchmarks at wire lengths greater than 15mm with a median at 11.5mm. For SWIM, the transcoder begins to save energy for wire lengths as short as 3mm. The result is less encouraging for the memory bus. This is due to the fact that the absolute number of transitions removed is low (even though fraction of transitions reduced was high). Thus energy removed on wire transitions was not enough to offset the transcoder circuitry energy. Perhaps a different coding scheme with a simpler encoder is needed to save wire transition energy on the memory bus. different technologies and transcoder designs. The resulting crossover lengths are given in Table 3 . As technology shrinks, the crossover point becomes shorter, which is what we would expect since wire power grows more dominant in smaller technologies. The crossover points for the projected 16-entry transcoder are also shorter because a 16entry transcoder removes more energy from wire transitions. These trends signify that as technology shrinks, more complex transcoders are well-positioned to take advantage of the growing disparity between wire and device energy.
Conclusion
We explored the space of bus transcoders for reducing the energy consumption of on-chip buses. We evaluated several transcoder schemes at a high level and showed that these techniques are indeed effective in reducing energy consuming wire events. For instance, we achieved an average of a 35% normalized energy reduction for SPEC95 benchmarks on the register bus. Although we did not perform an exhaustive exploration of the transcoder design space, we did give a detailed look at a number of possibilities, selecting one (the Window-based transcoder) for later implementation. From this high level evaluation, we pushed an implementation of the Window-based transcoder all the way down to a 0.13 Ñ layout. Using this physical model, we performed a complete evaluation of the transcoder's power consumption with realistic bus traffic. We found that an 8-entry Windowbased transcoder on the register bus saves energy for almost all SPEC95 benchmarks at wire lengths greater than 15mm and median at 11.5mm. Projection of a 16-entry design at 0.07 Ñ breaks-even at wire-lengths of only 2.7 Ñ. We believe that trading logic complexity to save on-chip communication energy will be increasingly attractive as Moore's law marches forward.
