In this paper, we introduce a new encoding scheme that explicitly targets the minimization of the bus energy due to the crosstalk capacitances between adjacent bus lines. 'The key transformation operated by the code consists of a permutation of the bus lines, implemented directly during physical design; as a desirable consequence, no additional encoding/decoding logic is required at the bus boundaries, thus implying th,at no latency penalty is introduced on the processor-memory path. An additional feature of the permutation-based code is that the encoding function can be determined without any knowledge of the binary stream being transmitted. Therefore, the code can be effectively exploited in general-purpose computing systems. The proposed code works best on address buses; savings obtained for diflerent address traces generated by two different processors are in the order of 26% with respect to the unencoded streams.
INTRODUCTION
Several bus encoding schemes for low power have been proposed in the literature in the last few years. Most of these approaches target the minimization the transition activity on the bus, with the objective of reducing the switching of the capacitances of the bus lines. Some codes, e.g., the Bus-Invert [l] , its variants [2, 31 and the Adaptive [4], have general applicability, i.e., they do not require any knowledge about the streams being transmitted. Others achieve more aggressive transition activity minimization by taking advantage of some characteristics of the pattern sequence traveling on the bus. In particular, a whole class of encoding methods is targeted towards address bus power minimization, whose strong sequentiality IS exploited by codes such as Gray [5] , T O [B] and their modifications [7, 81. Unfortunately, sizable transition activity reductions are accompanied by high-overhead codecs. In particular, while for Gray the penalty is in performance, T O encoders and decoders are more energy demanding. Thus, the TO solution is only suitable to off-chip buses, where high wiring capacitances allow to amortize the codec overhead. Fu-rther optimizations can be obtained when the encoding function is derived from the analysis of a (set of) specific stream(s). In this case, the statistical properties of the input sequence are assumed to be known and can be used to automatically generate the encoding/decoding functions and their corresponding interface circuitry.
Codes of this kind are well suited for embedded systems, where cores and microcontrollers tend to repeatedly execute a given application. Beach [9] and Working Zone [lo] are examples of application-specific codes thought explicitly for address buses, while the information-theoretic code of [4], a practical generalization of the work of [Il, 121, finds application also in data bus encoding. All the encoding solutions mentioned above have one point in common: They try to minimize the number of transitions on the bus lines, because this translates to a reduction of the capacitance that is switched during the communication. With technology scaling, however, the importance of inter-wire, or crosstalk, capacitances is becoming predominant.
It has been estimated that the simultaneous transition to opposite values (i.e., 0 + 1 and 1 -+ 0) of two adjacent bus lines dissipates about four times more energy than without considering coupling effects, for technologies below 0.25pm [13] . As a consequence, accounting for crosstalk capacitances between pairs of adjacent wires is becoming key for the applicability of low-power bus encoding in modern designs. Recent papers [13, 141 provide some pioneering research on this subject. The solutions proposed in those works share however a major limitation with any other bus encoding tedinique available in the literature: They require some interface circuitry to implement the transformation defined by the encoding. In several cases, e.g., performance-constrained designs, the addition of the extra logic cannot be afforded because it lies on the critical path of memory-to-processor transfers, even if some encoding schemes reduce this hardware overhead to the minimum. In this work, stemming from the observation that the minimization of the coupling capacitances can be achieved with a permutation of the bus lines, we propose a new address bus encoding scheme that does not require explicit encodingjdecoding circuitry. The permutation of the bus lines is accomplished at the layout level; this is a distinctive feature of our approach, because it allows the introduction of the encoding during physical design, where bus capacitance information can be accurately estimated.
Conversely, all existing techniques need to be applied a t the architectural level, since the presence of the encoding/decoding logic must be taken into account during RTL-to-layout synthesis in order to possibly enable a recovery of the extra latency imposed by the codec. We formulate the problem of finding an effective permutation of the bus lines as a graph problem (a variant of the well-known
, for which very efficient heuristics do exist. We provide experimental evidence that permutation-based encoding is general enough not to require a pre-processing phase -of the address stream of binary patterns being transmitted over the bus; therefore, the method can be categorized among the generalpurpose address bus encoding schemes, such as the Gray code. 
The physical capacitances of a bus line can be modeled as shown in Figure 1 . Besides the usual capacitance between the line and ground, C L (the self capacitance), also the coupling capacitance between the line and its adjacent lines, G I , must be modeled.
The ratio X = represents the relative weight of the two capacitive effects. The case X = 0 represents the conventional bus energy model based on the switching of the bus self capacitances. In deep-submicron technologies, X tends to become larger than 1, expressing the increasingly dominant effect of coupling capacitances w.r.t. to self capacitances. The energy model per cycle for a bus line will thus include the two capacitive effects:
where CYL and 011 denote the rates at which each capacitance is switched. While Q L represents the conventional switching activity of the lines, 0 1 is related t o the simultaneous switching of two adjacent lines. There are four types of transition pairs between two adjacent lines a and b:
1. a and 6 switch to different final values; 2. One of the two lines switches, while the other does not;
3. a and 6 switch to the same final value;
4. None of the lines switches.
Of these four types, only the first two cause CI to switch, yet by a different amount. T h e first type of transitions will cause CI to switch twice in a cycle, while transitions of type 2 imply a single switch per clock cycle. In the case of type 2 transitions, however, CI will switch only if the final values on the two bus lines are different.
We assume that the transitions on a and b are perfectly aligned in time. This assumption is reasonable because the drivers typically latch the data sent on the bus, thus eliminating possible misalignments between two simultaneous transitions. If this assumption does not apply, we should also consider possible switchings for transitions of type 3. Table 1 shows the normalized energy consumption for a two-line bus, when all capacitive effects are considered [13] . In the table, only transitions from 0 to 1 are counted as power dissipating transitions on C,. This is because the 0 --f 1 transition is the one that actually charges the capacitance, drawing energy from the power supply. In practice, this distinction replaces the factor in the conventional power dissipation model. 
WIRE PERMUTATION ALGORITHM
In the existing literature [13, 141 , the problem of minimizing the energy consumption due to coupling capacitances has been tackled by properly encoding the information transmitted on the bus so that the crosstalk energy is minimized. In other terms, redundancy, i.e., additional logic, is used to minimize the cost function.
The solution we propose in this paper is based on a different idea. A careful analysis of the problem shows that the minimization of the energy due to the coupling capacitances can be achieved by properly permuting (some of) the bus lines. Such a solution has the advantage of not requiring an explicit codec, because the transformation of the data is equivalent to a crossbar switch.
Problem Formulation
The problem of finding the best permutation of the bus lines can be formulated as a graph problem. We build a completely connected undirected graph G(V, E , W ) , where V = {U,} is the set of vertices, E = is the set of edges, and W = {wZr3} is the set of edge weights. Vertices correspond to bus lines, and weights denote the conditional switching probabilities between pairs of bus lines. More precisely, given the asymmetric nature of the weights (not all transition pairs have the same importance, as shown in Table 1 ), the actual weights are the product of conditional switching probabilities and'the coefficients that multiply parameter X in Table 1 . From the statistical point of view, the computation of the weights requires the knowledge of both transition and signal probabilities of each bus line
The graph G is completely connected because it represents all the potential permutations of the nodes. The cost of a permutation, that is, of a path in the graph, is simply the sum of the costs of the edges that belong to the path.
The problem of finding the permutation with the minimum overall cost is therefore equivalent to finding the Hamiltonian path (i.e., a path that visits each vertex exactly once) with minimum cost, starting from a given vertex u t , and ending with another vertex uf. The minimum-cost Hamiltonian path in the graph represents the permutation of the bus lines that minimizes the given cost function. This problem is a variant of the well-known traveling salesman problem (TSP), that solves the more specific problem of finding the minimum cost Hamiltonian cycle, that is, a path with the same initial and final vertices. A further difference in our case is that the initial and final vertices are not specified in advance, because all possible permutations must be explored. Nevertheless, minimal modifications to an existing TSP solver are sufficient to make it applicable to our case. Although the size of the TSP that must be solved is relatively small (the number of nodes in the graph is equal to the number of bus lines), it is large enough to prevent the use of an exact TSP algorithm. We must then resort to a heuristic solution, to allow reasonable execution time. In general, we can sacrifice some performance for a better quality of the solution, since the TSP will be executed once and for all for a given system. Among the vast choice of heuristic TSP solvers in the literature, we have implemented a local search algorithm inspired by the work of Lin [16] . Local searches are based on the exploration of the neighborhood of an initial solution to the problem. To guarantee a reasonably large exploration of the search space, several initial solutions should be tried as starting points of the search. Customizing the general paradigm of local search to a specific problem requires the definition of (i) The neighborhood and its size; (ii) The set of starting points and its cardinality. 24800 neighbor points of the initial solution. Furthermore, in our implementation the search using the 3-neighborhood scheme has been repeated for R = 100 different random starting points.
This value has shown to yield the optimum length of the TSP with a probability of 0.99 [16] . The average runtime of the algorithm with R = 100, k = 3 is of a few seconds on an Alpha Personal Workstation 433 with 128MB of main memory.
( ? ) =
PHYSICAL DESIGN ISSUES
In principle, the encoding/decoding functions can be realized without any area, performance and power overhead, because it simply consists of a permutation of some of the bus lines. In practice, however, their implementation may have introduce some overhead at the physical level. As a general observation, we note that the routing overhead (occupation of other levels of metal and vias) is confined to relatively small areas close to the components, and therefore has small impact to the global routing. In this section, we analyze and quantify the impact of the encoding/decoding logic on conventional design rnetrics.
Design of the Permutation Network
In order to evaluate the impact on area, performance and power of the codec on the final design, we must perform an accurate analysis at the physical level. The first issue is how to realize an effective layout of the permutation network. Since our objective is that of employing a unique permutation for all applications, we can think of inserting a sort of "hardwired" crossbar switch into the layout of the design. However, a plain implementation of a sorting network requires as many vias as the number of bus lines, in the worst case.
For these reasons, we decided to implement a different strategy, that guarantees the use of at most two vias per wire. The idea is to route a signal directly to its final position (wire a goes to
A,).
Care must be taken, though, to prevent shorts between distinct wires. We implemented a technique that is imitable for any permutation: We first consider exchanges such that x, 2 a and route them toward the final position T , using a second metal layer: This allows the routing of 2 different signals on the same "bus slot" (see, for example, the two overlapping layers at the bottom of the layout of Figure 2 ). In this solution, the overlap area has to be strictly controlled to avoid high mutual capacitances. It is possible t o route all T , 2 a signals without causing any short by using a slight modification of the left-edge channel routing algorithm. After all xiTz 2 a wires are routed (wires 4, 3 and 2 in the example of Figure 2 ), the remaining wires are treated similarly, without introducing vias, in the same metal layer as the rest of the bus. It might be necessary, in some cases, to build a sort of "bridge" to cross over a previously routed signal. This shows how it is possible to reduce to 2 the maximum number of vias needed, and try to compact the network. For an acceptable estimation of performance and power overhead introduced by the codec it is necessary to accurately evaluate all parasitics of the interconnection network. To this purpose, we generated the actual layout, using the algorithm discussed above, for 32 different permutations encompassing all possible values of D (from 0 to 32, excluding l), and we extracted resistance and capacitance values of a distributed net describing the electrical behavior of the encoder-bus-decoder cascade. The target technology is a 0.25 p m with 6 levels of met,al. We decided t o route the bus on metal3, and route the codec nets on metal3/metal4. Minimum pitch has been used throughout the designs. T h e detailed Standard Parasitics Format obtained from an extraction with Cadence Affirma Hyperextract (a state-of-the-art extractor based on 2.5D extraction that proved to give well-matched results with 3D extractors), has been included in a SPICE netlist and then simulated. The overall distance from the transmitter end to the receiver end of the bus (including the presence of the encoder-decoder pair) was chosen to be 1-mm long. The bus loads at the receiver end were chosen according to actual memory blocks input characteristics, and the driver capability according to its loads. The netlists were simulated with HSPICE with inputs switching at the maximal activity. The reason was that we were interested in the comparison between the unencoded bus and the coded version, rather than absolute values. For each net, all maximum timings were considered in performance reports. 26,000 pm'). We observe that, since the permutation network does not use active area (i.e., cells), we are using the same chip area that would be occupied by the bus itself. The real area overhead is given only by the second level of metal which is necessary to exchange the wires. The delay overhead never exceeds 6.3 ps, a value which is well below the delay of a single library cell. This amounts to less than 5% of the delay of the entire bus (158 ps). We observe that the delay is almost insensitive to D , while it depends on the resistive effects of the vias (that are never more than 2). Power figures are also insensitive t o variations of D , and the overhead is always below 1%, which is within the noise margin of the SPICE simulation.
Effect of Boundary Capacitances
The previous analysis considered the bus as it was isolated or, equivalently, far enough from other sources of crosstalk (e.g., ground lines). Obviously, this is an ideal assumption and should be removed when the effectiveness of the encoding must assessed with accuracy. The presence of neighboring wires adversely affects the potential energy optimization achievable by our method because the designer typically has no control on the information traveling on these wires. Therefore, the switching of the coupling capacitances between the two boundary bus lines (i.e., Lines 1 and 32) and their respective neighboring lines will cause additional energy consumption. Taking these effects into account in the algorithm of Section 3 requires the analysis of the information that is carried by neighboring wires. In particular, we should evaluate the possible correlation between these wires and the boundary bus lines. The typical interaction between the bus and the neighbor wires is depicted in Figure 4 . In general, there will be several wire segments of different lengths (typically much shorter than the length of the bus), at different distances from the bus (and thus with different coupling capacitances) and running in parallel with it.
32
Figure 4: Effect of Boundary Capacitances.
In practice, however, the neighbor wires will not be driven by the same drivers as the bus. From the point of view of energy balance, lines belonging t o different drivers can be considered as independent. Therefore, the energy due to the boundary coupling capacitances depends only on the statistics of the bus boundary lines, and not on their relation with local boundary wires. This implies that transition probabilities (of the bus boundary lines) can be used instead of conditional transition probabilities (as for internal bus lines).
The energy model per cycle can thus be modified as follows:
where ah and a; denote the transition probabilities of the two boundary bus lines, CB = 9 the boundary capacitance seen by the two wires, parameterized with respect t o the value of the distance p of the neighbor wires. The minimum value of 0 is 1, indicating that neighbor wires are at the same distance as the bus lines (typically, the minimum pitch).
EXPERIMENTAL RESULTS

PB Encoding
We have applied the permutation-based (PB) encoding to several address traces obtained from two different code profilers (namely, pixie for a MIPS processor, and Armulator for an ARM processor). Two artificial traces that cover a corner case typical of address traces (i.e., highly sequential streams) have also been considered: Counter represents a perfect counter that starts from a random address; CountSkip is a counter sequence intermixed with random jumps t o new locations.
To determine the encoding and decoding functions, we have first built the cost matrices (corresponding to the weights of the graph edges) for all the chosen benchmarks. Then, all the matrices have been averaged to a single correlation matrix. Finally, this matrix has been used to compute the best permutation by applying the TSP solver on the graph corresponding to the single correlation matrix. Energy data are shown in Table 2 Table 2 : Energy Results for PB a n d Gray.
P B encoding achieves an average energy decrease of 26% with respect to the unencoded stream, while Gray reduces energy, on average, by 32.5%. Although P B savings are slightly worse, we emphasize that they are only due to minimization of crosstalk energy, while Gray savings come mainly from a reduction of the number of bus transitions (which, as a by-product, also reduces crosstalk energy). This indicates that the usage of P B does not exclude a preliminary stage of Gray encoding; in the next subsection, experimental data will confirm that P B can be combined with other codecs to sinergically minimize the total bus energy. We observe also that Gray adds a significant amount of logic on the critical path. The sum of the delay introduced by the Gray codec is 6 . 2~~8 , confirming the fact that the usage of codes that require complex interface circuitry is constrained to lowperformance systems, where some latency penalty on bus transfers can be tolerated.
Combined Gray+PB Encoding
As mentioned in the previous subsection, the P B code attempts to minimize a cost function (i.e., the number of simultaneous pairwise transitions) which is different from that targeted by codes such as Gray (i.e., total number of transitions). It may thus be wise, whenever the design constraints (e.g., available timing slack or latency on the bus) allow it, applying P B after a preliminary step of conventional (i.e., Gray) encoding.
Stream
While this obviously reduces the absolute effectiveness of the PB code (streams encoded with Gray exhibit less transitions than the original traces), it helps in achieving a more sizable bus energy optimization. The results of Table 3 confirm our claim. In fact, cascading PB to Gray yields, on average, a total energy savings w.r.t. the unencoded stream of 46% (14% more than using Gray alone and 20% more than using P B alone). Table 3 : Energy Results for Combined Gray+PB.
General-Purpose vs. Custom PB Encoding
It was mentioned in Section 5.1 that the permutation used by P B for the experiments referred to the correlation matrix calculated as the average of all the matrices corresponding to the various benchmark programs. If, for each program, a specific permutation is computed starting from the corresponding correlation matrix, we obtain encoding and decoding functions which are optimal with respect to the considered program, and thus produce a more relevant minimization of the crosstalk energy. Draw-back of this approach is that code profiling of each application is required, and a custom codec must be synthesized, thus making P B an applicationspecific code.
We have run an experiment to evaluate how much potential energy savings is lost when the general-purpose permutation network (computed as explained in Section 5.1) is used instead of a trace-specific encoder for each individual stream; Table 4 shows the results. We note that address streams have a significant degree of uniformity in their statistics: The energy savings using the encoder for the "average" case (column General PB) cause only a marginal degradation (3.4% on average) of the savings with respect to the custom case (column Custom PB), where each stream would have a specific encoder. This results confirm the claim that P B can be considered a general-purpose code, whose applicability does not require the knowledge of the address streams being transfered.
~~~
DashBoard
FPT IirDemo
Boundary Capacitance Effect
In this section, we quantify the impact of the neighbor wires, as discussed in Section 4. [%I Figure 5 reports the crosstalk energy savings averaged over all streams. On the z-axis, the value of p is reported. The horizontal line corresponds t o the case of p = CO (i.e., the results of Table 4) . From the curve, we note that for realistic values of p (i.e., between 5 and 10) the reduction in achievable savings is relatively limited, and it is still above 20% for the case of p = 3.
Data Streams
We have applied the permutation-based encoding to a set of data traces representing various types of information, with the objective of analyzing the applicability of the code to data buses. The traces we used include ASCII files, binary files and multimedia data (music, images, etc.), and the results are summarized in Table 5 for the cases of both general and custom PB. From the results in column General PB we observe that the general-purpose version of P B is effective only for streams that exhibit explicit non-randomness (e.g., text files, binaries, and bitmaps), while it provides marginal savings for streams containing information in compressed format (e.g., images, sound files). The intuitive explanation for this is that compression tends to flatten the statistical distribution of the bits towards the random case. Due to this non-uniform statistical characteristics of the data streams, the custom version of the PB code is much more effective than the general purpose one, as shown in column Custom PB of the table.
CONCLUSIONS
We have presented a permutation-based (PB) encoding scheme that explicitly targets the minimization of energy dissipated due to the switching of crosstalk capacitances. Code calculation does not require the knowledge of the binary stream being transmitted on the bus, and its implementation does not require explicit encoding/decoding circuitry. The method is particularly effective for address buses: Savings obtained for different address traces generated by two different processors are in the order of 26% with respect to the unencoded streams. Our scheme is orthogonal with respect to conventional encodings that target the energy dissipated due to the switching of self capacitances. Therefore, it can be combined with such encodings (e.g., Gray), yielding average energy savings around 46%. Applicability to data buses is more dependent on the type of data that must be transferred; in this case, using an applicationspecific variant of the code is more appropriate.
