Abstract: A new approach based on genetic algorithms to reduce power consumption by communication buses in an embedded system is presented. This approach makes it possible to obtain the truth table of an encoder that minimises switching activity on a bus. This method is static, in the sense that the encoders are generated ad hoc for specific traffic. This is not, however, a limiting hypothesis if the application scenario considered is that of embedded systems. An embedded system, in fact, executes the same application throughout its lifetime and so it is possible to have detailed knowledge of the trace of the patterns transmitted on a bus following execution of a specific application. The approach is compared with the most effective ones already presented in literature, on both multiplexed and separate buses. The results obtained demonstrate the validity of the approach, which on average saves up to 50% of the transitions normally required, in addition to their practical applicability, even in an on-chip environment.
Introduction
Until a few years ago VLSI design was orientated mainly towards optimisation in terms of area and performance. The levels of integration and working frequencies were not so high as to cause problems of power consumption. Increasingly rapid progress in silicon process technologies now offers the possibility of integrating tens of millions of transistors operating clock frequencies in the GHz range on a single silicon die. This has led to a huge increase in the amount of power dissipated per unit of area, thus making power optimisation equally as important an objective as area and performance [1].
The increase in the level of integration has given birth to the concept of a system-on-a-chip (SoC) and to a proliferation of portable battery-driven applications such as mobile phones, PDAs, digital cameras and laptops. The competitiveness of these products in the marketplace depends to a great extent on the functionality-to-weight ratio. Most of these portable applications run on batteries, which are also their heaviest components. Without power minimisation, these systems would need heavy batteries to ensure reasonable operating time between battery recharges. Unfortunately, the gap between battery energy density (watt-hour/pound) and that of levels of integration is destined to grow even further: 30% improvements in energy density in 5 years as opposed to 2Â improvements in semiconductors in 1.5 years [2] .
Total power consumption continues to increase, despite the use of a lower supply voltage [2] (Fig. 1 ). Increased power consumption is caused by higher operating frequencies and the higher overall capacitance and resistance of larger chips with more on-chip functions. These are some of the reasons which have directed research towards the study and definition of low-power techniques and methodologies.
One of the greatest power consumers in a system is switching activity on the highly capacitive lines of an interconnection system. It is estimated that power dissipated on the I/O pads of an IC ranges from 10% to 80% of the total power dissipation with a typical value of 50% for circuits optimised for low power. Up to a few years ago these problems only affected switching on off-chip buses, but today they are becoming increasingly important for on-chip buses as well. The spread of IP-based design has caused the interconnection system to become the most important part of a SoC.
The interconnection structure supporting the architecture will be closer to a sophisticated network than to current bus-based solutions [3] . Networks-on-chips (NoCs) are the backbone through which the cores of a SoC communicate with each other. The most important physical trend, however, is the fact that on-chip wires are becoming much slower than logic gates as the on-chip devices shrink. Wire delays become dominant, forcing hardware to be more distributed [4] . In addition, the wire-to-gate capacitance ratio has gone from 3 in old technologies to 100 in new ones and is still on the increase [2], shifting the problem of power consumption from computing to communications. Various techniques have been proposed in the literature to reduce switching activity on the lines of a bus and thus the power dissipated by the bus. One of the most widely used consists of encoding data prior to transmission and decoding it at the destination. Unfortunately many techniques introduce extra lines along the bus or are so complex that the great power overhead for the encoder/decoder only allows off-chip use.
The approach proposed in this paper refers to embedded applications, i.e. ones in which it is possible to know in advance the trace of the patterns transmitted on a communication bus following execution of a specific application. The trace is used to generate a truth table for an encoder that will minimise switching activity on the bus, which can thus be synthesised using any automatic logical synthesis tool. As the problem of searching for the encoder that will minimise switching activity can be viewed as an optimisation problem, the exploration strategy adopted uses a heuristic method based on genetic algorithms (GAs).
Furthermore a simple compression algorithm has been proposed to reduce the considerable size of the trace files (used as input in static approaches). Using the compressed memory reference trace file processing times have been reduced drastically with the same the efficiency.
The results obtained on a set of benchmarks confirm the validity of the approach: the saving in terms of transitions is greater than that obtained by the most efficient techniques so far proposed in the literature.
The rest of the paper is organised as follows. In Section 2 we discuss the various bus encoding schemes proposed in the literature. In Section 3 we present GEG, an algorithm for generating the truth table for the encoder/decoder so as to minimise switching activity on a bus. In Section 4 we present the results obtained and compare them with other techniques proposed in the literature. Finally, in Section 5 we draw our conclusions and discuss possible future developments.
Previous work
Switching activity in a communication bus can be reduced by suitable encoding of the binary patterns before they are transmitted. Bus encoding strategies can be grouped into two categories: static and dynamic. Static techniques are based on a priori knowledge of the stream of patterns that will travel on the bus to generate an ad hoc encoder which will minimise the switching activity for that stream. Obviously, these techniques are highly application-dependent, in the sense that their applicability is limited to cases in which it is possible to have detailed information about the traffic on the buses. They are therefore suitable for application in embedded systems, as the latter execute a specific application throughout their lifetime and so it is possible to characterise the traffic on the buses with a high degree of accuracy.
The Beach solution proposed by Benini et al. [5] , analyses the correlation between blocks of bits in consecutive patterns to group the bus lines in clusters according to their correlations. For each cluster, the Beach approach automatically generates encoding schemes which minimise the average bus switching activity. In [6] the same authors present an algorithm which uses statistical information at the word level to generate an encoding table that minimises transition activity.
The approach proposed by Henkel et al. [7] is based on the observation that, owing to the effects of coupling capacity (which is of great relevance in more advanced technologies), the most internal lines of a bus have a greater capacity. For this reason, the encoding strategy analyses the reference trace to find a static permutation of bits that will reduce the activity on the lines with the greatest capacity. That is, the bits featuring greater switching activity are allocated on the outer lines of the bus (i.e. those with less capacity), while the bits with less switching activity are allocated to the innermost (greater capacity) lines.
Dynamic techniques are not based on advance knowledge of the stream of patterns that will travel on the bus, but encode the current pattern on the basis of the pattern sent at the previous instant. For this reason, advance knowledge of the stream of patterns is not required: encoding decisions are made on the sole basis of past history.
The bus-invert technique [8] proposed by Stan and Burleson for data buses and its derivatives [9, 10] invert a word to be transferred if the Hamming distance between it and the word previously sent is greater than the size of the bus divided by two; otherwise, no encoding is applied to the word. In this way the maximum number of lines that switch will be limited to half the size of the bus. Of course, an additional signalling line is needed to signal the type of encoding applied (inversion or no encoding). For buses with uniformly distributed data, the expected value analysis in [11] shows benefits of around 10% for 32-bit buses. To make the bus-invert method more effective, the bus can be partitioned into a handful of bit-level groups and a bus invert can be separately applied to each of these groups. However, this scheme will increase the number of surplus bits required for the encoding, which is absolutely undesirable.
Most of the bus encoding strategies proposed in the literature were envisaged for address buses. They all exploit the strong time correlation between one pattern and the next that is typically observed in an address stream. The streams on an address bus are, in fact, typically bursts of in-sequence addresses interrupted by words that start another sequence or repeat one that has just concluded (for example owing to taken branches or jumps). The high frequency of consecutive addresses, for instance, can be exploited by using Gray rather than natural binary encoding [12] . In this way the number of transitions for consecutive patterns will be limited to 1.
If additional signalling lines are introduced, even better results can be obtained by using T0 encoding [13] or one of the several variations on it [14] . The basic idea is to freeze the bus if the address to be sent is consecutive to the one sent previously (the receiver will generate the address locally). These approaches are less effective when applied to a multiplexed instruction and data address bus (because in this case the percentage of in-sequence addresses decreases). The requirement for an additional redundant control line is eliminated in [15] by observing that any new address transmission is sufficient in recognising that the address incrementation mode is no longer in effect. The slight problem of signalling the address that originally started the incrementation mode is resolved by transmitting the actual value of the incrementation instead. Working-zone encoding [16] , as proposed by Mussoll et al., solves this problem by observing that programs generally favour a few working zones of their address space at each instant. In this case the method identifies these zones and uses the bus to transfer only the offset of references with respect to the previous reference to the same zone, together with a reference identifying the zone.
In [17] , Mamidipaka et al. proposed an encoding technique based on the notion of self-organising lists. They use self-organising lists in order to achieve an optimal encoding for the most frequently accessed addresses. The list is reorganised in every clock cycle to map the most frequently used addresses to codes with fewer ones. The size of the list in this method has a significant impact on the performance. To achieve satisfactory results, it is necessary to use a long list. However, the large hardware overhead associated with maintaining long lists makes this technique quite expensive. Furthermore, the encoder and the decoder hardware are practically complex and their power consumption appears to be quite large.
3
Our proposal
Let us consider a binary alphabet to compose words with a fixed length of w bits. Let U (w) be the universe of discourse for words of w bits (i.e. the set of words it is possible to form with w bits). The cardinality of U (w) is therefore 2 w . An encoder associates each word in U (w) with one and only one word in U (w) in such a way that there is only one output coding for each input, thus making the decoder able to decode the word univocally. As no redundancy is being considered, it is easy to calculate that 2 w different encoders are possible. Once the reference stream has been fixed, the ensuing number of transitions on the bus depends on the encoder used. The aim is therefore to find the optimal encoder that will minimise the number of transitions on a bus for a specific reference stream. Of course, as the space of possible encoders grows in size with the factorial of the size of the bus, exploration based on an exhaustive technique would be unfeasible. Designing an encoder that will minimise switching activity on a bus can be seen as a problem of optimisation and dealt with using design space exploration techniques.
In general, when the space of configurations is too large to be explored exhaustively, one solution is to use evolutionary techniques. Genetic algorithms have been used in several VLSI design fields [18] : in problems relating to layout such as partitioning [19] , placement [20] and routing [21] ; in design problems including power estimation [22] , technology mapping [23] and netlist partitioning [24] and in reliable chip testing through efficient test vector generation [25] . All these problems are untreatable in the sense that no polynomial time algorithm can guarantee an optimal solution and they actually belong to the NP-complete and NP-hard categories.
In this Section we will present our approach, which we will indicate as genetic encoder generator (GEG) for generating an encoder that will minimise switching activity on a communication bus. It is static, in the sense that the encoder is generated ad hoc on an address stream taken as input. Figure 2 shows the design flow called genetic encoder generator (GEG).
Genetic encoder generator (GEG)
The starting point is the specific application being executed (e.g. simulated), to obtain a memory reference trace file which will be the address stream used to generate the encoder. In order to facilitate generation of the encoder, the stream is compressed, as will be explained in the following subsections. Initially a population of encoders is initialised with random encoders and evaluated on the compressed stream. Each encoder in the population has an associated fitness value which represents a measure of its capacity to reduce the number of transitions on the bus. Encoders with higher fitness values are therefore those which determine a lower number of transitions on the bus when stimulated with the compressed stream. The classical genetic operators, suitably redefined for this specific context, are applied to the population and the cycle is repeated until a stop criterion is met. At the end of the process, the individual with the highest fitness value is extracted from the population. This individual will be the optimal encoder being sought. As will be seen later on, the encoder is expressed in the form of a truth table. The last step in the flow is therefore logical synthesis of the optimal encoder, which can be done using any automatic logical synthesis tool. To obtain the encoder it will, of course, be sufficient to exchange the encoder input and output columns and perform the synthesis.
Compression of the reference stream
The memory reference trace file produced following execution of an application typically comprises hundreds of millions of references. It is therefore advisable to compress the stream so as to obtain a stream with an upper bound on the number of patterns. If S is the initial stream and S Ã is the compressed one, the optimal encoder obtained using S Ã has to be the same as the one that would have been obtained if we had used S as the input to the encoder design flow. The compression is therefore lossless for encoder generation purposes.
Rather than compression, the technique used is based on a different representation of the reference stream. Let us consider a bus with a width of w. A reference stream is a sequence of patterns. Each pattern is an address of w bits. A compressed stream is also a sequence of patterns. A generic pattern is a 3-tuple kr i , r j , n ij l with i, j ¼ 0, 1, 2, . . . , 2 w 2 1 and i . j that specifies the number of occurrences n ij when the references r i and r j are consecutive in S. The meaning of the condition i . j can be explained by observing that, for our purposes, it is only necessary to know what the consecutive addresses are and not their order. If inverted, in fact, the number of transitions does not change. Using this transformation, the maximum number of patterns in S Ã will be 
GA-based bus encoding
The approach we propose uses genetic algorithms as the optimisation tool. Application of GAs to an optimisation problem requires definition of the following three attributes: the chromosome, the fitness function, and the genetic operators.
The chromosome is a representation of the format of the solution to the problem being investigated. In our case it is a representation of an encoder. The representation we chose consists of encoding the truth table of an encoder. In this way the chromosome will be made up of as many genes as there are rows in the truth table of an encoder. The gene in position i represents encoding of the word i. That is, for an encoder of w bits, we will have 2 w genes. The ith gene will represent encoding of the binary word that encodes i with w bits.
For reasons that will be explained when we deal with the definition of the genetic operators, the chromosome was enriched with further information. The chromosome can be represented as a table with 2 w rows and 2 columns. Each row corresponds to a gene. Once the generic row i is fixed, the first column represents the encoding of i, while the second gives the position of the gene whose encoding is i (Fig. 3) .
The fitness function measures the fitness of an individual member of the population. In our case the individual is represented by an encoder, so the fitness function assigns each encoder a numerical value that measures its capacity to reduce switching activity on a bus. Naturally, the fitness function will depend not only on the encoder but also on the reference stream the encoder is stimulated by.
If E indicates the chromosome that maps the encoder (as in Fig. 3) that is, f (E, S Ã ) returns the number of transitions saved when the encoder E is used for the compressed stream S Ã . The fitness function defined by (1) is applied to each individual in the population (i.e. to each encoder). The aim of the GA is thus to make the population evolve so as to obtain individuals with increasingly higher fitness values.
The encoders are ordered by decreasing fitness values. The individual with the highest fitness value will always be inserted into the new population (elitism). The encoders making up the population are selected with a probability directly proportional to their fitness value. With a userdefined probability the genetic operators are applied to them and they are inserted into the new population. This selection process is repeated until the new population reaches the size established by the user.
The genetic operators were appropriately redefined so as to guarantee that application to an encoder (in the case of mutation and permutation) or a pair of encoders (in the case of cross-over) always gives rise to an encoder.
Before discussing the genetic operators in greater detail, it is necessary to define a support function that we will call Update(. . .), which updates the coding of a word while maintaining consistency at the end of the decoding phase. Mutation is a unary operator that is applied with a certain probability (which we will call mutation probability) to an encoder. Application of the mutation operator to an encoder consists of varying the coding of a word with a probability equal to the mutation probability.
Mutate(EncDec E, double probability) begin for (i ¼ 0 to E.size()) do if (Event(probability) then new_enc ¼ randomInt(0, E.size() 2 1); Update(E, i, new_enc); end if end for end where the function Event(p) returns true with a probability of p, and the function randomInt(m, M) returns a random integer ranging between m and M.
Cross-over:
Cross-over is a binary operator that is applied to two elements of the population with a certain probability that we will call cross-over probability. Given two encoders E 1 and E 2 , and having chosen two random indexes i and j where i , j, the coding of the words i, i þ 1, . . . , j -1, j is exchanged between E 1 and E 2 . 
Experiments
In this Section we will present the results obtained by applying our approach, comparing them with the most effective approaches proposed in the literature. The application scenario referred to is encoding of the addresses transmitted on a 32-bit bus and generated by a processor during execution of a specific application. Two cases are considered: (i) the bus is multiplexed, (ii) it is a dedicated bus. In the former case the addresses travelling on the bus refer to both fetching instructions and accesses generated by load/store instructions. In the second case, we considered both the address bus connecting the processor with the instructions memory and the address bus connecting the processor with the data memory.
The 32-bit bus is partitioned with clusters containing the same number of bits and the approach was applied to each cluster separately. It is, in fact, computationally unfeasible to apply the approach to the whole bus, given that the data structure used would require tables of 2 32 rows to be handled. The cases studied referred to clusters of 4 and 8 bits. The bus lines were grouped sequentially in a cluster. With c clusters, for example, each will include w ¼ 32/c lines. The ith cluster will contain the lines i Â c,
It is of course possible to cluster the bus lines in different ways, for example by allocating lines with a higher correlation to the same cluster. Table 1 gives an example of the saving on transitions according to the ways in which the lines are clustered. The data refer to the mm and dct applications with 10 different randomly generated clusterings. As can be seen, different clusterings can cause the goodness of the encoders to vary by 6 -8%, but the results obtained with sequential clustering are satisfactory. Sequential clustering of the lines of an address bus, in fact, corresponds to clustering according to the degree of correlation of the lines.
We considered the same reference traces as are used in [5] , generated following the execution of specific applications in the field of image processing, automotive control, DSP etc. More specifically, dashb implements a car dashboard controller, dct is a discrete cosine transform, fft is a fast Fourier transform and mat_mul a matrix multiplication. The other applications come from the Motorola Powerstone benchmark suite [26] , which contains a collection of embedded and portable applications, including paging, automotive control, signal processing, imaging and fax applications.
In all the experiments that will be discussed in the following subsections, we considered a population of 10 individuals, a mutation probability of 50%, and a cross-over probability of 25%. These values are set following an extended tuning phase. The convergence times and accuracy of the results were evaluated with various cross-over and mutation probabilities, and it was observed that the performance of the algorithm with the various benchmarks was very similar. This makes it reasonable to assume that the GA parameter tuning phase only needs to be performed once (possibly on a significant set of applications).
The stop criterion used consists of blocking iterations when no appreciable improvements in the fitness value of the best encoder in the population have been observed for a certain number of generations. By way of example, Fig. 4 shows the percent fitness values for varying numbers of generations. The Figure refers to the application fft on 8-bit clusters. The approach was only applied to the first three clusters containing the least significant lines, as the lines in the last cluster (lines 24 -31) do not feature switching activity. As can be observed, convergence is correlated to the actual size of the stream. For cluster 0 (with a stream of 837 patterns) convergence is reached after about 15,000 generations. For cluster 1 (with a stream of 123 patterns) it is achieved after only 6,000 generations. Finally, for cluster 3, whose compressed stream contains only three patterns, convergence is immediate. Before discussing the experiments it is necessary to introduce a further compression phase to make application of the algorithm computationally feasible.
Further compression of the reference stream
As stated previously, the technique used to compress the trace file is loss-less because it allows the size of the trace to be limited to a certain number of lines without affecting the results obtained by the encoder/decoder generation algorithm. There are, however, situations in which this compression is not sufficient and the analysis of a trace may not be limited to a single run of the application. In some cases, that is, it is indispensable to run the application several times with different input data to observe its behaviour when the initial conditions vary. Different initial conditions may, in fact, correspond to huge variations in terms of memory reference statistics. This may affect the performance of the algorithm, which may be optimal for some conditions and poor for others.
As an example, let us analyse the results of stream compression using the compress benchmark. The idea is to execute compress with different input datasets, obtaining a trace for each of them which will be chained in a single trace. Then GEG is applied to this trace. Table 2 shows the size of the traces, measured as the number of references, organised according to the structure described in Section 3.2, for the first three 8-bit clusters. As can be seen, the size of these traces may be so great as to make application of the algorithm computationally onerous. This makes it necessary to use a further compression technique to reduce the trace size even more.
This further compression technique entails loss, in the sense that it reduces the size of the trace (thus improving on the time required to generate the encoder/decoder) but Table 3 gives the variation in the size of the trace and the error made following this further compression on the compress application, using various threshold values. Choice of the threshold must therefore take into account the trade-off between the saving achieved on the size of the trace file and the maximum error that can be made in evaluating the number of transitions saved thanks to encoding. For example, with reference to Table 3 , choosing a threshold value of 5 guarantees a reduction of about 48% in the size of the trace, with a maximum tolerable error of 2.8%.
Address bus (fetch þ load/store)
In this subsection we will comment on the results obtained when encoding is applied to a multiplexed address bus (i.e. one on which addresses generated by both fetch and load/ store instructions are travelling). Table 4 summarises the results obtained. The first column (bench) identifies the benchmark. The second (trans) gives the total number of transitions on the bus when no encoding scheme is applied. The remaining columns (in groups of two) give the number of transitions for each approach (trans) and the percent saving in transitions as compared with the case in which no encoding scheme is applied (saving). GEG8 and GEG4 represent the same implementation of the approach GEG applied to partitioned buses of 4 and 8 bits respectively. Beach is the approach proposed in [5] . Others indicates the best result obtained by the encoding schemes Gray [12] , T0 [13] , Bus-invert [8] , T0 þ Bus-invert, DualT0 and DualT0 þ Bus-invert [14] . As can be seen, GEG4 is on average equivalent to Beach. Increasing the size of the clusters to 8 bits increases the saving by about 13% as it is possible to exploit the temporal correlation between the references more fully.
Address bus (load and store addresses)
Let us now analyse the performance of the various approaches when they are used to encode the addresses generated by load and store instructions. From Table 5 it can be seen that in this case the gap between the performance of the approach proposed here and those mainly based on exploration of locality (e.g. T0 and Gray) becomes greater, with a difference of almost 30%. Here again GEG4 and Beach exhibit similar performance, whereas GEG8 is more than 15% better.
Address bus (fetch only)
Finally, let us analyse the behaviour of the various approaches when encoding is only applied to the stream of addresses generated by fetch instructions. Here, unlike the cases analysed previously, the percentage of addresses in sequence increases considerably. As can be see from column in-seq of Table 6 the percentage of addresses in sequence may in some cases be more than 95% (e.g. with the des benchmark), making the approaches based on exploitation of this feature perform better than the others. In this case T0 achieves much better savings than the other approaches by exploiting the high percentage of addresses in sequence which do not determine any transitions on the bus. GEG8 maintains its efficiency with average savings of over 45%, comparable to those obtained by Gray.
The efficiency of T0 can be further enhanced by using a hybrid approach which uses T0 encoding when it observes a high percentage of addresses in sequence; otherwise it uses encoding based on GEG. Figure 5 shows a scheme of how this can be achieved. The pattern to be transmitted is encoded with both T0 and GEG. If it is in sequence with the previous one, the T0 encoding is transmitted; otherwise the GEG encoding is transmitted. Even though GEG þ T0 is extremely efficient at reducing the amount of power dissipated on the bus (more than 88% in average saving as shown in Table 6 ), in calculating the saving account has to be taken of the overhead owing to power consumption by the encoding/ decoding logic. In GEG þ T0, in fact, this contribution is certainly greater than that of both T0 and GEG, as it contains them both, and both are active at the same time. Another point against GEG þ T0 is that in inherits from T0 the use of a signalling line that is not present in GEG. Table 7 gives the area, delay and power characteristics of the encoders and decoders generated by GEG for 8-bit clusters and the benchmarks described previously. The results were obtained using Synopsys Design Compiler for the synthesis, and Synopsys Design Power for the power estimation. The circuits were mapped onto a 0.18 mm, 1.8 V gate-library from ST Microelectronics. The clock was set to a conservative frequency of 100 MHz (i.e. a period of 10 ns). The average delay introduced by the encoder/decoder is, in fact, shorter than 4 ns for GEG and so less than 40% of the clock cycle is dedicated to encoding and decoding information.
Overall power analysis
An encoding scheme is advantageous when the power saved on the bus (owing to less activity) is greater than the power consumed by the encoding and decoding blocks. The power consumed by the bus can generally be expressed as
where V dd is the supply voltage, a is the switching activity (i.e. the ratio between the total number of transitions on the bus and the number of patterns transmitted), f is the clock frequency and C l is the capacity of a bus line (assuming that all the lines have the same capacity). The overall percentage of power saved when an encoding scheme is used, as compared with when no encoding is used, can be calculated as follows P sav ¼ 100 Â P woe À P we P woe where P woe is the power consumed when no encoding strategy is used (which therefore corresponds to P B ) and P we is the power consumed when an encoding strategy is used (i.e. the sum of the power consumed by the encoder P E , the decoder P D and the bus P B ). Solving the inequality P sav .0 as a function of C l , we find the minimum bus line capacity with which there is a positive net saving in power
dd f ða woe À a we Þ ð2Þ Table 8 summarises the minimum capacity a bus line has to have for the approach to be effective for each benchmark. It is not a large value even for an on-chip bus line. In fact, a wire of about 1 cm in a 0.25 mm is 5 pF. We can therefore conclude that the techniques proposed can effectively be used even with on-chip buses.
Conclusions
In this paper we have presented a new GA-based strategy for designing an encoder that will minimise switching activity on a bus. This method, called genetic encoder generator (GEG), draws up a truth table for the encoder that will minimise switching activity on the communication buses in an embedded system. The results obtained on a set of specific applications for embedded systems have demonstrated the superiority of our approach, with savings of around 50% on multiplexed address buses (instructions/ data), more than 55% on data address buses and close to 45% on instruction address buses. In the latter case the T0 scheme [13] performs better than the approaches proposed here, with average savings of 78%. A mixed technique GEG þ T0 (in which GEG and T0 work concurrently) further enhances the efficiency of T0, achieving average savings of 88%. Finally, the low level of complexity of the encoder and decoder obtained make it possible to use them even in an on-chip environment. 
References

