Abstract Data, addresses, and 
Intr oduction
There are many microprocessor applications, typically battery-po wered embedded applications, where energy consumption ¥ is the most critical design constraint. In these applications, where performance is less of a concern, relati vely simple RISC-like pipelines are often used [8] [10] . A v ariety of circuit and microarchitecture techniques are emplo ¢ yed to conserve energy when the processor is operating, and power-down "sleep" modes are invoked when the processor is not in use. In current CMOS technology, most ener ¢ gy consumption occurs when transistor switching or memory access activity takes place [3] . Therefore, in this paper we focus on reducing dynamic energy consumption. Dynamic energy consumption is proportional to the switching activity, as well as the load capacitance and the square of the supply voltage. Thus, an important energy conservation technique is to reduce switching activity by "gating off" portions of logic and memory that are not being used.
Recently [1] it was proposed that rather than basing logic gating and instructions residing in the caches, registers, and functional units. In Fig. 1 , the extension bits are shown along the bottom of a basic pipeline. These bits correspond to portions of the datapath, and they flow through the pipeline to gate-off unneeded energy-consuming activity at each stage, ¥ including pipeline latching activity. New extension bit v alues are generated only when there is a cache line filled from main memory (although they could also be maintained in memory) and when new data values are produced via the ALU. The points where extension bits are generated are indicated in Fig. 1 by circled "G"s.
For the instruction caches, extension bits allow a simple form of compression targeted at reducing instruction fetch acti vity, rather than reducing the number of bits in the program's ! footprint. For other datapath elements, they enable a form of compression where memory structures actively load and store only useful (significant) operand bytes. For arithmetic and logical operations, the extension bits enable operand gating techniques similar to those proposed in [1] .
Gi ' ven that only significant bytes require datapath operations and storage, pipeline hardware can be simplified by using " byte-serial implementations, where the datapath width may be as narrow as one byte, and a pipeline stage is used repeatedly for the required number of significant bytes. Although there are many alternative implementations with dif & ferent degrees of parallelism, they all have some serialization in the pipeline. In particular, low-order byte(s) and e ¢ xtension bits are first accessed and/or operated on; then additional bytes may be accessed and/or operated on if necessary ¢ . We describe and evaluate several pipeline implementations of this type. The paper is organized as follows. Section 2 presents se ¥ veral techniques to reduce the activity at each stage of the pipeline. The experimental framework is described in section 3. Sections 4, 5, and 6 present implementations with dif & fering levels of complexity and performance. Finally, section 7 contains a summary and conclusions.
2.

)
Techniques for Reducing Activity Levels
In this section, we develop methods for reducing memory and we first undertake a trace-driven study to determine C the required activity for each of the major pipeline operations.
#
Then, in later sections, we propose and study pipelined implementations that come close to achieving the minimum "required" acti C vity levels. This D work is based on a simple 5-stage pipeline with inorder # issue as is often used for low power embedded applications.
$
We consider the 32-bit MIPS instruction set architecture (ISA) and focus on integer instructions and benchmarks --commonly used in the lo £ w power domain.
1. Data Representation
The D basic technique for representing data is to tightly compress data bits that do not hold significant data. For example, a small two's complement integer has only a few numerically $ significant low-order bits and a number of numerically insignificant higher order bits (all zeros or all ones). set at address 10 00 00 00, thus a variable may be located at address 10 00 00 09. T D o handle these cases, we propose a scheme with three e ¢ xtension bits (approx. 9% overhead). In this scheme, the e ¢ xtension bits apply on a per-byte basis. Each extension bit corresponds $ to one of the upper three data bytes (as before, the least significant byte is always fully represented). If an e ¢ xtension bit is set to one, it indicates that the previous byte position is sign extended; if the extension bit is zero, it indicates $ the corresponding byte is significant. Consequently, the earlier example 10 00 00 09 is represented as 10 --09: 011.
B
In
S
As a more complex example, FF E7 00 04 is represented as -E7 -04 : 101 ¥ The D three-bit extension scheme allows for eight different patterns of significant/insignificant bytes (assuming the low order # byte is always significant). We performed a study with the Mediabench benchmarks [6] to determine the relative frequenc 
PC Incr
W ement
Incrementing the PC is at the very beginning of the pipeline. When % incrementing the program counter, we do not literally append extension bits to the operands. One of the operands is always +1 (the PC is word resolution), so it is known to have only one significant bit. The PC, on the other hand, is held to full 30-bit precision. The PC increment is performed byte-serially £ to reduce activity. In particular, we first increment only the low order byte. If a carry out is produced, the next byte is incremented on the next cycle, etc. If a carry out is not produced at any stage, no additional byte additions need to be done.
This method very often saves adder and PC latching acti vity for higher order bytes, but it can lead to some performance loss in the uncommon cases when there is a carry be £ yond the low order byte, and instruction fetch is temporarily stalled while additional byte additions are performed. A brief analysis sheds some light on this trade-off. In general, 
3. Instruction Cache
W
To save instruction cache activity, instruction words are stored ¥ in a permuted form. The goal is to reduce the number of # instruction bytes that have to be read, written, and latched. This objective is somewhat related to the more common instruction compression techniques [4, 5, 12, 13, 18] that attempt to store more instructions in a given amount of memory. In our case, each instruction is still allocated a full w a ord in the instruction cache. However, not all bits have to be Table 3 . Thus, the most common eight function codes $ are recoded to 6-bit encodings, where the last three For the I-format, we simply note that often eight or fewer immediate bits are actually significant, and in these cases three instruction bytes are again adequate. Fig. 2c shows the permutation for the I-format instructions. For I-format instructions we also traced the benchmarks and determined the sizes of the immediate values. It was found that 59.1% of all instructions use immediate values and 80% of these immediates require only eight bits.
Although there are a few cases where it can be done, we do & not attempt to reduce the number of fetched instruction bytes £ to fewer than three. Consequently, we add a single "extension" bit to the instruction word portion of the instruction cache. This bit indicates whether three or four bytes should ¥ be fetched and latched. Note that only one bit is used and it serves multiple purposes depending on the actual 6-bit opcode. There is also additional overhead during instruction cache fill for permuting/modifying the instruction bits, but this is a relatively small amount of additional acti vity, assuming a reasonable instruction cache miss rate. Finally, note that the order of the rearranged instruction bytes £ is chosen so that the bytes needed earlier in the pipeline are toward the most significant end. This enables better performance for implementations (to be given later) that read instruction bytes serially. For example, after an implementation fetches the first two bytes, there is enough information to perform the initial opcode decode and register read operations. The other bytes give the immediate bits, a result register field, and/or ALU function bits that are not needed until later in the pipeline
4. Register File Access
W
For the register file, extension bits as described in Section 2.1 are used. When the register file is accessed, first the low order # data byte and the extension bits are read. Depending on # the values of the extension bits, additional register bytes may be read during subsequent clock cycle(s). In a study of the Mediabench suite described below, we determined that the extension bits result in large register file activity savings. On average, the number of bits that are read is reduced by 47%.
To implement the single-bank, 32-bit register file of the baseline £ configuration and each of the 8-bit register banks required by the pipelines proposed in this work, different layouts can be used. In particular, the physical arrangement of # the data array of each bank has a significant impact on the performance of the register file. Splitting the data array into multiple arrays, either horizontally or vertically, or widening the number of bits per word line has a significant impact on the access time as shown by Wada, Rajan and Przybylski [17] , as well as power consumption. The layout that minimizes access time may not be optimal with respect to power consumption.
$
Computing the optimal layout in terms of po wer consumption or finding the best trade-off between access time and consumption is an interesting work but it is be £ yond the scope of this paper. In the following discussions, we a assume that each bank is implemented through a single array (i.e., 32 word lines of 32 bits each for the 32-bit base- Only one of the operands has a significant byte. If the non-significant byte is zeros (ones) and the carry-in from the preceding byte is zero (one), the result byte will be equal to the significant byte. If the non-significant byte is zeros (ones) and the carry-in is one (zero), the result byte is the significant byte plus one (minus one). In all these cases one could simplify logic, for example by bypassing the addition. However, we do not include these potential optimizations in activity statistics. exceptions. The general rule is that the result byte C i is not significant, and the result is computed simply by setting the extension bits of the result because C i will also be a sign extension of C i-1 .
5. ALU Operations
In the exceptional cases, the ALU must d generate a full byte value. Table 4 lists all exceptions to the general rule. Another common ALU operation that can be optimized is the comparison, which represents the 6.9% of the instructions in our benchmarks. In this case, the normal byte processing $ order can be reversed, with the computation starting at the most significant byte and finishing with the least significant one. However, as soon as the two compared bytes are different, no more bytes must be computed, even for comparisons of the type greater $ -than or less-than. This reversal of access order can be implemented with dif & ferent levels of complexity depending on the particular processor design. For instance, for the byte-serial implementation described in section 4, the reversal is easily implemented since this design has a single byte-wide register file and a byte-wide ALU. In other cases such as the byte-parallel skewed implementation described in section 6, the order reversal is more complex and may require an additional register port for avoiding structural hazards.
Finally, bit-wise logical operations, which represent 4.2% of the instructions in our benchmarks, can also be byte-pipelined.
£
In this case, whenever two bytes are sign e ¢ xtensions, the result will also be a sign extension. Note that other # optimizations are feasible when just one of the operands is a sign extension, but we have not considered them. For instance, AA N D0=0 ,AA N D-1=A , etc. As shown belo £ w in Section 2. 9, extension bits result in an average reduction of the ALU activity of 33% for Mediabench.
6. Data Cache Operation
W
The data cache holds data in a manner similar to the register file. I.e. extension bits are appended to each data word and only # the bytes containing significant data are read and written.
The address bytes may also be formed sequentially, be £ ginning with the low order byte. This means that the cache index will be computed before the tag bits, and that the tag bits may be formed as part of multiple byte additions. do not have to be formed and compared, resulting in reduced activity. However, because the miss rate is often relati vely low, the activity saving is likely to be insignificant. There is similar activity for cache writes; the extension bits £ for the store data are read from the register file and written alongside the data. For cache fills, the extension bits must be generated as data is brought from memory (although Q the extension bit concept could also be maintained in main memory). We show below in Section 2. 9 that the abo ve techniques reduce the activity on the data cache by 31% for the data array and 1% for the tag array U .
7. Register Write Back
W
During the register write-back stage, only bytes holding significant values have to be written into the register file. The e ¢ xtension bits also have to be stored. For ALU data, the bits are generated as described above in Section 2. 5. For memory # data, the extension bits read from the data cache are used.
"
We show below in Section 2. 9 that extension bits result in an average reduction of 42% in register file write acti vity.
8. Pipeline latches
Significant q energy is consumed in pipeline latches [16] , not just r the major datapath elements. The extension bits are used for gating the pipeline latches in the normal way [9, 14] . Only the PC bytes that change require latch activity. Based on # extension bits, only significant register, ALU and cache bits £ need to be latched. Hence, activity savings in the datapath elements is reflected directly in activity savings in the pipeline latches immediately following the datapath elements. Furthermore, clock signals can be gated at the byte level, threby reducing clock activity.
Latch activity depends on the particular implementation. The lowest latch activity is achieved by the implementations with a fewer pipe stages. This is the case for instance of the byte-serial 
9. Acti
W vity performance
To determine the activity savings for the techniques described & above, we performed a trace driven simulation of the Mediabench [6] . Only byte activity indicated by the e ¢ xtension bits was performed. Table 5 provides the overall results for byte granularity, and for comparison, Table 6 contains average results for halfword granularity significance compression. The tables sho $ w percent activity savings. The byte-serial PC increment operation saves 73% activity, because the great majority of the time, only the least significant byte is changed, as predicted by the analysis in Section q 2.2. I-cache activity saving is 18%, and is quite uniform across all benchmarks. On average 47% of the Register read activity is saved, with individual benchmarks saving from 34% to 72%. ALU activity saving averages 33% (ranging from 15% to 68%) and data cache activity saves an average of 30% (ranging from 1% to 57%). The data cache acti vity is measured for data fills, reads and writes. The a verage saving on the data bank is 31% (ranging from 1% to 57%) s whereas the saving for the tag bank is negligible. Register writeback saving is on average 42% (ranging from 30% to 69%). Finally, for implementations where the number of stages ¥ is not increased beyond the basic 5-stage pipeline, the latch activity is reduced by 42% on average and between 30% and 67% for indi U vidual benchmarks. The 16-bit serial savings remain substantial (Table 6 ), but are somewhat less than the byte serial activity savings, as e ¢ xpected. The primary advantage of the 16-bit granularity is in implementation simplicity and in performance, as will be sho ¥ wn in the next section. Holding and maintaining the extension bits adds an overhead of 9% when three bits are used, and the PC increment and fetch stages ha ve much less overhead. The bottom line is that the net overall activity savings (and 
3.
}
Experimental Framework
W % e developed a simulator for several proposed pipeline implementations B using some components of the SimpleScalar 
Byte-Serial Implementation
Ha ving established potential activity reductions that can be achie ved (and therefore energy reductions), we now consider ¥ implementations that attempt to achieve these levels while In addition, the extension bits must flow through the pipeline and a three bit latch is provided between some stages ¥ for this purpose. The ALU stage includes a special unit " that operates on extension bits as described in Section 2. 5. The byte-serial implementation achieves significant activity reduction, but at the cost of substantial performance losses with a respect to the baseline 32-bit pipeline. For some applications, $ energy savings may be much more important than performance, and this may represent a good design point. There may be other applications, however, where performance is more important, and performance losses should be reduced. We now consider methods that retain low activity levels, but use additional hardware to improve performance.
The principle is to improve performance by adding additional byte-wide datapath elements at the various pipeline stages. we first undertook a bottleneck study of the byte-serial implementation to see where the major stalls occur. We observ # ed that in the byte-serial architecture the ALU is the most important bottleneck, 72% of the stalls were caused by structural byte must be written, this stage is used for multiple c ycles. Fig. 6 shows the CPI of this microarchitecture along with that of the 32-bit baseline processor and the byte-serial implementation. On average, the CPI is 24% higher than the 32-bit U baseline processor. We observe that the performance is much closer to the 32-bit implementation than the byteserial ¥ implementation while all the activity savings are retained except for a few additional latches.
Fully P arallel Implementations
The above still loses some performance -bottlenecks cannot be which is referred to as byte-parallel skewed, is depicted in Fig. 7 . W % e can get the best of both (performance wise) by putting forwarding paths into the byte-parallel skewed pipeline. In this way, when a short operand is encountered, it can skip the stages where no operation is performed. This reduces the latch activity to the same level as that of the byte-serial implementation, and at the same time the effective pipeline length is shortened, which reduces the branch penalty. Howe ¢ ver, the number of backward bypasses is the same as that of the byte-parallel sk ewed implementation. The performance of this architecture is also shown in Fig.  10 . Now performance is very close to the baseline 32-bit processor (the CPI is only 2% higher on average) while the acti vity is reduced around 30-40% for most of the stages. A disadv & antage, however, is that this design has rather complicated $ control and many data paths (for forwarding) -a more detailed & analysis is required and will be a subject of future study ¥ .
7.
¥
Summary and Conclusions
The significant bytes of instructions, addresses, and data values " essentially determine a minimal activity level that is required for executing a program. For a simple pipeline design, & we showed that this level is typically 30-40% lower than for a conventional 32-bit wide pipeline. Every stage of the pipeline shows significant activity savings (and therefore ener ¢ gy savings). W % e proposed a number of pipeline implementations that attempt to achieve these low activity levels while providing a reasonable level of performance. The byte-serial pipeline is very simple hardware-wise, but increases CPI by 79%. For some very low power applications, this may be an acceptable performance level, in which case the byte-serial implementation would be a very good design choice. We should ¥ also point out that the narrower data path may result in a faster clock, which will reduce performance loss, but this w as not considered in this paper. For higher performance, the pipeline stages can be widened.
¢
A rough analysis indicates that three bytes of instruction fetch, two bytes of register access and ALU, and one byte £ of data cache might provide a good balance of bandwidths. a For this configuration, the CPI is 24% higher than that of the full width baseline design. Activities are still at their reduced levels, and this design may provide a very good design point for man ! y very low power applications. Finally, we considered designs with a four byte wide datapath & at each stage. Operand gating is retained for reducing activity, but under ideal conditions throughput is no longer restricted. These designs can come very close in performance to the baseline 32-bit design while again retaining reduced activity levels. The disadvantage of these schemes is an increased latch activity, or additional forwarding paths or # more complex control. We believe that these may be a v ery important class of implementations however, because of # their high performance levels, and they deserve additional study ¥ . Note ¦ also that different designs may imply a variation in the load capacitance, which also affects dynamic energy consumption.
$
In particular, a narrower data-path may shorten ¥ some wires and thus reduce its capacitance. This paper focused on pointing out the potential of these architectures to reduce pipeline activity. The final quantification of ener ¢ gy requires a further detailed circuit-level analysis of the implementations.
