Data, addresses, and instructions are compressed by maintaining only significant bytes with two or three extension bits appended to indicate the signijicant byte positions. This significance compression method is integrated into a 5-stage pipeline, with the extension bitsflowing down the pipeline to enable pipeline operations only for the signijicant bytes. Consequently registel; logic, and cache activity (and dynamic power) are substantially reduced.
Introduction
There are many microprocessor applications, typically battery-powered embedded applications, where energy consumption is the most critical design constraint. In these applications, where performance is less of a concern, relatively simple RISC-like pipelines are often used [S] [lO] . A variety of circuit and microarchitecture techniques are employed to conserve energy when the processor is operating, and power-down "sleep" modes are invoked when the processor is not in use. In current CMOS technology, most energy consumption occurs when transistor switching or memory access activity takes place [3] . Therefore, in this paper we focus on reducing dynamic energy consumption. Dynamic energy consumption is proportional to the switching activity, as well as the load capacitance and the square of the supply voltage. Thus, an important energy conservation technique is to reduce switching activity by "gating off' portions of logic and memory that are not being used.
Recently [ 11 it was proposed that rather than basing logic gating decisions entirely on operation types, certain operand values could also be used to gate off portions of execution units. In particular, arithmetic involving short-precision operands only needs to be performed on the (relatively few) numerically significant bits. Operands containing insignificant bits (typically leading zeros or ones) can yield simpler computations or can be used to avoid computations altogether. Note that this operand-based gating targets a different source of energy consumption than operation-based gating, and both operation-and operand-based gating techniques can be used concurrently.
$Department of Electrical and Computing Eng.
University of Wisconsin-Madison jes@ece.wisc.edu
We generalize the notion of operand gating to all stages of the pipeline as a way of reducing switching activity and hence, dynamic energy consumption. The key principle is the use of a small number of extension bits appended to all data and instructions .residing in the caches, registers, and functional units. In Fig. 1 , the extension bits are shown along the bottom of a basic pipeline. These bits correspond to portions of the datapath, and they flow through the pipeline to gate-off unneeded energy-consuming activity at each stage, including pipeline latching activity. New extension bit values are generated only when there is a cache line filled from main memory (although they could also be maintained in memory) and when new data values are produced via the ALU. The points where extension bits are generated are indicated in Fig. 1 by circled "G"s.
For the instruction caches, extension bits allow a simple form of compression targeted at reducing instruction fetch activity, rather than reducing the number of bits in the program's footprint. For other datapath elements, they enable a form of compression where memory structures actively load and store only useful (significant) operand bytes. For arithmetic and logical operations, the extension bits enable operand gating techniques similar to those proposed in [I] .
Given that only significant bytes require datapath operations and storage, pipeline hardware can be simplified by using byte-serial implementations, where the datapath width may be as narrow as one byte, and a pipeline stage is used repeatedly for the required number of significant bytes. Although there are many alternative implementations with different degrees of parallelism, they all have some serialization in the pipeline. In particular, low-order byte(s) and extension bits are first accessed and/or operated on; then additional bytes may be accessed andor operated on if necessary. We describe and evaluate several pipeline implementations of this type.
When compared with a conventional 32-bit pipeline, significance compression can reduce activity by 30-40% for each pipeline stage. The simplest implementation (byteserial) suffers a CPI (cycles per instruction) increase of 79% but wider pipelines incur a performance loss as little as 2-6%.
The paper is organized as follows. Section 2 presents several techniques to reduce the activity at each stage of the pipeline. The experimental framework is described in section 3. Sections 4, 5, and 6 present implementations with differing levels of complexity and performance. Finally, section 7 contains a summary and conclusions.
Techniques for Reducing Activity Levels
In this section, we develop methods for reducing memory and logic activity for each pipeline stage. Because activity in the simple pipeline depends primarily on data values and instructions, we first undertake a trace-driven study to determine the required activity for each of the major pipeline operations. Then, in later sections, we propose and study pipelined implementations that come close to achieving the minimum "required" activity levels.
This work is based on a simple 5-stage pipeline with inorder issue as is often used for low power embedded applications. We consider the 32-bit MIPS instruction set architecture (ISA) and focus on integer instructions and benchmarks --commonly used in the low power domain.
Data Representation
The basic technique for representing data is to tightly compress data bits that do not hold significant data. For example, a small two's complement integer has only a few numerically significant low-order bits and a number of numerically insignificant higher order bits (all zeros or all ones).
In principle, one could consider significance at bit-level granularity, i.e. store and operate on exactly the numerically significant bits and no more. However, implementations are likely to be simpler and more efficient overall if a coarser granularity is used. Consequently, we primarily consider byte granularities and focus on the significant bytes rather than bits. Byte granularity is rather arbitrarily chosen, but it seems to be a good compromise of implementation complexity and activity savings. For comparison we also provide some results for halfword (16-bit) granularities. In general, one could consider non-power-of-two bit sequences and dividing words into sequences of different lengths, but this remains for future study. Because the lowest order data byte is very often significant, we will always represent and operate on the low order byte. Then we will use a very small number of bits (2 or 3) to indicate the significance of the other 3 bytes (of a word).
A simple encoding is to add two extra extension bits to encode the total number of bytes that are merely sign extensions. For example, the 32-bit number 00 00 00 04 (in hexadecimal) can be encoded as ---04 : 11. This is a mixed hexadecimalhinary notation that uses hexadecimal for significant (represented) bytes, a dash for the insignificant (non-represented) bytes, and a binary pattern after the colon for the values of the extension bits. In the above example, the only significant byte is 04 with three sign extension bytes, so the extension bits encode a binary three. This simple method also works for two's complement negative numbers if it is assumed that the high order significant bit of the most significant data byte is extended. For example, the number FF FF F5 04 can be represented as --F5 04: 10. I.e. it has two significant bytes, and the most significant bit of these two bytes is extended to fill out the full 32-bit number. This encoding works well and has an overhead of two bits per 32-bit word (about 6 percent).
After inspecting commonly occumng datdaddress patterns, it is apparent that there are other, easily compressible values. In these cases there are some "internal" bytes that are all zeros or all ones, and these bytes are in a sense insignificant (slightly abusing the meaning of "significance"). An important case occurs for memory addresses in upper memory. These addresses often have nonzero upper bits, nonzero lower bits, but zero bits in between. For example, the data segment base of our experimental framework (see section 3) is set at address IO 00 00 00, thus a variable may be located at address 10 00 00 09.
To handle these cases, we propose a scheme with three extension bits (approx. 9% overhead). In this scheme, the extension bits apply on a per-byte basis. Each extension bit corresponds to one of the upper three data bytes (as before, the least significant byte is always fully represented). If an extension bit is set to one, it indicates that the previous byte position is sign extended; if the extension bit is zero, it indicates the corresponding byte is significant. Consequently, the earlier example I O 00 00 09 is represented as 10 --09: 01 1. As a more complex example, FF E7 00 04 is represented as -E7 -04 : 101
The three-bit extension scheme allows for eight different patterns of significanthsignificant bytes (assuming the low order byte is always significant). We performed a study with the Mediabench benchmarks [6] to determine the relative frequency of occurrence of each (see section 3 for more details of the experimental framework). Table 1 lists the results. In the table, the notation "sess" indicates that the first, third, and fourth bytes are all significant and that the second byte is the sign extension of the third. The data show that the four most common cases include about 94% of operand values, and these four cases are the same as those that can be encoded with the two extension bit format described earlier. This suggests a trade-off between the twoand three-bit schemes. The former reduces the overhead from 9% to 6% whereas the latter may potentially reduce activity for about 6% more operands. We chose to study the 3-bit scheme, although one could reasonably argue that the 2-bit scheme is better due to simplicity and overhead advantages; in any case, the performance results are likely to be very similar for both schemes. 
Instruction Cache
To save instruction cache activity, instruction words are stored in a permuted form. The goal is to reduce the number of instruction bytes that have to be read, written, and latched. This objective is somewhat related to the more common instruction compression techniques [4,5,12,13,18] that attempt to store more instructions in a given amount of memory. In our case, each instruction is still allocated a full word in the instruction cache. However, not all bits have to be readwrittedatched each time an instruction is placed in the cache or is fetched. Simple permutation-based compression schemes are important because the energy consumption of the decompression task should not offset the benefits of reducing the number of bits to be processed. Permutation methods of this type are likely to be specific to the ISA, and we consider methods that work well for the MIPS ISA. While the exact methods may not extend entirely to other ISAs, similar methods are likely to be applicable -at least for RISC ISAs. Although we considered a number of methods, two basic schemes seem to work well for the MIPS ISA and probably provide a significant majority of the benefit that can be achieved. First of all, we observe that the MIPS ISA very often uses one of two formats' [I I I: R-format: A 6-bit opcode, three 5-bit register fields, a shift amount field, and a 6-bit function code. I-format: A 6-bit opcode, two 5-bit register fields, and a 16-bit immediate value. In the R-format, the number of significant instruction bits can frequently be reduced to three bytes by recoding the sixbit function field so that the most common eight cases use three bits of the field with zeros in the other three bits. For these eight common cases, only three instruction bytes must be fetched and latched. In the other less common cases, all four instruction bytes must be fetched. Shifts that use the shift amount field do not use the first register field (rs), so the fields can be permuted by moving the shift amount (sham?) into what is normally the rs field.
The permutation for R-format consists of shuffling bits in a minor way and re-encoding the function bits. Figs. 2a and 2b show the permutations for the R-format instructions. The function field is split into two 3-bit fields, f l and f2, as noted above. To determine which function re-encoding to use, we first traced the Mediabench benchmarks and counted the dynamic frequency of each of the function codes. The results are in Table 3 . Thus, the most common eight function codes are recoded to 6-bit encodings, where the last three block size of 8 bits forthe PC increment.
There is a third format (J-format), but it only accounts for 2.2% of the executed instructions in the Mediabench. . .
[ For the I-format, we simply note that often eight or fewer immediate bits are actually significant, and in these cases three instruction bytes are again adequate. Fig. 2c shows the permutation for the I-format instructions. For I-format instructions we also traced the benchmarks and determined the sizes of the immediate values. It was found that 59.1 % of all instructions use immediate values and 80% of these immediates require only eight bits.
Although there are a few cases where it can be done, we do not attempt to reduce the number of fetched instruction bytes to fewer than three. Consequently, we add a single "extension" bit to the instruction word portion of the instruction cache. This bit indicates whether three or four bytes should be fetched and latched. Note that only one bit is used and it serves multiple purposes depending on the actual 6-bit opcode. For typical R-format opcodes it indicates that the low order three function bits (fieldfl) are zeros. For the shift amount R-format opcodes, it also moves the sham1 field, and for I-format opcodes it indicates an 8-bit immediate.
Overall, in the Mediabench suite a total of 36.9% of instructions are R-format that use the function field; 4.1% are R-format but the function field is not used; 56.9% are Iformat, and 2.2% are J-format. Combining this with the immediate and function code frequency statistics, the average number of bytes fetched and latched per instruction is 3.17 bytes (3.29 if we include the extension bit). This represents a savings of about 20% (at an overhead of 3% for the extra bit per word). There is also additional overhead during instruction cache fill for permutinghodifying the instruction bits, but this is a relatively small amount of additional activity, assuming a reasonable instruction cache miss rate. Finally, note that the order of the rearranged instruction bytes is chosen so that the bytes needed earlier in the pipeh e are toward the most significant end. This enables better performance for implementations (to be given later) that read instruction bytes serially. For example, after an implementation fetches the first two bytes, there is enough information to perform the initial opcode decode and register 
Register File Access
For the register file, extension bits as described in Section 2.1 are used. When the register file is accessed, first the low order data byte and the extension bits are read. Depending on the values of the extension bits, additional register bytes may be read during subsequent clock cycle(s). In a study of the Mediabench suite described below, we determined that the extension bits result in large register file activity savings. On average, the number of bits that are read is reduced by 47%.
To implement the single-bank, 32-bit register file of the baseline configuration and each of the 8-bit register banks required by the pipelines proposed in this work, different layouts can be used. In particular, the physical arrangement of the data array of each bank has a significant impact on the performance of the register file. Splitting the data array into multiple arrays, either horizontally or vertically, or widening the number of bits per word line has a significant impact on the access time as shown by Wada, Rajan and Przybylski [ 171, as well as power consumption. The layout that minimizes access time may not be optimal with respect to power consumption. Computing the optimal layout in terms of power consumption or finding the best trade-off between access time and consumption is an interesting work but it is beyond the scope of this paper. In the following discussions, we assume that each bank is implemented through a single array (i.e., 32 word lines of 32 bits each for the 32-bit base-line configuration, and 32 word lines of 8 bits each for the proposed pipelines).
Under these assumptions, note that even in the worst case when all 32 bits are required, the multiple access do not necessarily increase energy consumption. The word line consumption of each single access is reduced by a factor of about four, since every bank is about one fourth the width and thus, word lines are about one fourth as long. Bit line consumption is reduced by about four, since the number of bit lines in each bank is reduced by a factor of four. Sense amplifier consumption is also reduced by a factor of four for each access, since the number of sense amplifiers matches the number of bit lines. Thus, four accesses result in approximately the same word line, bit line and sense amplifier energy consumption as the 32-bit bank file.
ALU Operations
ALU operations are performed using only the numerically significant register bytes and the extension bits as input operands. The ALU produces significant result bytes as well as the extension bits that go with them.
ALU operations are performed in a byte-serial fashion. Because additions/subtractions, memory instructions, and branches all require an addition, and they collectively account for 70.7% of the executed instructions in the Mediabench suite, this operation is the most critical one to be implemented efficiently. For each byte position, there are three major cases, depending on which of the operands have significant byte(s) in the position being added.
Case 1: Both bytes are significant. In this case, the byte addition must be performed. Case 2: Only one of the operands has a significant byte. If the non-significant byte is zeros (ones) and the carry-in from the preceding byte is zero (one), the result byte will be equal to the significant byte. If the non-significant byte is zeros (ones) and the carry-in is one (zero), the result byte is the significant byte plus one (minus one). In all these cases one could simplify logic, for example by bypassing the addition. However, we do not include these potential optimizations in activity statistics. Case 3: Neither of the operands has a significant byte in the position being added. Consider the addition of two bytes, Ci=Ai+Bi, where Ai and Bi are both sign extensions of their preceding bytes, A;.] and Bi.]. There is a general rule with some exceptions. The general rule is that the result byte Ci is not significant, and the result is computed simply by setting the extension bits of the result because Ci will also be a sign extension of Ci.]. In the exceptional cases, the ALU must generate a full byte value. Table 4 lists all exceptions to the general rule.
To understand the exceptions to the general rule of case 3, consider the example where Ai-,=OOOOOOO1, Bi-1 =01111111; Ai and Bi are both sign extensions (i.e. they are equal to zero). Then the addition of Ai+B; will obviously be zero, but because byte Ci-, has a one in its most significant bit, Ci is not the sign extension of Ci-,. In this case, the processor has to generate the full byte value, although the addition is not actually necessary. Finally, note that in some cases a result byte may not be significant although the two source operand bytes are significant (e.g. 3 + -3 = 0). To handle these cases, there is simple logic that examines each result byte and generates extension (although the extension bit concept could also be maintained in main memory). We show below in Section 2. 9 that the above techniques reduce the activity on the data cache by 3 1 % for the data array and 1 % for the tag array.
Register Write Back
During the register write-back stage, only bytes holding significant values have to be written into the register file. The extension bits also have to be stored. For ALU data, the bits are generated as described above in Section 2. 5. For memory data, the extension bits read from the data cache are used. We show below in Section 2. 9 that extension bits result in an average reduction of 42% in register file write activity.
Pipeline latches
Significant energy is consumed in pipeline latches [16] , not just the major datapath elements. The extension bits are used for gating the pipeline latches in the normal way [9, 14] . Only the PC bytes that change require latch activity. Based on extension bits, only significant register, ALU and cache bits need to be latched. Hence, activity savings in the datapath elements is reflected directly in activity savings in the pipeline latches immediately following the datapath elements. Furthermore, clock signals can be gated at the byte level, threby reducing clock activity. Latch activity depends on the particular implementation. The lowest latch activity is achieved by the implementations with fewer pipe stages. This is the case for instance of the byte-serial implementation described in section 4. In this case, we show in the next section that the latch activity can be reduced on average by 42%.
Activity performance
To determine the activity savings for the techniques described above, we performed a trace driven simulation of Table 5 provides the overall results for byte granularity, and for comparison, Table 6 contains average results for halfword granularity significance compression. The tables show percent activity savings.
The byte-serial PC increment operation saves 73% activity, because the great majority of the time, only the least significant byte is changed, as predicted by the analysis in Section 2.2. I-cache activity saving is 18%, and is quite uniform across all benchmarks. On average 47% of the Register read activity is saved, with individual benchmarks saving from 34% to 72%. ALU activity saving averages 33% (ranging from 15% to 68%) and data cache activity saves an average of 30% (ranging from 1 % to 57%). The data cache activity is measured for data fills, reads and writes. The average saving on the data bank is 3 1 % (ranging from 1 % to 57%) whereas the saving for the tag bank is negligible. Register writeback saving is on average 42% (ranging from 30% to 69%). Finally, for implementations where the number of stages is not increased beyond the basic 5-stage pipeline, the latch activity is reduced by 42% on average and between 30% and 67% for individual benchmarks.
The 16-bit serial savings remain substantial (Table 6 ). but are somewhat less than the byte serial activity savings, as expected. The primary advantage of the 16-bit granularity is in implementation simplicity and in performance, as will be shown in the next section.
Holding and maintaining the extension bits adds an overhead of 9% when three bits are used, and the PC increment and fetch stages have much less overhead.
The bottom line is that the net overall activity savings (and therefore the overall energy savings) can be substantial. 
Experimental Framework
We developed a simulator for several proposed pipeline implementations using some components of the SimpleScalar toolset, primarily the instruction interpreter and the TLB and cache simulators. In all cases we assumed an in-order issue processor, with the following microarchitecture parameters: First level split instruction and data cache: 8 KB, directmapped, 32-byte line, 1-cycle hit time. Second level unified cache: 64 KB, 4-way., 32-byte line, 6-cycle hit, 30-cycle miss. I-TLB: 16 entries, 4-way, 1-cycle hit,30-cycle miss. D-TLB: 32 entries, 4-way, I-cycle hit, 30-cycle miss. The processor does not perform any type of branch prediction, thus every branch stalls the fetch stage until the branch is resolved in the ALU stage. This is in keeping with some very low power embedded processors, although the trend is toward implementing branch prediction. The implications of branch prediction will be the subject of future study.
We used the Mediabench benchmark suite [6] , which were compiled with the gcc compiler with "-03 -f i n -1 i n e -f unc t i o n s -f u n r o l l -loops" optimization flags into a MIPS-like ISA. As a baseline for comparison we use a conventional 32-bit wide processor, with 5 pipeline stages: Instruction Fetch, Decode and Register Read, Execute, Memory, and Write Back.
Byte-Serial Implementation
Having established potential activity reductions that can be achieved (and therefore energy reductions), we now consider implementations that attempt to achieve these levels while providing good performance. Implementations will differ in total hardware resources although they may not necessarily differ in circuit activity.
First, we consider a simple byte-seriul implementation that has a one byte wide data path. If more than one data/ address byte is needed at a given stage, then that pipeline stage will be used sequentially for multiple cycles. While later sequential data bytes are being processed, however, earlier bytes can proceed up the pipeline. For example, if it is necessary to read 3 bytes from the register file, first the low order byte is read and passed on to the EX stage, then while the next byte is being accessed, the EX unit can perform on the first data byte and pass it to the data cache stage. Fig. 3 shows the byte-serial implementation. In this microarchitecture there is a single register file bank (R), a single ALU, and a single data cache bank, all one-byte wide. Inter-stage latches are provided to store values on a byte basis and only the significant bytes are required to be latched. In addition, the extension bits must flow through the pipeline and a three bit latch is provided between some stages for this purpose. The ALU stage includes a special unit that operates on extension bits as described in Section 2. 5. There is one byte-wide PC increment unit that operates serially and three instruction cache banks that are accessed in the first stage along with the extension bit. Then, if the extension bit indicates that it is needed, the instruction remains in this stage for one more cycle while one of the banks is accessed again. Using a three byte wide instruction cache stage is a departure from the strictly byte serial implementation. This decision was made to avoid excessive stalls while reading instructions; otherwise, every instruction would incur at least two stall cycles because the minimum number of bytes per compressed instruction is three. Fig. 4 shows the performance of the byte-serial implementation, expressed as cycles per instruction (CPI). For comparisan, the CPI of a baseline 32-bit wide implementation is also shown. For most programs, the performance of the byte-serial implementation is significantly lower than that of the 32-bit processor. CPI is increased by 79% on average, although activity (and energy) is reduced by 30-40% for most of the pipeline functions (Table 5) .
If the pipeline is widened to 16-bits, the average CPI becomes 1.96, which is just 29% higher than that of the byte-wide implementation, but the activity savings are lower (around 20-30% for most of the pipeline functions). Note that the relative performance of the pipelined schemes is quite uniform across all the benchmarks. 5. Semi-Parallel Implementations.
The byte-serial implementation achieves significant activity reduction, but at the cost of substantial performance losses with respect to the baseline 32-bit pipeline. For some applications, energy savings may be much more important than performance, and this may represent a good design point. There may be other applications, however, where performance is more important, and performance losses should be reduced. We now consider methods that retain low activity levels, but use additional hardware to improve performance. The principle is to improve performance by adding additional byte-wide datapath elements at the various pipeline stages. For example, the register file can be constructed of two byte-wide files (rather than one) and produce a full data word in 2 cycles instead of 4. Similarly, multiple byte-wide ALUs can be used to increase throughput in the execute stage.
Adding these units does not necessarily increase circuit and memory access activity, however, because not all the units have to be enabled every cycle. For example, if a data item has only one significant byte, then a register access can be performed for one byte of a two byte wide register file, while the other byte is disabled. Similarly, if the source operands of an addition only have two significant bytes, these bytes will be operated in two of the ALUs while the others will be disabled.
Finally, the numbers of byte-wide units in each of pipeline stages do not have to be the same. That is, the number of byte ALUs or memories can be established to permit balanced processing bandwidths among the pipe stages. To determine how many parallel units and memories should be used, we first undertook a bottleneck study of the byte-serial implementation to see where the major stalls occur. We observed that in the byte-serial architecture the ALU is the most important bottleneck, 72% of the stalls were caused by structural hazards in the EX stage. Thus, increasing the bandwidth of the ALU stage is the most effective approach to increase performance. To quantify how much bandwidth is required in each stage, we did the following simple analysis.
Consider each of the major pipeline stages. First, the study in Section 2.3 shows that an instruction requires about 3.2 bytes to be fetched on average. The ALU operates on an average of 2.7 bytes, but since the maximum CPI is 1.5 (32-bit baseline processor), the activity of the ALU will not be higher than 2.7/1.5 = 1.8 bytedcycle on average. Next, around one third of instructions access memory, and each access is 2.8 bytes wide on average. Thus, less than one byte per cycle is accessed on average. Based on this study, we determined that a good balance is achieved with an instruction cache three bytes wide, a register file and ALU 2 bytes wide, and data cache one byte wide.
An implementation for this configuration is shown in Fig.   5 and is referred to as byte semi-parallel. The instruction cache essentially contains three byte-wide banks and works as in the byte-serial implementation. The register access stage is skewed with the low order byte being accessed first together with the extension bits. In the next stage the low order byte is operated on, and at the same time another register byte is read if needed according to the extension bits. If there is more than one additional byte the instruction uses this stage for multiple cycles. The next stage performs the ALU operation on the additional bytes and is used for as many cycles as the previous stage. The following stage performs the data cache access (if needed). It first readdwrites the low order byte, the tags, and the extension bits and, according to the latter, the instruction uses this stage sequentially for multiple cycles until all data are read/written. Finally, the last stage writes the result into the register file. It first writes the low order byte, the extension bits and one additional byte if needed. If more than one additional byte must be written, this stage is used for multiple cycles. Fig. 6 shows the CPI of this microarchitecture along with that of the 32-bit baseline processor and the byte-serial implementation. On average, the CPI is 24% higher than the 32-bit baseline processor. We observe that the performance is much closer to the 32-bit implementation than the byteserial implementation while all the activity savings are retained except for a few additional latches.
Fully Parallel Implementations
The above still loses some performance -bottlenecks cannot be perfectly balanced all the time because of bursty behavior that most programs exhibit. So, we consider pipelines with maximum (4 bytes) parallelism at each stage, and use oper- and gating to enable only those datapath bytes that are needed. This requires a skewing of stages in a similar way to the semi-parallel implementation described in the previous section. A block diagram of a portion of the microarchitecture, which is referred to as byte-parallel skewed, is depicted in Fig. 7 .
This pipeline is optimized for the long data case, i.e. where the pipeline keeps flowing even if each operand is a full 4 bytes. No stage is used more than once (except for the PC computation in very few cases). Although the activity of the functional units is the same as that of the byte-pipelined and semi-parallel implementation, the longer pipeline of the byte-parallel skewed implementation implies more latch activity and more backward bypasses. The performance of this microarchitecture is shown in Fig. 8 . We can observe that the CPI is very close to that of the 32-bit baseline processor for all programs in which case the byte serial implementation would be a very good design choice.
Another alternative is a "compressed" parallel pipeline implementation (see Fig. 9 ). In this case, the pipeline consists of the original 5 stages. Each instruction spends one cycle in the Ifetch stage to read 3 bytes and an additional one if a fourth byte is needed. Then it moves on to the second stage where it reads the low order byte and the extension bits. If more bytes are needed, the instruction spends one more cycle in the same stage to read all of them in parallel. Then the instruction moves on to the ALU stage where it executes in a single cycle, using only the functional units that operate on significant bytes. Then it moves on to the memory stage where it reads first the low order byte and the extension bits, and if needed, it spends an additional cycle to read all the remaining bytes. If it is a store, all the significant bytes along with the extension bits are written in a single cycle. Finally, all significant bytes and the extension bits are written into the register file in a single cycle.
This design works well for short data because the pipeline length is kept minimal and this reduces the branch penalty and the number of backward bypasses. Furthermore, functional unit and latch activity is kept minimal (equal to the byte-serial implementation). However, full-width (32-bit) data operations suffer stalls in some stages, which result in performance losses when compared with the full parallel implementation. Performance is shown in Fig. 10 . The CPI increase compared with the 32-bit baseline processor is 6% on average, which is quite close to the performance of the byte parallel skewed configuration.
We can get the best of both (performance wise) by putting forwarding paths into the byte-parallel skewed pipeline. In this way, when a short operand is encountered, it can skip the stages where no operation is performed. This reduces the latch activity to the same level as that of the byte-serial implementation, and at the same time the effective pipeline length is shortened, which reduces the branch penalty. However, the number of backward bypasses is the same as that of the byte-parallel skewed implementation.
The performance of this architecture is also shown in Fig.  10 . Now performance is very close to the baseline 32-bit processor (the CPI is only 2% higher on average) while the activity is reduced around 30-40% for most of the stages. A disadvantage, however, is that this design has rather complicated control and many data paths (for forwarding) -a more detailed analysis is required and will be a subject of future study.
Summary and Conclusions
The significant bytes of instructions, addresses, and data values essentially determine a minimal activity level that is required for executing a program. For a simple pipeline design, we showed that this level is typically 30-40% lower than for a conventional 32-bit wide pipeline. Every stage of the pipeline shows significant activity savings (and therefore energy savings).
We proposed a number of pipeline implementations that attempt to achieve these low activity levels while providing a reasonable level of performance. The byte-serial pipeline is very simple hardware-wise, but increases CPI by 79%. For some very low power applications, this may be an acceptable performance level, in which case the byte-serial implementation would be a very good design choice. We should also point out that the narrower data path may result in a faster clock, which will reduce performance loss, but this was not considered in this paper.
For higher performance, the pipeline stages can be widened. A rough analysis indicates that three bytes of instruction fetch, two bytes of register access and ALU, and one byte of data cache might provide a good balance of bandwidths. For this configuration, the CPI is 24% higher than that of the full width baseline design. Activities are still at their reduced levels, and this design may provide a very good design point for many very low power applications.
Finally, we considered designs with a four byte wide datapath at each stage. Operand gating is retained for reducing activity, but under ideal conditions throughput is no longer restricted. These designs can come very close in performance to the baseline 32-bit design while again retaining reduced activity levels. The disadvantage of these schemes is an increased latch activity, or additional forwarding paths or more complex control. We believe that these may be a very important class of implementations however, because of their high performance levels, and they deserve additional study.
Note also that different designs may imply a variation in the load capacitance, which also affects dynamic energy consumption. In particular, a narrower data-path may shorten some wires and thus reduce its capacitance. This paper focused on pointing out the potential of these architectures to reduce pipeline activity. The final quantification of energy requires a further detailed circuit-level analysis of the implementations.
