Abstract
Introduction
Modern superscalar processors use out of order execution to exploit instruction level parallelism. The dynamic scheduling engine employed in such processors often uses associative logic embedded into the issue queue entries to wakeup instructions that are awaiting a result. This is accomplished by storing the addresses of the source registers within the issue queue entries and using the comparators that match the stored source register values against the address of the result that is broadcast on tag bus lines. A significant amount of energy dissipation results as the destination register address is broadcast on the tag busses. Energy dissipation occurs when the tag bus lines are driven because of the charging and discharging of the wire capacitance of the tag line itself and the gate capacitance of the devices that implement the tag comparators.
As wire capacitances dominate, a significant fraction of the energy spent in waking up instructions is attributed to the power used for broadcasting the tags. This is particularly true if comparators that dissipate energy only on a match are used within the issue queue [27] . Other researchers have reported the issue queue power to account for 20%-25% of total chip power [33, 34] . Few results have been published which break this down into components for the wakeup and select logic individually, but it has been reported that the wakeup power is dominated by the power spent in broadcasting the tags [32] .
In this paper, we propose two schemes -Tag Memoization and Tagline Folding -to reduce the energy consumed by the wakeup tag broadcasts. Tag memoization avoids driving the upper portion of the tags, if those bits did not change their values from what was driven on the same tag bus during the most recent broadcast. Tagline folding avoids driving the upper order bits on a tag bus if those bits match the upper order bits driven in the same cycle on another bus. We validate the power savings achieved by using our techniques through the cycle-accurate simulations of SPEC 2000 benchmarks and the circuit simulations of the full-custom issue queue layouts.
The main contributions of this paper are as follows: • We perform detailed layout-level simulations of the wakeup logic and establish that the power dissipated in the course of tag broadcasts amounts to almost 92% of the total wakeup power.
• We propose two complementary techniques -tag memoization and tagline folding -to reduce the power consumption of the wakeup logic.
• We perform detailed microarchitectural and circuitlevel simulations of the proposed mechanisms. Our results show more than 22% reduction in the wakeup power with no IPC degradation on any of the benchmarks and only the slight increase in the wakeup delay. The rest of the paper is organized as follows. Section 2 analyzes the power of the wakeup logic. Section 3 describes tag memoization and Section 4 describes tagline folding. Our simulation methodology is presented in Section 5 and our simulation results are presented in Section 6. We describe the related work in Section 7 and offer concluding remarks in Section 8.
Power Analysis of the Wakeup Logic
In this section, we analyze the sources of power consumption within instruction wakeup logic and present their percentage breakdown. For this analysis, we performed the circuit simulations using the actual hand-crafted and highly-optimized VLSI layouts of the issue queue and the associated logic. The complete details of our simulation framework can be found in Section 5.
There are two sources of power dissipation in the process of instruction wakeup:
-Power dissipated driving the destination tags of the selected instructions across the issue queue entries (hereinafter called tag broadcast power), and, -Power dissipated in the course of performing associative matching of the broadcasted destination tags against the locally stored source tags for each not-yetready operand of every issued instruction, and setting the corresponding source valid bits. The power of setting the instruction's ready bit (which is used to drive the request signals to the selection logic) is also accounted for here. For simplicity, we call this component the comparator power in the rest of the paper.
Our circuit simulations show that the tag broadcast power accounts for about 92% of the wakeup power. In these estimations, we assumed that the traditional comparators that dissipate energy on a mismatch are used within the issue queue. Since the majority of comparison situations result in a mismatch, more energy-efficient dissipate-on-match comparators could be used, resulting in further decrease of the comparison power [27] . We also assumed that if a source operand is ready or if the source operand is not used or if an issue queue entry is not allocated, the corresponding tag comparators are not activated in that cycle. Finally, we assumed that if no tag is driven on a tag bus within a given cycle, then this bus does not dissipate any power.
In the following sections, we describe two techniques to reduce the wakeup tag broadcast power by reducing the number of tag bits that are driven.
Tag Memoization
Tag memoization exploits the fact that the higherorder bits of the tags that are broadcasted within a short duration of each other are likely to be the same. The key idea here is to conserve power expended in broadcasting the tags by not driving the higher-order tag bits if they happen to match the higher-order tag bits that were driven on the same bus during the previous broadcast. The tag comparator used to match the tag on the bus is broken into two separate comparators, say U and L, to match the higher-order bits and the remaining lower-order bits, respectively. A 1-bit latch is inserted in between to remember if there was a match in the higher order bits with the previous broadcast. The match signal for an entry is derived by NAND-ing the output of the comparator for the lower order bits with either the latch output or the output of the comparator for the upper order bits. The modifications to the tag circuitry associated with a single tag bus to support tag memoization is shown in Figure 1 . We now describe this in some detail. The comparator used for tag comparison is split into two parts -an upper part U that compares the higher order bits on the tag bus and a lower part L that compares the remaining bits against the respective parts of the locally stored source tags. When both the upper part and lower part of the tag buses are to be driven, the line ~drive_upper (complement of drive_upper) is driven low during the cycle. These signal lines run across the length of the issue queue. Doing so turns the transmission gate switch on and allows the clock signal clk to latch in the current output of the comparator U into a D-type latch, D. Simultaneously, the multiplexer MUX selects the output of the comparator U and feeds it to the NAND gate. In this case the NAND gate effectively combines the output of the two comparators U and L to produce the low-active match signal. Storing the result of U in the D-latch in this manner makes it possible for the logic shown to remember the result of a match of the upper bits of the locally-stored tag value and the upper order bits driven on the tag bus.
When the upper order bits to be driven on the tag bus match what was driven earlier on the same lines, the drive_upper line is maintained in its (default) high state, using a weak pullup device. The upper order tag bus lines are, of course, not driven at all. Doing so disables the clock to the latch D (so that its contents remain unaffected) and allows the multiplexer MUX to select the contents of D. In this case, the match signal is obtained by NAND-ing the output of the comparator L and the contents of D. A tag match signal is produced in this case only if the lower order tag bits match and if the upper order bits of the tag of the result matches the upper order bits driven on the corresponding lines of the tag bus in the immediate past.
The additional delay introduced in the path that generates the match signal is the propagation delay of the NAND gate and the propagation delay of a turnedon CMOS switch within the multiplexer. Our circuit simulations (0.18 micron CMOS) showed that these components add a delay of 73 ps (55 ps for the NAND gate and 18 ps for the turned on transmission gate within the multiplexor) to the critical path of the wakeup logic. Since smaller comparators, operating in parallel, are now used for matching the upper and lower order bits, the total comparator delay is reduced by 50 ps. Therefore, the overall critical path delay of the wakeup logic increases by 23 ps, which represents a 4 % increase compared to the 569 ps delay of our baseline case (as shown in Table 1 ). 
Component
Delay (ps)
Tag-Bus Drive 224
Comparator Output 219
Final Match Signal 126
Total Delay 569
Determining if the upper order bits that are being broadcasted match the upper order bits that were last driven on the same bus requires us to store the values of the last-driven upper order bits in a latch. The logic for doing this is in the form of a fast combinational comparator (70 ps delay for a 3-bit combinational comparator), which has a much smaller delay than the pulldown comparators (110 ps for 3 bits). The delay of this logic can be absorbed by integrating this logic within the selection logic which not only selects instructions for issue but also performs tag bus assignments. While the grant signal propagates down the selection logic, which is usually a multi-level, treelike structure, the comparison of the upper order bits can be completed. If the selection tree is three levels deep and already turned on transmission gates are used to route the grant signals down the tree to the requesting IQ entry, our layouts indicate that the delay of bringing down the grant signal is about 65 ps.
Thus the overhead (70-65 = 5 ps) of detecting whether to drive the upper or tag bits is a negligible part of the total selection process. If wakeup and selection are both performed in the same cycle, this overhead is even smaller.
Notice that with the logic just described, the line ~drive_upper is driven low only when the upper order tag bits are to be driven. Thus, when the upper order tag bits are not driven, no additional energy is expended in maintaining the line at this state, which is maintained in the high state by a small pullup device (we ignore the energy spent in replenishing the charge lost on this line due to leakage). Neither do we need to maintain or drive the complement of this signal -the complementary signal is derived locally within the multiplexers. Consider now a 7-bit tag, there are 14 bus lines in the tag bus as both the value of a tag bit and its complement have to be driven to avoid the need to generate the complement values locally within each comparator (typical pulldown comparators require both the true input and the complement input bits). Our memoization scheme thus requires an additional bus wire to be driven when the upper order tag bits are driven. The added energy overhead of the ~drive_upper is thus small.
One can force additional savings from the memoization scheme by assigning tag broadcasts to a bus whose U bits match the upper order bits of the tag value to be driven. We call this "intelligent" tag bus assignment. There is, of course, additional energy and delay overhead in assigning tag broadcasts to specific buses in this case. Another alternative is to assign the tag values sequentially to instructions. This is possible in datapaths that use the ROB slots as physical registers or have rename buffers that are assigned from a circular FIFO. Tag memoization on these datapaths as well as datapaths with a unified register file is examined in the results section.
The approach just described can be generalized to accommodate the segmentation of the tag comparator into more than two parts requiring an intervening latch in between consecutive segments. For example, a 7 bit tag comparator can be segmented into three parts: U1 (upper order two bits), U2 (next two bits), and L (remaining 3 bits). This arrangement requires two latches: one to remember the result of U1 and another for U2. These latches may be set independently, allowing for the gating off of either set of bits, or both. The match signal is derived by NAND-ing the contents of the intervening latches and the output of the comparator segment covering the lower order bits.
Tagline Folding
As observed with tag memoization, the high-order bits of the tags that are broadcast within a short duration of each other are likely to be the same. Tag memoization targets the case where there is a match in the upper order bits of the tag occurs across two successive broadcasts on the same tag bus. Tagline folding, on the other hand, targets the case where a match occurs across two different tag busses driven in the same cycle. The goal of tagline folding is to conserve power by only broadcasting the upper order bits of the tags on one of the busses. Figure 2 presents the logic necessary for implementing tagline folding. Each comparator is broken into two parts: one for the upper-order bits of the tag and one for the lower-order bits. The number of bits in each of the two parts is determined by the folding width. For a folding width of w, the upper order comparator has w bits and the lower order comparator has n-w bits, where n is the number of bits in each tag. Tagline folding can be implemented on any subset of tag busses. As an example, let us assume that, in some cycle, the upper order bits of the tags driven on busses 1 and 2 match. The full tag will be broadcast on bus 1, and the comparator on bus one will NAND the output from its upper-order comparator and its lower ordercomparator for detecting a match. Bus 2, however, will only broadcast the lower-order bits along with one additional select_U_other bit. Rather than NAND-ing both the upper-order and lower-order comparator outputs for the source tag on bus 2, the comparators will NAND the outputs for this source of the upperorder comparator on bus 1 and the lower-order comparator on bus 2. The additional delay introduced in path that generates the match signal is the propagation delay of the NAND gate and the propagation delay of a turned-on CMOS switch within the multiplexer. Similar to the tag memoization case, the overall critical path delay of the wakeup logic increases by about 28 ps, which represents a 5% increase compared to the 569 ps delay of our baseline case (as presented in Table 1 ). This is because the propagation delay of a transmission gate-based multiplexer is independent of the number of inputs; the added delay comes from the wire length needed to bring in the signal from the neighboring bus. Figure 3 shows how the tag memoization logic (discussed in Section 3) can be augmented very simply to support tagline folding. The multiplexer shown in the figure now has the option of choosing one of three inputs -either the output of the local U comparator (select_upper driven high), the output of the local latch D (select_D driven high) or the output of the U comparator associated with an adjacent bus (select_Uother driven high). If the upper order bits driven on two adjacent buses happen to be the same, the select_upper line is asserted on one bus (say Bus A) and the select_U_other line is asserted on the other bus (say Bus B), to allow the comparator of Bus B to use the output of the U comparator of Bus A for the match. Only the upper order tag bus bits of Bus A are driven in this case.
The delay added in the path that produces the match signal for Bus B now has an additional component -the wire delay of the connection that brings in the output of the U-comparator of the adjacent bus to the input of the local multiplexer. This delay can be minimized by laying out the comparator logic for the two buses as close to each other as possible, by mirroring them symmetrically along a imaginary dividing line that runs across the length of the issue queue. The design of Figure 3 can be generalized to not only accept the output of the U comparator of the adjacent bus but also the output of the D latch for the adjacent bus or signals from comparators and latches of other buses if need be. Notice that the use of un-encoded lines to select the inputs of the multiplexer permits only one of the input selection lines to be activated at any time. The delay of the combined memoization and folding logic does not increase compared to either of the two designs individually.
Simulation Methodology
Our simulation environment includes a detailed cycle accurate simulator of the microarchitecture and cache hierarchy. We used a modified version of the Simplescalar simulator [3] that implements separate structures for the issue queue, re-order buffer, loadstore queue, register files, and the rename tables in order to more accurately model the operation of modern processors. All benchmarks were compiled with gcc 2.6.3 (compiler options: -O2) and linked with glibc 1.09, compiled with the same options, to generate the code in the portable ISA (PISA) format. All simulations were run on a subset of the SPEC 2000 benchmarks consisting of 8 integer and 7 floating-point benchmarks using their reference inputs (we had difficulties compiling other benchmarks in our framework, mostly those written in Fortran). In all cases, predictors and caches were warmed up for 1 billion committed instructions and statistics were gathered for the next 500 million instructions. Table 2 presents the configuration of the baseline simulated processor. Two separate datapaths have been simulated that use different physical register allocation mechanisms. Datapath A uses the PowerPC-style register file with rename buffers where registers are organized as a circular list [31] . Datapath B contains a unified architectural/physical register file similar to the Pentium 4 datapath. Both datapaths have the machine configuration as specified in Table 2 . For estimating the delay, energy and area requirements, we deigned the actual VLSI layouts of the issue queue and simulated them using SPICE. The layouts were designed in a 0.18 micron 6 metal layer CMOS process (TSMC) using Cadence design tools. A Vdd of 1.8 volts was assumed for all the measurements.
Results
Tag memoization does not impact IPC because it does not interrupt, hinder, or change the order of tag broadcasts. The power savings of tag memoization come from its ability to match the most significant bits of the tags on each bus from one broadcast to the next. Thus, it is important to consider how often these tag bits match. Figure 4 presents the number of most significant bits (MSBs), for each tag broadcast, that match those of the previous tag broadcast on the same bus for Datapath B (Datapaths A and B are defined in Section 5). The two most significant bits match 43% of the time even in this case, where the physical registers are not necessarily allocated from consecutive entries. If Datapath A is used, then physical registers are allocated from a FIFO-managed list, and thus the neighboring registers (which generally have the same most significant bits) are allocated to consecutive instructions. As a result, the percentage of matches in the two most significant bit positions during successive tag broadcasts on the same bus is even higher in this situation -49% on the average across the benchmarks.
For Datapath A, on a configuration with two separate intermediate latches, the two MSBs match an average of 48.6% of the time, while the next two MSBs match an average of 23.5% of the time, as presented in Figure  5 . Accounting for the extra line that must be driven every time one of these latches must be reset, the total power savings from such a variation of tag memoization is 16.4% of total tag-broadcast power. The savings are a little less for Datapath B, as expected -it amounts to 11.1% of the tag broadcast power. Similarly to tag memoization, tagline folding does not affect the IPC. As described previously, tagline folding can be implemented for any folding width that is less than or equal to the tag width of the processor. A folding width of 1, however, provides little benefit as the power saved is largely offset by the additional power dissipated in driving the select_U_other bit and the additional logic needed to support folding, as shown in Figure 2 . Alternatively, a folding width that is equal to the tag width of the processor makes little sense since matches would only be detected when the same exact tag is broadcast on two busses at the same time. However, this is not possible because each instruction broadcasts its tag on only one of the available busses. Thus, only a folding width between 2 and n-1 for a processor with a tag width of n bits is in the practical realm.
As the tagline folding width increases, the power savings realized by each match goes up but the number of such matches decreases. Figure 6 presents the wakeup tag broadcast power savings for tagline folding widths of 2 through 6 on our simulated processor with 7-bit wakeup tags. The figure indicates that the folding width of 2 or 3 bits results in an optimal performance for the simulated processor.
Tag folding can be implemented to match the tags on any subset of the tag busses. Figure 7 presents the percentage of tag broadcasts in which the 2 upper order bits of the tags match on various sets of busses. The frequency of matches is impacted by the bus arbitration logic. We assume that the tag buses are assigned for the selected instructions in order starting with bus 1 and ending with bus 4 (for a 4-way machine with 4 wakeup tag buses). Thus, if only two tags are broadcast in a given cycle, then busses 1 and 2 are used. Bus 4 is only active when 4 tags are broadcast in a given cycle, which does not happen very often because the average broadcast rate is only slightly more than 1.5 instructions per cycle. As a result, the highest match rate is observed between busses 1 and 2, where a match occurs for 52.6% of all tag broadcasts, since these two busses are used most of the time. Busses 3 and 4 produce fewer overall matches, since they are used less frequently. We consider the additional hardware to support folding between busses 3 and 4 unnecessary since matches on these busses occur in less than 6% of the cases. Note that the matches on busses 1 & 2 and those on busses 2 & 3 are not disjoint; there is an overlap when the upper order tag bits match on all three busses. It is possible to extend the tagline folding logic to allow the folding of three busses. This, however, adds complexity to the logic since the match on the upper-order comparator can now come from three places. An alternative is to separate out the instances where there is a match on all three busses and those with a match on only busses 2 and 3. The match of all three, then, can be handled as a match on bus 1 & 2, and a mismatch on bus 2 & 3 (requiring the full tags to be driven on both busses 1 and 3). Discounting the situation where all three busses match, the percentage of matches on busses 2 and 3 drops from 22.92% to 8%, indicating that it is not very beneficial to support folding on busses 2 and 3 in this situation. An alternative is to fold busses 1 and 3, which yields a 29% match rate. Thus, for the remainder of this section, we consider this implementation of tag folding on busses 1 & 2, which yields a 52% match rate, and busses 1 & 3, which yields a 29% match rate. The combination of folding on these two sets of busses results in matches on 81% of all tag broadcasts. The combination of folding on these two sets of busses produces an average of 6.85% reduction in the total tag broadcast power, with the highest savings being 12.81% (vortex) and the lowest being 5.75% (swim). Per-benchmark power savings are given in Table 3 for Datapath A.
Tag memoization and tagline folding are complementary techniques. Both require the separation of comparators into upper and lower parts. Thus, they can easily be combined to extract the benefits of each while overlapping the costs through comparator modifications shared by both schemes (as described in Section 4). Combining tag memoization and 2-wide tagline folding results in 22.25% savings in the tag broadcast power with no change in IPC for the Datapath A. Notice that while the techniques are complementary, the total power savings is not simply the sum of the savings achieved by individual schemes, as there exists some overlap between them.
The baseline processor considered in this paper contains 128 physical registers, requiring 7-bit wakeup tags. We also performed experiments with 256-entry register file. In this case, tag memoization and tagline folding result in the combined power savings of 25.6% in the wakeup tag broadcast power. This implies that both mechanisms are scalable with the number of physical registers, as one would expect. Similar trends were observed when further increasing the number of physical registers. 
Related Work
Researchers have proposed several ways to reduce the power consumption of the issue logic. Dynamic adaptation techniques [22, 23, 24, 25] partition the queue into multiple segments and deactivate some segments periodically, when the applications do not require the full issue queue to sustain the commit IPCs. Energyefficient comparators, which dissipate energy predominantly on a tag match were proposed in [26, 27] . In [29] , the associative broadcast is replaced with indexing to only enable a single instruction to wakeup. This exploits the observation that many instructions have only one consumer. In [30] , the wakeup width is decreased to be smaller than the machine width, noting that the full machine width is rarely used for instruction wakeup.
Energy savings on busses due to correlations among the past and immediate values of bits have also been explored in [35, 36, 37, 38] . In [35] , bus invert coding is proposed that uses redundancy to reduce bus transitions. The scheme adds one line to the bus to indicate if the actual data or its complement is transmitted depending on the hamming difference between the current value and the previous one. A technique to reduce switching activity on the address busses through the use of Gray codes was proposed in [36] . The Gray code has the advantage that there is only a single transition on the address bus when consecutive addresses are accessed. A technique to compress data addresses and instructions by maintaining only significant bytes with two or three extension bits appended to indicate significant byte positions was proposed in [37] . In [38] , a scheme to encode the bytes containing all zeros and save energy by only reading and writing one bit for each zero valued byte was proposed for caches. The zero byte encoding technique was extended to issue queues in [26] . The T0 code was proposed in [39] to reduce the switching activity on the address buss by freezing the value on the bus if addresses are sequential and driving an additional INC line. Several combinations of bus-invert and T0 encodings were proposed in [40] . A number of irredundant encoding techniques are presented in [21] that do not require any extra lines for encoding and decoding. None of these techniques specifically address the reduction in switching activity on the tag wakeup busses. In addition, we use the actual statistics about the bit patterns driven on these busses as obtained from the cycle-accurate microarchitectural simulations.
In [4] , some comparators are removed from the issue queue entries to save power and last-tag speculation mechanisms are introduced for use in instruction wakeup. In [28] , the tag buses were categorized into fast buses and slow buses, such that the tag broadcast on the slow bus takes one additional cycle.
Several works attempted to reduce scheduling complexity through pipelining the scheduling logic or reducing the issue queue size [2, 7, 8] . Other proposals have introduced new scheduling techniques with the goal of designing scalable dynamic schedulers to support a very large number of in-flight instructions [5, 6, 9, 14, 20] . Scheduling techniques based on predicting the issue cycle of an instruction [10, 11, 12, 13, 15, 16, 17, 18] remove the wakeup delay from the critical path and remove the CAM logic from the issue queue, but need to keep track of the cycle when each physical register will become ready.
Concluding Remarks
We proposed two schemes to reduce the power consumption of the wakeup tag broadcast. Tag memoization avoids driving the upper portion of the tags, if those bits did not change their values from what was driven on the same tag bus during the most recent broadcast. Tagline folding avoids driving the upper order bits on a tag bus if those bits match the upper order bits driven in the same cycle on another bus
The use of these two schemes results in 22.3% reduction in the wakeup tag broadcast power with only about 5% increase in the wakeup delay. If the wakeup and selection operations are combined within a single cycle to implement atomic dynamic scheduling in order to support back-to-back execution of dependent instructions, then the overall increase in the scheduling latency is only on the order of 2-3% (as the selection logic has about the same delay as the wakeup logic according to [1] ). Additionally, there is no IPC degradation as neither mechanism interrupts, changes or hinders the order of tag broadcasts. 
References

