In this paper, we present an improved adaptive issue queue for use in embedded microprocessors. The issue queue was designed, simulated, and implemented in 0.25J!m CMOS technology. Using an example of an 8-entry issue queue, we found that by shutting down the parts of the queue that are not in use, the power dissipation of the issue queue can be greatly decreased with only a small tradeoffs in speed and area. The additional logic, associated with the shutdown circuitry, introduces an insignificant overhead. Furthermore, low-power optimization has been performed on CAM and SRAM arrays which resulted in additional savings of power. It has been found that 4.8%-56% reduction in power can be achieved at 2.SV and at 1.5V the power consumption is reduced by 8.5%-60%. In addition, the overall power consumption drops 10 times compared to 2.5V operation.
INTRODUCTION
In modem electronics, increased power dissipation has become a major design constraint. An issue queue takes instructions and the destinations for the instructions and holds them until they are selected for processing. The issue queue contributes greatly to the overall power consumption of the microprocessor, as well as complicating verification [1] . In a "floating point high power test," a test designed by Intel that attempts to measure maximum power, the issue queue dissipates approximately 33% of the total processor power. Therefore, reducing power consumption in the queue should significantly reduce power consumption of the processor itself. A new type of issue queue has been proposed by Buyuktosunoglu et al. [1, 2] . However. the authors have focused on large issue queues from 32 to 128 entries and most common queue sizes are between 8 entries and 27 entries. By using a simple model, an 8-entry issue queue, and using many techniques for lowpower operation, we have demonstrated that additional significant savings in power can be achieved on a circuit-level combined with aggressive voltage scaling.
2.

QUEUE STRUCTURE
The overall structure of the adaptive power issue queue is shown in Figure 1 . Buyuktosunoglu et al. proposed separating the queue into "chunks" of equal size. Each chunk can be enabled or disabled based on its usage that is measured by the activity within the chunks. The bias logic generates an "active" or "not active" signal for the chunk it is assigned to. The statistics process and storage logic keeps track of how often chunks are active/not active. and the decision logic sends the enable/disable signal to the chunks. As seen in Figure 2 , the chunks are separated by a transmission gate on the bitlines. allowing the bitlines to be shortened if a chunk or chunks have been disabled, decreasing the capacitance on the bitline and increasing the speed of searching and retrieval of data from the queue. 
CAM Design
The CAM section of the queue is to hold the source registers of each instruction word. The CAM cell used is from Zhang and Asanovic [3] , shown in Figure 3 . The bitlines and search lines have been separated to reduce the capacitance during the searching operation. Each row is split into two, with the match detection logic on each half match line. This allows the worst case delay of the match line capacitance discharge to be divided in half. Energy is reduced on the match lines through the use of n-type transistors precharged to V dd -Vtn• The match detection circuitry used is from Bellaouar and Elmasry [4] , shown in Figure 4 . 
RAM Design
The RAM section of the queue is to hold the instruction, destination register, and any other information needed to complete the instruction's operation. We have utilized a low-power SRAM architecture originally proposed by Wang, et al. [5, 6, 7, 8] . The SRAM cell is shown in Figure 5 . It is a 7-T cell, however the sizing allows it to be 4.3% smaller than a standard 6-T cell. The additional transistor is for clearing the cell before a write operation. The sense amplifier used was a current -mode sense amplifier for a 1.5 power source, shown in Figure 6 . The power consumption is reduced by 61 to 94% versus a conventional design. To perform a current-mode write operation, an n-type current conveyor (shown in Figure 7 ) is inserted between between the data input cell and the memory cell. 
Shutdown Circuitry
The bare bones shutdown circuitry consists of a bias logic block for each chunk, and a detection logic block. The shutdown circuitry diagram for our sample 8-entry queue is shown below. 
ShuUlovnUVVakeup Signal
The bias logic block consists of a simple circuit that averages out the number of active entries per chunk, and generates an "active" or "not active" signal based on the logic The detection logic can take a maximum of four bias logic blocks as inputs. Its function is to generate the number of active bias logic blocks. For our sample 8-bit shutdown circuitry with two bias logic blocks, only one detection logic block is needed, which can generate a binary zero, one, or two. In a case where there are multiple detection logic blocks, a binary counter is added to the circuit to add all outputs from the detection logic blocks. For example, if 32 bits are passed to the shutdown circuitry, and each bias logic block monitors four bits, eight bias logic blocks will be needed, which will then be connected to two detection logic blocks. If the first detection logic block generates a binary three, and the second generates a binary two, the counter will then add both numbers and generate a binary five.
The decision logic block shown in Figure 1 is composed of a comparator and subtractor, and is the most complicated part of the shutdown circuitry, and the final stage before the generation of the shutdown/wake up signal. The number of active queue entries (generated from the detection logic or the counter) from the previous state is stored in a register, and compared with the current number of active queue entries. If the previous state contained more active queue entries, then a signal is sent to shut down the difference between the number of active entries. If, however, the current state contains a greater number of active entries, a signal is sent to "wake up" the difference between the number of active entries. It is the shutting down of unnecessary queue entries that compensates for the extra power that the shutdown circuitry uses.
TEST SETUP
The characteristics of the simulated queue are as follows:
• 8 entry queue • 2 chunks, 4 entries each • 3 bit register addresses • 16 bit instructions • 1.5 V and 2.5V Voltage Source Standard queue operation was simulated with and without the shutdown circuitry, and tested for power consumption and signal transmission with various combinations of active/inactive queue entries.
A set of instructions was created to simulate the same usage of the queue as the SPECint95 benchmark suite. The percentages are shown below in Figure 9 . As the graph shows, more than 60% of the time only 4 entries of the queue are used, meaning that more than half of the time the queue is operating, a chunk can be turned off and power can be saved. 
RESULTS
Two source voltages were used for the simulation, 2.5V (the models used were optimized for this voltage, as well as our designs) and 1.5V. The early results show that up to 56% (third column) of power can be saved at 2.5V. Furthermore, the Figure 10 shows that the additional shutdown circuitry does not represent a major increase in power consumed over it not being present (first and second columns). The fourth column represents an average power consumed by the issue queue during the test.
The results for I.5V are below in Figure 11 . As seen from the Figure, the power is further reduced 10 times compared to 2.5V operation. This significant reduction is due to reduced energy waste and short-circuit power at low-voltages. Many signal spikes that are present at 2.5V become negligible at 1.5V. In addition, Vdd=1.5V reduces the area between VIL and V ill consequently reducing the time for a direct path between the power line and the ground. Furthermore, the improved issue queue consumes 8.5%-60% less power compared to the standard issue queue.
Issue Queue Power
With the 1.5V power supply, there is a 8.5% power savings. A timing diagram for one case of the shutdown circuitry is shown in Figure 12 . The fIrst signal (XI4) shown is a logic "1," or 2.5 V, applied to both bias logic blocks to activate them. The second signal (BiaslsBigger) comes from the output of the comparator of the decision logic, indicating that there are more bias logic blocks active than in the previous stored state, which in this case is zero. The input signal reaches 2.5 V at 1 ns, and the comparator registers this at 2.1 ns. Thus, there is a 1.1 ns delay between the input of the bias logic and the input of the subtractor.
The third signal (Equal) also comes from the comparator, and is high when the number of active bias logic blocks is equal to the previous number of active blocks. When this "equal" signal is high, the number of queue entries is unmodifIed. This signal starts at logic "1," since both states are empty, and decreases to logic "0" at 2.5 ns, shortly after the bias logic blocks fIrst become active. It then returns to logic "1" at 3.7 ns. At this point, there are two active bias logic blocks, and the previous state register also contains a binary two.
The fourth (Ss2) and fIfth (Ssl) signals are the output shutdown/wakeup signals. The fourth signal represents a binary two, and the fIfth represents binary one. The subtractor generates a binary two, the result of subtracting the number of prevous active entries from the number of active bias blocks, at 2.4 ns. Since the comparator has flagged the bias logic as having the greater number of active entries, the signal is sent to "wake up" two queue entries. Were the previous number of entries greater than the number of current bias entries, the signal would instead be sent to shut down queue entries.
. It takes 1.4 ns from the time that the input signal reaches 2.5 V for the shutdown/wakeup signal to be generated. Most of this delay (1.1 -1.2 ns) occurs between the input signal and the subtractor of the decision logic block.
CONCLUSION
In this paper, an improved adaptive issue queue for embedded microprocessors has been presented. The optimization has been focused on memory architectures for CAMs and RAMs that provide highest energyefficiency (least energy for most performance). Early results show that 4.8-56% of power can be reduced at 2.5V and 8.5%-60% of power can be reduced by reducing the operating voltages to 1.5V. In addition, the issue queue consumes more than 10 times less power at 1.5V than at 2.5V. The improved issue queue has been designed in O.25J.1m CMOS technology.
