Abstract
INTRODUCTION
The primary goal of processor design is to improve throughput within the power constraint. This goal is conventionally achieved by two separated design stages 2 : architects optimize IPC (Instruction Per Cycle) with microarchitecture innovations, and then VLSI circuit designers perform logic synthesis and layout design to retain IPC and maximize clock frequency. In most cases, interconnects are optimized at the second stage, but is not considered at the microarchitecture-level. As VLSI technology advances, the system delay has become dominated by the interconnect delay. A growing number of repeaters and Flip-Flops (FFs) are used to reduce the interconnect delay [1] . Because interconnects with inserted repeaters and FFs may greatly affect IPC and power, a microarchitecture is hardly optimized without considering interconnect and layout optimization. However, most existing microarchitecture level simulation tools such as [2] - [5] do not explicitly characterize the impact of interconnects. At the layout and physical design level, there have been extensive studies on interconnect performance and power modeling considering repeater and FF insertion. Focusing on performance modeling in terms of interconnect delay and critical path estimation, [6] , [7] studied the repeater insertion for optimal delay. Such studies are extended to consider the impact of process variation in the ultra deep submicron design era [8] . All these studies [6] - [8] only considered repeater insertion, assuming the clock period is longer than the delay of critical path. As technology keeps scaling, wire delay becomes dominated and easily exceeds the clock cycle time [9] , making the insertion of FFs necessary. Targeting routing tree topology, [10] and [11] proposed concurrent FF and repeater insertion methodologies. However, no microarchitecturelevel characteristics such as the structure interconnect in Section IV was considered in either [10] or [11] .
Concerning the power consumption by a large number of repeaters, [12] estimated the power for interconnect repeater insertion based on the stochastic wire length distribution [13] , and studies delay-power trade-off for minimizing repeater power. [14] studied the trend of repeater power consumption for unit wire lengths for five technology generations from 180nm to 50nm. In both [12] and [14] , an over-simplified repeater model (i.e., single-model to be defined in Section 2) is used and no FF insertion is considered. In addition, none of them considered structure interconnects, layer assignment or cycle-accurate interconnect simulation. Furthermore, targeting buffer trees, powerefficient repeater insertion considering dual-V dd and dual-V t technologies are studied in [15] , [16] .
Such methods are orthogonal to our study. With the accurate power estimation proposed in this paper, methods in [15] , [16] can be conveniently extended to full-chip repeater power reduction.
At the microarchitecture level, [17] presents coupled system design and VLSI design for throughput optimization. However, [17] considers only buffer insertion but not FF insertion for interconnects. The initial study of this paper [18] studied the power and performance impact of concurrent repeater and FF insertion at microarchitecture level. Preliminary results showed in [18] that FF insertion has lower IPC but can improve the system throughput. [19] - [21] further developed efficient algorithms to consider the performance impact of FF insertion during fioorplanning optimization. However, only IPC, but not the system throughput, was optimized in [19] - [21] .
Considering interconnect layout optimization including fioorplanning, layer assignment, and concurrent repeater and FF insertion, we develop in this paper a cycle-accurate microarchitecturelevel power and throughput simulation and obtain an accurate modeling of interconnects at the early design stage. We also apply this simulation to optimize microprocessor throughput considering interconnect pipelining and fioorplanning adjustment.
The rest of this paper is organized as follows. In Section 2, we study repeater and FF insertion for individual wires. In Section 3, we study microarchitecture level interconnect power estimation and cycle-accurate power simulation with consideration of concurrent repeater and FF insertion. In Section 4, we optimize throughput considering interconnect pipelining and fioorplanning optimization. We conclude in Section 5. An extended abstract about the preliminary results of this study was published in [18] .
REPEATER AND FLIP-FLOP INSERTION

Interconnect and Device Models
In this paper, we model interconnects by the Π -type distributed RC circuit, and consider multiple interconnect layers. Top layers are used for wide and long global interconnects, and bottom layers are used for short local interconnects. Between them are the layers for intermediate interconnects. For the simplicity of presentation, we assume all wires are global wires in this section, and define the distinction of global and non-global wires in Sections 3. We assume that a unit length interconnect has resistance R w and capacitance C w , and model an inverter by its gate capacitance, drain capacitance and its effective resistance. We represent the gate, drain capacitances and effective output resistance for a minimum size inverter as C 0 , C p and R 0 , respectively. A repeater can be a single inverter, or a cascaded inverter chain.
We use the Elmore delay to calculate interconnect delay, i.e.
where T d is the total delay, R i is the resistance of a wire segment and C down is the sum of the downstream capacitances of R i . We consider interconnect power including dynamic power and leakage power given by Equation (2) and (3), respectively:
where f clk is the clock frequency, l is the wire length, α is the switching factor, I off is the unit leakage current, and S is the total inverter size. Furthermore, N F is the total number of FFs, C F is the total capacitance of one FF, and S F is the total gate size of one FF. We assume 100nm technology in this paper, with parameters in Table I , where the wire widths and heights are obtained from ITRS roadmap 3 , C w and R w are calculated by Berkeley Predictive Technology Model [22] , the I off is from [14] , the α is 0.15 [23] and is fixed for logic and interconnects except the structure interconnects with cycle-accurate power simulation in Section 3.3. The other values are obtained from SPICE simulations.
In this paper, we assume all interconnects are two-pin nets. This assumption has been used widely in the literature for high-level estimation [12] , [13] . Specifically, as shown in Figure 1 , we assume every interconnect has one driver and one load. Both the driver and load are inverters with the 4X minimum inverter size. We study the repeater and FF insertion for two objective functions: one is to meet the delay target with minimum number of FFs, or min-FF; and the other is to meet the delay target with minimum total interconnect power consumption, or min-power.
Min-FF Solution
It has been assumed in [12] , [14] that for repeater insertion, the input capacitance C in and effective resistance for each repeater are equal to Under this assumption, each repeater is a single inverter, named single model. To drive a large load, a repeater may contain a chain of cascaded inverters, where C in of a repeater is equal to C 0 times the size of first inverter in the inverter chain. The formulas to determined S and the location of each inverter along the interconnect are presented in [12] and [14] . We call this type of repeater cascaded
repeater. An inverter in a cascaded repeater is a stage, and the size ratio between two consecutive inverters is the stage ratio. In addition, we also consider a hybrid model where the first stage is a chain of cascaded inverters, but the rest are single inverters. In the hybrid model, the cascaded repeater is put at the beginning of the interconnect, and the location of other single inverters can be calculated based on the formulas used in the single model. The hybrid model may lead to a good solution when the inverter in the last stage of the first repeater is large enough to drive the rest of the single repeaters. We illustrate the three repeater insertion models in Figure 2 .
We study the power optimization problem under a given delay target for interconnects. The existing analytical repeater insertion methods [12] , [14] Table II shows our experiment results from all three models as discussed above. We use the wire lengths 4mm, 8mm, and 1cm, and clock frequencies 1GHz, 2GHz and 3GHz. We assume that the delay target is 80% clock period. No FF insertion is needed for wires up to 10mm and 4mm for 1GHz and 2GHz clock frequencies (see highlights in the table), respectively. Among these cases, the hybrid model achieves up to 15.09% power reduction compared with the single model. The hybrid model also has the smallest number of FFs for the same wire and delay target. This is further illustrated in Table III . For target delay, the longest wire without FF insertion in the hybrid model can be 1.5X of that in the single model.
Min-Power Solution
Although the hybrid model provides better power consumption for the same wire length, FF number and clock frequency, we also observe from the Table II that the single model with more FFs actually has lower power consumption than the hybrid model with fewer FFs. The reason is that for all repeater insertion models, the resulting power consumption is super-linear with respect to the wire length as shown in Figure 3 , where the wire length increases by 4X from 1mm to 4mm, the power consumption increases by more than 7X. It is easy to see that instead of inserting FFs merely to meet the delay target, we can reduce power by aggressively inserting more FFs. The min-power solution finds the concurrent repeater and FF insertion method with minimum power and less delay than the delay target. Again, we use enumeration to find the min-power solution.
We enumerate a range of reasonable FF numbers. For each number, we find the repeater insertion solution as discussed before. Finally, we choose the solution with the minimum total power. We present the results under min-power FF insertion and hybrid repeater model in Table II . The minpower method can reduce the interconnect power by up to 40.39% compared with the min-FF method.
However, the effectiveness of min-power method may not be over-emphasized because it depends on specific interconnect length distribution in individual design. For some specific design, the power reduction of the min-power method over the min-FF method may be small, as in our example in Section 3.2.
Runtime Reduction
In our implementations, we use a lookup 
MICROARCHITECTURE LEVEL INTERCONNECT POWER ESTIMATION AND CYCLE-ACCURATE SIMULATION
In this section we refine interconnect power modeling as follows. We first assume purely stochastic interconnects and fixed switching factor, perform layer assignment and develop type I interconnect power estimation. Then we introduce the concepts of random interconnects and structural interconnects, and develop type II interconnect power estimation. Finally we consider accurate activity rate for interconnects based on cycle-accurate simulation, and develop type III interconnect power estimation, which is also called power simulation. As we have already seen in Section 2, the hybrid model achieves the lowest power and least number of FFs compared with the other two models. In this section and the rest of this paper, we only use the hybrid model for interconnect 50% _ Chip size × × power estimation unless specified otherwise.
Power Estimation with Stochastic Interconnects
Interconnects are routed in different metal layers for routability and performance optimization, and layer assignment has a significant impact on power estimation. In our layer assignment, we assume the top two layers are used for global interconnects. We further assume that on these two layers, 50%
of the area is used by power/ground and clock routing. Therefore, the total area occupied by all global interconnects are 2 , and the minimum length of global interconnects l gmin satisfies Equation (4):
Chip size Global pitch width l i l dl g g (4) where l max is the maximum length of interconnects and it is 2 N with N being the total number of gates on the chip. i(l) is the length density function. l gmin can be used as the length boundary between global and intermediate interconnects.
Similarly, we find the length boundary between intermediate and local interconnects l mmin by Equation (5):
Layer number Chip size Intermediate pitch width l i l dl g g( )
where Layer_number is the number of intermediate layers, and the area utilization rate is 100% for the intermediate layer. We assume the Layer_number is an even number, and keep increasing
Layer_number until the interconnects with the length of l mmin can meet the delay target without repeater insertion. Interconnects with length less than l mmin are local interconnects and are assigned to local layers.
We obtain the chip size from ITRS and assume the chip area for random logic by subtracting cache area from the total chip area. For type I interconnect power estimation, we use the length density function i(l) from the stochastic length distribution methodology [13] Table IV . For min-FF and minpower methods, the system clock frequency is 3GHz and we assume the interconnect delay target is about 80% of the clock period. There is no delay target for min-delay method as the minimum interconnect delay depends on the interconnect length. Figure 5 shows the type I interconnect power calculated by the three different repeater and FF insertion solutions. In the first solution, repeaters are inserted for minimum delay, or min-delay, i.e.,
we insert repeaters as long as it can reduce delay and we do not insert any FFs. The power reduction from the min-power method mainly comes from the reduced repeater area. We define one equivalent repeater as one minimum size inverter. A repeater with total size S can be mapped to S equivalent repeaters. For any repeater and FF insertion solution, the total power is decided by total wire capacitances, the number of equivalent repeaters and FFs. Table V shows the total number of equivalent repeaters and FFs for all three solutions. Note that in the min-delay method we do not insert any FF, and there is no guarantee that the delay target for min-FF and min-power can be satisfied in the min-delay method. From Table V we can see that the min-FF and min-power solutions reduce the number of equivalent repeaters by 3.40X and 8.23X, respectively. Although the number of FFs in min-power solution is almost 8X of that in min-FF solution, the min-power solution still saves 14.03% power as it reduces the number of equivalent repeaters by 58.76%.
Power Estimation with Structural Interconnects
Stochastic interconnect distribution is assumed in [11] , [12] , [14] and in our type I interconnect power estimation. However, major components in a system-on-a-chip are often connected by varieties of busses that can be modeled accurately. To capture this, we introduce the concepts of random interconnects and structural interconnects. The random interconnects are interconnects inside each module and can be calculated by the same stochastic model as in type I interconnect power estimation. The structural interconnects are address and data busses between related modules, and their lengths are decided by the floorplan of the layout.
We consider high-performance SuperScalar processors, and summarize the configuration of processors under study in Table VI . Based on the die photo of the MIPS R10000 microprocessor [24] , we first design the fioorplanning without a L2 cache, and then incorporate a L2 cache into the floorplanning according to appropriated area ratio between L2 cache and other modules, as shown in Figure 6 . We measure the lengths of busses according to the Manhattan distances between the centers of modules connected by the busses. Table VIII shows the bit-width and lengths for all busses.
The number of long interconnects are reduced with the introduction of structural interconnects.
Therefore in type II interconnect power estimation, we need to re-calculate the overall wire length distribution and layer assignment. The interconnect density function i(l) for a system is now the sum of all interconnect density functions among all modules and busses, given by Equation (6): (6) where subscript k iterates over all modules and busses. Using the same number of layers as in Table   IV , the new length boundaries with consideration of structural interconnects are shown in table IX. Considering the new layer assignment, we apply the power estimation method based on the stochastic length distribution to each module independently and obtain the interconnect power for each module (see Table VII ). We also apply concurrent repeater and FF insertion to obtain the interconnect power for busses (see Table VIII ). In Table X, adding power for 
Cycle-accurate Power Simulation
To obtain accurate activity for interconnects, we further incorporate our interconnect power models with concurrent repeater and FF insertion into the sim-outorder simulator of SimpleScalar toolset [2] .We perform the following cycle by cycle simulation: if a module is accessed, we count its active (dynamic + leakage) interconnect power, otherwise we only count its leakage power. On the other hand, for each bus, we count the number of bit-line transitions every cycle. The dynamic power in that cycle equals the number of transitions times the dynamic switching power per bus bit-line. Note that the dynamic switching power is the full switching power ( We run simulations for total seven SPEC 2000 benchmarks: bzip2, gcc, gzip, mcf, parser, mesa, equake. Among them, mesa and equake are floating-point benchmarks, while the rest are integer benchmarks. During each simulation, the benchmark is first fast forwarded by 10 million instructions to avoid the startup effect, and is then simulated for 10 million instructions. 
THROUGHPUT MAXIMIZATION CONSIDERING INTERCONNECT PIPELINING AND FLOORPLANNING OPTIMIZATION
In this section, we optimize throughput using BIPS (billion instructions per second) as the metric.
We call interconnects with FFs inserted as pipelined interconnects, and compare pipelined interconnects and logic gates with voltage scaling. Then, we introduce throughput maximization by optimizing the clock frequency and fioorplaning, respectively.
Throughput Metric and Voltage Scaling
Our metric for throughput optimization is BIPS (Billion Instruction Per Second) defined as: However, the delay of pipelined interconnects behaves differently with respect to V dd scaling. Figure 7 plots the normalized delays for logic gates and pipelined interconnects. It is easy to see that when V dd increases, gates reduce delay much faster than pipelined interconnects and there is an increasing gap between them. Since we scale V dd according to the gate delay, the pipelined interconnect can not sustain the same clock frequency increase as the logic gates and modules.
Therefore, we have to re-design repeater and FF insertion for interconnects in order to obtain the increase of clock frequency decided by the logic modules.
Throughput Maximization by Clock Frequency Scaling
With FF insertion, IPC and clock frequency are no longer independent to each other. For a given microarchitecture and floorplan, the increased clock frequencies require that more FFs be inserted.
This degrades IPC however. Therefore, there may exist an optimal clock frequency to maximize throughput when the clock frequency and IPC are well balanced.
We study throughput maximization for the same microarchitecture and floorplan as in Section 3.
We evaluate BIPS with respect to clock frequencies between 2GHz and 4.5GHz. For each clock frequency, we first obtain the min-FF solution for concurrent repeater and FF insertion, then modify SimpleScalar according to the resulting FF insertion, and report the simulated IPC and BIPS in Figure 8 . For all benchmarks, when clock frequency increases from 2GHz to 3GHz, although IPC slightly decreases, BIPS keeps increasing due to the increased clock frequency. When clock frequency exceeds 3GHz, IPC decreases severely due to FFs inserted on a critical path, such as data busses between LSQ and L1 d-cache. As a result, BIPS does not improve even when clock frequency is increased up to 4.5GHz. Figure 8 clearly shows that there does exist an optimal clock frequency for BIPS maximization for a given microarchitecture and floorplan, and this clock frequency is 3GHz in our example.
Throughput Maximization by Floorplanning Optimization
Floorplanning directly affects the lengths of structural interconnects, and in turn, the interconnect pipelining solution. By adjusting the floorplan, we may reduce interconnect pipeline stages for better IPC and BIPS. Because Figure 8 has shown a severe IPC degradation when the clock frequency increases from 3GHz to 3.5GHz, we target the 3.5GHz clock frequency for adjusting the microprocessor floorplan. Figure 9 presents two floorplans, A and B, for the SuperScalar processors we study. Floorplan A is same as that in Figure 6 and floorplan B is the new floorplan optimized for IPC-critical interconnects.
The differences between them are highlighted in the figure and include: (i) we move LSQ closer to L1 d-cache, and eliminate one FF between them.
(ii) we distribute the four integer function units, remove one FF between RUU and IALU1/IALU2, but introduce one extra FF between RUU and IMULT. Because multiplication and division take much longer than addition, IMULT has a much larger latency than IALU. Intuitively, the IPC gain of IALU1/IALU2 outweighs the IPC loss of IMULT. For a similar reason, we exchange the locations of FALU and FPMULT such that FALU is closer to RUU but FPMULT is further away.
As shown in Figure 10 , floorplan B optimized for IPC-critical interconnects increases IPC (as well as BIPS) by 23.49% 6 . Although the IPC improvement is significant, floorplan B only reduces 5% of the total structural interconnect length, the objective to minimize in the conventional floorplan.
Therefore, in the presence of interconnect pipelining, the floorplan should consider both the conventional objective of minimizing total interconnect length and the new objective of maximizing IPC.
CONCLUSIONS AND DISCUSSIONS
Considering structural interconnects, layer assignment, and concurrent repeater and FF (flip-flop) insertion, we have developed cycle-accurate microarchitecture-level power and throughput simulations and obtained an accurate modeling of interconnects at the early design stage.
Experiments have shown that the simulation reduces over-estimation by up to 2.24X compared to the conventional power estimation based on purely stochastic interconnects and fixed switching factor.
Given such a difference, cycle-accurate simulation becomes a necessarity to validate microarchitecture innovations for power optimization.
With the presence of pipelined interconnects, we have shown that throughput is not always higher for an increased clock frequency, and there exists an optimal clock frequency to maximize throughput for a given microarchitecture and floorplan. We have illustrated that floorplanning optimized for IPC (instructions per cycle)-critical interconnects has little effect on the total interconnect length but it improves throughput by 23.49%. Therefore, future floorplanning optimization should consider both the conventional objective of minimizing total interconnect length and the new objective of maximizing IPC.
As FF insertion becomes necessary to achieve the clock rate specified by ITRS, we conclude that the traditional design flow of optimizing IPC and clock rate separately is no longer valid, and coupled microarchitecture and layout optimization may improve both power efficiency and throughput. Such co-optimization has been further studied in recent work on automatic floorplanning optimization with interconnect pipelining [25] [26] [27] .
In this paper, we assume two-pin interconnects. Similar assumption has been used extensively for early-stage estimation [12, 14] . In the future, we will extend our study to consider multi-pin interconnects in a fashion similar to [11] . 
