Abstract-The growing speed gap between transistors and wire interconnects is forcing the development of distributed, or clustered, architectures. These designs partition the chip into small regions with fast intracluster communication. Longer latency is required to communicate between clusters. The hardware and/or software are responsible for scheduling instructions to clusters such that critical path communication occurs within a cluster. This paper presents GENEric SYstems Simulator (GENESYS), a technology modeling tool that captures a broad range of materials, device, circuit, and interconnect parameters across current and future semiconductor technology. This tool is used to explore the relationship between key technology parameters (intercluster wire delay and transistor switching delay) and key architecture parameters (superscalar versus multithreaded instruction dispatch, and value prediction support). GENESYS is used to predict intercluster latencies as VLSI technology advances. The study provides quantitative data showing how conventional superscalar performance is degraded with increasing wire latency. Threaded designs are more tolerant to wire delay. Optimal thread size changes with advancing VLSI technology, suggesting a highly adaptive architecture. Value prediction is shown to be useful in all cases, but provides more benefit to the multithreaded design.
optimization routines based on physical laws and empirical data to predict key technology metrics. Assuming clock frequency continues to advance at historic rates, the tool is used to estimate the number of clock cycles needed to move data between clusters. By 2014, over six clock cycles will be needed for intercluster communication.
A simulator is used to explore performance limits imposed by increasing wire delay while varying two architectural parameters; the way in which instructions are distributed to functional units, and the use of value prediction. Two different instruction distribution algorithms are considered 1) a conventional (superscalar) approach and 2) a distributed (multithreaded) approach. The study shows that the conventional approach offers substantially higher performance without wire delay. As wire delay increases, performance of the conventional processor degrades quickly. The multithreaded design is more tolerant to communication delay. Results show that this approach is preferred when wire delay between clusters exceeds one clock cycle, which is projected to occur by 2003.
Value prediction [18] is shown to be useful in all cases, but is more important for multithreaded designs in that it can break critical path interthread data dependencies. This study indicates that the optimal architecture varies with implementation. It suggests an architecture that allows for easily adapting key architecture parameters to the implementation technology.
A. Related Work
There is considerable published research proposing architectures for future technologies [3] , [23] , [25] , [26] , [31] , [32] . This work often focuses on the performance of a specific architecture, without fully exploring the interaction between varying architectural configurations and the capabilities and limitations of a technology. It provides an indepth understanding of an architectural approach as a data point, but offers little insight into the architecture's interactions with the implementation technology.
Recent work has begun to illuminate the broader relationship between architectures and future VLSI technologies. A relationship between processor complexity and cycle time in defined in [15] and [24] ; [13] considers the balance between cache memory and processor resources for a traditional microprocessor architecture; [14] examines optimal architectures in 0.35-m technology; and [6] studies singe chip multiprocessor configurations in 0.1 m technology. This research builds on this work by examining the performance limits due to intercluster wire delay across a range of VLSI technology. This paper is divided into six sections. Section II describes the GENESYS tool and modeling. Section III discusses the impact of advancing VLSI technology. Section IV details the architectural variations studied. Section V presents simulation results, and Section VI provides conclusions.
II. GENESYS

A. Gigascale Integration Hierarchy
GENESYS traces its inception to the thesis that future opportunities for high-performance/low-power computing for gigascale integration (GSI) are governed by hierarchy of limits both practical and theoretical [20] . This hierarchy is properly defined in increasing order of complexity as: 1) fundamental; 2) material; 3) device; 4) circuit; and 5) system. The fundamental limits are derived from first principles and form the foundation of the hierarchy. These limits are independent of any other level of the hierarchy and cannot be overcome. A clear example is the time-of-flight limit imposed by the speed of light in free space. The material limits build upon the fundamental limits, but are independent of the higher levels. The relative dielectric constant of insulator material is a key material property as it modifies not only the time-of-flight and has implications at every other level of the hierarchy. The device, circuit, and system levels of the hierarchy build similarly. These five primary modeling regimes are incorporated into GENESYS allowing the simulator to project physical system performance through a key set of output parameters.
B. Introduction to GENESYS
GENESYS is designed to project future trends and performance for gigascale systems (one billion+ transistors) by applying a core set of analytical and physical models derived from first principles and established empirical knowledge to a set of input parameters spanning all levels of the GSI hierarchy. Through this set of models and parameters GENESYS assimilates the structure of the hierarchy of limits. Fig. 1 illustrates the conceptual organization of the simulator.
A generic system model incorporated into the simulator allows the systematic exploration of tradeoffs made at any level of the hierarchy such as different material systems, or interconnect architecture. GENESYS produces a richly detailed report on the characteristics of a system for each level of the hierarchy, but of primary concern are the key system-level metrics: die size, average power dissipation, and clock speed. These three primary output parameters represent the most globally observable metrics of chip performance. A detailed description of GENESYS is provided. Additional details are found in [11] 
C. Fundamental and Material Modeling
Before GENESYS can begin making projections regarding the system-level metrics, the fundamental and material level parameters must be addressed. The fundamental parameters are primarily defined in a set of physical constants (speed of light, permitivitty of free space, Boltzmann's constant, etc.), which cannot be redefined. Therefore, the set of fundamental inputs adjustable to the user is limited to the ambient temperature and the thermal noise factor [21] . The material parameters of interest with regards to the interconnect structure are the resistivity of the interconnect metal, and the relative dielectric constants of the inter/intra level interconnect insulator material. The semiconductor, gate, and gate dielectric material are currently restricted to silicon, n+/p+ polysilicon, and , respectively. Therefore the electron/hole mobility, gate-channel work function, and gate dielectric constant are already factored in. The interconnect material parameters are of special consequence as they have a direct impact on the resulting circuit/system level output parameters. The distributed resistance-capacitance product for an interconnect is directly proportional to the product of the resistivity and the relative permitivity. This directly impacts the resulting propagation delay for an interconnect of a given length, which in turn impacts the maximum possible clock frequency. Additionally the capacitance (directly proportional to the relative permitivitty) impacts the switching energy and average power dissipation.
D. Device and Circuit Modeling
At the device level, GENESYS assumes that the structure under consideration is that of a bulk-CMOS device. The key device description of the CMOS device consists of: minimum feature size, gate oxide thickness, desired threshold voltage, supply voltage, and choice of device model. The minimum feature size, gate length, is the primary technology descriptor (1999 ITRS technology generations are commonly referred to by feature size). The feature size has a profound effect on the circuit and system level output parameters as it directly impacts the die size. Additionally, because the minimum feature size is generally limited by lithography, the concept extends to a limit on the minimum interconnect pitch (width + spacing). Thinner-gate oxides result in higher-gate capacitance and higher-drive current yielding higher-power dissipation. Even though the dielectric material is limited to the native oxide, high dielectric gate materials may be simulated by using the equivalent oxide thickness for the new material. The channel doping and effective mobility (doping dependent) are calculated at the device level. There are two models for CMOS devices available in GENESYS, the alpha-power law model [27] and the transregional MOSFET model [1] . The alpha-power law model is a well-established empirical model used for calculating drain current for use in delay modeling. The transregional MOSFET model is derived from first principles and uses no empirical parameters. Both device models calculate the saturation drain current for use in estimating the gate delay.
The circuit level inputs consist primarily of empirical gate parameters related to the physical layout of the logic gates: the ratio of pFET width to nFET width (normally set to for equal rise/fall times), the layout area of a minimum size CMOS inverter in terms in square feature sizes, the average fan-in/fan-out of a random logic gate, and the average width to length ( ) ratio of the transistors. These empirical parameters are utilized to determine the layout area of a canonical random logic gate (assumed to be a 3-input NAND gate). The propagation delay of a random logic gate is calculated as the pull-down delay through the three series connected nFETS. The gate delay model includes the switching delay of the driving MOSFETS, the RC delay of the interconnect, time-of-flight, and load delay. This model is illustrated in Fig. 2 . Because the gate delay model includes the interconnect parasitics, the gate delay cannot be calculated until the interconnect length and dimensions are known. Therefore, the random logic gate delay is determined after performing interconnect related calculations/optimizations. The gate power dissipation is calculated as the sum of static (leakage) and dynamic (switching) power. Dynamic power is modeled via the standard metric with static power determined from the off current (subthreshold) multiplied by the supply voltage. These circuit level calculations provide the first cut estimation of the three primary outputs (die size, power, speed) on a per gate basis.
E. System Level Modeling
In order to make estimates of total system performance based on the modeling at the fundamental, material, device, and circuit levels of the hierarchy, GENESYS makes use of a generic system model of an ASIC/microprocessor in which there are two possible critical paths; random logic, and long-distance interconnect.
The assumption made regarding the random logic is that the computational logic on the chip is represented as a single random logic network (GENESYS is concerned only with the electrical characterization at this point). The speed and power dissipation within the random logic network is characterized via a chain of gates model built from the circuit level gate modeling. The random logic critical path model is shown in Fig. 3 .
The key parameters for characterizing the random logic critical path is the logic depth (number of gate delays between latched clock elements), and the average length of an interconnect in the logic network. The logic depth is an input parameter supplied by the user, and the average interconnect length is calculated at the system level (this calculation is discussed later). Once the average interconnect length is known, the circuit level gate delay is calculated and the random logic critical path is computed via following:
Where is the random logic depth, is the number of gate delays through a latch, and is the circuit level gate delay. Here, the random logic delay is modeled as the delay through a latch and a chain of random logic gates to the input of the next latch. This represents the minimum possible clock delay as data must be captured at the next storage element before the next clock pulse.
The other possible critical path is the delay through the longest distance interconnect (LDI) in the random logic network. If single cycle across chip communication is required, the delay of long distance wires cannot be ignored. The longest interconnect length is assumed to be proportional to twice the chip edge length (the Manhattan distance between the opposite corners of the chip). The key parameters governing the delay of the LDI are the interconnect dimensions which determine the associated parasitics, resistance and capacitance, and choice of delay model. GENESYS supports two delay models for LDIs, the single driver model and the repeater driven model. The single driver model assumes an unbroken distributed RC network driven by a gate sized according to size of the interconnect capacitance relative to the logic gate capacitance. The repeater delay model assumes optimal repeater placement with models for calculating the number and size ( ratio) of repeaters inserted into a wire. The delay is then calculated for a section of line between repeaters and summed for the total LDI delay.
A key feature of GENESYS impacting the cycle time calculations for both the random logic and interconnect is the mod- eling of a multilevel/multitier interconnect architecture with a stochastic interconnect length distribution. In order to maintain short delays across the wiring distribution from short to long wires, the sizing of the wires must be adjusted to reduce parasitics. This is typically accomplished by increasing the crosssectional area of the interconnect to reduce the resistance.
GENESYS assumes that the wiring levels are organized hierarchically with the longest global wires occupying the upper levels and the shortest wires on the lower levels. Therefore, a typical wiring architecture would have "fat" wires on the upper levels, intermediate sized wires on the midlevels, and minimum pitch wires on the lowest levels. Interconnect levels are organized into tiers according to their dimensions. Interconnect levels with wires of identical dimensions reside in the same wiring tier. Therefore, a wiring tier may be comprised of multiple levels. An interconnect tier is fully described by several key parameters, the number of wiring levels and the critical interconnect dimensions; width, spacing, height, dielectric thickness between levels, and the dielectric constant(s) of the insulator material(s).
The other critical component of the interconnect modeling in GENESYS is a stochastic interconnect distribution for evaluating the routing requirements of the random logic network with regards to the available supply (area). The routing requirements are calculated for each interconnect level via the Davis distribution [9] . This distribution is used to calculate the required total length of interconnect necessary to route the chip. The parameters of interest are the number of gates, the internal rent parameters [17] , and the gate pitch (side length of square layout area available per gate as determined at the circuit level). The Davis distribution predicts the number of interconnects over a range of interconnect lengths. By integrating the product of interconnect length and distribution the demand is found for any range of interconnect lengths. Additionally, the average interconnect length used to calculate the random logic gate delay and power dissipation is found from the Davis distribution via the following formula:
Where is length and is the stochastic wire-length distribution. The interconnect demand is returned in terms of the gate pitch. The final value of the gate pitch is determined either by the routing requirements or the minimum gate layout area. A comparison of the stochastic distribution against real data from [8] is provided in Fig. 4 from [9] . The stochastic distribution agrees with the actual data much better than a simple distribution derived directly from Rents rule parameters.
The specification of the interconnect architecture (levels/tiers) and the stochastic distribution provides GENESYS with the information needed to determine partition lengths for each level according to the supply/demand driven routing approach and determine optimal interconnect architectures for both maximum clock speed and minimum die size. The partition length is simply the longest interconnect length on a given level. For a given wiring pitch (width + spacing) GENESYS evaluates the available supply and solves the distribution for the range of interconnect lengths for which the condition is met. This is done for each interconnect tier until all the partition points are found. If the wiring architecture is defined by the user with no optimization and the wiring demand does not exceed the supply, GENESYS calculates the gate pitch and maximum wiring delays for each tier. For system level interconnect optimizations GENESYS manages the interconnect dimensions/gate pitch so the relationships between wiring pitch, routing supply, tier partitioning, wiring demand, and die size are satisfied for a set of given constraints. The first constraint applicable to both optimizations (max speed, min size) is the number of interconnect levels/tiers. If an optimization for max clock speed is chosen, GENESYS will increase the interconnect pitch on a per tier basis and repartition, comparing the interconnect delay on each tier with the random logic delay until either the LDI delay is less than or equal to the logic delay or the maximum die size (user supplied constraint) is reached. If a minimum die size optimization is performed, GENESYS will decrease the interconnect pitch until the LDI delay exceeds the target clock speed (user constraint).
When the analysis of the interconnect architecture is complete, GENESYS has the data necessary to calculate die size, power, and speed.
GENESYS determines the die size as a function of the gate pitch. The final gate pitch (measured in m) is limited either by the required gate layout area or the wire routing requirements. The gate pitch area is multiplied by the number of gates to yield the area required by the random logic network. If on-chip cache is specified, GENESYS incorporates empirically determined layout models (similar to the gate area models) based on SRAM cells to calculate the total cache area. The logic and cache area is summed to yield the total chip area.
The clock speed is determined by either the random logic delay or the LDI delay. GENESYS assumes that the system cycle time is set by the longest delay. Therefore, if the delay through the random logic is 0.45 ns and the LDI delay is 0.65 ns, the system delay is to budget for clock skew. Optionally, the random logic delay may be interpreted as local and the LDI delay as global in a system with a two clocks present (as is projected by the ITRS).
Once the clock speed is known, the power dissipation in the random logic may be extrapolated from the circuit level capacitance model via the following formula:
Where is the average activity factor (probability that a specific gate will switch on any given clock cycle), is the gate output capacitance, is the wiring capacitance of an average length interconnect, is the input capacitance of the loading gate(s), is the supply voltage, is the number of gates in the system, and is the final clock frequency. Power dissipation in the clock drivers/distribution and on-chip cache is also considered.
In addition to speed, power, and area, GENESYS produces detailed information at each level of the hierarchy from device/material to system. Information on the circuit area, delay, and energy is provided. The average/maximum interconnect lengths and partition points. Detailed statistics on the interconnect dimensions and delay per tier are reported along with driver/repeater delay statistics for each tier. A detailed breakdown of both power and area is presented for random logic, clock, cache, and I/O.
F. GENESYS Validation
For this study, the primary outputs of GENESYS are module area, gate delay, and interconnect delay predictions for different technologies, based on different architectural configurations. To assess the accuracy of these predictions, GENESYS has been used to predict similar qualities of commercial microprocessors for which actual implementation details are known. Table I compares GENESYS predictions with actual data.
The average prediction error for these examples is 7% (area), 6% (clock frequency), and 11% (power). The area and frequency predictions are especially significant for this study since they incorporate interconnect performance models that are used in cluster performance prediction.
III. TECHNOLOGY
In this section, GENESYS is used to illustrate the importance of interconnect delay. The growing speed gap between transistors and interconnects will force computer architects to consider 
A. Interconnect Technology
Fig . 5 plots the predicted number of clock cycles needed to travel the longest global interconnect (with repeaters) for microprocessors across technology generations. The interconnect physical dimensions, longest wire length, resistance and capacitance values, and repeater placement are predicted using GENESYS. The points plotted are the GENESYS predicted wire delay relative to the 1998 SIA target cycle time [22] . The graph shows that with clock rates climbing at historic rates, global interconnect delay becomes a critical issue.
B. Clustered Micro-Architectures
The importance of wire interconnects requires substantially different architectures than are popular today. The superscalar and VLIW designs of today were originally conceived during a time when communication delay was ignored. As a result, they rely on centralized structures that can be easily accessed in a single clock cycle. The organization in Fig. 6(a) shows a conventional design with centralized instruction issue and a centralized register file. To be effective, an instruction should be capable of being issued to any ALU regardless of physical location, and also access any register regardless of physical location. It has been shown that there are substantial performance penalties associated with introducing wire delay in this type of architecture [2] , [12] . A clustered design, such in Fig. 6(b) , favors partitioning the resources and then having explicit communication between resources. A primary goal of the hardware and/or software is to keep instructions and data local to a partition, and to minimize critical path cross-cluster communication.
C. Technology Projections
The goal of this work is to study the relationship between wire-limited technology and clustered micro-architectures. The key technology parameters are the wire delay between adjacent functional units and the clock frequency. These two parameters can be used to calculate the number of clock cycles required to move data from one FU to an adjacent FU (called the hop latency).
To calculate the hop latency in different technologies, some assumptions must be made about the micro-architecture implementation. An assumption is made about which wiring level will be used to route intercluster interconnect. It is reasonable to assume that the uppermost levels will be used for clock, power, and ground. Current designs, such as the Alpha 21264, devote entire wiring levels to power grids and ground planes. This trend is expected to continue. At the same time, the lowest-wiring levels are typically used for short local interconnect between gates within functional blocks.
There is tremendous wiring demand for the data forwarding logic. Each FU needs to be connected to every other FU with minimally 64 wires (for 64-b data). A best case estimate is 1024 wires (64 b 16 FUs) into each FU (of course, it is likely to be much higher). It is also desirable to have the data forwarding wires have the same pitch as the FU logic to which it connects. For these three reasons (upper levels used by other resources, large wiring demand, and desire for small pitch), it is assumed that all routing of forwarding wires will be done on the "semiglobal" wiring levels. For this work, it is assumed that all data forwarding wires are routed on the level under the top three (level NUM_LEVELS -3).
GENESYS can now be used to calculate the hop latency for different technologies. Table II summarizes the results of the calculations.
IV. ARCHITECTURE
This section describes the examined superscalar and multithreading execution architectures. First, the data forwarding model employed in both architectures is defined. Then, the instruction dispatch method for each architecture is described and differentiated. Finally, the cluster communication network is explained.
A. Data Forwarding
There are two critical issues associated with clustered microarchitectures. The first is distributing instructions to the clusters and the second is providing data to the clusters. Instruction distribution can be handled with replicated instruction caches. Each cluster is given a copy of the instructions through private L1 instruction caches. Because instruction cache hit rates are typically 95%+ for most applications, instruction delivery is not a critical problem. Data communication between clusters, however, can be a significant issue.
To quantitatively study the importance of intercluster communication, it is necessary to assume a particular instruction execution model. This work assumes runtime (dynamic) instruction scheduling (versus compiler scheduling) because of its widespread use and the importance of supporting binary code compatibility.
Fundamental to all dynamic ILP processors is the concept of restricted data-flow execution. In this execution model, producing instructions send results directly to consuming instructions. An instruction is ready to execute as soon as its inputs and a FU are available. Many instructions may execute in a single clock cycle, in any order, driven only by the data flow requirements of the program. The term "restricted" is used because only a fixed window of dynamic instructions is considered for data flow execution. The window size is determined by the physical implementation. Tomasulo [29] introduced an algorithm whereby reservation stations hold multiple instructions waiting for a functional unit. An instruction can enter the FU once its inputs are available. Results are broadcast to all reservation stations over a common data bus. This algorithm is seen in Fig. 7 .
For dependent instructions to execute back-to-back, every FU must be able to reach every reservation station in a single clock cycle. In future technologies, with many 10's of functional units, this will not be possible. A recent implementation of restricted data flow execution (the Alpha 21 264) partitioned the FUs into two clusters. Each cluster can be reached in a single clock cycle, but intercluster communication requires an additional cycle. In the limit of this approach, every FU has a separate path to every other FU. Delay between FUs is then determined solely by distance. Fig. 8 shows an example of this approach, illustrating the paths from FU1 to all other functional units. Similar interconnects would be needed for FUs 2, 3, and 4 to communicate their results, but are not shown for clarity.
Given an execution core with generalized data forwarding, experiments can be run to determine the influence of FU-to-FU wire delay on performance. Before instructions can be executed, however, they must be delivered to the reservation stations. How they are delivered determines the dynamic communication pattern between FUs, which influences performance because of the varying latency of different paths. The next section discusses two different approaches to instruction delivery.
B. Superscalar versus Multithreaded Instruction Delivery
In a superscalar processor, instructions are fetched sequentially (in program order) and delivered to all FUs at once. Fig. 9 illustrates this process, showing groups of instructions being delivered to four FUs. On the first fetch cycle, instructions 1 through 4 are given to FUs 1-4, respectively. On the next fetch cycle, instructions 5-8 are delivered, followed by 9-12, and so on. This approach is known to be capable of exposing abundant instruction level parallelism.
An alternative instruction delivery approach, referred to here as multithreaded, is being explored by a number of researchers [5], [23] , [26] , [28] , [30] . In this approach, separate fetch engines feed each FU from different portions of the program. Fig. 10 illustrates this approach. The first fetch engine is responsible for supplying instructions 1-4 to the first FU. The second fetch engine supplies instructions 5-8 to the second FU, and so on. Because instructions are typically dependent on instructions close by, this approach has the advantage of localizing communication within a single FU. The disadvantage is that less instruction level parallelism is exposed.
C. Network Topology
Instructions in a program generally communicate with nearby instructions. For this reason, it is desirable to have instructions that are nearby in program order, phyically close together while executing. This is true for both the superscalar and multithreaded approaches. This is accomplished by delivering instructions according to the pattern shown in Fig. 11 . In this pattern, maximum wire length is limited to ajacent nodes. For the superscalar model, all FUs get one instruction in the same cycle. Program order adjacent instructions are always physically nearest neighbor. For the multithreaded architecture, groups of instructions are assigned to fetch engines in the same pattern. In this way, each group of instructions typically communicates with another group of instructions physically nearby. 
V. RESULTS
Many simulations were run while varying the key parameters: hop latency, instruction delivery method, and use of value prediction.
A. Simulation Methodology
An execution-driven simulator has been constructed on top of the SimpleScalar simulator, which implements a MIPS-like ISA [4] . The intent of the simulations is to isolate performance limits due to data forwarding, and study the impact of advancing technology. For this reason, ideal behavior is assumed for nondata forwarding related performance issues, such as memory bandwidth and latency, FU latency, instruction cache misses, etc. The parameters used, which set an upper bound on ILP are the issue width (16) , reservation station size (16 per FU), and branch predictor (64 kgshare). The data forwarding related architectural parameters are the instruction-to-FU delivery method and use of value prediction. The technology parameter that influences data forwarding is the hop latency. It sets the time needed to move data from one FU to another. If the hop latency is one and the producing instruction and consuming instruction are two FUs apart, then two cycles are needed before the result arrives and the consuming instruction can execute. With a hop latency of 2, 4 cycles would be needed. Table III summarizes important simulator configurations. The atlas multi-adaptive (AMA) value predictor is described in [7] . Branch prediction (gshare) in described in [19] .
The SPEC benchmarks are compiled with gcc 2.6.1 using full optimizations. Each benchmark is simulated using the train input set for the first 200 million instructions, or until the program completes, whichever is first. Fig. 12 shows a distribution of the distance data traffic moves between functional units while using the multithreaded-16 instruction delivery. Over 60% of the data traffic is local to a functional unit when using this policy. Nearly 90% of the traffic is contained within 2 FU hops. Fig. 13 shows a similar graph with the superscalar instruction delivery. Here, the traffic is almost never local, and then fairly evenly distributed at distances of 1, 2, 3, and 4 hops. Fig. 14 shows cumulative traffic distributions for threads of size 1 (superscalar), 4, 8, and 16 (multithreaded) . The traffic can be contained locally by increasing the thread size. This comes at a performance penalty, however, and will be analyzed in Section V-C.
The fundamental reason multithreading is able to contain traffic locally is that it exploits the natural temporal data locality present in most programs. It is natural that produced values will be consumed sooner, not later. This is the same basic principle that makes data caches effective. By keeping related instructions close together rather than spread throughout the machine, traffic can be contained locally. Fig. 15 shows the performance impact of increasing hop latency on the different instruction delivery methods. The 1-instruction thread superscalar method achieves very high performance (over 7 instructions per cycle) with zero-delay wires. As the hop latency is increased, performance falls quickly. In a technology with a 6 clock cycle penalty between FU, the IPC drops to under 1. The multithreaded designs are not able to exploit as much parallelism as the superscalar designs under free wiredelay conditions. The 16-instruction size threads achieves only 3.5 instructions per cycle in this technology. However, these designs are not as impacted by wire delay. As the hop latency in- creases, performance drops at a slower pace. With zero-delay wires, the superscalar is twice the performance of the multithreaded-16 design. But with a 6 cycle hop penalty, the situation is reversed. The multithreaded-16 is twice the performance of the superscalar. The choice of architecture should be made depending on the implementation technology. Note there are several cross-over points where different architectures are favored depending on the technology.
C. Performance
The next graph, Fig. 16 , shows the performance over the hop latencies predicted by GENESYS for different technology generations. Notice that there are many cross-over points where different thread granularities offer the best performance. This is important, as it suggests that the optimal thread partitioning changes with evolving technology. The implication is that it is beneficial to have the hardware control the size of the thread, rather than the software. The micro-architecture can resize threads for a given implementation. The software is more rigid in that it must choose one size and compile for that size. Without recompilation, future implementations must obey a suboptimal partitioning.
Because the software has greater visibility into the program than the hardware, a good approach would be a to have the software partition large sections of code (100s or 1000s of dynamic instructions) and the hardware partition small sections (10s of instructions). This way, the processor could leverage the strengths of software (visibility to the whole program), and hardware (dynamic adaptability). An evaluation of dual software/hardware partitioning is left for future work.
D. Value Prediction
Next, value prediction is considered as an architectural option. Value prediction has been recognized as a way to break true data dependencies by predicting the value that an instruction will consume and executing it in parallel with the producing instruction [18] . In the simulations, all instruction sources are value predicted using the AMA value predictor [7] . AMA value prediction combines last-value stride prediction with a control correlated context predictor. Whenever the predictions are correct, the instruction is allowed to execute. Table V shows the value prediction achieved by AMA.
In this work, the value predictor is given immediate updates. While overly optimistic, this does represent a limit on value prediction performance. Fig. 17 shows the data collected for both real value prediction (i.e., using realistic AMA value prediction performance) and no value prediction using superscalar and multithreaded-16 delivery policies. Performance improves across the board when using value prediction, but the increase is more substantial for the multithreaded design. The superscalar design receives an average of 40% improvement with value prediction, while the multithreaded design receives an average of 120% improvement. Removing intercluster dependencies is more valuable in the multithreaded case because it allows a larger amount of work to begin sooner. Fig. 18 shows the performance over the GENESYS predicted hop latencies. It is interesting to note that the cross-over point between superscalar and multithreaded is the same with and without value prediction.
VI. CONCLUSION
With the advent of gigascale integration, many diverse architectural approaches are possible for future microprocessors. It is important to carefully study the interactions between advancing technology and architecture. In this paper, the limits of ILP imposed by data forwarding between clusters was considered as technology advances. The most important technology parameter is the number of clock cycles needed to communicate data between clusters. GENESYS predicts that the hop latency will increase significantly as technology advances. New architectural techniques will be needed to overcome emerging technology limitations. In addition to interconnect delay, other system parameters such as power dissipation and area efficiency (also estimated in GENESYS) are factoring significantly into new processor designs. Technology estimation tools like GENESYS will become increasingly important in early architectural design in order achieve optimal system performance.
