As VLSI advances towards billions of fast transistors on a chip (Gigascale Integration, or GSI) 
Introduction
Microprocessor architectures strive for the highest possible performance on important applications while meeting the constraints of area, power consumption, and design time.
Transistor size and delay are the traditional drivers of microprocessor architecture. As designs push towards billions of fast transistors on a chip (Gigascale Integration, or GSI), interconnect issues will become paramount [14] . Conventional uniprocessor architectures, developed in an era when interconnect delay was less significant, are incompatible with this technology.
This work studies the impact of gigascale technology on microprocessor architecture. It is an exploration into the design of architectures for GSI using architecture simulation tools, technology modeling, and historical data. 100 nm technology, projected to be available in 2006 [15] , was chosen as the basis of the study. A set of reasonable candidate architectures that span a spectrum of uniprocessor and multiprocessor designs is modeled to determine the expected clock frequency, instructions per cycle (IPC), and area.
Compilers, applications, and operating systems all have a significant impact on performance. The impact of innovation in implementation architecture is difficult to predict. For this reason, the amount of explicit coarse-grained parallelism available is treated as a variable in this study.
The clock frequency of large uniprocessor systems suffers due to wire delay. In addition, these systems cannot exploit high level parallelism. When even a small amount of coarse-grained parallelism is available, other architectures offer higher performance. Massively parallel collections of small processors suffer when small amounts of explicit parallelism are available, and they do not have the flexibility to exploit fine instruction-level parallelism.
The next section summarizes related work on technology-based architectural prediction. Section 3 describes the modeling method used in this study. Section 4 describes the architectural candidates being considered and how their performance, area, and operation frequency are estimated. Section 5 presents the results and Section 6 offers conclusions.
Related Work
There are considerable published research evaluating architectures in future technologies [16] [23] . This work often focuses on the performance of a specific architecture, without fully exploring the interaction between varying architectural configurations and the capabilities and limitations of a technology. It provides an in-depth understanding of an architectural approach as a data point, but offers little insight into the treads that connect points.
Recent work has begun to illuminate the broader relationship between architectures and future VLSI technologies. [9] [18] defines the relationship between processor complexity and cycle time. [7] considers the balance between cache memory and processor resources for a traditional microprocessor architecture. [8] examines optimal architectures in 0.35µm technology. This research builds on this work by examining advanced VLSI technology (100 nm) and ranges of explicit parallelism in the application workload.
Methodology
This paper considers systems of one or more processors (nodes) in a MIMD organization.
The design space to be explored is shown in Figure 1 .
Multiple Processors
Highly Parallel Simple Processors
Wide Superscalar Uniprocessor
Aggressive Nodes
Figure 1 Design Space
Performance of the architectures considered can be approximately expressed as the product of per node instruction throughput (IPC), node clock frequency (f), and the parallel speedup offered by multiple nodes (Sp). It is assumed that Sp results from coarse-grained parallelism, and that Sp is independent to node IPC. This assumption holds for multi-tasking and coarse-grained multi-threading environments. It does not hold for fine-grain parallel systems where task and instruction level parallelism are shared. Under this assumption, the system performance is given as: Performance = Clock frequency * IPC * Sp.
Cycle time (inverse of frequency) is the delay through a critical path, including both gate and wire delay. In a superscalar architecture, the critical path includes the ALU and bypass logic delay plus wire delay across the waiting functional units, renaming buffers, and register file [18] . This delay can be estimated for different node sizes in future VLSI technologies. The Instructions Per Cycle (IPC) for a node is determined using an efficiency metric defined in Section 4.1. The speedup offered by multiprocessor configurations is dependent on many factors, including the application, algorithm, compiler technology, operating system, etc. In this study, the fraction of parallelism available in the workload will be treated as a variable to allow examination of the speedup for a range of values.
Candidate Architectures
Numerous combinations of uniprocessor and multiprocessors are possible in 100 nm VLSI technology. The candidate architectures in the study span the spectrum from a very simple processor replicated across the die to a single large uniprocessor. The following sections determine the possible node sizes and expected node performance.
Estimating IPC
Estimating IPC for the 4-way superscalar and smaller nodes is straightforward as these architectures are well studied. Determining the performance of more complex nodes (> 4-way superscalar) is more difficult, as there is little data available.
One approach to estimate performance for larger nodes is to extrapolate historical performance of uniprocessor architectures. Figure 2 shows IPC growth as a function of dimensionless area. The data points, taken from [5] , represent the last three generations of current RISC and CISC architectures. Extrapolating this curve to 100 nm, a single node will achieve an IPC of approximately 12. 
Figure 2 Historical Data on IPC as a function of Area. IPC is calculated as (2.6 * SPECint95/Clock frequency). Area is calculated as (die size in mm 2 / (feature size in um) 2 ).
This simple extrapolation of historical performance is not realistic. There are many bottlenecks to achieving this performance including limited inherent parallelism in a fixed instruction window, penalties from control mispredictions, and memory access availability. To illustrate the significance of these bottlenecks, a set of simulations have been run using the Simplescalar simulator [3] . Table 1 shows the results of an experiment on available parallelism.
The only restrictions on performance are the pipeline width (set to 16 instructions), and the size of the instruction window (set to 1024 instructions There are many ideas currently being explored to circumvent these basic limitations. For example, predicated execution [12] attempts to reduce the limitations of real branch prediction, while value prediction [11] can be used to increase parallelism. But the impact of these techniques is still unknown. Because of this uncertainty, an empirical instruction throughput expression is used.
In this study, the efficiency of an out-of-order superscalar node is defined to be (achieved IPC/pipeline width). This metric is shown in Table 2 Table 2 Efficiency of Superscalar Nodes
Estimating Area
The previous section presented a method to calculate IPC based on pipeline width, assuming efficiency remains constant. The composition of the complete system requires an area estimate for the implementation of various superscalar configurations. Unfortunately, no public domain area model exists, probably due to the multitude of different physical implementations.
Olukotun [16] argues that superscalar processors grow quadratically with issue width. Empirical data supports this assertion. Moving from the PA-RISC 7100 (a two way machine) to the PA-RISC 8000 (a four way machine) required 4.3 times more area. Similarly, the MIPS architecture grew by 3.5 times from the R5000 to the R10000. This quadratic relationship will be used to estimate the area of larger (> 4-way superscalar) processor nodes. Table 4 shows the characteristics of the basic processors for 100 nm technology. Data for processors smaller than 6-way superscalar are averages of data collected on current microprocessors, obtained from [5] . Data for processors larger than 4-way are generated using the area relationship and IPC formula described previously. 
Node Architectural Characteristics

Estimating Frequency
The longest delay path within the processor determines the clock frequency. This delay path consists of both a wire delay and a gate delay component. frequency = 1 / (critical path delay) critical path delay = wire delay + gate delay
In future superscalar designs, this critical path will likely consist of the ALU and execution bypass [18] . Bypass is provided so execution results can be used on the next cycle by instructions that need them. In a fully bypassed design, dependent instructions are allowed to execute back-to-back. This is an important capability, as nearly ½ of all instructions get their operands from bypass [1] . Pipelining the result broadcast severely degrades IPC. A typical configuration is shown in Figure 3 .
Register File
FU FU FU
Bypass wire length = f(#Fus, #physical registers, #read/write ports)
Figure 3 Bypass Wire Length
The register file is in the center of the ALU so that the distances between the register file and the functional units are minimized. For wire delay estimation, this study assumes the ALU and register file are co-located in an square layout. The maximum wire length is then estimated by the side of this square. Estimates of the size and number of functional units and the size of the physical register file are given in Table 5 . 
Table 5 Estimating Longest Wire in Critical Path
The delay of a wire is given by the expression [2] where R and C are the wire resistance and capacitance:
wire delay = 0.7 x RC R and C are calculated using the following formulae where R int and C int are respectively the resistance and capacitance of the wire per unit length.
R = R int x length, C = C int x length
The C int value was obtained from GENESYS [6] , while R int was calculated using the following equation, where ρ is the wire resistivity:
Aluminum is assumed for the wire material. They have a square cross-section with each side being 3 feature sizes in length:
Gate delays for each technology were obtained from GENESYS. 
Estimating Parallel Speedup
The speedup obtained by multiple processors is estimated using Amdahl's law, which says that speedup is a function of the fraction of the workload that can be executed in parallel.
Speedup = 1/{(Parallel fract/#nodes) + (1-Parallel fract)}
This parallel fraction is a variable for this analysis.
Results
The complete information for all candidate systems is summarized in Table 7 . 
Table 7 Summary of Candidate Architectures
The parallel fraction is then swept from 0 to 1.0. The next four graphs summarize the results in different regions of parallel operation. Figure 4 shows the results for parallel fractions from 0 to 0.5. In this region, the configuration with two 12-way superscalar processors is the best choice. The single uniprocessor (1x16-way) configuration performs best when little explicit parallelism is available (less than 10% of the workload). The more parallel configurations (32x2-way and 64x1-way) do not appear on this graph as their performance is significantly less than the other systems. An interesting point on this graph is at 0% parallel fraction. Without any explicit parallelism, the system with four 8-way superscalar processors is only 27% worse than the large uniprocessor. This is primarily due to the clock cycle penalty imposed on the large uniprocessor design. The next graph, Figure 5 , shows results for parallel fractions between 0.5 and 0.9. This is an interesting region as many general purpose programs have parallelism in this region. A multiprogramming environment, for example, should be able to provide this level of parallelism.
The eight 6-way processor system leads this region. Note that at these ranges of parallel fraction, the uniprocessor is anywhere from 1.3 -4.0 times worse than the other systems. The more parallel sixteen 4-way processor system only becomes competitive at greater than 85% parallel fraction. is best, after which the thirty-two 2-way processor system leads. In this region, the uniprocessor is from 4 to 13 times worse than the parallel systems. This is the first graph to provide performance projected by Moore's law. In 100 nm technology, Moore's law projects 12 IPC * 3
GHz or 36 Gops/sec. This level of performance is only predicted with greater than 97% parallel fraction on the 64 2-way processor system, and greater than 98% parallel fraction with the 256 scalar processor system. Finally, Figure 11 shows the geometric mean performance for workloads with low parallelism. In this case, the 2 12-way processors are the best choice. Wire delay limits the clock frequency of large uniprocessors. In addition, these architectures perform poorly in environments with even small amounts of explicit parallelism.
The other extreme, highly parallel systems composed of many small processors, are effective only when large amounts of explicit parallelism is available. The most parallel system considered, 256 simple processors, is the best choice only when over 99% of the workload can be executed in parallel.
Moore's law is sustained by only two configurations, highly parallel systems with greater than 97% parallel fraction in the workload. This is not the best choice for general-purpose computing, as many programs do support this level of parallelism.
Finding the best architecture for 100 nm technology is a difficult task. Perhaps the most interesting systems are combinations of configurations in this paper. For example, a heterogeneous processor consisting of several moderately complex superscalar processors together with a highly parallel processor array could efficiently execute parallel workloads when available and also efficiently handle serial tasks.
