AbstractÐThe semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions fine-grain instruction-level parallelism across the tiles and statically schedules intertile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles in which software orchestrates communication over static interconnections. One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile and the balance is the proportion of area in each tile devoted to memory, processing, communication, and off-chip global I/O. If the total chip area is fixed, higher processing power per tile requires large tiles and hence reduces the total number of tiles on the chip. This paper presents SimpleFit, a novel analytical framework that designers can use to reason about the design space of Raw microprocessors. Our model is also generalizable to multiprocessors on a chip. Based on an architectural model, an application model, and a VLSI cost analysis, the framework computes the performance of applications and uses an optimization process to identify designs that will execute these applications most cost-effectively. Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a onr billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per tile, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for most applications.
INTRODUCTION
A DVANCES in semiconductor technology have made possible the integration of multiple functional units, large cache memories, reconfigurable logic arrays, and peripheral functions into single-chip microprocessors. Unfortunately, increases in the performance of contemporary microprocessors have come at the cost of increasing inefficiencies in silicon area usage. The inefficiencies arise from the complexity of designs that use hardware support to exploit more instruction level parallelism.
Maintaining a rapid increase in microprocessor performance will require a cost efficient utilization of silicon area. The MIT Raw microprocessor is a proposed architecture that exposes its internal hardware structure to the compiler so that the compiler can determine and orchestrate the best mapping of an application to the hardware. A Raw microprocessor [1] is reminiscent of a coarse-grained FPGA and comprises a replicated set of tiles coupled together by a set of compiler orchestrated, pipelined, switches (Fig. 1) . Each tile contains a simple RISC-like processing core and SRAM memories for instructions and data. Instruction memory allows the multiplexing of the compute logic on a cycle by cycle basis. SRAM memory distributed across the tiles eliminates the memory bandwidth bottleneck, provides low latency to each memory module, and prevents off-chip I/O latency from limiting effective computational throughput.
The tiles are interconnected by a high-speed 2D mesh network, allowing intertile communications that are statically scheduled to occur with register-like latencies. The switches themselves contain some amount of SRAM so that the compiler can load into the switch a program that multiplexes the interconnect in a cycle by cycle fashion, just as in a virtual wires-based multi-FPGA system [4] .
A typical Raw system includes a Raw microprocessor coupled with off-chip RDRAM (RamBus DRAM) through multiple high bandwidth paths. The two level memory hierarchy, namely a local SRAM memory attached to each tile inside the Raw chip and a large external RDRAM memory, is necessary to be able to solve large problems that exceed the size of the on-chip memory.
Raw architectures achieve the performance of FPGA-based custom computing engines by exploiting fine-grained parallelism and fast static communication, and by exposing the low-level hardware details to facilitate compiler orchestration. Unlike FPGA systems, however, Raw machines support instruction sequencing and are more flexible because the execution of a new operation can be accomplished merely by pointing to a new instruction. Compilation in Raw is faster than in FPGA systems because it binds into hardware commonly used to compute mechanisms, such as ALUs and memory paths, thereby eliminating repeated low-level compilations of these macro units. Binding of common mechanisms into hardware also yields better execution speed, lower area, and better power efficiency than FPGA systems.
The designer of an FPGA device or a Raw microprocessor is faced with the challenge of determining the best division of VLSI resources among computing, memory, and communication. This challenge is termed the balance problem. Furthermore the designers of both an FPGA and a Raw device must address the grain size issueÐin other words, whether to implement a few powerful tiles or whether to use many small tiles, each with lower processing power. This paper presents SimpleFit, an analytical framework that designers can use to reason about the division of resources in a VLSI chip. Although our analysis in this paper is focused on the Raw microprocessor, the analysis generalizes other chip multiprocessor architectures. Our objective in this paper is to gain more insight into costperformance optimal designs given a fixed amount of resources.
The framework presented in this paper focuses on the performance requirements of applications, introduces an architecture model, a cost model, and a performance model for applications, and defines an optimization process to search for performance optimal designs given a cost constraint.
The architecture model defines an architecture based on parameters that include the number of tiles P, the processing power of each tile p, the amount of memory in each tile m, the communication bandwidth out of each tile c, and a few other parameters, as shown in Section 2. The cost model estimates the cost in terms of chip area of realizing the given architecture with the specified set of parameters.
The performance model estimates the runtime of each application as a function of the problem size. Performance estimation is based on both 1) a characterization of the application and its algorithms in terms of its requirements, including processing steps, memory, and communication volumes and 2) the architecture model.
Together with a cost constraint defined in terms of the cost model, our performance model allows us to perform a constrained optimization on the independent architectural variables. We can, for example, compute the points or contours in the architectural space that correspond to the best performance for a given cost, lowest cost for a given level of performance, or best efficiency defined by performance/cost.
The algorithms used in this study have been adapted to the Raw system architecture illustrated in Fig. 1 by first partitioning them into subproblems that can fit within the Raw chip. Each subproblem is loaded from the external global RDRAM memory into the set of local memories in the tiles. Computation occurs on the subproblem and the results are stored back into external RDRAM. All the subproblems are visited (possibly multiple times) in sequence. The algorithmic slowdown due to blocking the problem in this manner is accurately modeled. Each subproblem is solved in parallel with a blocking algorithm. Applications studied in this paper include Jacobi Relaxation, Dense Matrix Multiply, Nbody, FFT, and Largest Common Subsequence.
The specific contributions of this paper include:
. a general framework for reasoning about the design space of VLSI-based parallel architectures, including models for cost and performance, . insights on optimal grain size and balance in Raw microprocessors. The remainder of this paper is organized as follows: Section 2 describes the three models developed in this paper: the performance model, the cost model, and the application model, and gives a qualitative analysis of cost and performance. Section 2.7 formulates the optimization process based on previous model assumptions. Section 3 gives our experimental results and Section 4 discusses related work. Section 5 concludes the paper.
FRAMEWORK
This section presents the analytical framework used in analyzing candidate designs in terms of their grain size and balance. We first start with a motivation for a study of grain size issues.
Motivation
Two key questions in the design of a Raw microprocessor involve the grain size of its tiles and their balance. The grain size reflects the sizes of various components inside the tiles such as memory, processing, and communication. A very coarse grain design would involve multiple issue superscalars for processing and large local memories. Very fine grain designs would be similar to contemporary FPGAs and include a few bits worth of logic and memory within each tile and a few wires connecting the individual tiles. Designs Each Raw tile contains a simple RISC-like processor, an SRAM memory for instructions and data, and a switch. The tiles are interconnected in a 2D mesh network that is orchestrated by the compiler. The switches themselves contain some amount of SRAM so that the compiler can load into the switch a program that multiplexes the interconnect in a cycle by cycle fashion, just as in a virtual wires-based multi-FPGA system.
with a moderate grain size would involve very simple single-issue processors in each node.
Grain size and balance play a large part in determining the efficiency or performance per unit cost of a machine assuming a fixed total budget. If an engineer builds a small number of very large (coarse grain) nodes, a point of diminishing returns is reached where node performance increases very slowly (if at all) as node size is increased. On the other hand, building a large number of very small (fine grain) nodes will also result in diminishing returns as the communication costs dominate. The highest efficiency occurs at an optimal point between the two extremes. Similarly, as observed by Kung and Yeung et al. [18] , [12] , there is an optimal balance of resources between the processor, memory, and the communication components within a node.
While there has been much debate on this topic, few concrete results have been reported. Machine balance and grain size continues to be determined more by convenience and market forces than by engineering analysis. Our primary motivation in undertaking this study is to provide an analytical framework to enable engineers to obtain insights into the trade-offs in choosing various machine parameters.
Let us first provide an overview of the framework. Throughout the paper, execution times are measured in machine cycles, information in units of machine words, and cost in SRAM bit equivalents (Sbe). As discussed in Section 2.4, an Sbe is the area occupied by one bit of SRAM memory.
Overview of the Framework
Let us overview our analytical framework, illustrated in Fig. 2 , by considering a simple machine model. In its simplest form, a parallel machine can be characterized by the number of tiles or nodes, P, the processing power of each node, p (operations per cycle), communication bandwidth of each node, c (words per cycle), and the amount of local memory per node, m (words).
For a given problem size and partitioning strategy, an application can be described by its processing, communication, and memory requirements per node or p (operations to be performed), (words to be communicated), and m (words). The model used in the paper is not complex and is discussed in Section 2.3.
The performance of the application in terms of its runtime is derived from the application requirements and the architectural model. If the processing time We use cost models u p p, u , u m m to map the machine parameters Y pY Y m into costs. In other words, the processor cost model u p p provides the area cost of implementing a processor that can perform p operations per cycle. The total machine cost for a P processor machine is then u u p u u m .
Given an application with a fixed problem size N and an area budget B, a constrained optimization problem is defined with the objective of finding the optimal machine Fig. 2 . Analytical framework. The key components of the framework are the models and the optimization process. Given an application with an associated problem size and a fixed silicon area budget, the constraint equations are derived for the optimization. The nonlinear optimization process searches the machine configuration space that gives the minimal runtime for the application.
configuration that gives the smallest runtime for that budget. In other words, the framework finds the set of architectural parameters Y pY Y m that yield a minimum value for T given that the cost K cannot exceed the available budget B. Or, more formally:
As discussed in more detail later, the optimization process is sped up by a set of balance constraints. The balance constraints state that, for the optimal solution, the computation time and communication time must be equal and that the physical memory should fit the problem. The balance constraints greatly reduce the size of the search space and thus the complexity of the optimization procedure.
The following sections discuss each of the components of the framework and the optimization process in more detail.
Architecture Model
This section discusses parameters necessary for architecture characterization. Although several approaches to modeling the performance of a parallel computer have been proposed in the literature [2] , [3] , none are completely suited to modeling fine-grain parallel systems built on a chip. Fig. 3 shows our characterization of a Raw system using the parameters described below. Our machine characterization differs from previous ones in the sense that it captures both local and global communication performance and includes software overheads.
We choose as independent parameters the number of nodes, P, the processing power per node in operations per cycle, p, the memory per node m in words, the local communication bandwidth per node in words per cycle, c, the software overhead for communication in cycles, o, the single hop latency of the network, l, the global off-chip communication bandwidth per chip in words per cycl, g , and the RDRAM latency expressed in cycles, l g .
As an example, sending a local intertile message of length L words first involves spending o cycles in launching the message. The message header word travels, on average, a distance of k d hops in the network using l cycles per hop. Because the bandwidth out of a node is c words per cycle, subsequent message words take I to enter the network. The receiving tile would also spend o cycles receiving the message. Thus, the communication time per message is:
Writing a block of data to the off-chip RDRAM memory first involves an overhead o associated with starting up global communication. The latency of accessing the DRAM will be the sum of the latency of traversing the interconnection network in one dimension (k d laP) plus l g , the DRAM latency. (We divide by two to indicate that RDRAM memory messages do not have to traverse both the X and Y network dimensions). The transfer rate of subsequent words will be the minimum of the local communication bandwidth and the global communication bandwidth per tile (since multiple tiles might be writing external memory). Thus the time for writing a block of size L to memory is:
Communication locality can be captured at the application level by accounting for it in the average distance, that messages travel (k d ). We ignore contention effects (e.g., resource and network contention) also because we assume that the compiler can statically orchestrate communication events much as in a virtual wires system. We also use a conservative approach in defining applications' communication requirements.
Cost Model
We use silicon area as a measure of cost. Silicon area reflects the fundamental cost of building a component and is a good basis for comparing alternatives as opposed to market price which includes many artificial factors. The cost model is based on CMOS microprocessors, SRAM and DRAM memories, and a mesh interconnection technology. For simplicity, we consider the off-chip RDRAM memory-free. Although our assumptions may change specific numerical results, the methodology for determining balance and grain size remains the same.
We normalize cost to units of SRAM bits, viz. one bit of SRAM takes one unit of area and, therefore, one unit of cost. We express the cost of all other components in terms of SRAM bit equivalents (Sbe).
We use the notion of relative density to enable the normalization of logic, memory, and communication areas into units of SRAM bit equivalents. Relative density captures the area impact of wires and more irregular structures, such as logic areas versus the more regular memory arrays. Although an SRAM bit comprises typically four to six transistors, we observe that the area it occupies is similar to the area of a logic transistor in a CPU die because of its regular structure and, therefore, its higher relative density. Thus, the chip size expressed in Sbe units is equivalent to the total number of transistors for logic areas (Table 1) .
A DRAM bit is realized with one transistor and the area it occupies is 10-16 times smaller than an SRAM bit area. We arrived at this conclusion as the typical SRAM cell requires a wire grid of dimension Q Â R compared to a DRAM cell implemented on the intersection of two wires. Factors such as the number of metal layers may change the relative density relations as more layers increase the density of logic areas. The logic area density is also reduced because of the greater amount of area devoted to wiring.
The following cost functions are based on empirical observations and statistics gathered on current implementations of superscalars and router chips:
Processor Cost u u p . The processor cost model computes the area cost as a function of p. We find it convenient to relate p to cost u p using an intermediate parameter i, which is the number of issue units i in the processor. Thus, i R implies a 4-way superscalar with a maximum of four operations per cycle.
We model the relationship between processing cost and instruction issue structure as a quadratic curve, which captures the cost increase due to multiple issue superscalars:
In the above, a cost of f p is required to achieve a single issue processor with i I.
We relate processing power p and the number of issue units i using:
This model captures the relationship between performance and cost due to more aggressive clock rates of lower issue processors. Typically, single issue designs obtain 1.6 to 2 times faster clock rates than corresponding high-issue rate processors. It also captures the fact that it is easier to obtain performance close to the theoretical maximum cycles per instruction in lower-issue processors as they require a smaller amount of instruction-level parallelism in applications.
Studying the layout of some simple RISC processors [13] , [21] , [20] , [15] leads to a base cost of f p PXS Â IH S transistor. That is, a minimal single issue 64-bit processor can be built in the area of 250K SRAM bits or with 250K logic transistors. A cost constant of u ps R Â IH S Sbe was arrived at from the study of some high-end processors [29] , [27] , [28] , [15] .
For validation, Fig. 4 compares the number of transistors dedicated to logic in several superscalar microprocessors with our cost model for u p i. We observe that, for higherissue superscalars, the variation in the number of transistors dedicated to logic areas is large. This variation is caused by important differences in implementation of components like issue structure, scheduling, and memory interfaces. A more detailed cost model for superscalars may also deal with the cost impact of dynamic or static issue structures, scheduling, and memory interfacing.
Memory Cost. We approximate memory cost as a linear function of capacity m:
Here, m is the memory size in words, u ms is the cost per word of memory, and f m is the fixed overhead cost of the memory. This overhead includes logic for translation, address decode, data multiplexing, and memory peripheral circuitry. For our calculations, we assume that wordsize TR and the overhead, f m , is S Â IH R . Communication Cost. The main components of a typical router comprise a routing module, a crossbar arbiter, and input output modules, often including large FIFOs. We observe that most of the area in current router chips is taken up by FIFOs and pad frames (circa 20 percent). Crossbar logic usually occupies a small part of the total area.
The amount of FIFOs depends on such factors as the number of virtual channels. The area of queues reflects the size of message flits and a length which is typically 16-20 flits. A flit is the number of bits transferred in one cycle and, therefore, it also equals c expressed in bits. One word per cycle communication bandwidth thus requires a flit size of one word. Although not necessary, we also assume the flitsize is equal to the physical channel width. We denote the dimension of the network as n. The total number of bidirectional channels is then Pn. Our results focus on two-dimensional networks, so n P for most of this paper. We have found that the area of routers is proportional with the number of queue sets used in implementing virtual channels, the flit size, the dimension of the network, and the length of the FIFOs. The cost function for the routers is described in the following equation:
In the equation above, p l is the length of the FIFOs and Q is the number of queue sets due to virtual channels. Our results use I. The communication cost factor, u s , is derived by fitting the cost function equation with the areas of routing chips shown in Table 2 .
For our calculations, we use u s PS. For example, a router with a 64 bit flit size and with one set of queues, each with length 16 flits, takes approximately a 125,000 logic transistor area in our model.
The base area for a router, f , is estimated at PXS Â IH R . We arrive at this from a study of simple routers [17] , [16] , [13] , [22] . Examples of routers with the number of transistors used in current implementations are shown in Table 2 . The estimates using our communication cost model are also shown. The comparison indicates that our cost model reflects relatively accurately the area occupied by these routers except the RDT [14] router chip that has more than half of its area devoted to a multicast mechanism module and a bit-map generator.
Global Communication Cost. We approximate global communication cost as a linear function of global off-chip communication capacity. The base area for global I/O, f g IH R , is estimated to be somewhat smaller than a simple router area as no routing functions are necessary. The global communication bandwidth is limited by the maximum number of pins a packaging technology will allow. As current microprocessor packaging technologies use from 100 to several hundred pins, we assume that, in 10-12 years, packaging will allow no more than roughly 2,000 pins. The maximum possible global bandwidth is then mx PY HHHa . The global communication cost factor, u s IH S , multiplied with the wordsize is approximately the cost in SRAM bit equivalents of one word per cycle of global I/O bandwidth:
Global Latency Cost. For simplicity, we assume this cost as a constant reflecting the more or less constant speed of DRAM access over time. f lg is estimated at IH S :
u lg l g f lg X V Total Cost of the System. The total cost of the system is equal to the sum of its components:
Application Model
The application model contains functions and parameters that are necessary for application performance characterization. To predict the performance of an application with a particular machine configuration, we assume that the resource demands are uniform over time and that processing, local, and global communication can be completely overlapped. Some algorithms, such as those used in dynamic programming, also require the estimation of the algorithmic imbalance or the idle time due to synchronization overhead. Applications with several phases can be handled by dividing the application into its phases and characterizing each phase separately. Our assumption that processing, local, and global communication are overlapped imposes constraints on how the problem is partitioned and on the total amount of memory required. As we will show later, besides the memory needed to hold the problem, local and global communication buffers are required in order to be able to overlap communication times.
Our application model does not distinguish between different forms of parallelism and types of functional units, i.e., we assume that the parallelism available in the application can be utilized equally well in a multiple issue or in a multitile design.
We will exemplify the concepts of this section by analyzing the Jacobi relaxation problem. The requirements of the other applications considered in this paper are presented in Table 6 . The Jacobi Relaxation problem is an iterative algorithm which, given a set of boundary conditions, finds discretized solutions to differential equations of the form r P e f H. Each step of the algorithm replaces the value at each node of a grid with the average of the values of its nearest neighbors.
The original Jacobi problem defined by a grid of size N is partitioned in subgrids of size x H , as illustrated in Fig. 5 . Each subgrid or subproblem is solved by storing the subproblem of size x H in the internal memory of a Raw microprocessor and running a blocking relaxation algorithm. After a given number of phases, the subgrid is stored in external RDRAM and the next subgrid is loaded. Clearly, a given subgrid has to be loaded and operated upon multiple times to reflect the effect of synchronization with the values computed in neighboring subgrids.
Because values from neighboring subgrids do not impact the relaxations on a given subgrid stored in the micro-
TABLE 2 Important Cost Factors for Router Chips
In the Type column, we give the number of virtual channels where necessary, e.g., Pv means two virtual channels. The second and third columns compare the actual number and the estimated number of transistors. With Flits, we show the flit size or the number of bits transferred in one cycle. p l shows the length of FIFOs in flits and Q shows the set of queues in the design often reflecting the number of virtual channels.
processor, the number of iterations needed for convergence increases. We choose i s x H p aP as the number of iterations after which resynchronizations must occur between subproblems. Starting with some boundary conditions, this means propagating border values to all points in a subproblem. We chose the total number of iterations as being i t x P , giving an error reduction factor of 10. Let us analyze the requirements of this application. Required Processing per Node p . This requirement reflects the total amount of computation required per Jacobi node given the algorithmic assumptions described above. The total number of operations for each point is three additions and one multiplication:
Required Amount of Memory Words per Node m . The required memory is comprised of the memory required to solve the subblock of size x H and also the memory buffers needed to overlap local and global communication:
Required Number of Words of Local Communication per Node . The required local communications is the total amount of data sent or received during the whole execution time. For any iteration, each processor requires the bordering points from its neighbor processors:
Required Local Communication Events o . These events incur a software penalty for initiating a communication step. It reflects the total number of times a local send or receive is issued:
Required Latency of Events l . Reflects the total number of times a local send is issued: 
IU
As an example, if the number of operations that must be performed is p and the processing power is p operations per cycle, then the processing time is simply p ap. Similarly, if the number of events incurring the message overhead (o cycles) is o , then the time wasted in message overhead activity is o o.
The Optimization Problem
In this section, we describe in more detail the optimization procedure. The problem solved is the following constrained based nonlinear optimization problem:
Given: A fixed chip area or budget B and a problem size N. 
Constraints.
1. Budget B must be greater than or equal to the total cost. The total cost of the system is computed as the sum of its components:
f ! u p u u m u g u lg X IW Fig. 5 . Jacobi Relaxation. The problem of size N is first partitioned in subproblems of size x H . Each subproblem is solved with blocking on P processors. Each processor receives bordering data from its four neighbors and sends its data along borders to its neighbors. Subproblems are resynchronized after a number of iterations.
It is expedient to use an additional set of balance constraints, as given below, when the communication and computation are overlapped. The balance constraints focus the search for the optimal solution to balanced machine configurations. In other words, the second and third equations state that communication and computation times should be equal. If they are not equal, we can take resources from the faster component and give them to the slower component to improve runtime. The last balance constraint states that the memory should fit the problem. If the memory is larger than this amount, it can be reduced without impacting performance. When local and global communication times are equal and memory fits the problem, the machine configuration is balanced for the application. In a balanced machine, each resource is utilized to its fullest. The balance constraints greatly reduce the search space and, thus, the complexity of the optimization procedure. 
ANALYSIS
In this section, we study a set of applications in the context of the framework presented. The applications are: Jacobi Relaxation, Dense Matrix Multiply, Nbody, FFT, Largest Common Subsequence. We chose these applications becouse they are diverse and require conflicting machine performances to run efficiently. The optimization procedure has been implemented in Mathematica. We use a three cycle software overhead, a 100 cycle DRAM access latency, and assume an MIPS R2000 ISA for instruction latencies. We also counted an 8 Kbyte SRAM-based instruction and data cache per node. In all the experiments, we used a budget of one billion SRAM bit equivalents or the area required for one billion logic transistors. This budget is achievable in 10-12 years as projected by the Semiconductor Industry Association (SIA) given a 10-20 percent growth rate per year of die areas and a growth rate in transistor counts of between 60 and 80 percent per year due to increasing densities.
Application Specific Results
Fig . 6 shows the optimal division of chip resources for the various applications as a function of problem size. The optimal amount of each resource is shown in greater detail in Fig. 7, Fig. 8, Fig. 9, Fig. 10, and Fig. 11 .
Perhaps the most important result from Fig. 6 is that the amount of area devoted to processing and local communication is more or less constant at about 75 percent for all the applications and problem sizes. There is a variance, however, across the programs in terms of how much of this area should be devoted to computation versus communication. These two components could be traded, for example, at runtime in true FPGA systems. Although the 25 percent area dedicated to memory is less than what we have in today's microprocessors, it is still a significant portion of the chip. Future applications, such as media and streaming applications will likely require even less memory because fast local and global communication can eliminate the need for buffering an intermediate state.
The global communication bandwidth of 30 words per cycle is the maximum achievable given a packaging technology allowing 2,000 pins. The only application that is I/O limited and requires this bandwidth is FFT. All the other applications have a negligible area allocated to global communication. The total chip area for global communication is relatively small, even for FFT. Therefore, providing the maximum possible global bandwidth is not a bad idea in a final configuration.
As we can see, the relative communication area required is small in applications such as Jacobi and LCS as they also show good spatial locality. These applications can use most of the resources for processing. FFT and Nbody require the largest communication area with an optimal communication bandwidth between four and five words per cycle. The division between processing and memory areas is uniform.
The matrix multiplication based on Connor's memory efficient blocking algorithm gives the most uniformly divided configuration. For this application, memory, local communication, and processing areas are approximately equal.
The amount of memory per node obtained is relatively small compared to modern day multiprocessors in all applications. The reason is twofold. First, the total amount of memory in the entire Raw chip is still quite large since it is the product of P and m. Second, fast local communication obviates the need for huge amounts of local memory. The matrix multiplication required the largest amount of memory giving a total of 24 Kbytes per node. The smallest memory is required for Nbody.
For all the applications, the optimal processing power obtained is equivalent to a single-issue processor. The total number of processors P varied between 1,100 to 2,310 for large problem sizes.
Although the optimal machine configurations obtained vary for different applications, problem sizes, and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming a one billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1,000 tiles, 30 words/cycle global I/O, 20 Kbytes of local memory per node, three to four words/cycle local communication bandwidth, and single-issue processors. This configuration will give performance near the global optimum for applications studied.
Sensitivity of Grain Size
The framework helps answer many other questions about machine configurations. Let us study the sensitivity of performance to the machine configuration near the optimum machine configuration point. This study is useful to determine a machine configuration that is robust across many applications. As an example, let us determine the machine configuration with the smallest number of nodes whose performance is within 25 percent of the optimal configuration.
Results are shown in Table 3 . For each application, the first row gives the optimal configuration. The second row gives the configuration with the smallest number of nodes under the condition that the performance is no worse than 25 percent of the optimal. As we can see, balanced machine configurations with fewer nodes usually take advantage of the parallelism available in superscalar processors. However, for all the applications studied, the configuration that gave best performance used nodes based on 2-way superscalars at most.
Sensitivity to Different Processor Cost Model Assumptions
In the analysis presented in this section, we used a quadratic cost model for the processors.
In the following, we analyze the sensitivity of optimal machine configurations to a slightly different processor cost model. Table 4 compares machine configurations obtained with two processor cost models. For each application and each parameter, two experimental values are shown. The first value corresponds to the case when the quadratic processor model is used; the values in the parentheses correspond to a processor cost function u p i f p u ps i À I QaP or assume a less dramatic impact on chip area with multiple issue designs. As we can notice from Table 4 , the variations in the optimal balanced machine configurations are very small. The framework suggests simple processors even in the case of the less expensive processor model. The explanation is that the applications used in this study are highly parallel, the cost of local communication is very low in applications with good locality, and adding more functional units is still expensive (the processor cost function is nonlinear). Additionally, the impact of communication latency is further reduced if it is overlapped with computation. The first row of each application shows the global optimum and the second row shows the solution with the minimum number of processors and performance within 25 percent of the optimal. The numbers in parentheses show the performance degradation compared to the global optimum for the configurations with the minimum processors. The first columns between P to g represent the optimal machine configuration and the columns from u p to u lg are the chip sizes in percent of the total cost.
TABLE 4 Sensitivity of Optimal Machine Configuration to Processor Cost Model Assumptions
Breakdown of optimal machine configurations for three problem sizes and two processor cost models. The first model corresponds to a quadratic curve between processor costs and issue rate u p i f p u ps i À I P . The numbers in the parentheses are for a processor cost function u p i f p u ps i À I QaP assuming a less dramatic impact on chip area with multiple issue designs. Columns P to g represent the optimal machine configuration.
Sensitivity to Communication Overlapping Asssumptions
In the analysis presented in this section, we assumed that communication time can be completely overlapped with computation. We used this assumption because it has significantly reduced the complexity of the optimization problem and it is a reasonable assumption for many regular applications (especially if they are statically scheduled). However, it is possible to extend the model with an extra parameter, the overlapping factor, to account for the situation when complete overlapping is not possible. We define the overlapping factor as the ratio between the overlapped and total communication times. As an example, to study the sensitivity of optimal resource partitioning to the overlapping factor, we determined the machine configuration assuming that only half of the communication time can be overlapped with computation. Results for two machine configurations are shown in Table 5 . For each application and each parameter, two experimental values are given. The first value corresponds to the completely overlapped case, the values in the parentheses correspond to an overlapping factor equal to HXS. As we can notice, the variations in the optimal balanced machine configurations are very small. The explanation is that none of the applications are memory bound, meaning that the extra memory required for local and global communication buffering in the overlapped case is not very significant. Similarly, the added communication cost when only half of the communication times are overlapped is not significantly impacting the execution time and thus the optimal partitioning of resources.
Design Comparisons
The framework also allows us to compare competing designs for the same budget. As an example, let us compare the two designs: 1) using on-chip SRAM and routers with 16-flit FIFOs and 2) using only a small SRAM cache and the rest of the memory in on-chip DRAM as well as small 2-flit FIFOs. We derive the performance/cost optimal configurations and look to application performance for different problem sizes.
Since DRAM densities are much higher than SRAM densities we can have more memory per node in alternative (2) . One problem in using DRAMs is that the access latency is higher than corresponding SRAMs. To reduce the impact of the latency, we include a small SRAM cache in each node and assume that the SRAM cache results in a near perfect hit rate (this assumption can be made because the total amount of SRAM per Raw chip is very large even for a small SRAM per Raw tile). Case 2 also has small FIFOsÐwith good static scheduling of the communication channels, the need for deep FIFO's is reduced.
The question is: How much do these changes impact the performance of applications given performance/cost optimal partitioning of resources in both cases? Fig. 12 shows the performance ratio between the second and the first designs. It is easy to see that the larger amount of on-chip memory in Case 2 results in significantly higher Breakdown of optimal machine configurations for three problem sizes and two overlapping models. The first overlapping model corresponds to the case when all communication latencies can be overlapped with computations. The numbers in the parentheses correspond to the case when only half of the communication time can be overlapped with computation. Columns P to g represent the optimal machine configuration.
TABLE 6
Overview Application Requirements performance. Assuming a fairly large 10 percent local miss rate (between the local SRAM and local DRAM), an on-chip DRAM latency of 10 cycles, one cycle SRAM latency, 25 percent memory instructions, the performance improvement in Case 2 is reduced by 25 percent to a speedup of 1.12 to 2.62 for the applications studied.
RELATED WORK
There are several research efforts that have defined a set of analytical models that allow estimation of system performance metrics. In this section, we briefly mention a few of these research works.
One of the first works incorporating technology, architecture, and packaging models in a framework is SUSPENS [7] . SUSPENS is a generic systems-level approach, covering the circuit and systems level of abstraction. The GENESYS framework [5] assimilates the entire hierarchical description of a microprocessor chip through a concise set of input parameters and projects its key performance metrics by engaging a set of interrelated models, incorporating both physical and empirical knowledge. SimpleFit is a much higher level framework than GENESYS and it has no physical models incorporated. In contrast, SimpleFit incorporates an analytical model for application runtime behavior, enabling optimization of architectures for different applications.
The LogP model [2] is a simple parallel machine model intended to serve as a basis for developing portable parallel algorithms. Alexandrov et al. defined the LogGP [11] model as an extension of LogP to capture the large bandwidth requirements of applications using long message primitives. LoGPC [9] leverages the performance parameters of LogP and LogGP and extends the analysis with a more detailed model of the DMA pipeline and a network contention component. The LoPC model [8] extended the LogP model with a resource contention model. The performance model in SimpleFit is using a similar set of parameters for modeling the most important performance aspects of a tiled single chip system.
There has also been a lot of interest in analytical models for caching. Some of the earliest work in cache area modeling has been done by Mulder et al. [6] . The area for a simple single-ported SRAM cell has been empirically found to be 0.6 register bit equivalents (RBE). This is comparable to the empirical estimation done in SimpleFit where an SRAM bit area is assumed to be equivalent with a CPU logic bit.
Another direction where analytical modeling is used is for estimating and optimizing microprocessor power dissipation at the architectural level. Wattch [10] , for example, is a recent architectural simulator that estimates CPU power consumption by using parameterizable power models.
SimpleFit leverages some aspects of the early work on grain size for multiprocessor systems by Yeung et al. [12] . It goes, however, to a greater level of detail and focuses on single-chip tiled architectures as opposed to multiprocessor systems.
CONCLUSIONS
This paper describes SimpleFit, a novel framework for reasoning about single chip tiled microprocessors, such as Raw, with replicated, fine-grain processing elements. The framework uses a machine characterization that considers processing, memory, local and global communication, and latency as separate machine resources. This is a unique characterization of machine space since it captures the effects of locality by treating local and global communication separately. The framework incorporates a cost model based on empirical observations and statistics gathered on current implementations of superscalars and router chips.
The framework recognizes the importance of balance in good design and integrates this idea with a cost and performance model to provide a useful design tool. Having provided this framework, this paper chooses a diverse application suite in order to exercise the framework and to address some general questions in parallel computer design in general. More specifically, it addresses the questions of on-chip resource division in the MIT Raw microprocessor.
Although the optimal machine configurations vary for different applications, problem sizes, and budgets, the general trends are consistent. The framework further suggested that, for the applications studied and assuming a one billion logic transistor equivalent area, designers should build a system with about 1,000 nodes, 30 words/ cycle of global I/O, 20 Kbytes of local memory per node, three to four words/cycle local communication bandwidth, and single-issue processors for optimal performance. He led the Alewife project at MIT which designed and implemented a large-scale cache-coherent multiprocessor. He currently codirects the Raw and Oxygen projects. Raw is developing a software-exposed VLSI processor architecture that exposes its wires to the compiler. Oxygen is building a pervasive computing environment in which humans interact with computation using speech and vision.
F For further information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
