This paper explores area parallelism tradeo s in the design of distributed shared-memory DSM multiprocessors built out of large single-chip computing nodes. In this context, area-e ciency arguments motivate a heterogeneous organization consisting of few nodes with large caches designed for single-thread parallelism, and a larger number of nodes with smaller caches designed for multi-thread parallelism. Quantitative performance of such organization is reported for a set of homogeneous multiprocessor programs from the SPLASH-2 benchmark suite. These programs are mapped onto the heterogeneous processors without source c ode modi cations via static thread assignment policies. Simulationbased analysis is used t o c ompare the performance o f heterogeneous and homogeneous DSMs that occupy the same silicon area. The analysis shows that a 4-node heterogeneous DSM with 21 processors outperforms its homogeneous counterpart with 4 processors by an average of 36 for the studied multiprocessor workload, while having the same performance for sequential codes. A sensitivity analysis based on a factorial design experiment is used to study the implications of processor, memory, and network heterogeneity on overall cost and performance of a heterogeneous DSM. The studied benchmarks are a ected, on average, primarily by heterogeneity in processor performance 59.3, followed by cache sizes 18.2, memory latency 14.6, and network latency 5.6.
Introduction
The predicted advent of billion-transistor chips 25 will enable the implementation of multiprocessors in a single chip. Large multiprocessors will then be possible by using single-chip multiprocessors as the building blocks. For a given silicon area i.e. budget, the question arises as to how to design and organize such future multiprocessors. In this context, this paper shows quantitatively that heterogeneous distributed sharedmemory multiprocessor designs outperform their homogeneous counterparts in the execution of unmodi ed multiprocessor workloads.
At a fundamental level, proposed billion-transistor chip designs di er in how resources are allocated to exploit parallelism. From a software perspective, it is appealing to devote circuit area to complex structures capable of enhancing the performance of a single thread of code 22, 17, 26, 5 because unmodi ed sequential binaries then can be run with high performance. However, uniprocessor architectures that aggressively exploit instruction-level parallelism ILP require increasingly area-expensive structures.
Designing area-e cient structures that are replicated to expose parallelism to multiple threads of code 13, 16, 11 is appealing in terms of hardware design simplicity and e ciency. A c hip-multiprocessor CMP exploits the area of a very large die by replicating smaller processing units and caches. Replication, in conjunction with design simplicity, allows for high clock rates of individual processing units and larger aggregate issue width due to more e cient area utilization 13 .
Combining both ILP and CMP designs can potentially lead to area-e cient m ultiprocessors capable of fast execution of both sequential and parallel codes. In this paper, a heterogeneous design that meets the goals of high area-e ciency and good performance of lowparallelism tasks is analyzed in the context of an implementation that uses multiple multiprocessor chips. The design consists of a hierarchy of processors and memories that includes a large number of simple processors for parallel computation as well as a few complex processors for fast execution of sequential and or moderately parallel code Figure 1B . Such an organization, also called HPAM Hierarchy of ProcessorAnd-Memory, was proposed and studied in 2, 3 . Previous studies showed that, for message-passing programs with one or more degrees of parallelism, HPAM machines have higher cost-e ciency than conventional designs. However, the shared-memory paradigm has become increasingly important for parallel processing due to the availability o f l o w-cost, busbased multiprocessor nodes 4, 9 and of distributed shared-memory DSM protocols 29, 2 3 . In particular, current trends in microprocessor design 1 suggest that microprocessors of the future will have enhanced support for multiprocessing and directory-based coherence protocols. The question arises as to whether a heterogeneous DSM HDSM organization would be able run-time determined degree of parallelism DoP.
In summary, this paper di ers from past work in several important w ays: a shared-memory heterogeneous implementation is assumed and simulated, unmodi ed, thread-based shared-memory multiprocessor programs are used for quantitative evaluation and these programs are written in a style that hides run-time variations of parallelism. This paper makes two main contributions. The rst one is a simulation-based performance analysis of HDSMs executing unmodi ed, shared-memory multiprocessor programs. To this end, three di erent static assignment s c hemes of homogeneous programs to heterogeneous nodes are quantitatively studied. The main conclusion from this analysis is that the HDSM con guration outperforms an equal-area homogeneous counterpart by a n a verage of 36 for multiprocessor workloads. Another conclusion is that the static thread assignment that maximizes performance depends on application characteristics, particularly communication and synchronization. The second contribution is a quantitative assessment o f h o w heterogeneity in the design of the processor, memory, and network subsystems impacts the performance of HDSM machines. One of the main conclusions is that the performance of HDSMs has low sensitivity to the speed of memories in the highly parallel levels.
The quantitative results reported in this paper were obtained through simulation of shared-memory parallel scienti c benchmarks from the SPLASH-2 suite 30 .
Benchmarks are simulated individually to study singleprogram parallel speedup. The environment used in the performance studies is based on a modi ed version of the Wisconsin Wind Tunnel-II multiprocessor simulator 10 that supports heterogeneity. The performance criterion is application execution time, measured as the number of simulated target machine cycles. Experimental tools and data used in the work described in this paper can be publicly accessed and or reused through the Netcare infrastructure http: www.ecn.purdue.edu NETCARE. The rest of this paper is organized as follows. Section 2 discusses the machine model and the experimental methodology used in the performance analysis. Section 3 presents a static approach to the mapping of homogeneous applications to HDSMs and describes the benchmarks that are used in the performance analysis and the simulation environment. Section 4 presents performance results and analysis. Section 5 discusses related work, and Section 6 presents conclusions.
Experimental methodology 2.1 Machine model
The machine model assumed in this paper is a DSM whose nodes have single-chip multiprocessors with integrated on-chip directory-based coherence support and memory controller. Both homogeneous and heterogeneous DSM con gurations are considered. The heterogeneous machine consists of three levels connected via a point-to-point network Figure 2 . Each level, in turn, consists of one or more nodes. Each node consists of one or more processors, a remote-access device RAD, and o -chip main memory, all connected by a bus. The RAD is responsible for providing a shared address space across the DSM nodes and maintaining coherence across remotely cached data.
The caches of CMPs are con gured as in conventional symmetric multiprocessor shared-memory designs, where each processor has a private data cache 19 . All simulated caches are direct-mapped. Cache coherence is maintained via a bus-snooping protocol inside the node, and via the Stache 23 replication policy together with a conventional invalidationbased directory protocol across nodes. Stache employs part of each node's DRAM memory as a large, fullyassociative cache 24 . This paper assumes that this protocol is handled by a dedicated hardware controller. Figure 2 depicts the heterogeneous DSM machine model assumed in this paper. The con guration shown in this gure is based on the processor-and-memory hierarchical design approach 2 : the number of processing elements increases from top to bottom levels, while cache sizes and the performance of processors and memories decrease from top to bottom levels.
Heterogeneity across machine levels is modeled by 
Performance analysis roadmap
The performance of HDSMs is studied from two di erent perspectives. In the rst analysis constant-area, Figure 1 dashed arrow, an HDSM is compared to homogeneous con gurations under the assumption of constant total die area. The systems under comparison di er only with respect to the organization of the processing elements in each node. Thus, memory access times and network latencies are the same for these systems. The actual response times of memory and network transactions, however, may di er across heterogeneous nodes due to contention on both the memory bus and network interface fully modeled in the simulations. This rst analysis is divided into two parts. Subsection 4.1 compares an HDSM to a fast uniprocessor, and subsection 4.2 presents a speedup analysis of the constant-area HDSM with respect to a homogeneous multiprocessor of same area.
In the second analysis constant-resources, Figure 1 solid arrow, the relative e ect of heterogeneity of processor, memory, and network on the performance of HDSMs is determined. In this analysis, only heterogeneous con gurations are considered. The con gurations under comparison have the same number of resources nodes, processors, caches, memories, and network, while the performance and capacity o f e a c h resource may v ary across heterogeneous levels.
The use of slower parts in the parallel levels of an HDSM is motivated by potential savings in total system cost. The primary goal of the constant-resources analysis is to study the performance impact of using less expensive parts in the parallel levels of an HDSM. The design space of heterogeneous systems is studied by considering processor, cache, memory, and communication hardware as dimensions along which a machine can be made heterogeneous. To analyze the impact of heterogeneity along each dimension on application performance, a factorial design experiment i s done. The methodology and results of this analysis are presented in Subsection 4.3.
Heterogeneous node con gurations
In the constant-area analysis, the machines under comparison di er only in the internal on-chip organization of processors and caches: the homogeneous nodes have a single, fast processor and large caches, while in the heterogeneous machine there are also nodes with more processors and smaller individual caches. The organizations of the computing nodes are such that die area is bounded by the area of the highperformance node of the homogeneous machine. Pointto-point i n ter-processor network connections and standard interfaces to main memory are assumed to be the same across the heterogeneous nodes. Since computing nodes have the same die area and same external interfaces to memory and other processors, their packaging is assumed to be the same for all con gurations.
Hence, in this analysis, heterogeneity is present only in terms of nodesi, procsi, clocki, and $sizei. Table 1 shows the values assumed for these parameters across the heterogeneous nodes, as well as the values of parameters corresponding to main memory and interconnection network latencies accessti, latencyi,j which are common to all nodes. The cache of the level-1 processor is dimensioned to hold the secondary working set of most SPLASH-2 30 programs Subsection 4.2 presents a small-cache analysis where cache sizes are smaller than the secondary working set. A base clock cycle of 1GHz is assumed for the level-1 processor. The main memory latency is assumed to be 56ns, and the interconnection latency between two computing nodes is assumed to be 50ns. The resulting average simulated ratio of remote local memory latency is 4.8.
The choices of number and performance of proces-
$sizei 1MB 128KB 64KB accessti 56clock1 56clock1 56clock1 latencyi,j 50clock1 50clock1 50clock1 Table 1 sors and of cache sizes shown in Table 1 are based on the assumption of bounded-area chips. The heterogeneous design has a single high-performance uniprocessor node and three chip-multiprocessor CMP nodes Figure 1B . The con guration of the highperformance uniprocessor node is based on a nextgeneration, 100-million transistor microprocessor, the Alpha 21364 1 . The speci cations of this microprocessor include the processor core of an Alpha 21264 8 , 1.5MB of level-2 cache, memory controller, and directory protocol support, all integrated into a single die.
The con gurations of processors and caches in the CMP nodes are motivated by the tradeo between processor design complexity and performance discussed in Section 1. The actual parameters used to de ne the organization of these nodes are based on a case study of the area performance tradeo between two microprocessors from the Alpha family Table 2 . This case is presented in the remaining of this section. Table 2 The order-of-magnitude performance improvement in terms of Spec95 results observed for the 21264 is achieved via a combination of architectural improvements higher ILP and a higher clock rate. The better clock rate of the faster chip is highly dependent o n a d v ances in semiconductor process technology: the 21264 and 21064 under comparison are fabricated in 0:35m and 0:75m technology, respectively. It is therefore reasonable to assume that the simpler design can achieve a clock speed comparable to the ILP-enhanced design if fabricated under the same CPU Clk Transistors P e r f int P e r f fp The normalized indices with respect to the 21064 processor, Table 2 account for clock speed di erences to yield an estimate of the relative performance between the two microprocessors under the assumption of same fabrication technology. By factoring out clock speeds, the normalized performance numbers provide an approximation to the speedup due to architectural enhancements.
The data in Table 2 show that a nine-fold increase in transistor count results in three-fold normalized speedups due to architectural enhancements. In other words, the same transistor budget of the highperformance processor can be used to design nine simpler engines with a third of the performance, under the assumption of equal clock speed. Based on these ndings, the model of the CMP used in the third level of the heterogeneous machine conservatively assumes 8 processors, each with a quarter of performance of the level-1 uniprocessor. The level-2 CMP is modeled assuming the same quadratic area performance relationship, yielding 4 processors, each with half of the performance of the level-1 uniprocessor. In this paper, the heterogeneity of processor performance is modeled via scaling of the clock rate. Table 1 shows the simulation parameters used to represent the heterogeneous processors under this model a detailed discussion of the scaling model is presented in Subsection 3.1. The level-2 and level-3 CMPs are assumed to have private L2 caches for each processor. The sizes of the private caches are obtained by scaling down the size of the level-1 uniprocessor cache 1MB. The scaling model assumes that the CMPs with 4 and 8 processors have private caches of sizes 1MB 8 and 1MB 16, respectively, e ectively reducing the aggregate on-chip cache size by a factor of two. This assumption is conservative in accounting for potential increases in interconnection requirements for the multiprocessor design, since the large on-chip 1MB cache accounts for the majority of the die area of the high-performance uniprocessor 1 . Figure 1B depicts These homogeneous applications are developed under a single-program, multiple-data SPMD model. Parallelism is expressed via PARMACS directives. In this model, the application distributes the workload evenly across N threads which are forked after an initialization phase, executed in parallel, and joined at the end of execution. Hence, these applications have been designed to exhibit a single DoP during their parallel execution, given by the total number of threads spawned. When a given application is executed in a homogeneous machine, each processor is assigned the same number of threads typically one, for processors that do not have hardware multi-threading support. If the same unmodi ed application is executed across heterogeneous processors, the workload remains equally distributed across N threads, thus threads assigned to slower processors may take longer to complete than those in faster processors, and fast processors may become idle while waiting for data and or synchronization from threads in slower processors.
Two solutions may be applied to increase the utilization of faster processors in this scenario: one is to redistribute the work across threads by modifying the application code, and the other is to redistribute the work by means of assigning more threads to more powerful processors. While the rst solution may i n volve extensive code analysis, the second solution which i s used in this paper can be implemented with little programming e ort and or operating system support. However, if the homogeneous thread assignment is one thread per processor, the second solution implies the creation of a larger number of threads. This implies that the amount of communication and synchronization may increase for the same workload.
In order to map the homogeneous applications onto a heterogeneous architecture without requiring code analysis, and to investigate possible performance advantages of heterogeneous thread assignment s c hemes, three static thread assignment policies were considered:
1. single-thread assignment: assigns a single thread to each processor in the machine. 2. virtual-processor assignment: assigns VPi threads to a physical processor i, where VPi is the performance ratio between the processor i and the slowest processor in the system. 3. single-level assignment: assigns a single thread to each processor of a machine level, while the remaining levels do not participate in computation.
Simulation environment
The simulation environment is based on a modi ed version of the Wisconsin Wind Tunnel-II 10 . The modications allow an HDSM machine with up to three levels to be de ned. The experiments conducted to perform the validation of this simulation environment are presented in detail in 12 .
In the simulation experiments, a processor with an average speedup due to ILP enhancements of n with respect to a scalar processor is approximately modeled as a scalar pipeline with clock rate scaled by n. Although this clock-scaling model has been found to introduce errors in execution time estimates 20 , two reasons contribute to reduce the magnitude of potential errors. First, the number of aggressive ILP processors in the simulated multiprocessors is smaller than in previously studied systems 4 in the homogeneous case and 1+4 in the heterogeneous case. Results obtained using the RSIM 20 simulator for a 4-processor system show reductions from 74 to 24 in the average error observed with respect to an 8-processor system 12 . Second, the same model applies to both heterogeneous and homogeneous systems. For a given benchmark, the potential errors arise in both con gurations and should have little e ect in relative comparisons.
The virtual-processor assignment requires support for the execution of multiple threads on processing nodes. The simulator models a coarse-grain multithreading scheme based on voluntary context switches initiated by threads at synchronization points. The average context-switch o verhead for the heterogeneous con guration is 870 processor cycles. The homogeneous con guration is not a ected by this overhead since there is a single thread per processor.
Performance analysis
In this section, simulation results for both conventional and heterogeneous designs are presented and analyzed, according to the analysis roadmap described in Subsection 2.2.
Performance with respect to uniprocessor
In this subsection, the HDSM machine speci ed in Table 1 is compared to the high-performance level-1 uniprocessor. This comparison determines which static assignment policy under study maximizes the parallel speedup achieved for each benchmark. Figure 3 presents the HDSM speedups with respect to a level-1 uniprocessor for the single-level, virtualprocessor and single-thread assignment policies note that the linear speedup achievable by the 21-processor HDSM is 7.0, with respect to a level-1 processor, given the heterogeneity in processor performance. The single-level scenario only considers assignment t o l e v el-3 processors. The virtual-processor scenario assigns 4, 2, and 1 threads to processors in levels 1, 2, and 3, respectively, except for benchmarks requiring power-oftwo n umber of threads, where 3 threads are assigned to level-2 processors.
The virtual-processor assignment a c hieves better performance than the other two s c hemes for Barnes, Cholesky, LU, Ocean, and Water-nsquared due to better utilization of the level-1 and level-2 processors. The single-level assignment prevents these processors from participating in computation, while in the singlethread scheme fast processors remain idle while waiting for slower processors on synchronization points. The virtual-processor assignment successfully improves load-balancing for these benchmarks by assigning heterogeneous workloads to processors, based on their relative performance.
For the remaining ve benchmarks, the performance of the virtual-processor scheme ranges from comparable to inferior compared with the other two s c hemes. For the kernel FFT, the single-level assignment yields the best performance. The reason for this performance advantage lies in the communication characteristics of this application.
The single-level scheme assigns threads to two wide 8-processor nodes, while in the virtual-processor case, threads are assigned to all four nodes. For the highly communication-intensive transpose phase of FFT, the inter-processor communication overhead is higher in the 4-node con guration than in the 2-node case. For this benchmark, the bene t of faster communication outweighs the higher parallelism exposed by the virtual-processor assignment, and the single-level assignment becomes the policy that yields best performance. The same behavior is observed, to a lesser extent, in Radix.
The single-thread policy achieves the best performance for FMM and Raytrace. Single-thread performs better than the virtual-processor assignment in these cases due to the synchronization characteristics of the two applications and the coarse-grain multi-threading model assumed in the simulations, as discussed in the following paragraphs. Previous work has shown that FMM spends a signi cant fraction of its execution time in synchronization 30 via locks and barriers. Since the simplistic multi-threading model assumed in the simulations only supports voluntary context switches, a thread ready to acquire a lock can be delayed by other threads in the same processor for a long period of time. The delayed thread, in turn, can potentially prevent dependent threads in other processors in the system from proceeding. The larger number of threads required by the virtual-processor assignment, combined with the underlying context-switching model, can thus induce unnecessary serialization and have a negative impact on performance. In FMM, the context-switching overhead of the virtual processor assignment four threads on the level-1 processor accounts for 15.9 of the total execution time.
The benchmark Raytrace synchronizes through locks; as with FMM, the virtual-processor assignment fails to deliver good performance. However, Raytrace uses a dynamic task-queue algorithm that achieves good load balancing under the single-thread policy, rendering the virtual-processor assignment unnecessary. In this algorithm, threads continually fetch w ork from a shared pool of tasks. A thread executing in a fast processor is thus likely to execute more work than one executing in a slower processor. This behavior is con rmed by measurements of the number of tasks executed by each processor. The simulated workload has a total of 1024 tasks in the shared work-pool; the average number of tasks executed by threads of levels 1, 2 and 3 are 156, 94 and 31, respectively. The fastest processor thus performs, on average, 408 more work than each quarter-speed level-3 processor without requiring a virtual-processor assignment. Due to this dynamic load-balancing scheme, Raytrace achieves the best speedup across single-thread assigned benchmarks.
The memory-intensive benchmark Ocean stands out with the best obtained speedup 5.7 for the HDSM model. The reason for this behavior lies in the ability of the architecture to expose high memory bandwidth. A more detailed look at the behavior of the memory subsystem of the HDSM reveals how it exploits the heterogeneous caches to provide high memory bandwidth.
Figures 4 and 5 show the average load miss rates obtained for the simulated benchmarks, and the average absolute number of misses per processor, respectively, under the virtual-processor assignment policy for levels larger caches are used more frequently and with lower miss rates than smaller caches, but the large number of independent small caches in the lowest level collectively provides high memory bandwidth.
Summarizing the results and ndings of this subsection, the arithmetic mean of the best-assignment speedups for the four-node HDSM is 4.12. The performance analysis has shown that load-balancing mechanisms and the large aggregate memory bandwidth provided by many independent caches are key to achieving good performance in most benchmarks.
Constant-area performance analysis
In this subsection, an HDSM multiprocessor is compared to a conventional, homogeneous multiprocessor under a constant-area assumption. This comparison provides a quantitative analysis of the potential performance bene ts of designing DSM machines with heterogeneous nodes.
Two di erent scenarios are considered in the analysis. The rst scenario large cache compares a four-node HDSM as speci ed in Table 1 to a fournode homogeneous multiprocessor whose nodes are all fast level-1 uniprocessors. The second scenario small cache di ers from the rst scenario only with respect to cache size: all cache sizes are eight times smaller in both homogeneous and heterogeneous con gurations. The purpose of this scenario is to study the performance of the HDSM model when caches may not be large enough to hold the secondary and possibly the primary working sets of SPLASH-2 benchmarks 30 . The best static thread assignment Figure 3 is used in speedup calculations.
A homogeneous scenario consisting of four 8-processor chip-multiprocessors could also be conceived. For the SPLASH-2 parallel workloads, it is likely to outperform both studied homogeneous and heterogeneous con gurations. However, such 32-processor machine would have a v ery poor sequential performance. Since all processors would be of the slowest type, this design can be as much as four times slower than either of the DSMs studied. Although compiler 6 and runtime 14 parallelization techniques can improve the performance of sequential code on this design for some types of applications, the analysis presented in this subsection does not include such scenario. Figure 6 shows the speedups, calculated as ratios of simulated homogeneous and heterogeneous DSM execution times, for the SPLASH-2 benchmarks. For all benchmarks but Water-spatial, the HDSM model outperforms the homogeneous counterpart, by a s m uch a s 95 for Ocean. Speedups of 25 or more are also observed for Barnes, FFT, FMM, LU, and Raytrace.
Good relative speedups for Ocean and Raytrace are expected from the analysis presented in the previous subsection; the HDSM model achieves uniprocessor speedups in excess of 5.0 for these benchmarks in a Heterogeneous Set  level i=1 level i=2 level i=3 level i=1 level i=2 level i=3   clocki  clock1  clock1  clock1  clock1   2clock1  4clock1   $sizei   1MB  1MB  1MB  1MB  128KB  64KB  accessti 56clock1 56clock1 56clock1 56clock1 84clock1 112clock1 latencyi,j 50clock1 50clock1 50clock1 50clock1 100clock1 100clock1 Table 3 : Homogeneous and heterogeneous sets of values for the factors clocki, $sizei, accessti and latencyi,j j = 2 ; 3; 1 for i = 1 ; 2; 3. Values are normalized with respect to the clock period of the fastest processor.
4-node system. Signi cant speedups are also observed for FFT 63 to 84. In the homogeneous con guration, the communication-intensive transpose phase of the FFT involves communication among four nodes, while in the HDSM with single-level assignment, the transpose involves only two wide nodes. Simulation results show that the HDSM executes the FFT transpose 143 faster than the homogeneous DSM, while executing the remaining compute-intensive phases 33 faster.
The HDSM relative performance for the small-cache scenario di ers from the large-cache scenario by less than 8.0 for six of the studied benchmarks. For the remaining benchmarks, the relative small-cache HDSM speedup is smaller than the large-cache speedup for Radix and FFT, but larger for Barnes and Ocean. O n average across all benchmarks, the two scenarios yield similar HDSM relative speedups: 37 large cache and 35 small cache.
The 11 slowdown observed for Water-spatial is primarily due to contention on the memory bus of the level-3 CMP nodes. For Water-spatial, simulation results show that the ratio between the number of bus transactions that must stall due to contention in level-3 and level-1 nodes is 47.6. For the remaining benchmarks, the average ratio is an order of magnitude smaller 5.4. The virtual processor and single-thread assignments are not able to signi cantly reduce total execution time in this case, since performance is limited by contention in the level-3 CMP memory bus.
In summary, the HDSM con guration signi cantly outperforms a homogeneous counterpart for the multiprocessor workload considered in this paper, under both large and small cache scenarios. In the next subsection, a factorial design analysis determines the impact of heterogeneity on HDSM performance, and identi es which application characteristics lead to good performance.
Constant-resources performance analysis
In this subsection, the impact of heterogeneity of the processor, memory, and network subsystems on the performance of HDSMs is analyzed Figure 1 , solid arrow via a factorial design methodology. Several design points are used in the simulation of each application. The values of nodesi and procsi are common to all simulated con gurations: levels 1, 2, and 3 consist of 1, 1, and 2 nodes with a total of 1, 4, and 16 processors, respectively. Each design point i s c haracterized by four triples, each specifying the value of clocki, $sizei, accessti and latencyi,j for each of three levels i=1, 2, and 3. Each triple can take one of two v alues, a homogeneous one i.e. the elements of each triple are all identical and a heterogeneous one i.e. the elements of each triple have di erent v alues.
The values assumed for each possible triple are shown in Table 3 . The factorial design experiment considers 16 possible design points which correspond to all possible combinations of homogeneous and heterogeneous triples.
The heterogeneous triples shown in Table 3 di er from the con guration of Table 1 studied in Subsections 4.1 and 4.2 only with respect to the memory and network latencies. In this subsection, the assumption of same memory and network technology is relaxed in order to study the sensitivity of HDSM performance to heterogeneity in memory and network latencies. They di er by at most a factor of two across heterogeneous levels.
To individually assess the performance impact of heterogeneity of each of the factors listed in Table 3 , a 2 k factorial design has been performed. Using the terminology of 15 , such experimental design is used to determine the e ect of each o f k factors in variations of a response variable, where each factor has two alternatives or levels. In this paper, the k = 4 factors under study are clocki, $sizei, accessti and latencyi,j. The two levels that each factor can assume are the homogeneous and heterogeneous triples shown in Table 3 . The response variable used is simulated execution time.
The 2 4 factorial design experiment considers sixteen di erent combinations of levels; each combination corresponds to a distinctly heterogeneous con guration. As an example, one possible con guration may h a ve homogeneous network latencies and cache sizes, while having heterogeneous processor and memory speeds. Each benchmark is simulated once for each distinct con guration; the obtained simulated execution times yield a 16-entry vector. This vector is then mathematically analyzed to determine the e ect of each o f the k = 4 factors in the variations of execution time. The analysis consists of calculating, for each factor, the inner-product between the execution-time vector and a sign vector with entries from the set ,1; +1 associated with the factor, and dividing the inner product by the total variation of the response variable y, given by
The results obtained from the factorial design experiment are summarized in Figure 7 . Variations in execution time are due to di erences in speed and capacity of processors, memories, and network: execution time is larger if a given factor is assigned a heterogeneous triple rather than a homogeneous triple if all other factors remain unchanged.
The benchmarks are a ected in di erent w ays by variations in the processor, memory, and network architectural factors. Figure 7 shows that, for the benchmarks FFT, Radix, and Raytrace, heterogeneity in the network and memory subsystems combined have the most signi cant impact in execution time; for Ocean and Barnes, heterogeneous cache sizes are responsible for 79.6 and 48.6 of the increase in execution time, respectively. The remaining ve benchmarks are mainly a ected by heterogeneity in processor speed.
In the average across all benchmarks, the variation in execution time is mostly due to heterogeneity i n processor speed 59.3, followed by heterogeneity i n cache sizes 18.2, memory access times 14.6, and network latency 5.6.
From a cost-performance standpoint, the results from the factorial analysis indicate that heterogeneity in memory and network may be desirable in large con gurations. To i n vestigate this scenario, an experiment comparing a fully heterogeneous DSM labelled D 1 against the con guration of Table 1 labelled D 2 w as performed. An average slowdown of 19.2 due to slower memories and networks in the lower hierarchy levels was observed across the 10 SPLASH-2 benchmarks 2 . If the cost reduction associated with the use of slower 2 This result has no direct relationship to the sum of the factorial design components for memory access time and network latency, since only two con gurations are compared in this experiment instead of the 16 design points of the factorial analysis. Figures 6 and 7 shows that the benchmarks with a signi cant memory component in the factorial experiment h a ve good speedups with respect to the homogeneous con guration. In particular, Ocean is the benchmark with the largest cache+memory term from the factorial analysis and achieves the best relative speedup among the SPLASH-2 programs.
Related work
The multiprocessor designs of this paper assume the future availability o f v ery large chips capable of multiprocessing and or very high performance for sequential codes. Proposals of the so-called billion-transistor architectures focus on the design of single-chip microprocessors that make use of very dense logic to exploit parallelism at di erent levels. Instruction-level parallel ILP processors proposed in 27, 2 2 , 1 7 , 26 exploit parallelism in a single thread of execution. Chipmultiprocessors 13 exploit parallelism across multiple threads. Simultaneous multi-threaded SMT processors 5 target both single-and multi-thread parallelism in a single chip. Such designs will serve a s building blocks for large multiprocessor con gurations that use multiple multiprocessor chips. This paper considers system-level implications of the use of heterogeneous building blocks on the performance of futuregeneration DSMs.
Several studies have shown that heterogeneous multiprocessor systems may be more cost-e ective than homogeneous multiprocessors 18, 2 . These studies have used analytical cost performance models and or simulation of message-passing workloads explicitly parallelized for a heterogeneous con guration. In contrast, this paper quantitatively analyzes the performance of unmodi ed shared-memory parallel programs in homogeneous and heterogeneous DSM multiprocessors of equal chip area. 6 Conclusions and future work This paper shows that HDSM multiprocessors organized as processor-and-memory hierarchies constitute a high-performance approach to computer design. In addition, this paper identi es tradeo s between performance and heterogeneity in the design of processors, caches, memories, and network of such a class of machines.
Simulation experiments show that a 4-node HDSM achieves average speedups of 4.12 with respect to a uniprocessor and 1.36 with respect to a 4-node homogeneous multiprocessor for ten SPLASH-2 benchmarks. The heterogeneous organization is particularly e ective for memory-intensive programs. While levels with small processor counts provide fast response for latency-sensitive tasks, levels with large numbers of processors provide large aggregate bandwidth for memory-intensive parallel tasks.
Three static thread assignment mechanisms that map homogeneous programs to heterogeneous organizations have been evaluated. The policy based on virtual processors provides good performance for memoryand CPU-intensive applications with low synchronization requirements. The single-thread assignment policy provides better performance for applications with high lock-based synchronization requirements. The singlelevel assignment policy results in the best performance for communication-intensive applications. These conclusions motivate ongoing work in dynamic thread assignment mechanisms that use run-time information and speculation to decide on the mapping of tasks to heterogeneous resources. Given the sensitivity o f this class of applications to shared-memory protocol processing and latency overhead, future research o n HDSMs will also focus on distributed shared-memory protocols and latency-tolerance mechanisms that account for node heterogeneity.
The impact of heterogeneity of the processor, memory and network subsystems on the performance of HDSMs is application-dependent. The studied applications are a ected, on average, primarily by heterogeneity in processor speed 59.3, followed by cache sizes 18.2, memory latency 14.6 and network latency 5.6. The performance of HDSMs thus has low sensitivity to the use of slow memory technology in the highly parallel machine levels.
