Synthesizing architectural requirements from an application viewpoint can help in making important architectural design decisions towards building large scale parallel machines. In this paper, we quantify the tii bandwidth requirement on a binary hypercube topology for a set of five parallel applications. We use an executiondriven simulator called SPASM to collect data points for system sizes that are feasible to be simulated. These data points are then used in a regression analysis for projecting the link bandwidth requirements for larger systems, The requirements are projected as a function of the following system parameters: number of processors, CPU clock speed, and problem size. These results are also used to project the link bandwidths for other network topologies. Our study quantities the link bandwidth that has to be made available to liiit the network overhead in an application to a specified tolerance level. The results show that typical link bandwidths (200-300 MBytes/see) found in current commercial parallel architectures (such as Intel Paragon and Cray T3D) would have fairly low network overhead for the applications considered in this study. For two of the applications, this overhead is negligible.
Permission to copy without fee all or parl of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and Its date appear, and notice is given that copying is by permission of the Association of Computing Machinery.To copy otherwise, or to republlsh, requtres a fee and/or specific permission. SIGMETRICS '95, Ottawa, Ontario, Canada 0 1995 ACM 0-89791-695-6 show using analytical models that network locality has an important role to play in the performance of the mesh. Since network requirements are sensitive to the workload, it is necessary to study them in the context of real applications. driven simulator to study k-my n-cube networks in the context of applications drawn from image understanding, and show the impact of application characteristics on the choice of the network topology. We take a similar approach to deriving the network requirements in this study.
Using an execution-driven simulation platform called SPASM [27, 25] , we simulate the execution of five parallel applications on an architectural platform with a binary hypercube network topology. We vary the Iii bandwidth on the hypercube and quantify its impact on application performance.
From these results, we derive link bandwidths that are needed to lirnh network overheads to an acceptable level. We also study the impact of the number of processors, the CPU clock speed and the application problem size on bsndwidthrequirements.
Using regression analysis and application knowledge, we ex~apolate requirements for larger systems of 1024 processors and other network topologies. The results suggest that existing link bandwidth of 200-300 MBytes/see available on machines like Intel Paragon [14] and CrayT3D [17] can easily sustain the requirements of two applications (EP and FFT) even on highspecdprocessors of the future. For the other three, one maybe able to maintain network overheads at an acceptable level if the problem size is increased commensurate with the processing speed.
The technique otrtimed in this paper may be used to derive the bandwidth requirements of an application as a function of the system parameters (such as processor clock speed, number of processors, and expected problem size of applications).
This information can then be used by an architect to deduce the network requirements for a specific set of system parameters. When cost and technological factors prohibit supporting the required bandwidth, the information may be useful to find out the efficiency that would result from a lower bandwidth or the factor by which the problem size needs to be scaled up to maintain reasonable efficiency.
Section 2 gives an overview of our approach and details of the simulation platfom, section 3 briefly describes the hardware platform and applications used in this study; section 4 presents results from our experiments along with an analysis of bandwidth requirements 55 a fimction of system parameters; section 5 summarizes the implication of these results; and section 6 presents concluding remarks.
Approach
As observed in [22] , communication in an application may be characterized by four attributes. Volume refers to the number and size of messages.
The communication pattern in the application determines the source-destination pairs for the messages, and reflects on the application's ability to exploit network locality. Frequency pertains to the temporal aspects of communication,
i.e., the interval between successive network accesses by each processor as well as the interleaving in time of accesses from different processors. This temporal aspect of communication would play an important role in determitdmg network contention.
Tolerance is the ability of an application to hide network overheads by overlapping computation with communication.
Modeling all these attributes of communication in a parallel application is extremely difficult by simple static analysis, Further, the dynamic access patterns exhibited by many aPPhcations makes modeliig more complex. Several researchers [22, 25, 12] have observed the importance of simulation for capturing the communication behavior of applications.
In this study, we use snexecution-driven simulator called SPASM (Simulator for Parallel Architectural Scalability Measurements) [27, 25] that enables us to accurately model the behavior of applications on a number of simulated hardware platforms. SPASM has been written using CSIM [16] , a process oriented sequential simulation package, and currently runs on SPARCstations. The input to the simulator axe parallel applications written in C. These progrants are preprocessed (to label shared memory accesses), the compiled assembly code is augmented with cycle counting instructions, and the assembled binary is linked with the simulator code. As with other recent simulators [5, 9, 6, 19] , bulk of the instructions is executed at the speed of the native processor (the SPARC in this case) and only instructions (such as LOADs and STORES on a shared memory platform or SENDS and RECEIVES on a message-passing platform) that may potentially involve a network access are simulated. The input parameters that may be specified to SPASM are the number of processors, the CPU clock speed, the network topalogy, the link bandwidth and switching delays. We can thus vaty a range of system parameters and study their impact on application performance.
The main problem with the execution-driven simtsIation approach is the tremendous resource (both space and time) requirement in simulating large parallel systems. Related studies [28, 24] address this problem and show how it maybe alleviated by augmenting simulation with other evaluation techniques.
SPASM gives a wide range of statistical information about the execution of the program. The novel feature of SPASM is its ability to provide a complete isolation and quantification of dfferent overheads in a parallel system that liiit its scalability. The algorithmic overhead (arising from factors such as the serial part and workintbalance in the algorithm) and the network overheads (latency and contention) are the important overheads that are of relevance to this study. SPASM gives the total time (simulated time) which is the maximum of the running times of the individual parallel processors. This is the time that would be taken by an execution of the parallel program on the target parallel machine. SPASM also gives the ideaf lime, which is the time taken by the parallel program to execute cm an ideal machine such as the PRAM [31] . This mehic includes the algorithmic overheads but does not include any overheads arising from architectural limitations.
Of the network overheads, the time that a message would have taken in a contention free environment is charged to the latency overhead, while the rest of the time is accounted for in the contention overhead.
To synthesize the communication requirements of parallel applications, the separation of the overheads provided by SPASM is crucial. For instance, an application may have an algorithmic deficiency due to either a large serial part or due to work-imbalance, in which case 100% efficiency] is impossible regardless of other architectural parameters. The separation of overheads, provided by SPASM, enables us to quantify the bandwidth requirements as a function of acceptable network overheads (latency and contention). We thus quantify the bandwidth needed to limit the network overheads to 10%, so~. and so~. of the overall execution time. This also amounts to quantifying the bandwidth needed to attain an et%-ciency which is 9070, 70% and 50% of the ideal efficiency (on an ideal machme with zero network overhead).
Experimental Setup
We have chosen a CC-NUMA (Cache Coherent Non-Urtifonn Memory Access) shared memory multiprocessor as the architectural platform for this study. Since uniprocessor architecture is getting standardized with the advent of RISC technology, we fix most of the processor characteristics by using a SPARC chip as the bsseliie for each processor in a parallel system. But to study the impact of processor speed on network requirements we allow the clock speed of a processor to be varied. Each node in the system has a piece of the globally shared memory and a 2-way set-associative private cache (64 KBytes with 32 byte blocks). The cache is maintained sequentially consistent using an invalidation-based filly-mapped directory-based cache coherence scheme. Rothberg et al. [20] lEflciency is defined as speedup (p ) /p where p is the maabarof precesso$s.
Speedap (p) is the ratio of the time taken to execute. rtre paratlat application on 1 precessor to the Iurre taken to execute the same on p processors.
show that a cache of moderate size (64KBytes) suffices to capture the working set in many applications, and Wood et al. [30] show that for several applications, the execution time is not significantly different across cache coherence protocols. In an earlier study [28] , we show that for an invahdation-based protocol with a full-map duectory, the coherence maintenance overhead due to the protocol is not significant for a range of applications. Thus in our approach to synthesizing network requirements, we fix the cache parameters and vary only the clock speed of the processor and the network parameters. The synchronization primitive supported in hardware is a test-and-set operation and applications use a test-test-and-set to implement higher level synchronization.
The study ia conducted for a binary hypercube intercomect. The hypercubeis assumed to have serial (1 -bit wide) unidirectional links and uses the e-cube routing algorithm [29] . Messages are circuitswitched using a wormhole routing strategy and the switching delay is assumed to be zero. We simulate the network in its entirety in the context of the given applications. Ideally, we would liie to simulate other networks as well in order to study the change in lii bandwidth requirements with network connectivity.
Since these simulations take an inordinate amount of time, we have restricted ourselves to simulating the hypercube network in this study. We use the results from the hypercube study in hypothesizing the requirements for other networks using analytical techniques coupled with application knowledge.
We have chosen five parallel applications in this study that exhibit different characteristics and are representative of many scientific computations.
Three of the applications (EP, IS and CG) are from the NAS parallel benchmark suite [4] ; CHOLESKY is from the SPLASH benchmark suite [23] ; and FIT is the well-known Fast Fourier Transform algorithm. EP and FFT are well-structured applications with regular communication patterns determinable at compile-time, with the difference that EP has a higher computation to communication ratio. IS also has a regular communication pattern, but in addition it uses locks for mutual exclusion during the execution. CG and CHOLESKY are different from the other applications in that their communication patterns are not regular (both use sparse matrices) and cannot be determined at compile time. While a certain number of rows of the matrix in CC is assigned to a processor at compile time (static scheduling), CHOLES KY uses a dynamically maintained queue of rumable tasks. Further details on the applications can be found in [26] .
Results and Analysis
In thk section, we present results from our simulation experiments. Using these results, we quanti~the link bandwidth necessary to suppmt the efficient performance of these applications and project the bandwidth requirements for building large-scale parallel systems with a binary hypercube topology.
The experiments have been conducted over a range of processors (p=4, 8,16,32, 64) , CPU clock speeds (s=33, 100,300 MHz) and link bandwidths (*1, 20,100,200,600 and 1000MBytes/see). The problem size n of the applications has been varied as 16K, 32K, 64K, 128K and 256K for EP, IS and FFT, 1400 x 1400 and 5600 x 5600 for CG, and 1806 x 1806 for CHOLESKY.
In studying the effect of each parameter (processors, clock speed, problem size), we keep the other two constant.
4.1
Impact of System Parameters on Bandwidth Requirements As the link bandwidth is increased, the efficiency of the system is also expected to increase as shown in Figure 1 . But we soon reach a point of dimiiishing returns beyond which increasing the bandwidth does not have a significant impact on application performance (the curves flatten) since the network overheads (both latency and contention) are sufficiently low at this point. In all our results, we observe such a distinct knee. One would expect the efficiency beyond this knee to be close to 100%. But owing to algorithmic overheads such as serial part or work-imbalance, the knee may occur at a much lower efficiency (eo in Figure 1 may be much lower than 1.0). These algorithmic overheads may also cause the curves for each configuration of system parameters to flatten out at entirely different levels in the efficiency spectrum. The bandwidth corresponding to the knee (&) still represents an ideal point at which we would liie to operate since the network overheads beyond this knee are minimal and the network is no longer the bottleneck for any loss of efficiency. We track the horizontal movement of this knee to study the impact of system parameters (processors, CPU clock speed> problem size) on lii b~dwid~requ~emen~. respectively, across different number of processors (p). The knees for EP and FFT, which display a high computation to communication ratio, occur at low bandwidths and are hardly noticeable in fhese figures. The algorithmic overheads in these applications is negligible yielding efficiencies that are close to 100V0. For the other applications, the knee occurs at a higher bandwidth.
Further, the curves tend to flatten at different efficiencies suggesting the presence of algorithmic overheads. For all the applications, the knee shifts to the right as the number of processors is increased indicating the need for higher bandwidth. As the number of processors is increased, the network accesses incurred by a processor in the system may increase or decrease depending on the application, but each such access would incur a larger overhead from contending for network resources (due to the larger number of messages in the network as a whole for the chosen applications).
Further, the computation performed by a processor is expected to decrease, lowering the computation to 
Quantifying Link Bandwidth Requirements
We analyze bandwidth requirements using the above simulation results in projecting requirements for large-scale parallel systems. We track the change in the knee with system parameters by quantifying the link bandwidth needed to limit the network overheads to a certain fraction of the overall execution time. This fraction would determine the closeness of the operating point to the knee. For instance, if the network overhead is less than 10% of the overall execution time, then it amounts to saying that we are achieving an efficiency that is withii 90% of the ideal efficiency (on an ideal machine with zero netsvork overhead). Ideally, one would like to operate as close to the knee as possible. But owing to either cost or technological constraints, one may be forced to operate at a lower bandwidth and it would be interesting to investigate if one may stiU obtain reasonable efficiencies under these constraints. Figure 16 shows the trade-off between the tolerable network overhead and the resulting bandwidth that needs to be sustained to maintain the overhead within the specified level. With the ability to tolerate a larger network overhead, the bandwidth requirements me expected to decrease as shown by the curve labeled "Actual" in Figure 16 . To calculate the bandwidth requirement needed to limit the network overhead (troth the latency and contention component) to a cenain value, we simulate the applications and the network in their entirety over a range of link bandwidths.
We plot the bandwidths and the resulting network overheads as shown by the curve labeled "Simulated" in Figure 16 . We perform a linear interpolation between these data points to calculate the bandwidth (b.) required to limb the network overhead to x% of the total execution time. This bandwidth would represent a good upper bound on the corresponding "Actual" bandwidth (ha) required, Instead of presenting all the interpolated graphs, we simply tabulate the requirements for z = 10%, 30% and 50% in the following discussions, These reqttiremertts are expected to change with the number of processors, the CPU clock speed and the application problem size. The rate of change in the knee is used to study the impact of these system parameters.
In cases where the analysis is simple, we use our knowledge of the application and architectural characteristics in extrapolating the performance for larger systems. In cases where such a static analysis is not possible (due to the dynamic nature of the execution), we perform a non-liiear regression analysis of the simulation results using a mtdtivariatc secant method with a 95% confidence interval in the Using this methodology, we now discuss for each application its intrinsic characteristics that impact the communication and computation requirements; present the link bandwidth requirements as a function of increasing number of processors, CPU clock speed, and problem size; and project the requirements for a 1024-node system with a problem size appropriate for such a system. EP EP has a high computation to communication ratio with the communication being restricted to a few logarithmic global sum operations. The bulk of the time is spent in local computation and as a result, even a bandwidth of 1 MByte/see is adequate to limit network overheads to less than 10% of the execution time (see Table 1 ). As the number of processors is increased, the communication incurred by a processor in the global sum operation grows logarithmically and a bandwidth of 10 MBytes/see can probably sustain even a system with 1024 processors. As the clock speed is increased, the time spent by a processor in the local computation is expected to decrease liiearly and the bandwidth requirement for the global sum operation needs to increase at the same rate in order to maintain the same efficiency. Table 2 reflects this behavior. As the problem size (n) is increased for this application, the local computation incurred by a processor is expected to grow as O(n) while the communication (both the number of global sum operations as well as the number of messages for a single operation) remains the same. As a resul~the bandwidth requirements are expected to decrease linearly with problem size. Given that real world problem sizes for this application are of the order of n = 2Z [4], a very low link bandwidth (less than 1 MByte/see) would suffice to yield an efficiency close to 10070. 
IS
IS is more communication intensive than EP and its bandwidth requirements are expected to be considerably higher. There are two, dominant phases in the execution that account for the bulk of the communication [26] . In the firs~a processor accesses a part of the local buckets of every other processor in makiig updates tm the global buckets allotted to it. In the second phase, a processor locks each of the global buckets (that is partitioned by consecutiw: chunks across processors) in ranking the list of elements allotted to it. Whh an increase in the number of processors, a processor needs to access data from more processors in the former phase. In the latter, the amount of global buckets that is allocated to a processor decreases linearly with increase in processors due to the partitioning scheme. Hence, in both these phases, the communication is expected to grow as O(p) with increase in processors. Further, the computation performed by a processor decreases with an increase in processors, but the rate is less than liiear owing to algorithmic deficiencies in the problem [25] . These factors combine to yield a considerable bandwidth requirement for huger systems (see Table  3 ), if we are willing to tolerate less than 10% network overheads. The bandwidth function has been obtained by performing a nonlinear regression analysis of the simulation data points for the given system parameter, and the resulting function has been used to calculate the requirements for the 1024 node system. As the CPtJ clock speed is increased, the computation to communication ratio decreases, making the requirements more stringent as shown in Table 4. As the problem size (n) is increased, the comrnunicatioo in the above mentioned phases increases linearly. The local computation also increases, but the former effect is more prominent as is shown in Table 5 , where the bandwidth requirements grow moderately with problem size.
Using these results, the bandwidth requirements for IS are projected in Table 6 for a 1024 node system and a problem size of 2n that is representative of a real world problem [4] . 'ilk table shows that bandwidth requirements of IS are considerably high. We may at best be able to operate currently at around 50% networlk overhead range with 33 MHz processors given that link bandwidti of state-of-the-art networks is around 200-300 MBytes/see. With faster processors like the DEC Alpha, the network becomes an even bigger bottleneck for this application.
In projecting the above bandwidth requirements for this application with 1024 processors, both the number of buckets as well as the number of list elements have been increased for the larger problem. But bucket sort is frequently used in cases where the number of buckets is relatively independent of the number of elements in the list to be sorted. A scaling strategy where the size of the list is increased and the number of buckets is maintained constant would cause no change in communication in the above mentioned phases of IS, while the computation is expected to grow as O ( n). Hence, if we employ such a scaling strategy and increase the problem size linearly with the CPU clock speed, we maybe able to limit the network overheads to within 30-50% for this application with existing technology.
agree with theoretical results presented in [10] where the authors show that FFT is scalable on the cube topology and the achievable efficiency is only limited by the ratio of the CPU clock speed and the link bandwidth. Table 4 .2. These requirements can be satisfied even for faster processors (see Table 8 ). As we mentioned earlier, the computation to communication ratio is proportional to O(log n), and the network requirements are expected to become even less stringent as the problem size is increased. Table 9 confirms this observation. Hence, in projecting the requirements for a 1024-node system, link bandwidths of around 100-150 MB ytes/sec would suffice to limit the network overheads to less than 10% of the execution time (see Table 10 Table 11 . We observe that the effect of lower local computation, and lesser data reuse has a more significant impact in increasing the communication requirements for larger systems. Increasing the clock speed has an ahnost linear impact on increasing bandwidth requirements as given in Table 12 ,
As the problem size is increased, the local computation increases, and the probability of data re-use also increases. The rate at which these factors impact the requirements depends on the sparsity factor of the matrix. Table 13 shows the requirements for two different problem sizes. For the 1400 x 1400 problem, the spamity factor is 0.04, while the sparsity factor for the 5600 x 5600 problem is 0.02. The conesponding factor for the 14000 x 14000 problem suggested in [4] is 0.1 and we scale down the bandwidth requirements accordingly in Table 14 for a 1024 node system. The results suggest that we maybe able to limit the overheads to within 50% of the execution time with existing technology. As the processors get faster than 100 MHz, it would need a considerable amount of bandwidth to limit overheads to within 30%. But with faster processors, and larger system configurations, one may expect to solve larger problems as well. If we increase the problem size (number of rows of the matrix) linearly with the clock speed of the processor, one may expect the bandwidth requirements to remain constant, and we may be able to limit network overheads to within 30% of execution time even with existing technology, 50% 
CHOLESKY
This application performs a Cholesky factorization of a sparse positive definite matrix [26] . Each processor while working on a column will need to access the non-zero elements in the same row position of other columns. Once a non-local element is fetched, the processor can reuse it for the next column that it has to process. The communication pattern in processing a column is similar to that of CC. The difference is that the allocation of columns to processors in CHOLESKY is done dynamically. As with CC, an increase in the number of processors is expected to decrease the computation to communication ratio as well as the probability of data reuse. Further, the network overheads for implementing dynamic scheduling are also expected to increase for larger systems. Table 15 reflects thk trend, showing that bandwidth requirements for CHC)LESKY grow modestfy with increase in processors. Still, the requirements may be easily satisfied with existing technology for 1024 processors. Even with a 300 MHz clock, one maybe able to limit network overheads to around 30% as shown in In the previous section, we quantified the link bandwidth requirements of five applications for the binary hypercube topology as a function of the number of processors, CPU clock speed and problem size. Based on these results we also projected the requirements of large systems built with 1024 processors and CPU clock speeds upto 300 MHz. We observed that EP has negligible bandwidth requirements and FIT has moderate requirements that can be easily sustained. The network overheads for CC and CHOLESKY may be maintained at an acceptable level for current day processors, and as the processor speed increases, one may still be able to tolerate these overheads provided the problem size is increased commensurately. The network overheads of IS are tolerable for slow processors, but the requirements become unmanageable as the clock speed increases. As we observed, the deficiency in this problem is in the way the problem is scaled (the number of buckets is scaled linearly with the size of the input list to be sorted). On the other hand, if the number of buckets is maintained constang it may be possible to sustain bandwidth requirements by increasing the problem size linearly with the processing speed.
In [18] , the authors show that the applications EP, IS, and CG scale well on a 32-node KS R-1. Although our results suggest that these applications may incur overheads affecting their scalability, this does not contradict the results presented in [18] since the implications of our study are for larger systems built with much faster processors.
All of the above link bandwidth results have been presented for the binary hypercube network topology.
The cube represents a highly scalable network where the bisection bandwidth grows linearly with the number of processors. Even though cubes of 1024 nodes have been built [11 ] , cost and technology factors often play an important role in its physical realization. Agarwal [2] and Dally [8] show that wire delays (due to increased wire lengths associated with planar layouts) of higher dimensional networks make low dimensional networks more viable. The 2-dimensional
[15] and 3-dimensional [17, 3] toroids are common topologies used in cument day networks, and it would be interesting to project link bandwidth requirement for these topologies.
A metric that is often used to compare different networks is the bisection bandwidth available per processor. On a k-ary n-cube, the bisection bandwidth available per processor is inversely proportional to the radix k of the network. To reduce the degree of pessimism in these projections, one may thus introduce a correction factor of 0.5 that can be multiplied with the above-mentioned factors of 16 and 5 in projecting the bandwidths for 2-D and 3-D networks respectively.
EP would still need negligible bandwidth and we can still limit network overheads of FFT to around so~o on these networks with existing technology.
But the problem sizes for IS, CC and CHOLESKY would have to grow by a factor of 8 compared to their hypercube counterparts if we are to sustain the corresponding efficiency on a 2-D network with current technology. Despite the correction factor, these projections are still expected to be pessimistic since the method ignores the temporal aspect of communication.
The projection assumes that every message in the system traverses the bisection at the same time. If the message pattern is temporally skewed, then a lower link bandwidth may suffice for a given network overhead. It is almost impossible to determine these skews statically, especially for applications like CC and CHOLESKY where the communication pattern is dynamic. It would be interesting to conduct a detailed simulation for these network topologies to confirm these projections.
Concluding Remarks
In this study, we undertook the task of synthesizing the bandwidth requirements of five parallel applications. Such a study can help in making cost-performance trade-offs in designing and implementing networks for large scale multiprocessors.
One way of conducting such a study would be to examine the applications statically, and develop simple analytical models to capture their communication requirements.
But as we mentioned in Section 2, it is difficult to faithfully model all the attributes of communication by a simple static analysis for all applications, On the other hand, simulation can faithfully capture all the attributes of communication.
We used an execution-driven simulator called S PASM for simulating the applications on an architectural platform with a binary hypaculx topology. The link bandwidth of the simulated platform was varied and its impact on application performance was quantified, From these results, the link bandwidth requirements for limiting the network overheads to a specified tolerance level were identified. We also studied the impact of system parameters (number of processors, processing speed, problem size) on link bandwidth requirements. Using regression analysis and analytical techniques, we projected requirements for large scale parallel systems with 1024 processors and other network topologies.
The results show that existing link bandwidth of 200-300 MBytes/see available on machines like Intel Paragon [14] and Cray T3D [17] can sustain high speed applications with fairly low network overhead. For applications like EP and FFT, this overhead is negligible.
For the other applications, this overhead can be limited to about SOYO of the execution time provided the problem sizes are increased commensurate with the processor clock speed.
Using the technique outlined in this paper, it would be possible for an architect to synthesize the bandwidth requirements of an application as a function of system parameters. For instance, given a set of applications, the system size (number of processors) and the CPU speed, an architect may use this technique to calculate the bandwidth that he needs to support in hardware.
In cases where cost/technological problems prohibit supporting thk bandwidth, the architect may use the results to find out the efficiency that would result from a lower hardware bandwidth or the factor by which the problem size needs to be scaled to maintain good efficiency. The results may also be used to quantify the rate at which the network (which is often custom-built) capabilities have to be enhanced in order to accommodate the rapidly improving off-the-shelf components used in realizing the processing nodes. 
