We believe that the common hardware organization described in Figure 1 will dominate commercial MPPs at least for the rest of this decade, for reasons d~cussed in Section 2 of this paper. of-the-art memory systems to remain competitive.
We believe that the common hardware organization described in Figure 1 will dominate commercial MPPs at least for the rest of this decade, for reasons d~cussed in Section 2 of this paper.
In Section 3 we develop the LogP model, which captures the important characteristics of this organization. 
Figure 3: Optimal brvadcast tree for P = 8, L = 6, g = 4,0 = 2 (left) and the activi~of each processor overtime (right). 77senumbcrshown for each node is the time at which it has received the datum and can begin sending it on. 7he last value is received at time 24.
realistically through the parameters o and g. Moreover the capacity constraint allows multithreadmg to be employed only up to a limit of L/g virtual processors. Under I..ogP, multithreadittg represents a convenient tectilque which simplifies analysis, as long as these constraints are me~rather than a fundamental requirement [26, 30] . he tree has the same shape as an optimal broadcast tree [20] . Each processor has the task of summing a set of the elements and then (except for the root processor) transmitting the result to ita parent.
The elements to be summed by a processor consist of original inputs The computation schedule for our summation algorithm can also be represented as a tree with a node for each computation step. 
Fast Fourier Transform
Our first example, the fast Fourier transform, illustrates these ideas in a concrete setting. We discuss the key aspects of the algorithm and then an implementation that achieves near peak performance on the Thinking Machines CM-5. We focus on the "butterfly"
algorithm [8] for the discrete FIT problem, most easily described in terms of its computation graph. The n-input (n a power of 2)
butterfly is a directed acyclic graph with n(log n + 1) nodes viewed as n rows of (log n + 1) columns each. For O s r < n and O s c < log n, the node (r, c) has directed edges to nodes (r, c There is a vast body of work on this structure as an interconnection topology, as well as on efficient embeddmgs of the butterfly on hypercubes, shuffle-exchange networks, etc. This has led many researchers to feel that algorithms must be designed to match the interconnection topology of the target machine. In real machines, however, the n data inputs and then log n computation nodes must be laid out across P processors and typically P << n. The nature of this layou~and the fact that each processor holds many data elements has a profound effect on the communication structure, as shown below.
A natural layout is to assign the first row of the butterfly to the first processor, the second row to the second processor and so on.
We refer to this as the cyclic layout.
Under this layou~the first log~columns of computation require only local data, whereas the last log P columns require a remote reference for each node. An alternative layout is to place the first~rows on the first processor, the next # rows on the second processor, and so on. With this blocked layou~each of the nodes in the first log P cohumts requires a remote datum for its computation, while the last log~columns require only local data. Under either layou~each processor spends # log n time computing and (g: + L) log P time communicating, assuming g~20.
Since the initial computation of the cyclic layout and the final computation of the blocked layout are completely local, one is led to consider hybrid layouts that are cyclic on the first log P columns and blocked on the last log P. Indeed, switching from cyclic to blocked layout at any column between the log P-th and the log~-th (assuming that n > Pz) leads to an algorithm which has a single "all-to-all" communication step between two entirely local computation phases. Figure 5 highlights the node assigmnent for
processor O for an 8-input FIT with P = 2 under the hybrid layout remapping occurs between cohtmna 2 and 3. 'J3e bad remap curve shows the time spent remapping the data ftvm a cyclic Jayout te a blocked layout ifa naive communication schedule is used.
'he good remap curve shows the time for the same remapping} but with a contcrrtion-fm communication schedule, which is en onfer of magnitude fastrx 'Rre X axis scale refers to the entire PPT size.
takes only~th as long.
The two computation phases involve purely local operations and are standard FFI's. From the computational performance in Figure 7 we can calibrate the "cycle time" for the FFT as the time for the set of complex 3.5 Figure 8 shows that this eliminates the performance drop.
We can test the effect of reducing g by improving the implementation to use both fat-tree networks present in the machine, thereby doubling the available network bandwidth. The result shown in Figure 8 is that the performance increases by only 15% because the network interface overhead (o) and the loop processing dominate.
4~e bisection bsndwidth is the minimum bandwidth through any cut of the network that separates the set of processors into halves. 5For s~PlicitY, tie fiplemenmtion uses the hardware barri.r avaitable on the CM-5. llte same effect could have been achieved using explicit acknowledgement messages.
'Ilk detailed quantitative analysis of the implementation shows that the hybrid-layoutFFT algorithm is nearly optimal on the CM-5. where lfis themaximttm distance of a rou~and M is tie fixed message size being used, and g to be M divided by the per processor bisection bandwidth.
The send and receive overheads in Table 1 This is more properly viewed as part of the computational work of an algorithm using that style of communication.
For the CM-5, the bulk of the cost is due to the protocol associated with the synchronous send/receive, which involves a pair of messages before transmitting Table 1 : Network timing pruametera for a one-way message without contention on severaf cummt commercial and research multipmcessors. 7%e final two suws m.fer to the active message layer, which uses the wmmercial hardware, but reduces the interface overhead.
the first data element. This protocol is easily modeled in terms of and Monsoon [25] for a dataflow model. Although a significant improvement over the commercial machines, the overhead is still a significant fraction of the communication time.
Saturation
In a real machine the latency experienced by a message tends to increase es a fimction of the load, i.e., the rate of message initiation, because more messages are in the network competing for resources.
Studies such es [11] show that there is typically a saturation point at which the latency increases sharply; below the saturation point the latency is fairly insensitive to the load. This characteristic is captured by the capacity constraint in LogP.
5.4
Long messages
The LogP model does not give special~eatrnent to long messages, yet some machines have special hardware (e.g., a DMA device connected to the network interface) to support sending long messages. The processor overhead for setting up that device is paid once and a part of sending and receiving long messages can be overlapped with computation. This is tantamount to providing two processors on each node, one to handle messages and one to do the computation. Our basic model assumes that each node consists only of one processor that is also responsible for sending and receiving messages. Therefore the overhead o is paid for each word (or small number of words). Providing a separate network processor to deliver or receive long messages can at best double the performance of each node. This can simply be modeled as two processors at each node.
5.5
Specialized hardware support Some machines provide special hardware to perform a broadcass can, or global synchronization. In LogP, processors must explicitly send messages to perform these operations.7 However, the hardware versions of these operations are typically litnhed in functionality;
for example, they may only work with integers, not floating-point numbers, They may not work for only a subset of the machine,
PRAM models
The PRAM [13] is the most popular model for representing and It has been suggested that the PRAM can serve as a good model for expressing the logical structure of parallel algorithms, and that implementation of these algorithms can be achieved by general-purpose simulations of the PRAM on distributed-memory machines [26] . However, these simulations require powerful interconnection networks, and, even then, may be unacceptably slow, especially when network bandwidth and processor overhead for sending and receiving messages are properly accounted.
6.2
Extensions of the PRAM model
There are many variations on the basic PRAM model which address one or more of the problems discussed above, namely memory contention, asynchrony, latency and bandwidth.
Memory Contention: The Module Parallel Computer [19, 23] differs from the PRAM by assuming that the memory is divided into modules, each of which can process one access request at a time. This model is suitable for handling memory contention at the module level, but does not address issues of bandwidth and network capacity.
Asynchrony: Gibbons[14] proposed the Phase PRAM, an extension of the PRAM in which computation is divided into "phases."
All processors run asynchronously within a phase, and synchro- We conclude that the design of portable algorithms can best be carried out in a model such as LogP, in which detailed assumptions about the interconnection network are avoided.
high-end computer industry toward massively parallel machmes constructed from nodes containing powerful processors and sub- the number of processors ( P). We believe the model is sufficiently detailed to reflect the major practical issues in parallel algorithm
design, yet simple enough to support detailed algorithmic analysis.
At the same time, the model avoids specifying the programming style or the communication protocol, being equally applicable to shsred-memory, message passing, and data parallel paradigms.
As with any new proposal, there will naturally be concerns regarding its utility as a basis for further study. The model exhibits interesting theoretical structure; our optimal algorithms for broadcast and summation result in a formulation that is distinct from that 
