Abstract. This paper surveys and places into perspective a number of results concerning the D-BSP (Decomposable Bulk Synchronous Parallel) model of computation, a variant of the popular BSP model proposed by Valiant in the early nineties. D-BSP captures part of the proximity structure of the computing platform, modeling it by suitable decompositions into clusters, each characterized by its own bandwidth and latency parameters. Quantitative evidence is provided that, when modeling realistic parallel architectures, D-BSP achieves higher effectiveness and portability than BSP, without significantly affecting the ease of use. It is also shown that D-BSP avoids some of the shortcomings of BSP which motivated the definition of other variants of the model. Finally, the paper discusses how the aspects of network proximity incorporated in the model allow for a better management of network congestion and bank contention, when supporting a shared-memory abstraction in a distributed-memory environment.
Introduction
The use of parallel computers would be greatly enhanced by the availability of a model of computation that combines the following properties: usability, regarded as ease of algorithm design and analysis, effectiveness, so that efficiency of algorithms in the model translates into efficiency of execution on some given platform, and portability, which denotes the ability of achieving effectiveness with respect to a wide class of target platforms. These properties appear, to some extent, incompatible. For instance, effectiveness requires modeling a number of platform-specific aspects that affect performance (e.g., interconnection topology) at the expense of portability and usability. The formulation of a bridging model that balances among these conflicting requirements has proved a difficult task, as demonstrated by the proliferation of models in the literature over the years.
In the last decade, a number of bridging models have been proposed, which abstract a parallel platform as a set of processors and a set of either local or shared memory banks (or both) communicating through some interconnection. In order to ensure usability and portability over a large class of platforms, these models do not provide detailed characteristics of the interconnection but, rather, summarize its communication capabilities by a few parameters that broadly capture bandwidth and latency properties.
Perhaps the most popular example in this arena is Valiant's BSP (Bulk Synchronous
Parallel) model [Val90] . A BSP machine is a set of Ò processors with local memory, communicating through a router, whose computations are sequences of supersteps. In a superstep, each processor (i) reads the messages received in the previous superstep; (ii) performs computation on locally available data; (iii) sends messages to other processors; and (iv) takes part in a global barrier synchronization. A superstep is charged a cost of Û · · , where Û (resp., ) is the maximum number of operations performed (resp., messages sent/received) by any processor in the superstep, and and are parameters with inversely related to the router's bandwidth and capturing latency and synchronization delays. establishes a substantial equivalence between LogP and BSP as computational models for algorithm design guided by asymptotic analysis.
In recent years, a number of BSP variants have been formulated in the literature, whose definitions incorporate additional provisions aimed at improving the model's effectiveness relative to actual platforms without affecting its usability and portability significantly (see e.g., [BGMZ95, BDM95, JW96b, DK96] ). Among these variants, the E-BSP (Extended BSP) by [JW96b] and the D-BSP (Decomposable BSP) by [DK96] are particularly relevant for this paper. E-BSP aims at predicting more accurately the cost of supersteps with unbalanced communication patterns, where the average number Ú of messages sent/received by a processor is lower than the corresponding maximum number, . Indeed, on many interconnections, routing time increases with Ú , for fixed , a phenomenon modeled in E-BSP is by adding a term depending upon Ú to the cost of a superstep. However, the functional shape of this term varies with the topology of the intended target platform, making the model somewhat awkward.
D-BSP extends BSP by incorporating some aspects of network proximity into the model. Specifically, the set of Ò processor/memory pairs is viewed as partitionable as a collection of clusters, where each cluster is able to perform its own sequence of supersteps independently of the other ones and is characterized by its own and parameters, typically increasing with the size of the cluster. The partition into clusters can change dynamically within a pre-specified set of legal partitions. The key advantage is that communication patterns where messages are confined within small clusters have small cost, like in realistic platforms and unlike in standard BSP. In fact, it can be shown quantitatively that this advantage translates into higher effectiveness and portability of D-BSP over BSP. Clustering also enables efficient routing of unbalanced communication patterns in D-BSP, making it unnecessary to further extend the cost model in the direction followed by E-BSP. Thus, D-BSP is an attractive candidate among BSP variants and, in general, among bandwidth-latency models, to strike a fair balance among the conflicting features sought in a bridging model of parallel computation. In Section 2, we define a restricted version of D-BSP where clusters are defined according to a regular recursive structure, which greatly simplifies the use of the model without diminishing its power significantly. In Section 3, we employ the methodology based on cross-simulations proposed in [BPP99] to quantitatively assess the higher effectiveness of D-BSP with respect to BSP, relatively to the wide class of processor networks. Then, in Subsection 3.1 we show that, for certain relevant computations and prominent topologies, D-BSP exhibits a considerably higher effectiveness than the one guaranteed by the general result. In such cases, the effectiveness of D-BSP becomes close to optimal. Furthermore, we present a general strategy to exploit communication locality: one of the corollaries is a proof that D-BSP can be as effective as E-BSP in dealing with unbalanced communication patterns. Finally, in Section 4 we show how D-BSP can efficiently support a shared memory abstraction, a valuable provision for algorithm development in a distributed-memory environment. The results presented in the section clearly indicate that the network proximity modeled by D-BSP can be exploited to reduce network congestion and bank contention when implementing a shared address space both by randomized and by deterministic strategies.
The D-BSP Model
The D-BSP (Decomposable BSP) model was introduced in [DK96] as an extension of Valiant's BSP [Val90] aimed at capturing, in part, the proximity structure of the network. In its most general definition, the D-BSP is regarded as a set of Ò processor/memory pairs communicating through a router, which can be aggregated according to a predefined collection of submachines, each able to operate independently. For concreteness, we focus our attention on a restricted version of the model (referred to as recursive D-BSP in [DK96] ) where the collection of submachines has the following regular structure. Let Ò be a power of two. For ¼ ÐÓ Ò, the Ò processors are par- introduces the notion of proximity in BSP through clustering, and groups -relations into specialized classes associated with different costs. This ensures full compatibility between the two models, which allows programs written according to one model to run on any machine supporting the other, the only difference being their estimated performance.
In this paper, we will often exemplify our considerations by focusing on a class of parameter values for D-BSP of particular significance. Namely, let « and ¬ be two ar- Å ¼ in a given class, this quantity provides an upper measure of the portability of Å with respect to the class. We use this approach to evaluate the effectiveness of D-BSP with respect to the class of processor networks. Let be a connected Ò-processor network, where in one step each processor executes a constant number of local operations and may send/receive one point-to-point message to/from each neighboring processor (multi-port regimen). As is the case for all relevant network topologies, we assume that has a decomposition tree 
where
We can apply Equations 1 and 2 to quantitatively estimate the effectiveness of D-BSP with respect to specific network topologies. Consider, for instance, the case of an Ò-node -dimensional array. Fix It is important to remark that the D-BSP clustered structure provides a crucial contribution to the model's effectiveness. Indeed, it can be shown that, if Å ¼ is a BSP´Ò µ and is a -dimensional array, then AE´Å ¼ µ ª Ò ½ ¡ independently of , Ð and the size of the memory at each processor [BPP99] . This implies that, under the AE metric, D-BSP is asymptotically more effective than BSP with respect to multidimensional arrays.
Effectiveness of D-BSP with respect to specific computations
We note that non-constant slowdown for simulating an arbitrary computation of a processor network on a D-BSP is to be expected since the D-BSP disregards the fine structure of the network topology, and, consequently, it is unable to fully exploit topological locality. However, for several prominent topologies and several relevant computational problems arising in practical applications, the impact of such a loss of locality is much less than what the above simulation results may suggest, and, in many cases, it is negligible.
Consider, for example, the class of processor networks whose topology has a recursive structure with bisection bandwidth Ç Ò ½ « ¡ , which is not optimal, for instance, when « ¬.
We call -sorting a sorting problem in which keys are initially assigned to each one of Ò processors and are to be redistributed so that the smallest keys will be held by processor È ¼ , the next smallest ones by processor È ½ , and so on. It is easy to see that -sorting requires time ª Ò Ñ Ü ½ ¾ [FPP01] . Indeed, standard lower bound arguments show that such a routing time is optimal for [SK94] . As a corollary of the above routing result, we can show that, unlike the standard BSP model, D-BSP is also able to handle unbalanced communication patterns efficiently, which was the main objective that motivated the introduction of a BSP variant, called E-BSP, by [JW96a] . Let an´ Ñµ-relation be a routing instance where each processor sends/receives at most messages, and a total of Ñ messages are exchanged. 
Providing Shared Memory on D-BSP
A very desirable feature of a distributed-memory model is the ability to support a shared memory abstraction efficiently. Among the other benefits, this feature allows porting the vast body of PRAM algorithms [JáJ92] to the model at the cost of a small time penalty. In this section we present a number of results that demonstrate that D-BSP can be endowed with an efficient shared memory abstraction.
Implementing shared memory calls for the development of a scheme to represent Ñ shared cells (variables) among the Ò processor/memory pairs of a distributed-memory machine in such a way that any Ò-tuple of variables can be read/written efficiently by the processors. The time required by a parallel access to an arbitrary Ò-tuple of variables is often referred to as the slowdown of the scheme.
Numerous randomized and deterministic schemes have been developed in the literature for a number of specific processor networks. Randomized schemes (see e.g., [CMS95, Ran91] ) usually distribute the variables randomly among the memory modules local to the processors. As a consequence of such a scattering, a simple routing strategy is sufficient to access any Ò-tuple of variables efficiently, with high probability. Following this line, we can give a simple, randomized scheme for shared memory access on D-BSP. Assume, for simplicity, that the variables be spread among the local memory modules by means of a totally random function. In fact, a polynomial hash function drawn from a ÐÓ Ò-universal class [CW79] , suffices to achieve the same results [MV84] , but it takes only poly´ÐÓ Òµ rather than Ç´Ò ÐÓ Òµ random bits to be generated and stored at the nodes. We have: we send the messages containing the access requests to their destination -clusters, so that each node in the cluster receives roughly the same number of messages. A standard occupancy argument [MR95] suffices to show that, with high probability, there will be no more than Ò ¾ messages destined to the same -cluster, for a given small constant ½, hence each step requires a simple prefix and the routing of an Ç´½µ-relation in -clusters. In the last step, we simply send the messages to their final destinations, where the memory access is performed. Again, the same probabilistic argument implies that the degree of the relation in this case is Ç´ÐÓ Ò ÐÓ ÐÓ Òµ, with high probability.
For read accesses, the return journey of the messages containing the accessed values can be performed by reversing the algorithm for writes, thus remaining within the same time bound. Let us now switch to deterministic schemes. In this case, achieving efficiency is much harder, since, in order to avoid the trivial worst-case where a few memory modules contain all of the requested data, we are forced to replicate each variable and manage replicated copies so to enforce consistency. A typical deterministic scheme replicates every variable into copies, which are then distributed among the memory modules through a map exhibiting suitable expansion properties. Expansion is needed to guarantee that the copies relative to any Ò-tuple of variables be never confined within few nodes. The parameter is referred to as the redundancy of the scheme. In order to achieve efficiency, the main idea, originally introduced in [UW87] and adopted in all subsequent works, is that any access (read or write) to a variable is satisfied by reaching only a subset of its copies, suitably chosen to maximize communication bandwidth while ensuring consistency (i.e., a read access must always return the most updated value of the variable).
A general deterministic scheme to implement a shared memory abstraction on a D-BSP is presented in [FPP01] . The scheme builds upon the one in [PPS00] for a number of processor networks, whose design exploits the recursive decomposition of the underlying topology to provide a hierarchical, redundant representation of the shared memory based on · ½ levels of logical modules. Such an organization fits well with the structure of a D-BSP, which is hierarchical in nature. More specifically, each variable is replicated into Ö Ç´½µ copies, and the copies are assigned to Ö logical modules of level 0. In general, the logical modules at the -th level, ¼ are replicated into three copies, which are assigned to three modules of level · ½ . This process eventually creates Ö¿ ¢ ¿ ¡ copies of each variable, and ¿ replicas of each module at level . The number (resp., size) of the logical modules decreases (resp., increases) with the level number, and their replicas are mapped to the D-BSP by assigning each distinct block to a distinct cluster of appropriate size, so that each of the sub-blocks contained within the block is recursively assigned to a sub-cluster. The key ingredients of the above memory organization are represented by the bipartite graph that governs the distribution of the copies of the variables among the modules of the first level, and those that govern the distribution of the replicas of the modules at the subsequent levels. The former graph is required to exhibit some weak expansion property, and its existence can always be proved through combinatorial arguments although, for certain memory sizes, explicit constructions can be given. In contrast, all the other graphs employed in the scheme require expansion properties that can be obtained by suitable modifications of the BIBD graph [Hal86] , and can always be explicitly constructed.
For an Ò-tuple of variables to be read/written, the selection of the copies to be accessed and the subsequent execution of the accesses of the selected copies are performed on the D-BSP through a protocol similar to the one in [PPS00] , which can be imple- An interesting consequence of the above theorem is that it shows that optimal worstcase slowdowns for shared memory access are achievable with constant redundancy for machines where latency overheads dominate over those due to bandwidth limitations, as is often the case in network-based parallel machines. When this is not the case, it is shown in [FPP01] that the proposed scheme is not too far-off from being optimal. Perhaps, the most important feature of the above scheme is that, unlike the other classical deterministic schemes in the literature, it solely relies on expander graphs of mild expansion, hence it can be made fully constructive for a significant range of the parameters involved. Such mild expanders, however, are only able to guarantee that the copies of an arbitrary Ò-tuple of variables be spread among Ç Ò ½ ¯¡ memory modules, for some constant¯ ½. Hence the congestion at a single memory module can be as high as Ç´Ò¯µ and the clusterized structure of D-BSP is essential in order to achieve good slowdown. In fact, any deterministic strategy employing these graphs on a BSP´Ò µ could not achieve better than ¢´ Òµ slowdown.
