Abstract-Predicting how well applications may run on modern systems is becoming increasingly challenging. It is no longer sufficient to look at number of floating point operations and communication costs, but one also needs to model the underlying systems and how their topology, heterogeneity, system loads, etc, may impact performance. This work focuses on developing a practical model for heterogeneous computing by looking at the older BSP model, which attempts to model communication costs on homogeneous systems, and looks at how its library implementations can be extended to include a run-time system that may be useful for heterogeneous systems. Our extensions of BSPlib with MPI and GASnet mechanisms at the communication layer should provide useful tools for evaluating applications with respect to how they may run on heterogeneous systems.
I. PROBLEM DESCRIPTION
The scalability of a parallel program is inherently connected with the performance parameters of the platform of execution. Matching a problem to a suitable architecture often requires extensive analysis of both the individual application and candidate hardware.
This work presents methods on how to extract a shared set of characteristics from both programs and architectures, and support quantitative analysis of expected performance. The quantities of interest include the sustainable computation rate of individual processors, pairwise communication bandwidth between processes, pairwise communication latency between processes, synchronization cost/requirement, and potential for overlap (simultaneous computation and communication).
All these quantities are known to be central to performance analysis, and they combine in the "fundamental equation of modeling" [1] :
However, the impact of system heterogeneity creates several challenges. As parallel systems grow in scale, deriving the time required to satisfy an application's communication requirement becomes dependent on the interplay between process locality and interconnect topology. Also, developments in computer architecture suggest that the computation rates of individual processors integrated in a parallel platform are unlikely to remain uniform as systems grow. Coupled with the fact that the magnitude of the overlap term is bound to both an analysis of dependencies between intermediate results in a given algorithm, and architectural capacity for simultaneous communication and computation, the resulting parameter space suggests that performance modeling must combine them in order to produce valid predictions.
II. PROPOSED APPROACH
Our approach focuses on explicitly acknowledging the parameters in terms of shared metrics throughout multiple abstraction levels, specifically, algorithm description, programming model and execution environment. Recognizing that a general model must encompass large variability in application and architectural design spaces, suggests that a framework for deriving specific models is more feasible than capturing a general model. Figure 1 shows the building blocks from which we intend to present such a framework. The columns show the research we rely on to provide each term of Equation 1: communication, overlap and computation.
A. Cost Model for Heterogeneous Interconnects
Our context requires that a useful model for communication on heterogeneous platforms accurately accounts for both the variable link capacity between pairs of processes, and the cost of global synchronization. Related work has already proposed a strategy for adapting uniform network models to heterogeneous clusters, in the extension of the LogP model to HLogGP, which is discussed in Section V-B. With respect to global synchronization cost, we have completed some studies which show that synchronization is closely connected to latency on modern architectures, and accordingly, that synchronization algorithms must be made locality-aware to be efficient. These results are summarized in Section VI-A, and some promising results on localityaware barrier algorithms are discussed in Section VI-C.
B. Asynchronous BSPlib implementation
To examine how the identification of the overlap term may be simplified, we find that the BSP (Bulk-Synchronous Parallelism) model [2] , through its attempt at capturing a common parameter set for algorithms and architectures, contains untapped potential for supporting quantitative analysis of the computation/communication balance. An attractive aspect of this model is that it is accompanied by a programming interface specification [4] , which effectively relates an algorithmic description to a set of semantics specific enough that they permit implementation of a corresponding runtime system. Communication semantics in BSP permit the registration of a point-to-point data transfer at any time during a computational superstep, while the effect is only expected to be valid after a subsequent barrier primitive.
This implies that the overlap term can be deduced from an algorithm's potential for postponing computation to the end of the superstep, combined with the architecture's capacity for concurrent communication and computation. Experiments with models and implementations show that there is great benefit in the exploitation of potential overlap, but that improvements are required in BSP runtime libraries in order to utilize it, as discussed in Sections VI-B and VII-A.
C. Cost Model for Sustained Computation Rate
Developing a cost model for computation time requirements has been the topic of some preliminary work, which shows that in order to attain a predictive model, it is necessary to benchmark operations which are large enough to warrant some memory traffic, and reach a measurable steady state. This suggests that computation cost is most appropriately measured in terms of higher level operations than the basic arithmetic operations of the processor. On this note, we intend to rely on the success of the BLAS linear algebra programming interface, in identifying common computational kernels which apply to a large spectrum of applications. Some effort has been put towards this end, as discussed in Section VII-B.
III. RESEARCH METHODOLOGY
This research focuses on building a realistic model for heterogeneous systems using both analytical and empirical methods. Algorithm analysis has a strong tradition of establishing bounds on resource requirements from architectureindependent observations of algorithm properties, denoting practical operation cost with constants, which are frequently hidden by asymptotic notation in order to generalize the result. Performance analysis, on the other hand, seeks to find system bottlenecks by profiling, simulation, and explorative experiments, often relying on statistics to reduce complexity and attain predictability. Bridging the differences between these two methods requires an iterative refinement process, wherein empirical data feeds back into our performance model. Because the initial model necessarily abstracts certain system features, it cannot be expected to immediately produce verifiable results, but any discrepancies between observational data and predictions can guide model refinements while searching for significant points in the space of relevant performance parameters.
IV. SIGNIFICANCE OF THE RESEARCH
The significance of this research lies in systematically merging performance modeling with system development and deployment. Presently, the impact of each term in Equation 1 requires meticulous scrutiny of programmatic and architectural properties before model extraction. A structured approach to finding them will simplify the endeavour. It also suggests stages of the work as candidates for automation.
The outlined research work is expected to culminate in an implementation of the BSPlib interface of sufficient maturity to demonstrate its predictive power with respect to modeling small application programs, subject to varying scales and deployment platforms. This work shows how performance parameters can be integrated with programming models. Our implementation does not cover all potential performance parameters one could include in a model, but shows the of this approach.
V. RELATED WORK
Because this work aims to extend the BSP model, Valiant's original proposal [2] is obviously of great significance. The LogP model [3] presents a more detailed view of communication cost, and has seen several successful applications in performance modeling.
A. The Classic BSP Model
The classic BSP model presents the expression of computation in terms of supersteps, and suggests its use as a bridge to identify a common parameter space where performance targets can be shared for algorithms and architectures. As has already been suggested, the main shortcoming of this work lies in its simplistic cost model for communication. Several suggestions have been made to refine it, from the work of Tiskin [11] which places great emphasis on scheduling and optimal simulation, to practical approaches such as those of Bisseling [12] and Hou et. al. [13] . Several others are surveyed in Valiant's second model [10] , which also elaborates on the original model, to account for the structure of modern architectures.
Most of these works are complementary to our research, in that performance modeling activities often are considered to require a number of architectural details detrimental to the generality of the models. Bisseling's book [12] is perhaps the work that comes closest to connecting theory with applications. The main issues with its description is that it suggests program code to benchmark architectural parameters which approximates execution speed by an operation count estimate on a vector dot product kernel. Our preliminary results on run time estimates suggest that this approach will as easily measure operating system interference as it will capture a scalable performance metric. It is also suggested that an all-to-all collective operation can emulate the superstep synchronization. This is consistent with BSP semantics, but fails to account for overlap, resulting in a similar discrepancy between modelable features of the algorithm and its implementation as we have observed using a BSPlib implementation which adopts this strategy.
B. The LogP and HLogGP Models
The LogP model is refined into the HLogGP model [14] for heterogeneous clusters by transforming scalar parameters into matrices, to describe the pairwise relationships between all nodes. Such a model is relevant for this work, as it delivers accurate analysis of the aggregate performance of a loosely coupled heterogeneous network. The shortcoming of the LogP approach is that it models communication cost in terms of breaking down the cost of transmitting a single message. While this makes it applicable to arbitrary message passing programming models, its decoupling from program semantics is also its weakness, as the translation of programs and architectural features into model features is entirely left to the practices of the application engineer.
VI. RESULTS
Our results include the impact of latency of modern systems as well as an evaluation of an existing BSPlib implementation. These results are summarized below.
A. Results on Latency Impact
One observation of significant impact is that analytical models must account for nonuniform interconnects in order to admit empirical validation. The seminal work of MellorCrummey and Scott [5] showed that spin-lock performance at the time was easily constrained by saturating interconnect bandwidth, and that careful manipulation of lock memory structure could avoid this effect. Our initial study to quantify the significance of this effect on contemporary architectures suggested that message latency has displaced bandwidth limits as the dominant performance parameter for synchronization [6] . This work has been further developed in a journal article [7] , which verifies what the initial study indicated, and shows that latency effects measurably affect spin-lock performance not only on large interconnects, but also between cores on a chip. This consideration proved significant also in our performance study on parallel bit-reversal [9] . In this study, an application-relevant algorithm of limited numerical intensity proved to be parallelizable using a workpool scheme, but the resulting potential for overlap attained by this technique limits scalability due to nonuniform memory access. This further strengthens the argument that effective performance measures must account for process locality.
B. Practical BSPlib Results
A study of models and attained performance of a distributed memory stencil code [8] was performed, contrasting BSP, MPI and hybrid MPI/Pthreads implementations of the same problem on the 2-level hierarchical interconnect of a cluster of 8-way SMP nodes. Obtained results again demonstrated the importance of local communications, as the hybrid implementation obtained vastly superior utilization through its explicit acknowledgment of the interconnect hierarchy. In itself, this result echoes similar observations made by many researchers, but it also pinpoints how the application and architecture both contain a potential overlap in model terms, which the tested runtime fails to reflect.
C. Locality-Aware Barrier Implementation
Our work on locality-aware barrier models has produced a model which is parametric in terms of algorithm and interconnect topology. Preliminary tests provide stable predictions of interconnect impact on 4 different algorithms on a cluster of 8-way nodes, mostly accurate to within 25% of empirical results, as shown in Fig. 2 . Algorithmic and architectural analysis are both fully automated, capturing differences in algorithm performances which span three orders of magnitude in absolute terms. Our method reflects particular behavior of two of the algorithms which a manual analysis would justify in terms of the underlying topology. Further testing on larger platforms is ongoing.
VII. REMAINING OBJECTIVES AND CHALLENGES
This section summarizes some of our remaining issues related to modeling heterogeneity.
A. Asynchronous BSPlib Implementation
An important step in further work is to produce an implementation of the BSPlib standard which programmatically realizes the potential for overlap implied by the semantics of the interface specification. Two such implementations have been written, which employ the asynchronous communication mechanisms of MPI and GASnet as communication layers, respectively. The MPI implementation is intended for portability, to enable application testing and model validation on a wide range of platforms, while the GASnet implementation is intended for testing on interconnects with explicit support for remote memory writes. Both can run experiments, but further experimental evaluation is pending the integration of the barrier model in Section VI-C.
B. Computational Rate Measures
Some effort has been put into testing the accuracy of measurable rates of computation at the application level. Early experiments suggest that it is possible to obtain a rate measurement which can yield predictions of completion time for input sizes orders of magnitude larger than the benchmarked problem, to an accuracy which is good enough to display decreasing relative error with growing problem size. Obtaining these benchmarks required the isolation of a number of effects induced by the operating system, such as the impacts of demand paging, involuntary context switches, accuracy of the system clock, and processor power stepping. As the resulting predictions are still parametric wrt. the specific numerical kernel measured, using this technique outside of a laboratory setting remains a challenge.
