8 research outputs found
Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000
Colloque avec actes sans comité de lecture.The ccNUMA architecture of the SGI Origin2000 has been shown to perform and scale for a wide range of scientific and engineering applications. This paper focuses on a well known computer graphics hierarchical algorithm - wavelet radiosity - whose parallelization is made challenging by its irregular, dynamic and unpredictable characteristics. Our previous experimentations, based on a naive parallelization, showed that the Origin2000 hierarchical memory structure was well suited to handle the natural data locality exhibited by this hierarchical algorithm. However, our crude load balancing strategy was clearly insufficient to benefit from the whole Origin2000 power. We present here a fine load balancing analysis and then propose several enhancements, namely "lazy copy" and "lure", that greatly reduce locks and synchronization barriers idle time. The new parallel algorithm is experimented on a 64 processors Origin2000. Even if in theory, a communication over-cost has been introduced, we show that data locality is still preserved. The final performance evaluation shows a quasi optimal behavior, at least until the 32-processor scale. Hereafter, a problematic trouble spot has to be identified to explain the performance degradation observed at the 64-processor scale
A Scalable Distributed Shared Memory
Parallel computers of the future will require a memory model which offers a global address space to the programmer, while performing equally well under various system configurations. We present a logically shared and physically distributed memory to match both requirements. This paper introduces the memory system used in the ADAM coarse-grain dataflow machine which preserves scalability by tolerating latency and offers programmability through its object-based structure. We show how to support data objects of arbitrary size and different access bandwidth and latency characteristics, and present a possible implementation of this model. The proposed system is evaluated by analysis of the bandwidth and latency characteristics of the three different object classes and by examination of the impact of different network topologies. Finally, we present a number of simulation results which confirm the previous analysis
A Scalable Distributed Shared Memory
Parallel computers of the future will require a memory model which offers a global address space to the programmer, while performing equally well under various system configurations. We present a logically shared and physically distributed memory to match both requirements. This paper introduces the memory system used in the ADAM coarse-grain dataflow machine which preserves scalability by tolerating latency and offers programmability through its object-based structure. We show how to support data objects of arbitrary size and different access bandwidth and latency characteristics, and present a possible implementation of this model. The proposed system is evaluated by analysis of the bandwidth and latency characteristics of the three different object classes and by examination of the impact of different network topologies. Finally, we present a number of simulation results which confirm the previous analysis
MaDCoWS: A scalable distributed shared memory environment for massively parallel multiprocessors
In this paper we present MaDCoWS, a software implementation of a Distributed Shared Memory (DSM) runtime system, specifically designed for massively parallel 2-D grid multiprocessors. The system takes advantage of the network topology in order to minimise the paths of the message sequences realising the shared operations. As a result its performance is increased and the system becomes scalable even to very large processor numbers. We present the basic ideas for 2-D optimisations, the implementation structure and results from synthetic and application benchmarks executed on a 1024 processor Parsytec GCel. © Springer-Verlag Berlin Heidelberg 1999
Recommended from our members
Performance of large-scale scientific applications on the IBM ASCI Blue-Pacific system
The IBM ASCI Blue-Pacific System is a scalable, distributed/shared memory architecture designed to reach multi-teraflop performance. The IBM SP pieces together a large number of nodes, each having a modest number of processors. The system is designed to accommodate a mixed programming model as well as a pure message-passing paradigm. We examine a number of applications on this architecture and evaluate their performance and scalability