8 research outputs found

    Load Balancing Analysis of a Parallel Hierarchical Algorithm on the Origin2000

    Get PDF
    Colloque avec actes sans comité de lecture.The ccNUMA architecture of the SGI Origin2000 has been shown to perform and scale for a wide range of scientific and engineering applications. This paper focuses on a well known computer graphics hierarchical algorithm - wavelet radiosity - whose parallelization is made challenging by its irregular, dynamic and unpredictable characteristics. Our previous experimentations, based on a naive parallelization, showed that the Origin2000 hierarchical memory structure was well suited to handle the natural data locality exhibited by this hierarchical algorithm. However, our crude load balancing strategy was clearly insufficient to benefit from the whole Origin2000 power. We present here a fine load balancing analysis and then propose several enhancements, namely "lazy copy" and "lure", that greatly reduce locks and synchronization barriers idle time. The new parallel algorithm is experimented on a 64 processors Origin2000. Even if in theory, a communication over-cost has been introduced, we show that data locality is still preserved. The final performance evaluation shows a quasi optimal behavior, at least until the 32-processor scale. Hereafter, a problematic trouble spot has to be identified to explain the performance degradation observed at the 64-processor scale

    A Scalable Distributed Shared Memory

    No full text
    Parallel computers of the future will require a memory model which offers a global address space to the programmer, while performing equally well under various system configurations. We present a logically shared and physically distributed memory to match both requirements. This paper introduces the memory system used in the ADAM coarse-grain dataflow machine which preserves scalability by tolerating latency and offers programmability through its object-based structure. We show how to support data objects of arbitrary size and different access bandwidth and latency characteristics, and present a possible implementation of this model. The proposed system is evaluated by analysis of the bandwidth and latency characteristics of the three different object classes and by examination of the impact of different network topologies. Finally, we present a number of simulation results which confirm the previous analysis

    A Scalable Distributed Shared Memory

    No full text
    Parallel computers of the future will require a memory model which offers a global address space to the programmer, while performing equally well under various system configurations. We present a logically shared and physically distributed memory to match both requirements. This paper introduces the memory system used in the ADAM coarse-grain dataflow machine which preserves scalability by tolerating latency and offers programmability through its object-based structure. We show how to support data objects of arbitrary size and different access bandwidth and latency characteristics, and present a possible implementation of this model. The proposed system is evaluated by analysis of the bandwidth and latency characteristics of the three different object classes and by examination of the impact of different network topologies. Finally, we present a number of simulation results which confirm the previous analysis

    MaDCoWS: A scalable distributed shared memory environment for massively parallel multiprocessors

    No full text
    In this paper we present MaDCoWS, a software implementation of a Distributed Shared Memory (DSM) runtime system, specifically designed for massively parallel 2-D grid multiprocessors. The system takes advantage of the network topology in order to minimise the paths of the message sequences realising the shared operations. As a result its performance is increased and the system becomes scalable even to very large processor numbers. We present the basic ideas for 2-D optimisations, the implementation structure and results from synthetic and application benchmarks executed on a 1024 processor Parsytec GCel. © Springer-Verlag Berlin Heidelberg 1999
    corecore