24 research outputs found

    Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

    Get PDF
    We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect

    Understanding and Mitigating Multicore Performance Issues on theAMD Opteron Architecture

    Full text link

    Performance Measurement and Analysis of Large-Scale Parallel Applications on Leadership Computing Systems

    Get PDF

    HPCC Update and Analysis

    Get PDF
    Abstract: The last year has seen significant updates in the programming environment and operating systems on the Cray X1E and Cray XT3 as well as the much anticipated release of version 1.0 of HPCC Benchmark. This paper will provide an update and analysis of the HPCC Benchmark Results for Cray XT3 and X1E as well as a comparison against historical results

    Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

    Get PDF
    Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations

    Summary of multi-core hardware and programming model investigations

    Get PDF
    This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community

    Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

    Get PDF
    Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

    Full text link
    The demand for processing power is increasing steadily. In the past, single processor architectures clearly dominated the markets. As instruction level parallelism is limited in most applications, significant performance can only be achieved in the future by exploiting parallelism at the higher levels of thread or process parallelism. As a consequence, modern “processors” incorporate multiple processor cores that form a single shared memory multiprocessor. In such systems, high performance devices like network interface controllers are connected to processors and memory like every other input/output device over a hierarchy of peripheral interconnects. Thus, one target must be to couple coprocessors physically closer to main memory and to the processors of a computing node. This removes the overhead of today’s peripheral interconnect structures. Such a step is the direct connection of HyperTransport (HT) devices to Opteron processors, which is presented in this thesis. Also, this work analyzes how communication from a device to processors can be optimized on the protocol level. As today’s computing nodes are shared memory systems, the cache coherence protocol is the central protocol for data exchange between processors and devices. Consequently, the analysis extends to classes of devices that are cache coherence protocol aware. Also, the concept of a transfer cache is proposed in this thesis, which reduces latency significantly even for non-coherent devices. The trend to the exploitation of process and thread level parallelism leads to a steady increase of system sizes. Networks that are used in such large systems are very susceptible to both hard and transient faults. Most transient fault rates are constant per bit that is stored or transmitted. With increasing system sizes and higher clock frequencies, the number of faults in time increases drastically. In the end, the error rate may rise at a level where high level error recovery becomes too costly if lower layers do not perform error correction that is transparent to the layers above. The second part of this thesis describes a direct interconnection network that provides a reliable transport service even without the use of end-to-end protocols. Also, a novel hardware based solution for intermediate routing is developed in this thesis, which allows an efficient, deadlock free routing around faulty links
    corecore