1,970 research outputs found

    The CBE Hardware Accelerator for Numerical Relativity: A Simple Approach

    Full text link
    Hardware accelerators (such as the Cell Broadband Engine) have recently received a significant amount of attention from the computational science community because they can provide significant gains in the overall performance of many numerical simulations at a low cost. However, such accelerators usually employ a rather unfamiliar and specialized programming model that often requires advanced knowledge of their hardware design. In this article, we demonstrate an alternate and simpler approach towards managing the main complexities in the programming of the Cell processor, called software caching. We apply this technique to a numerical relativity application: a time-domain, finite-difference Kerr black hole perturbation evolver, and present the performance results. We obtain gains in the overall performance of generic simulations that are close to the theoretical maximum that can be obtained through our parallelization approach.Comment: 5 pages, 2 figures; Accepted for publication in the International Journal of Modeling, Simulation, and Scientific Computing (IJMSSC

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)

    Full text link
    We present a detailed approach for making use of two new computer hardware architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel Processing and Applied Mathematics (PPAM 2009

    Parallelising wavefront applications on general-purpose GPU devices

    Get PDF
    Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups

    High-Precision Numerical Simulations of Rotating Black Holes Accelerated by CUDA

    Full text link
    Hardware accelerators (such as Nvidia's CUDA GPUs) have tremendous promise for computational science, because they can deliver large gains in performance at relatively low cost. In this work, we focus on the use of Nvidia's Tesla GPU for high-precision (double, quadruple and octal precision) numerical simulations in the area of black hole physics -- more specifically, solving a partial-differential-equation using finite-differencing. We describe our approach in detail and present the final performance results as compared with a single-core desktop processor and also the Cell BE. We obtain mixed results -- order-of-magnitude gains in overall performance in some cases and negligible gains in others.Comment: 6 pages, 1 figure, 1 table, Accepted for publication in the International Conference on High Performance Computing Systems (HPCS 2010

    Elliptic Curve Cryptography on Modern Processor Architectures

    Get PDF
    Abstract Elliptic Curve Cryptography (ECC) has been adopted by the US National Security Agency (NSA) in Suite "B" as part of its "Cryptographic Modernisation Program ". Additionally, it has been favoured by an entire host of mobile devices due to its superior performance characteristics. ECC is also the building block on which the exciting field of pairing/identity based cryptography is based. This widespread use means that there is potentially a lot to be gained by researching efficient implementations on modern processors such as IBM's Cell Broadband Engine and Philip's next generation smart card cores. ECC operations can be thought of as a pyramid of building blocks, from instructions on a core, modular operations on a finite field, point addition & doubling, elliptic curve scalar multiplication to application level protocols. In this thesis we examine an implementation of these components for ECC focusing on a range of optimising techniques for the Cell's SPU and the MIPS smart card. We show significant performance improvements that can be achieved through of adoption of EC

    LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors

    Get PDF
    To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations

    Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks.</p> <p>Results</p> <p>We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from <url>http://cbe.ipk-gatersleben.de</url>.</p> <p>Conclusions</p> <p>The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.</p
    corecore