1,970 research outputs found
The CBE Hardware Accelerator for Numerical Relativity: A Simple Approach
Hardware accelerators (such as the Cell Broadband Engine) have recently
received a significant amount of attention from the computational science
community because they can provide significant gains in the overall performance
of many numerical simulations at a low cost. However, such accelerators usually
employ a rather unfamiliar and specialized programming model that often
requires advanced knowledge of their hardware design. In this article, we
demonstrate an alternate and simpler approach towards managing the main
complexities in the programming of the Cell processor, called software caching.
We apply this technique to a numerical relativity application: a time-domain,
finite-difference Kerr black hole perturbation evolver, and present the
performance results. We obtain gains in the overall performance of generic
simulations that are close to the theoretical maximum that can be obtained
through our parallelization approach.Comment: 5 pages, 2 figures; Accepted for publication in the International
Journal of Modeling, Simulation, and Scientific Computing (IJMSSC
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)
We present a detailed approach for making use of two new computer hardware
architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis
application (Einstein@Home). Our results suggest that both the architectures
suit the application quite well and the achievable performance in the same
software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel
Processing and Applied Mathematics (PPAM 2009
Parallelising wavefront applications on general-purpose GPU devices
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups
High-Precision Numerical Simulations of Rotating Black Holes Accelerated by CUDA
Hardware accelerators (such as Nvidia's CUDA GPUs) have tremendous promise
for computational science, because they can deliver large gains in performance
at relatively low cost. In this work, we focus on the use of Nvidia's Tesla GPU
for high-precision (double, quadruple and octal precision) numerical
simulations in the area of black hole physics -- more specifically, solving a
partial-differential-equation using finite-differencing. We describe our
approach in detail and present the final performance results as compared with a
single-core desktop processor and also the Cell BE. We obtain mixed results --
order-of-magnitude gains in overall performance in some cases and negligible
gains in others.Comment: 6 pages, 1 figure, 1 table, Accepted for publication in the
International Conference on High Performance Computing Systems (HPCS 2010
Elliptic Curve Cryptography on Modern Processor Architectures
Abstract
Elliptic Curve Cryptography (ECC) has been adopted by the US National Security Agency (NSA) in Suite "B" as part of its "Cryptographic Modernisation Program ". Additionally,
it has been favoured by an entire host of mobile devices due to its superior performance characteristics. ECC is also the building block on which the exciting field of pairing/identity based cryptography is based. This widespread use means that there is potentially a lot to be gained by researching efficient implementations on modern processors such as IBM's Cell Broadband Engine and Philip's next generation smart card cores. ECC operations can be thought of as a pyramid of building blocks, from instructions on a core, modular operations on a finite field, point addition & doubling, elliptic curve scalar
multiplication to application level protocols. In this thesis we examine an implementation of these components for ECC focusing on a range of optimising techniques for the Cell's SPU and the MIPS smart card. We show significant performance improvements that can be achieved through of adoption of EC
LU Decomposition on Cell Broadband Engine: An Empirical Study to Exploit Heterogeneous Chip Multiprocessors
To meet the needs of high performance computing, the Cell Broadband Engine owns many features that differ from traditional processors, such as the large number of synergistic processor elements, large register files, the ability to hide main-storage latency with concurrent computation and DMA transfers. The exploitation of those features requires the programmer to carefully tailor programs and simutaneously deal with various performance factors, including locality, load balance, communication overhead, and multi-level parallelism. These factors, unfortunately, are dependent on each other; an optimization that enhances one factor may degrade another. This paper presents our experience on optimizing LU decomposition, one of the commonly used algebra kernels in scientific computing, on Cell Broadband Engine. The optimizations exploit task-level, data-level, and communication-level parallelism. We study the effects of different task distribution strategies, prefetch, and software cache, and explore the tradeoff among different performance factors, stressing the interactions between different optimizations. This work offers some insights in the optimizations on heterogenous multi-core processors, including the selection of programming models, considerations in task distribution, and the holistic perspective required in optimizations
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets
<p>Abstract</p> <p>Background</p> <p>Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks.</p> <p>Results</p> <p>We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from <url>http://cbe.ipk-gatersleben.de</url>.</p> <p>Conclusions</p> <p>The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.</p
- …