42,858 research outputs found
Using the PlayStation3 for speeding up metaheuristic optimization
Traditional computer software is written for serial computation. To solve an optimization problem, an algorithm or metaheuristic is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit (CPU) on one computer.
Parallel computing uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm
simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above.
Today most commodity CPU designs include single instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Modern video game consoles and consumer computer-graphics hardware rely heavily on vector processing in their architecture. In 2000, IBM, Toshiba and Sony collaborated to create the Cell Broadband Engine (Cell BE), consisting of one traditional microprocessor (called the Power Processing Element or PPE) and eight SIMD co-processing units, or the
so-called Synergistic Processor Elements (SPEs), which found use in the Sony PlayStation3 among other applications
The computational power of the Cell BE orPlayStation3 can also be used for scientific computing. Examples and applications have been reported in e.g. Kurzak et al. (2008), Bader et al. (2008), Olivier et al. (2007), Petrini et al. (2007).
In this work, the potential of using the PlayStation3 for speeding up metaheuristic optimization is investigated.
More specifically, we propose an adaptation of an evolutionary algorithm with embedded simulation for inspection optimization, developed in Van Volsem et al. (2007), Van Volsem (2009a) and Van Volsem (2009b
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Performance of the Cell processor for biomolecular simulations
The new Cell processor represents a turning point for computing intensive
applications. Here, I show that for molecular dynamics it is possible to reach
an impressive sustained performance in excess of 30 Gflops with a peak of 45
Gflops for the non-bonded force calculations, over one order of magnitude
faster than a single core standard processor
Parallelising wavefront applications on general-purpose GPU devices
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups
Ianus: an Adpative FPGA Computer
Dedicated machines designed for specific computational algorithms can
outperform conventional computers by several orders of magnitude. In this note
we describe {\it Ianus}, a new generation FPGA based machine and its basic
features: hardware integration and wide reprogrammability. Our goal is to build
a machine that can fully exploit the performance potential of new generation
FPGA devices. We also plan a software platform which simplifies its
programming, in order to extend its intended range of application to a wide
class of interesting and computationally demanding problems. The decision to
develop a dedicated processor is a complex one, involving careful assessment of
its performance lead, during its expected lifetime, over traditional computers,
taking into account their performance increase, as predicted by Moore's law. We
discuss this point in detail
Count three for wear able computers
This paper is a postprint of a paper submitted to and accepted for publication in the Proceedings of the IEE Eurowearable 2003 Conference, and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library.
A revised version of this paper was also published in Electronics Systems and Software, also subject to Institution of Engineering and Technology Copyright. The copy of record is also available at the IET Digital Library.A description of 'ubiquitous computer' is presented. Ubiquitous computers imply portable computers embedded into everyday objects, which would replace personal computers. Ubiquitous computers can be mapped into a three-tier scheme, differentiated by processor performance and flexibility of function. The power consumption of mobile devices is one of the most important design considerations. The size of a wearable system is often a design limitation
The CBE Hardware Accelerator for Numerical Relativity: A Simple Approach
Hardware accelerators (such as the Cell Broadband Engine) have recently
received a significant amount of attention from the computational science
community because they can provide significant gains in the overall performance
of many numerical simulations at a low cost. However, such accelerators usually
employ a rather unfamiliar and specialized programming model that often
requires advanced knowledge of their hardware design. In this article, we
demonstrate an alternate and simpler approach towards managing the main
complexities in the programming of the Cell processor, called software caching.
We apply this technique to a numerical relativity application: a time-domain,
finite-difference Kerr black hole perturbation evolver, and present the
performance results. We obtain gains in the overall performance of generic
simulations that are close to the theoretical maximum that can be obtained
through our parallelization approach.Comment: 5 pages, 2 figures; Accepted for publication in the International
Journal of Modeling, Simulation, and Scientific Computing (IJMSSC
An investigation of the performance portability of OpenCL
This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
- …