42,858 research outputs found

    Using the PlayStation3 for speeding up metaheuristic optimization

    Get PDF
    Traditional computer software is written for serial computation. To solve an optimization problem, an algorithm or metaheuristic is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit (CPU) on one computer. Parallel computing uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above. Today most commodity CPU designs include single instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Modern video game consoles and consumer computer-graphics hardware rely heavily on vector processing in their architecture. In 2000, IBM, Toshiba and Sony collaborated to create the Cell Broadband Engine (Cell BE), consisting of one traditional microprocessor (called the Power Processing Element or PPE) and eight SIMD co-processing units, or the so-called Synergistic Processor Elements (SPEs), which found use in the Sony PlayStation3 among other applications The computational power of the Cell BE orPlayStation3 can also be used for scientific computing. Examples and applications have been reported in e.g. Kurzak et al. (2008), Bader et al. (2008), Olivier et al. (2007), Petrini et al. (2007). In this work, the potential of using the PlayStation3 for speeding up metaheuristic optimization is investigated. More specifically, we propose an adaptation of an evolutionary algorithm with embedded simulation for inspection optimization, developed in Van Volsem et al. (2007), Van Volsem (2009a) and Van Volsem (2009b

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Performance of the Cell processor for biomolecular simulations

    Full text link
    The new Cell processor represents a turning point for computing intensive applications. Here, I show that for molecular dynamics it is possible to reach an impressive sustained performance in excess of 30 Gflops with a peak of 45 Gflops for the non-bonded force calculations, over one order of magnitude faster than a single core standard processor

    Parallelising wavefront applications on general-purpose GPU devices

    Get PDF
    Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres. This paper investigates the viability of graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). We identify the optimisations suitable for this new architecture and quantify the characteristics of those wavefront codes that are likely to experience speedups

    Ianus: an Adpative FPGA Computer

    Full text link
    Dedicated machines designed for specific computational algorithms can outperform conventional computers by several orders of magnitude. In this note we describe {\it Ianus}, a new generation FPGA based machine and its basic features: hardware integration and wide reprogrammability. Our goal is to build a machine that can fully exploit the performance potential of new generation FPGA devices. We also plan a software platform which simplifies its programming, in order to extend its intended range of application to a wide class of interesting and computationally demanding problems. The decision to develop a dedicated processor is a complex one, involving careful assessment of its performance lead, during its expected lifetime, over traditional computers, taking into account their performance increase, as predicted by Moore's law. We discuss this point in detail

    Count three for wear able computers

    Get PDF
    This paper is a postprint of a paper submitted to and accepted for publication in the Proceedings of the IEE Eurowearable 2003 Conference, and is subject to Institution of Engineering and Technology Copyright. The copy of record is available at the IET Digital Library. A revised version of this paper was also published in Electronics Systems and Software, also subject to Institution of Engineering and Technology Copyright. The copy of record is also available at the IET Digital Library.A description of 'ubiquitous computer' is presented. Ubiquitous computers imply portable computers embedded into everyday objects, which would replace personal computers. Ubiquitous computers can be mapped into a three-tier scheme, differentiated by processor performance and flexibility of function. The power consumption of mobile devices is one of the most important design considerations. The size of a wearable system is often a design limitation

    The CBE Hardware Accelerator for Numerical Relativity: A Simple Approach

    Full text link
    Hardware accelerators (such as the Cell Broadband Engine) have recently received a significant amount of attention from the computational science community because they can provide significant gains in the overall performance of many numerical simulations at a low cost. However, such accelerators usually employ a rather unfamiliar and specialized programming model that often requires advanced knowledge of their hardware design. In this article, we demonstrate an alternate and simpler approach towards managing the main complexities in the programming of the Cell processor, called software caching. We apply this technique to a numerical relativity application: a time-domain, finite-difference Kerr black hole perturbation evolver, and present the performance results. We obtain gains in the overall performance of generic simulations that are close to the theoretical maximum that can be obtained through our parallelization approach.Comment: 5 pages, 2 figures; Accepted for publication in the International Journal of Modeling, Simulation, and Scientific Computing (IJMSSC

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation
    corecore