7 research outputs found

    Mainstream parallel array programming on cell

    Get PDF
    We present the E] compiler and runtime library for the ‘F’ subset of the Fortran 95 programming language. ‘F’ provides first-class support for arrays, allowing E] to implicitly evaluate array expressions in parallel using the SPU coprocessors of the Cell Broadband Engine. We present performance results from four benchmarks that all demonstrate absolute speedups over equivalent ‘C’ or Fortran versions running on the PPU host processor. A significant benefit of this straightforward approach is that a serial implementation of any code is always available, providing code longevity, and a familiar development paradigm

    Array languages and the N-body problem

    Get PDF
    This paper is a description of the contributions to the SICSA multicore challenge on many body planetary simulation made by a compiler group at the University of Glasgow. Our group is part of the Computer Vision and Graphics research group and we have for some years been developing array compilers because we think these are a good tool both for expressing graphics algorithms and for exploiting the parallelism that computer vision applications require. We shall describe experiments using two languages on two different platforms and we shall compare the performance of these with reference C implementations running on the same platforms. Finally we shall draw conclusions both about the viability of the array language approach as compared to other approaches used in the challenge and also about the strengths and weaknesses of the two, very different, processor architectures we used

    Automatic analysis of DMA races using model checking and k-induction

    No full text
    Modern multicore processors, such as the Cell Broadband Engine, achieve high performance by equipping accelerator cores with small "scratch- pad" memories. The price for increased performance is higher programming complexity - the programmer must manually orchestrate data movement using direct memory access (DMA) operations. Programming using asynchronous DMA operations is error-prone, and DMA races can lead to nondeterministic bugs which are hard to reproduce and fix. We present a method for DMA race analysis in C programs. Our method works by automatically instrumenting a program with assertions modeling the semantics of a memory flow controller. The instrumented program can then be analyzed using state-of-the-art software model checkers. We show that bounded model checking is effective for detecting DMA races in buggy programs. To enable automatic verification of the correctness of instrumented programs, we present a new formulation of k-induction geared towards software, as a proof rule operating on loops. Our techniques are implemented as a tool, Scratch, which we apply to a large set of programs supplied with the IBM Cell SDK, in which we discover a previously unknown bug. Our experimental results indicate that our k-induction method performs extremely well on this problem class. To our knowledge, this marks both the first application of k-induction to software verification, and the first example of software model checking in the context of heterogeneous multicore processors. © Springer Science+Business Media, LLC 2011

    Monte Carlo Simulations of Spin Glasses on Cell Broadband Engine

    Get PDF
    Several large-scale computational scientific problems require high-end computing systems to be solved. In the recent years, design of multi-core architectures delivers on a single chip tens or hundreds Gflops of peak computing performance, with high power dissipation efficiency, and it makes available computational power previously available only on high-end multi-processor systems. The aim of this Ph.D. thesis is to study the capability of multi-core processors for scientific programming, analyzing sustained performance, issues related to multicore programming, data distribution, synchronization, in order to define a set of guideline rules to optimize scientific applications for this class of architectures. As an example of multi-core processor, we consider the Cell Broadband Engine (CBE), developed by Sony, IBM and Toshiba. The CBE is one of the most powerful multi-core CPU current available, integrating eight cores and delivering a peak performance of 200 Gflops in single precision and 100 Gflops in double precision. As case of study, we analyze the performances of CBE for Monte Carlo simulations of the Edwards-Anderson Spin Glass model, a paradigm in theoretical and condensed matter physics, used to describe complex systems characterized by phase transitions (such as the para-ferro transition in magnets) or model “frustrated” dynamics. We descrive several strategies for the distribution of data set among on-chip and off-chip memories and propose analytic models to find out the balance between computational and memory access time as a function of both algorithmic and architectural parameters. We use the analytic models to set the parameters of the algorithm, like for example size of data structures and scheduling of operations, to optimize execution of Monte Carlo spin glass simulations on the CBE architecture

    A new parallelisation technique for heterogeneous CPUs

    Get PDF
    Parallelization has moved in recent years into the mainstream compilers, and the demand for parallelizing tools that can do a better job of automatic parallelization is higher than ever. During the last decade considerable attention has been focused on developing programming tools that support both explicit and implicit parallelism to keep up with the power of the new multiple core technology. Yet the success to develop automatic parallelising compilers has been limited mainly due to the complexity of the analytic process required to exploit available parallelism and manage other parallelisation measures such as data partitioning, alignment and synchronization. This dissertation investigates developing a programming tool that automatically parallelises large data structures on a heterogeneous architecture and whether a high-level programming language compiler can use this tool to exploit implicit parallelism and make use of the performance potential of the modern multicore technology. The work involved the development of a fully automatic parallelisation tool, called VSM, that completely hides the underlying details of general purpose heterogeneous architectures. The VSM implementation provides direct and simple access for users to parallelise array operations on the Cell’s accelerators without the need for any annotations or process directives. This work also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM implementation as a one compiler system. The developed compiler system, which is called VP-Cell, takes a single source code and parallelises array expressions automatically. Several experiments were conducted using Vector Pascal benchmarks to show the validity of the VSM approach. The VP-Cell system achieved significant runtime performance on one accelerator as compared to the master processor’s performance and near-linear speedups over code runs on the Cell’s accelerators. Though VSM was mainly designed for developing parallelising compilers it also showed a considerable performance by running C code over the Cell’s accelerators

    Design and implementation of an array language for computational science on a heterogeneous multicore architecture

    Get PDF
    The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufacture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high processor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread context switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming. This dissertation investigates the application of a programming language supporting the first-class use of arrays; and capable of automatically parallelising array expressions; to the heterogeneous multicore domain of the CBE, as found in the Sony PlayStation 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E , is developed to support this aim, and written in the Haskell programming language. The output of the compiler is in an extended C++ dialect known as Offload C++, which targets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory locality are introduced. A suite of medium-sized (100-700 lines), real-world benchmark programs are used to evaluate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs. The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high performance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs

    Methoden und Werkzeuge zum Einsatz von rekonfigurierbaren Akzeleratoren in Mehrkernsystemen

    Get PDF
    Rechensysteme mit Mehrkernprozessoren werden hÀufig um einen rekonfigurierbaren Akzelerator wie einen FPGA erweitert. Die Verlagerung von Anwendungsteilen in Hardware wird meist von Spezialisten vorgenommen. Damit Anwender selbst rekonfigurierbare Hardware programmieren können, ist mein Beitrag die komponentenbasierte Programmierung und Verwendung mit automatischer Beachtung der DatenlokalitÀt. So lÀsst sich auch bei datenintensiven Anwendungen Nutzen aus den Akzeleratoren erzielen
    corecore