8 research outputs found
Mainstream parallel array programming on cell
We present the E] compiler and runtime library for the ‘F’ subset of
the Fortran 95 programming language. ‘F’ provides first-class support for arrays,
allowing E] to implicitly evaluate array expressions in parallel using the SPU coprocessors
of the Cell Broadband Engine. We present performance results from
four benchmarks that all demonstrate absolute speedups over equivalent ‘C’ or
Fortran versions running on the PPU host processor. A significant benefit of this
straightforward approach is that a serial implementation of any code is always
available, providing code longevity, and a familiar development paradigm
The PEPPHER Approach to Programmability and Performance Portability for Heterogeneous many-core Architectures
International audienceThe European FP7 project PEPPHER is addressing programmability and performance portability for current and emerging heterogeneous many-core archi- tectures. As its main idea, the project proposes a multi-level parallel execution model comprised of potentially parallelized components existing in variants suitable for different types of cores, memory configurations, input characteristics, optimization criteria, and couples this with dynamic and static resource and architecture aware scheduling mechanisms. Crucial to PEPPHER is that components can be made performance aware, allowing for more efficient dynamic and static scheduling on the concrete, available resources. The flexibility provided in the software model, combined with a customizable, heterogeneous, memory and topology aware run-time system is key to efficiently exploiting the resources of each concrete hardware configuration. The project takes a holistic approach, relying on existing paradigms, interfaces, and languages for the parallelization of components, and develops a prototype framework, a methodology for extending the framework, and guidelines for constructing performance portable software and systems including paths to migration of existing software for heterogeneous many-core processors. This paper gives a high-level project overview, and presents a specific example showing how the PEPPHER component variant model and resource-aware run-time system enable performance portability of a numerical kernel
Programmability and Performance Portability Aspects of Heterogeneous Multi-/Manycore Systems
International audienceWe discuss three complementary approaches that can provide both portability and an increased level of abstraction for the programming of heterogeneous multicore systems. Together, these approaches also support performance portability, as currently investigated in the EU FP7 project PEPPHER. In particular, we consider (1) a library-based approach, here represented by the integration of the SkePU C++ skeleton programming library with the StarPU runtime system for dynamic scheduling and dynamic selection of suitable execution units for parallel tasks; (2) a language-based approach, here represented by the Offload-C++ high-level language extensions and Offload compiler to generate platform-specific code; and (3) a component-based approach, specifically the PEPPHER component system for annotating user-level application components with performance metadata, thereby preparing them for performance-aware composition. We discuss the strengths and weaknesses of these approaches and show how they could complement each other in an integrational programming framework for heterogeneous multicore systems
Selection of Task Implementations in the Nanos++ Runtime
New heterogeneous systems and hardware accelerators can give higher levels of computational power to high performance computers. However, this does not come for free, since the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource utilization.
OmpSs is a task-based programming model and framework focused on the automatic parallelization of sequential applications. We present a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the framework will choose between these versions at runtime to obtain the best performance achievable for the given application. From our results, obtained in a multi-GPU system, we can prove that our project gives flexibility to application's source code and can potentially increase application’s performance
Self-adaptive OmpSs tasks in heterogeneous environments
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.This work has been supported by the European Commission through the ENCORE project (FP7-248647), the TERAFLUX project (FP7-249013), the TEXT project (FP7-261580), the HiPEAC-3 Network of Excellence (FP7-ICT
287759), the Intel-BSC Exascale Lab collaboration project, the support of the Spanish Ministry of Education (CSD2007-
00050 and FPU program), the projects of Computación de
Altas Prestaciones V and VI (TIN2007-60625, TIN2012-34557) and the Generalitat de Catalunya (2009-SGR-980).Peer ReviewedPostprint (author’s final draft
Automatic analysis of DMA races using model checking and k-induction
Modern multicore processors, such as the Cell Broadband Engine, achieve high performance by equipping accelerator cores with small "scratch- pad" memories. The price for increased performance is higher programming complexity - the programmer must manually orchestrate data movement using direct memory access (DMA) operations. Programming using asynchronous DMA operations is error-prone, and DMA races can lead to nondeterministic bugs which are hard to reproduce and fix. We present a method for DMA race analysis in C programs. Our method works by automatically instrumenting a program with assertions modeling the semantics of a memory flow controller. The instrumented program can then be analyzed using state-of-the-art software model checkers. We show that bounded model checking is effective for detecting DMA races in buggy programs. To enable automatic verification of the correctness of instrumented programs, we present a new formulation of k-induction geared towards software, as a proof rule operating on loops. Our techniques are implemented as a tool, Scratch, which we apply to a large set of programs supplied with the IBM Cell SDK, in which we discover a previously unknown bug. Our experimental results indicate that our k-induction method performs extremely well on this problem class. To our knowledge, this marks both the first application of k-induction to software verification, and the first example of software model checking in the context of heterogeneous multicore processors. © Springer Science+Business Media, LLC 2011
SHAPES : Easy and high-level memory layouts
CPU speeds have vastly exceeded those of RAM. As such, developers who aim to achieve high
performance on modern architectures will most likely need to consider how to use CPU caches
effectively, hence they will need to consider how to place data in memory so as to exploit spatial
locality and achieve high memory bandwidth.
Performing such manual memory optimisations usually sacrifices readability, maintainability,
memory safety, and object abstraction. This is further exacerbated in managed languages, such
as Java and C#, where the runtime abstracts away the memory from the developer and such
optimisations are, therefore, almost impossible.
To that extent, we present in this thesis a language extension called SHAPES . SHAPES aims
to offer developers more fine-grained control over the placement of data, without sacrificing
memory safety or object abstraction, hence retaining the expressiveness and familiarity of OOP.
SHAPES introduces the concepts of pools and layouts; programmers group related objects into
pools, and specify how objects are laid out in these pools. Classes and types are annotated
by pool parameters, which allow placement aspects to be changed orthogonally to how the
business logic operates on the objects in the pool. These design decisions disentangle business
logic and memory concerns.
We provide a formal model of SHAPES , present its type and memory safety model, and its
translation into a low-level language. We present our reasoning as to why we can expect
SHAPES to be compiled in an efficient manner in terms of the runtime representation of objects
and the access to their fields.
Moreover, we present SHAPES -z, an implementation of SHAPES as an embeddable language,
and shapeszc , the compiler for SHAPES -z. We provide our our design and implementation
considerations for SHAPES -z and shapeszc . Finally, we evaluate the performance of SHAPES
and SHAPES -z through case studies.Open Acces