1,354 research outputs found

    Extending OmpSs-2 with flexible task-based array reductions

    Get PDF
    Reductions are a well-known computational pattern found in scientific applications that needs efficient parallelisation mechanisms. In this thesis we present a flexible scheme for computing reductions of arrays in the context of OmpSs-2, a task-based programming model similar to OpenMP

    C Language Extensions for Hybrid CPU/GPU Programming with StarPU

    Get PDF
    Modern platforms used for high-performance computing (HPC) include machines with both general-purpose CPUs, and "accelerators", often in the form of graphical processing units (GPUs). StarPU is a C library to exploit such platforms. It provides users with ways to define "tasks" to be executed on CPUs or GPUs, along with the dependencies among them, and by automatically scheduling them over all the available processing units. In doing so, it also relieves programmers from the need to know the underlying architecture details: it adapts to the available CPUs and GPUs, and automatically transfers data between main memory and GPUs as needed. While StarPU's approach is successful at addressing run-time scheduling issues, being a C library makes for a poor and error-prone programming interface. This paper presents an effort started in 2011 to promote some of the concepts exported by the library as C language constructs, by means of an extension of the GCC compiler suite. Our main contribution is the design and implementation of language extensions that map to StarPU's task programming paradigm. We argue that the proposed extensions make it easier to get started with StarPU,eliminate errors that can occur when using the C library, and help diagnose possible mistakes. We conclude on future work

    Characterizing traits of coordination

    Full text link
    How can one recognize coordination languages and technologies? As this report shows, the common approach that contrasts coordination with computation is intellectually unsound: depending on the selected understanding of the word "computation", it either captures too many or too few programming languages. Instead, we argue for objective criteria that can be used to evaluate how well programming technologies offer coordination services. Of the various criteria commonly used in this community, we are able to isolate three that are strongly characterizing: black-box componentization, which we had identified previously, but also interface extensibility and customizability of run-time optimization goals. These criteria are well matched by Intel's Concurrent Collections and AstraKahn, and also by OpenCL, POSIX and VMWare ESX.Comment: 11 pages, 3 table

    Actors: The Ideal Abstraction for Programming Kernel-Based Concurrency

    Get PDF
    GPU and multicore hardware architectures are commonly used in many different application areas to accelerate problem solutions relative to single CPU architectures. The typical approach to accessing these hardware architectures requires embedding logic into the programming language used to construct the application; the two primary forms of embedding are: calls to API routines to access the concurrent functionality, or pragmas providing concurrency hints to a language compiler such that particular blocks of code are targeted to the concurrent functionality. The former approach is verbose and semantically bankrupt, while the success of the latter approach is restricted to simple, static uses of the functionality. Actor-based applications are constructed from independent, encapsulated actors that interact through strongly-typed channels. This paper presents a first attempt at using actors to program kernels targeted at such concurrent hardware. Besides the glove-like fit of a kernel to the actor abstraction, quantitative code analysis shows that actor-based kernels are always significantly simpler than API-based coding, and generally simpler than pragma-based coding. Additionally, performance measurements show that the overheads of actor-based kernels are commensurate to API-based kernels, and range from equivalent to vastly improved for pragma-based annotations, both for sample and real-world applications

    Extending OmpSs for OpenCL kernel co-execution in heterogeneous systems

    Get PDF
    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.This work has been supported by the University of Cantabria with grant CVE-2014-18166, the Generalitat de Catalunya under grant 2014-SGR-1051, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2016- 76635-C2-2-R (AEI/FEDER, UE) and TIN2015-65316-P. The Spanish Government through the Programa Severo Ochoa (SEV-2015-0493). The European Research Council under grant agreement No 321253 European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc Projects, grant agreement n 288777, 610402 and 671697 and the European HiPEAC Network.Peer ReviewedPostprint (published version

    A Domain-Specific Language and Editor for Parallel Particle Methods

    Full text link
    Domain-specific languages (DSLs) are of increasing importance in scientific high-performance computing to reduce development costs, raise the level of abstraction and, thus, ease scientific programming. However, designing and implementing DSLs is not an easy task, as it requires knowledge of the application domain and experience in language engineering and compilers. Consequently, many DSLs follow a weak approach using macros or text generators, which lack many of the features that make a DSL a comfortable for programmers. Some of these features---e.g., syntax highlighting, type inference, error reporting, and code completion---are easily provided by language workbenches, which combine language engineering techniques and tools in a common ecosystem. In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL and development environment for numerical simulations based on particle methods and hybrid particle-mesh methods. PPME uses the meta programming system (MPS), a projectional language workbench. PPME is the successor of the Parallel Particle-Mesh Language (PPML), a Fortran-based DSL that used conventional implementation strategies. We analyze and compare both languages and demonstrate how the programmer's experience can be improved using static analyses and projectional editing. Furthermore, we present an explicit domain model for particle abstractions and the first formal type system for particle methods.Comment: Submitted to ACM Transactions on Mathematical Software on Dec. 25, 201

    Using the High Productivity Language Chapel to Target GPGPU Architectures

    Get PDF
    It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe

    Dynamic Model Averaging for Practitioners in Economics and Finance: The eDMA Package

    Get PDF
    Raftery, Karny, and Ettler (2010) introduce an estimation technique, which they refer to as Dynamic Model Averaging (DMA). In their application, DMA is used to predict the output strip thickness for a cold rolling mill, where the output is measured with a time delay. Recently, DMA has also shown to be useful in macroeconomic and financial applications. In this paper, we present the eDMA package for DMA estimation implemented in R. The eDMA package is especially suited for practitioners in economics and finance, where typically a large number of predictors are available. Our implementation is up to 133 times faster then a standard implementation using a single-core CPU. Thus, with the help of this package, practitioners are able to perform DMA on a standard PC without resorting to large clusters, which are not easily available to all researchers. We demonstrate the usefulness of this package through simulation experiments and an empirical application using quarterly U.S. inflation data.Comment: 21 pages, 5 figures, 2 table
    • …
    corecore