9,838 research outputs found

    High level algorithmic auto-tuning for scientific applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 102-107).In this thesis, we describe a new classification of auto-tuning methodologies spanning from low-level optimizations to high-level algorithmic tuning. This classification spectrum of auto-tuning methods encompasses the space of tuning parameters from low-level optimizations (such as block sizes, iteration ordering, vectorization, etc.) to high-level algorithmic choices (such as whether to use an iterative solver or a direct solver). We present and analyze four novel auto-tuning systems that incorporate several techniques that fall along a spectrum from the low-level to the high-level: i) a multiplatform, auto-tuning parallel code generation framework for generalized stencil loops, ii) an auto-tunable algorithm for solving dense triangular systems, iii) an auto-tunable multigrid solver for sparse linear systems, and iv) tuned statistical regression techniques for fine-tuning wind forecasts and resource estimations to assist in the integration of wind resources into the electrical grid. We also include a project assessment report for a wind turbine installation for the City of Cambridge to highlight an area of application (wind prediction and resource assessment) where these computational auto-tuning techniques could prove useful in the future.by Cy P. Chan.Ph.D

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    Simplifying the Development, Use and Sustainability of HPC Software

    Full text link
    Developing software to undertake complex, compute-intensive scientific processes requires a challenging combination of both specialist domain knowledge and software development skills to convert this knowledge into efficient code. As computational platforms become increasingly heterogeneous and newer types of platform such as Infrastructure-as-a-Service (IaaS) cloud computing become more widely accepted for HPC computations, scientists require more support from computer scientists and resource providers to develop efficient code and make optimal use of the resources available to them. As part of the libhpc stage 1 and 2 projects we are developing a framework to provide a richer means of job specification and efficient execution of complex scientific software on heterogeneous infrastructure. The use of such frameworks has implications for the sustainability of scientific software. In this paper we set out our developing understanding of these challenges based on work carried out in the libhpc project.Comment: 4 page position paper, submission to WSSSPE13 worksho

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
    • …
    corecore