Search CORE

88 research outputs found

Phase-based Tuning for Better Utilized Multicores

Author: Rajan Hridesh
Rajan Hridesh
Sondag Tyler
Publication venue: Iowa State University Digital Repository
Publication date: 23/01/2009
Field of study

The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new software engineering challenges for developers. A key challenge is that for effective utilization of these performance-asymmetric multicore processors, code sections of a program must be assigned to cores such that the resource needs of a section closely matches resource availability at the assigned core. Determining this assignment manually is tedious, error prone, and it significantly complicates software development. We contribute a transparent and fully-automatic program analysis, which we call phase-based tuning, to solve this problem. Phase-based tuning adapts an application to effectively utilize performance-asymmetric cores of a processor. Our technique does not require any changes in the compiler or operating system, thus it is easy to deploy in existing tool chains. It does not require any input from the programmer except the application. Furthermore, it is independent of the characteristics (performance-asymmetry) of the target multicore processor, which has two benefits. First, it avoids the need to create multiple customizations of the binary for each target architecture, and second it relieves the programmer of the burden of anticipating the target architecture. Last but not least, our technique significantly improves performance. Compared to the stock Linux scheduler, our best technique shows 36% average process speedup, while maintaining fairness and with negligible overheads

Digital Repository @ Iowa State University (ISU)

Building scalable software systems in the multicore era

Author: Rajan Hridesh
Rajan Hridesh
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2010
Field of study

Software systems must face two challenges today: growing complexity and increasing parallelism in the underlying computational models. The problem of increased complexity is often solved by dividing systems into modules in a way that permits analysis of these modules in isolation. The problem of lack of concurrency is often tackled by dividing system execution into tasks that permits execution of these tasks in isolation. The key challenge in software design is to manage the explicit and implicit dependence between modules that decreases modularity. The key challenge for concurrency is to manage the explicit and implicit dependence between tasks that decreases parallelism. Even though these challenges appear to be strikingly similar, current software design practices and languages do not take advantage of this similarity. The net effect is that the modularity and concurrency goals are often tackled mutually exclusively. Making progress towards one goal does not naturally contribute towards the other. My position is that for programmers that are not formally and rigorously trained in the concurrency discipline the safest and most productive way to get scalability in their software is by improving modularity of their software using programming language features and design practices that reconcile modularity and concurrency goals. I briefly discuss preliminary efforts of my group, but we have only touched the tip of the iceberg

Digital Repository @ Iowa State University (ISU)

Crossref

Autotuning and Self-Adaptability in Concurrency Libraries

Author: Guckes Christopher
Karcher Thomas
Tichy Walter F.
Publication venue
Publication date: 12/05/2014
Field of study

Autotuning is an established technique for optimizing the performance of parallel applications. However, programmers must prepare applications for autotuning, which is tedious and error prone coding work. We demonstrate how applications become ready for autotuning with few or no modifications by extending Threading Building Blocks (TBB), a library for parallel programming, with autotuning. The extended TBB library optimizes all application-independent tuning parameters fully automatically. We compare manual effort, autotuning overhead and performance gains on 17 examples. While some examples benefit only slightly, others speed up by 28% over standard TBB.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

arXiv.org e-Print Archive

KITopen

Phase-based tuning for better utilized performance-asymmetric multicores

Author: Sondag Tyler
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2009
Field of study

The latest trend towards performance asymmetry among cores on a single chip of a multicore processor is posing new software engineering challenges for developers. A key challenge is that for effective utilization of these performance-asymmetric multicore processors, application threads must be assigned to cores such that the resource needs of a thread closely matches resource availability at the assigned core. Determining this assignment manually is tedious, error prone, and it significantly complicates software development. We contribute a transparent and fully-automatic program analysis, which we call phase-guided tuning, to solve this problem. Phase-guided tuning adapts an application to effectively utilize performance-asymmetric cores of a processor. Our technique does not require any changes in the compiler or operating system, thus it is easy to deploy in existing tool chains. It does not require any input from the programmer except the application. Furthermore, it is independent of the characteristics (performance-asymmetry) of the target multicore processor, which has two benefits. First, it avoids the need to create multiple customizations of the binary for each target architecture, and second it relieves the programmer of the burden of anticipating the target architecture. Last but not least, our technique significantly improves performance. Compared to the stock Linux scheduler, our best technique shows 215% improvement in throughput and 36% average process speedup, while maintaining fairness and with negligible overheads

Digital Repository @ Iowa State University (ISU)

GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs

Author: Han G
Wang CL
Zhang C
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.published_or_final_versio

HKU Scholars Hub

LMRE: un entorno multiprocesador para la enseñanza de conceptos de concurrencia en un curso CS1

Author: De Giusti Armando Eduardo
De Giusti Laura Cristina
Frati Fernando Emmanuel
Leibovich Fabiana Yael
Madoz María Cristina
Sanchez Mariano
Publication venue
Publication date: 01/10/2011
Field of study

Se presenta un entorno visual interactivo para la enseñanza de conceptos de concurrencia y paralelismo en un curso inicial de algoritmos. El entorno LMRE (Lidi MultiRobot Environment) es una evolución del Visual Da Vinci utilizado extensamente en la introducción a la programación en varias Universidades. El artículo analiza la problemática del cambio tecnológico a partir de la introducción de los procesadores de múltiples núcleos y su impacto sobre la programación y describe una definición del entorno, así como las primitivas a utilizar en la programación de aplicaciones concurrentes. Por último se detallan aspectos de implementación del prototipo actualmente en prueba, así como la evolución del mismo para ser empleado en cursos más avanzados de concurrencia.Presentado en el IX Workshop Tecnología Informática aplicada en Educación (WTIAE)Red de Universidades con Carreras en Informática (RedUNCI

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

Servicio de Difusión de la Creación Intelectual

High-level programming of stencil computations on multi-GPU systems using the SkelCL library

Author: Breuer Stefan
Gorlatch Sergei
Haidl Michael
Steuwer Michel
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/09/2014
Field of study

The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA. This makes development of stencil applications a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high-level programming abstractions with competitive performance on multi-GPU systems. SkelCL extends the OpenCL standard by three high-level features: 1) pre-implemented parallel patterns (a.k.a. skeletons); 2) container data types for vectors and matrices; 3) automatic data (re)distribution mechanism. We introduce two new SkelCL skeletons which specifically target stencil computations – MapOverlap and Stencil – and we describe their use for particular application examples, discuss their efficient parallel implementation, and report experimental results on systems with multiple GPUs. Our evaluation of three real-world applications shows that stencil code written with SkelCL is considerably shorter and offers competitive performance to hand-tuned OpenCL code

Crossref

Enlighten

Self-Configuration and Self-Optimization Autonomic Skeletons using Events

Author: Henrio Ludovic
Pabon Gustavo
Publication venue: ACM
Publication date: 15/02/2014
Field of study

International audienceThis paper presents a novel way to introduce self-configuration and self-optimization autonomic characteristics to algorithmic skeletons using event driven programming techniques. Based on an algorithmic skeleton language, we show that the use of events greatly improves the estimation of the remaining computation time for skeleton execution. Events allow us to precisely monitor the status of the execution of algorithmic skeletons. Using such events, we provide a framework for the execution of skeletons with a very high level of adaptability. We focus mainly on guaranteeing a given execution time for a skeleton, by optimizing autonomically the number of threads allocated. The proposed solution is independent from the platform chosen for executing the skeleton for example we illustrate our approach in a multicore setting, but it could also be adapted to a distributed execution environment

HAL-UNICE

INRIA a CCSD electronic archive server

An events based algorithm for distributing concurrent tasks on multi-core architectures

Author: Amdahl
Anthes
Bank
Chandra
Chrysanthakopoulos
Chrysanthakopoulos
David W. Holmes
De Angeli
Dumbill
Eadline
Foster
Fox
Gropp
Irving
John R. Williams
Kuppuswamy
Lam
Leiserson
Liu
Lu
Lu
MPI.NET
Peter Tilke
Pohl
Pohl
Qiu
Qiu
Ramadan
Richter
Stewart
Stuart
Tian
Tuminaro
Wu
Zienkiewicz
Publication venue: 'Elsevier BV'
Publication date: 01/08/2009
Field of study

In this paper, a programming model is presented which enables scalable parallel performance on multi-core shared memory architectures. The model has been developed for application to a wide range of numerical simulation problems. Such problems involve time stepping or iteration algorithms where synchronization of multiple threads of execution is required. It is shown that traditional approaches to parallelism including message passing and scatter-gather can be improved upon in terms of speed-up and memory management. Using spatial decomposition to create orthogonal computational tasks, a new task management algorithm called H-Dispatch is developed. This algorithm makes efficient use of memory resources by limiting the need for garbage collection and takes optimal advantage of multiple cores by employing a “hungry” pull strategy. The technique is demonstrated on a simple finite difference solver and results are compared to traditional MPI and scatter-gather approaches. The H-Dispatch approach achieves near linear speed-up with results for efficiency of 85% on a 24-core machine. It is noted that the H-Dispatch algorithm is quite general and can be applied to a wide class of computational tasks on heterogeneous architectures involving multi-core and GPGPU hardware.Schlumberger-Doll Research CenterSaudi Aramc

ResearchOnline at James Cook University

Queensland University of Technology ePrints Archive

¿Concurrencia y Paralelismo en el primer curso de Algorítmica?

Author: De Giusti Armando Eduardo
Frati Fernando Emmanuel
Publication venue
Publication date: 03/09/2012
Field of study

Se presenta un tema de intensa discusión curricular en la actualidad: ¿El cambio tecnológico en los procesadores impondrá un cambio en el enfoque de la enseñanza de la programación, reemplazando el paradigma secuencial por el paralelo? La inclusión temprana de los temas de concurrencia y paralelismo en la currícula de Informática está en discusión y en este trabajo se presenta un análisis de posibilidades, así como una propuesta en desarrollo en la UNLP.Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual