Search CORE

7 research outputs found

Task migration of DSP application specified with a DFG and implemented with the BSP computing model on a CPU-GPU cluster

Author: Fristot Vincent
Houzet Dominique
Huet Sylvain
Mansouri Farouk
Publication venue: HAL CCSD
Publication date: 08/10/2013
Field of study

International audienceNowadays computer applications are becoming heavier and require, at the same time, real-time results. The Heterogeneous clusters with their computing power represent a good solution to this request. However, it is possible that during the execution, a computing element of the cluster becomes defaulting, needs maintenance, or that the load needs to be re-balanced. . . In this paper, we propose a migration strategy for relocating the execution of a task to another computing element. In particular, we are interested in remap nodes of Data Flow Graph (DFG), representing Digital Signal Processing (DSP) application, onto heterogeneous (CPU-GPU) clusters while keeping up the flow of data and minimizing the temporal perturbation. For our approach, we give a lower bound for the flow of data after the migration and, validate it by the real-time construction of visual saliency map from video input

Hal - Université Grenoble Alpes

Intra-node Memory Safe GPU Co-Scheduling

Author: Nikolopoulos Dimitrios S.
Reaño González Carlos
Silla Jiménez Federico
Varghese Blesson
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/12/2017
Field of study

[EN] GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among multiple applications by taking memory constraints into account. Two approaches, namely a client-server and a shared memory approach are explored. However, the shared memory approach is more suitable due to lower overheads when compared to the former approach. Four policies are proposed in schedGPU to handle applications that are waiting to access the GPU, two of which account for priorities. The feasibility of schedGPU is validated on three real-world applications. The key observation is that a performance gain is achieved. For single applications, a gain of over 10 times, as measured by GPU utilisation and GPU memory utilisation, is obtained. For workloads comprising multiple applications, a speed-up of up to 5x in the total execution time is noted. Moreover, the average GPU utilisation and average GPU memory utilisation is increased by 5 and 12 times, respectively.This work was funded by Generalitat Valenciana under grant PROMETEO/2017/77.Reaño González, C.; Silla Jiménez, F.; Nikolopoulos, DS.; Varghese, B. (2018). Intra-node Memory Safe GPU Co-Scheduling. IEEE Transactions on Parallel and Distributed Systems. 29(5):1089-1102. https://doi.org/10.1109/TPDS.2017.2784428S1089110229

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

RiuNet

Comparação do desempenho do FDTD com implementação em CPU e em GPU

Author: Mestre Nuno Roberto Pereira
Publication venue: Universidade de Aveiro
Publication date: 21/12/2012
Field of study

Mestrado em Engenharia de Computadores e TelemáticaO Finite-Difference Time-Domain é um método utilizado em electromagnetismo computacional para simular a propagação de ondas electromagnéticas em meios cujas características podem não ser uniformes. É um método com inúmeras aplicações, e como tal é vantajoso que o seu desempenho possa ser aumentado, de preferência recorrendo a sistemas computacionais de baixo custo. O propósito desta dissertação é aproveitar duas tecnologias emergentes e de relativo baixo custo para aumentar o desempenho do FDTD em uma e duas dimensões. Essas tecnologias são sistemas com processadores Multi-Core e placas gráficas que permitem utilizar as suas características de processamento massivamente paralelo para a execução de código de propósito geral. Para explorar as capacidades de um sistema com processador Multi-Core, o algoritmo originalmente sequencial foi alterado de modo a ser executado em múltiplas threads. Por sua vez, para tirar partido da tecnologia CUDA, o algoritmo foi convertido de forma a ser executado num GPU. Os acréscimos de desempenho obtidos indicam que é vantajoso o uso destas tecnologias comparativamente com implementações puramente sequenciais.The Finite-Difference Time-Domain is a method used in computational electromagnetics to simulate the propagation of electromagnetic waves in fields that might not have uniform characteristics. It is a method with countless applications and so it is advantageous to increase its performance, preferably using low cost computer systems. The purpose of this thesis is to make use of two relatively low-cost emerging technologies to increase the FDTD performance in one and two dimensions. These technologies are Multi-Core systems and graphics cards that allow the use of its massive parallel processing characteristics to run general purpose code. To make use of a Multi-Core system, the original sequential code was changed to be executed by multiple threads. In order to use the CUDA technology, the algorithm was converted so that it could be executed on the GPU. The performance increase shows that the use of these technologies is advantageous in comparison to pure sequential implementations

Repositório Institucional da Universidade de Aveiro

Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

Author: Amatya Vinay Chandra
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

Louisiana State University

Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

ASU Digital Repository

ANALYSIS OF BIOPATHWAY MODELS USING PARALLEL ARCHITECTURES

Author: RAMANATHAN RAJENDIRAN
Publication venue
Publication date: 31/03/2017
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

ANALYSIS OF BIOPATHWAY MODELS USING PARALLEL ARCHITECTURES

Author: RAMANATHAN RAJENDIRAN
Publication venue
Publication date: 01/01/2015
Field of study

Ph.DDOCTOR OF PHILOSOPH

MPG.PuRe

ScholarBank@NUS