Search CORE

134,067 research outputs found

Extending and Implementing the Self-adaptive Virtual Processor for Distributed Memory Architectures

Author: Koivisto Juha
van Tol Michiel W.
Publication venue
Publication date: 01/01/2011
Field of study

Many-core architectures of the future are likely to have distributed memory organizations and need fine grained concurrency management to be used effectively. The Self-adaptive Virtual Processor (SVP) is an abstract concurrent programming model which can provide this, but the model and its current implementations assume a single address space shared memory. We investigate and extend SVP to handle distributed environments, and discuss a prototype SVP implementation which transparently supports execution on heterogeneous distributed memory clusters over TCP/IP connections, while retaining the original SVP programming model

arXiv.org e-Print Archive

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Scaling CUDA for Distributed Heterogeneous Processors

Author: Lam Siu Kwan
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2012
Field of study

The mainstream acceptance of heterogeneous computing and cloud computing is prompting a future of distributed heterogeneous systems. With current software development tools, programming such complex systems is difficult and requires an extensive knowledge of network and processor architectures. Providing an abstraction of the underlying network, message-passing interface (MPI) has been the standard tool for developing distributed applications in the high performance community. The problem of MPI lies with its message-passing model, which is less expressive than the shared-memory model. Development of heterogeneous programming tools, such as OpenCL, has only begun recently. This thesis presents Phalanx, a framework that extends the virtual architecture of CUDA for distributed heterogeneous systems. Using MPI, Phalanx transparently handles intercommunication among distributed nodes. By using the shared-memory model, Phalanx simplifies the development of distributed applications without sacrificing the advantages of MPI. In one of the case studies, Phalanx achieves 28x speedup compared with serial execution on a Core-i7 processor

SJSU ScholarWorks

A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System

Author: Blanco Sebastián
Teragni Matías
Zabala Gonzalo
Publication venue
Publication date: 12/03/2019
Field of study

Distributed systems allow the existence of impressive pieces of software, but usually impose strict restrictions on the implementation language and model. We propose a distribution system model that enables the incorporation of any hardware device connected to the internet as its nodes, and places no restriction on the execution engine, allowing the transparent incorporation of any existing codebase into a Distributed Shared Memory.XIX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System

Author: Blanco Sebastián
Teragni Matías
Zabala Gonzalo
Publication venue
Publication date: 01/10/2018
Field of study

Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

Author: Ciorba Florina M.
Eleliemy Ahmed
Publication venue
Publication date: 01/01/2018
Field of study

Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model

arXiv.org e-Print Archive

Crossref

edoc

Distributed shared memory on heterogeneous CPUs+GPUs platforms

Author: Alves Ricardo
Publication venue
Publication date: 19/12/2012
Field of study

Dissertação de mestrado em Engenharia InformáticaDesenvolver aplicações para plataformas heterogéneas pode dificultar significativamente o processo de codificação, visto que o uso de dispositivos de computação diferentes significa ter que lidar com arquiteturas diferentes, modelos de programação e organização de memória diversos, espaços de endereçamento de memória disjuntos, etc. Este documento propõe que o processo de desenvolvimento pode ser simpli- cado ao virtualizar um ambiente de memória partilhada tradicional em cima de um sistema de memoria heterogéneo distribuído e expondo um modelo de memória unificado ao programador. O sistema de memória liberta o programador da gestão manual dos dados e permite o uso de memória dinâmica acessível por todos os dispositivos. O sistema de memória proposto foi implementado e validado na frame- work GAMA usando três algoritmos para testar o sistema: SAXPY, simulação N-Body "all-pairs" e Barnes-Hut. Estes algoritmos foram usados para avaliar o desempenho e a escalabilidade da framework quando equipada com o sistema de memória proposto. Os resultados mostram que, de uma forma geral, o sistema de memória melhorou o desempenho de todos os algoritmos. O sistema de memória provou ser mais útil em algoritmos com uma alta razão de computação sobre acessos a memória e especialmente em algoritmos irregulares ao melhorar também a escalabilidade. O alocador de memória paralelo mostrou optimos resultados quando usado apenas no CPU, mas teve problemas na velocidade de alocação quando foram adicionados GPUs ao sistema.Developing for heterogeneous platforms can significantly complicate the coding process, since different processing devices mean different architectures, programming and memory models, disjoint address spaces and so on. This document proposes that the development process can be eased by virtualizing a traditional shared memory environment on top of the heterogeneous distributed system and exposing a unified memory model to the developer. The memory system frees the developer from having to manually manage data movements and allows the use of dynamic memory, accessible by all the devices. The proposed memory system was implemented and validated on the GAMA framework using three algorithm to benchmark the system: SAXPY, all-pairs N-Body simulation and Barnes-Hut N-Body simulation. These algorithms were used to evaluate the framework performance and scalability when paired with the proposed memory system. The results show that, overall, the memory system improved performance on all algorithm. The memory system proved most useful on algorithms with high ratio of computation over memory accesses by improving execution times and especially useful on irregular algorithms by improving also scalability. The parallel memory allocator showed great results when used only on CPU, but had speed issues when paring GPUs to the CPU

Universidade do Minho: RepositoriUM

Towards Virtual Shared Memory for Non-Cache-Coherent Multicore Systems

Author: Bharath Ramesh
Calvin J Ribbens
Srinidhi Varadarajan
Publication venue
Publication date: 02/04/2020
Field of study

Abstract-Emerging heterogeneous architectures do not necessarily provide cache-coherent shared memory across all components of the system. Although there are good reasons for this architectural decision, it does provide programmers with a challenge. Several programming models and approaches are currently available, including explicit data movement for offloading computation to coprocessors, and treating coprocessors as distributed memory machines by using message passing. This paper examines the potential of distributed shared memory (DSM) for addressing this programming challenge. We discuss how our recently proposed DSM system and its memory consistency model maps to the heterogeneous node context, and present experimental results that highlight the advantages and challenges of this approach

CiteSeerX

A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System

Author: Blanco Sebastián
Teragni Matías
Zabala Gonzalo
Publication venue
Publication date: 01/10/2018
Field of study

Servicio de Difusión de la Creación Intelectual

Parallel Branch-and-Bound in Multi-core Multi-CPU Multi-GPU Heterogeneous Environments

Author: Derbel Bilel
Vu Trong-Tuan
Publication venue: 'Elsevier BV'
Publication date: 20/03/2016
Field of study

International audienceWe investigate the design of parallel B&B in large scale heterogeneous compute environments where processing units can be composed of a mixture of multiple shared memory cores, multiple distributed CPUs and multiple GPUs devices. We describe two approaches addressing the critical issue of how to map B&B workload with the different levels of parallelism exposed by the target compute platform. We also contribute a throughout large scale experimental study which allows us to derive a comprehensive and fair analysis of the proposed approaches under different system configurations using up to 16 GPUs and up to 512 CPU-cores. Our results shed more light on the main challenges one has to face when tackling B&B algorithms while describing efficient techniques to address them. In particular, we are able to obtain linear speed-ups at moderate scales where adaptive load balancing among the heterogeneous compute resources is shown to have a significant impact on performance. At the largest scales, intra-node parallelism and hybrid decentralized load balancing is shown to have a crucial importance in order to alleviate locking issues among shared memory threads and to scale the distributed resources while optimizing communication costs and minimizing idle time

HAL - Lille 3

INRIA a CCSD electronic archive server