134,067 research outputs found
Extending and Implementing the Self-adaptive Virtual Processor for Distributed Memory Architectures
Many-core architectures of the future are likely to have distributed memory
organizations and need fine grained concurrency management to be used
effectively. The Self-adaptive Virtual Processor (SVP) is an abstract
concurrent programming model which can provide this, but the model and its
current implementations assume a single address space shared memory. We
investigate and extend SVP to handle distributed environments, and discuss a
prototype SVP implementation which transparently supports execution on
heterogeneous distributed memory clusters over TCP/IP connections, while
retaining the original SVP programming model
Scaling CUDA for Distributed Heterogeneous Processors
The mainstream acceptance of heterogeneous computing and cloud computing is prompting a future of distributed heterogeneous systems. With current software development tools, programming such complex systems is difficult and requires an extensive knowledge of network and processor architectures. Providing an abstraction of the underlying network, message-passing interface (MPI) has been the standard tool for developing distributed applications in the high performance community. The problem of MPI lies with its message-passing model, which is less expressive than the shared-memory model. Development of heterogeneous programming tools, such as OpenCL, has only begun recently. This thesis presents Phalanx, a framework that extends the virtual architecture of CUDA for distributed heterogeneous systems. Using MPI, Phalanx transparently handles intercommunication among distributed nodes. By using the shared-memory model, Phalanx simplifies the development of distributed applications without sacrificing the advantages of MPI. In one of the case studies, Phalanx achieves 28x speedup compared with serial execution on a Core-i7 processor
A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System
Distributed systems allow the existence of impressive pieces of software, but usually impose strict restrictions on the implementation language and model. We propose a distribution system model that enables the incorporation of any hardware device connected to the internet as its nodes, and places no restriction on the execution engine, allowing the transparent incorporation of any existing codebase into a Distributed Shared Memory.XIX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System
Distributed systems allow the existence of impressive pieces of software, but usually impose strict restrictions on the implementation language and model. We propose a distribution system model that enables the incorporation of any hardware device connected to the internet as its nodes, and places no restriction on the execution engine, allowing the transparent incorporation of any existing codebase into a Distributed Shared Memory.XIX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access
Scientific applications often contain large computationally-intensive
parallel loops. Loop scheduling techniques aim to achieve load balanced
executions of such applications. For distributed-memory systems, existing
dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a
master-worker execution model to assign variably-sized chunks of loop
iterations. The master-worker execution model may adversely impact performance
due to the master-level contention. This work proposes a distributed
chunk-calculation approach that does not require the master-worker execution
scheme. Moreover, it considers the novel features in the latest MPI standards,
such as passive-target remote memory access, shared-memory window creation, and
atomic read-modify-write operations. To evaluate the proposed approach, five
well-known DLS techniques, two applications, and two heterogeneous hardware
setups have been considered. The DLS techniques implemented using the proposed
approach outperformed their counterparts implemented using the traditional
master-worker execution model
Distributed shared memory on heterogeneous CPUs+GPUs platforms
Dissertação de mestrado em Engenharia InformáticaDesenvolver aplicações para plataformas heterogéneas pode dificultar
significativamente o processo de codificação, visto que o uso de dispositivos
de computação diferentes significa ter que lidar com arquiteturas
diferentes, modelos de programação e organização de memória
diversos, espaços de endereçamento de memória disjuntos, etc. Este
documento propõe que o processo de desenvolvimento pode ser simpli-
cado ao virtualizar um ambiente de memória partilhada tradicional
em cima de um sistema de memoria heterogéneo distribuído e expondo
um modelo de memória unificado ao programador. O sistema
de memória liberta o programador da gestão manual dos dados e permite o uso de memória dinâmica acessível por todos os dispositivos.
O sistema de memória proposto foi implementado e validado na frame-
work GAMA usando três algoritmos para testar o sistema: SAXPY,
simulação N-Body "all-pairs" e Barnes-Hut. Estes algoritmos foram
usados para avaliar o desempenho e a escalabilidade da framework
quando equipada com o sistema de memória proposto.
Os resultados mostram que, de uma forma geral, o sistema de memória
melhorou o desempenho de todos os algoritmos. O sistema de memória
provou ser mais útil em algoritmos com uma alta razão de computação
sobre acessos a memória e especialmente em algoritmos irregulares ao
melhorar também a escalabilidade. O alocador de memória paralelo
mostrou optimos resultados quando usado apenas no CPU, mas teve
problemas na velocidade de alocação quando foram adicionados GPUs
ao sistema.Developing for heterogeneous platforms can significantly complicate the coding process, since different processing devices mean different architectures, programming and memory models, disjoint address spaces and so on. This document proposes that the development process can be eased by virtualizing a traditional shared memory environment on top of the heterogeneous distributed system and exposing a unified memory model to the developer. The memory system frees the developer from having to manually manage data movements and allows the use of dynamic memory, accessible by all the devices.
The proposed memory system was implemented and validated on the GAMA framework using three algorithm to benchmark the system: SAXPY, all-pairs N-Body simulation and Barnes-Hut N-Body simulation. These algorithms were used to evaluate the framework performance and scalability when paired with the proposed memory system.
The results show that, overall, the memory system improved performance on all algorithm. The memory system proved most useful on algorithms with high ratio of computation over memory accesses by improving execution times and especially useful on irregular algorithms by improving also scalability. The parallel memory allocator showed great results when used only on CPU, but had speed issues when paring GPUs to the CPU
Towards Virtual Shared Memory for Non-Cache-Coherent Multicore Systems
Abstract-Emerging heterogeneous architectures do not necessarily provide cache-coherent shared memory across all components of the system. Although there are good reasons for this architectural decision, it does provide programmers with a challenge. Several programming models and approaches are currently available, including explicit data movement for offloading computation to coprocessors, and treating coprocessors as distributed memory machines by using message passing. This paper examines the potential of distributed shared memory (DSM) for addressing this programming challenge. We discuss how our recently proposed DSM system and its memory consistency model maps to the heterogeneous node context, and present experimental results that highlight the advantages and challenges of this approach
A Cloud Powered Relaxed Heterogeneous Distributed Shared Memory System
Distributed systems allow the existence of impressive pieces of software, but usually impose strict restrictions on the implementation language and model. We propose a distribution system model that enables the incorporation of any hardware device connected to the internet as its nodes, and places no restriction on the execution engine, allowing the transparent incorporation of any existing codebase into a Distributed Shared Memory.XIX Workshop Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI
Parallel Branch-and-Bound in Multi-core Multi-CPU Multi-GPU Heterogeneous Environments
International audienceWe investigate the design of parallel B&B in large scale heterogeneous compute environments where processing units can be composed of a mixture of multiple shared memory cores, multiple distributed CPUs and multiple GPUs devices. We describe two approaches addressing the critical issue of how to map B&B workload with the different levels of parallelism exposed by the target compute platform. We also contribute a throughout large scale experimental study which allows us to derive a comprehensive and fair analysis of the proposed approaches under different system configurations using up to 16 GPUs and up to 512 CPU-cores. Our results shed more light on the main challenges one has to face when tackling B&B algorithms while describing efficient techniques to address them. In particular, we are able to obtain linear speed-ups at moderate scales where adaptive load balancing among the heterogeneous compute resources is shown to have a significant impact on performance. At the largest scales, intra-node parallelism and hybrid decentralized load balancing is shown to have a crucial importance in order to alleviate locking issues among shared memory threads and to scale the distributed resources while optimizing communication costs and minimizing idle time
- …