Search CORE

22 research outputs found

Scalable RDMA performance in PGAS languages

Author: Almási George
Cortés Toni
Farreras Esclusa Montserrat
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2009
Field of study

Partitioned global address space (PGAS) languages provide a unique programming model that can span shared-memory multiprocessor (SMP) architectures, distributed memory machines, or cluster ofSMPs. Users can program large scale machines with easy-to-use, shared memory paradigms. In order to exploit large scale machines efficiently, PGAS language implementations and their runtime system must be designed for scalability and performance. The IBM XLUPC compiler and runtime system provide a scalable design through the use of the shared variable directory (SVD). The SVD stores meta-information needed to access shared data. It is dereferenced, in the worst case, for every shared memory access, thus exposing a potential performance problem. In this paper we present a cache of remote addresses as an optimization that will reduce the SVD access overhead and allow the exploitation of native (remote) direct memory accesses. It results in a significant performance improvement while maintaining the run-time portability and scalability.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture

Author: Almási George
Ayguadé Parra Eduard
Cascaval Calin
Castaños José G.
Labarta Mancho Jesús José
Martorell Bofill Xavier
Moreira Jose E.
Ródenas Picó David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Supporting role of non-governmental health insurance schemes in the implementation of universal health coverage in developing countries

Author: Abul-Magd Ehab
Almási Tímea
Arnaiz Fernando
Elezbawy Baher
George Mohsen
Kaló Zoltán
Nagy Balázs
Publication venue
Publication date: 01/01/2020
Field of study

Repository of the Academy's Library

Scalable RDMA performance in PGAS languages

Author: Almási George
Cortés Toni
Farreras Esclusa Montserrat
Publication venue
Publication date
Field of study

RECERCAT

An unified parallel C compiler that implements automatic communication aggregation

Author: Almási George
Amaral José Nelson
Barton Christopher
Farreras Esclusa Montserrat
Publication venue
Publication date: 01/01/2009
Field of study

Partitioned Global Address Space (PGAS) programming languages, such as Unified Parallel C (UPC), offer an attractive high-productivity programming model for programming large-scale parallel machines. PGAS languages partition the application’s address space into private, shared-local and shared-remote memory. When running in a distributed-memory environment, accessing shared-remote memory leads to implicit communication. For fine-grained accesses, which are frequently found in UPC programs, this communication overhead can significantly impact program performance. One solution for reducing the number of fine-grained accesses is to coalesce several accesses into a single access. This paper presents an analysis to identify opportunities for coalescing and an algorithm that allows the compiler to automatically coalesce accesses to shared-remote memory in UPC. It also describes how opportunities for coalescing can be created by the compiler through loop unrolling. Results obtained from coalescing accesses in manually-unrolled parallel loops are presented to demonstrate the benefit of combining parallel loop unrolling and communication coalescing.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

An unified parallel C compiler that implements automatic communication aggregation

Author: Almási George
Amaral José Nelson
Barton Christopher
Farreras Esclusa Montserrat
Publication venue
Publication date
Field of study

RECERCAT

MaJIC

Author: David Padua
Garcia Alejandro L.
George Almási
Mathews John H.
Steven
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Multidimensional Blocking in UPC

Author: Almási George
Amaral José Nelson
Barton Christopher
Cascaval Calin
Farreras Esclusa Montserrat
Garg Rahul
Publication venue
Publication date: 01/02/2008
Field of study

Abstract. Partitioned Global Address Space (PGAS) languages offer an attractive, high-productivity programming model for programming large-scale parallel machines. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the messagepassing paradigm by allowing users control over the data layout. PGAS languages distinguish between private, shared-local, and shared-remote memory, with shared-remote accesses typically much more expensive than shared-local and private accesses, especially on distributed memory machines where sharedremote access implies communication over a network. In this paper we present a simple extension to the UPC language that allows the programmer to block shared arrays in multiple dimensions. We claim that this extension allows for better control of locality, and therefore performance, in the language. We describe an analysis that allows the compiler to distinguish between local shared array accesses and remote shared array accesses. Local shared array accesses are then transformed into direct memory accesses by the compiler, saving the overhead of a locality check at runtime. We present results to show that locality analysis is able to significantly reduce the number of shared accesses

CiteSeerX

UPCommons. Portal del coneixement obert de la UPC