The Partitioned Global Address Space (PGAS) programming model strikes a balance between the explicit, localityaware, message-passing model and locality-agnostic, but easy-to-use, shared memory model (e.g. OpenMP). However, the PGAS memory model comes at a performance cost which limits both scalability and performance. Compiler optimizations are often not sufficient and manual optimizations are needed which considerably limit the productivity advantage. This paper proposes a hardware architectural support for PGAS, which allows the processor to efficiently handle shared addresses through new instructions. A prototype compiler is realized allowing to use the support with unmodified code, preserving the PGAS productivity advantage. Speedups of up to 5.5x are demonstrated on the unmodified NAS Parallel Benchmarks using the Gem5 full system simulator.
INTRODUCTION
The PGAS programming model has shown a great potential for scalability and performance, as it furnishes a partitioned memory view allowing the programmer to exploit the data locality. At the same time, it maintains the shared view of the memory, and thus provides an important productivity advantage. However, this partitioned global view of the memory entails a more complex addressing mode to map the programmer's view of the memory to the actual physical layout of the memory. This creates a significant overhead within the runtime to translate this representation to a regular memory address which, in turn, forces users to manually optimize their codes (using complex pointers and MPI-like messaging), clearly reducing the productivity advantage. Also, implementing and maintaining manually optimized versions is not realistic for all codes as it takes a significant effort. To solve this issue, we propose a hardware support mechanism to handle complex PGAS address Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. mapping tasks via newly introduced instructions. A PGAS compiler can make use of the new instructions to efficiently address the shared address space, eliminating the need for the user to manually optimize the code. The Unified Parallel C (UPC) programming language is an implementation of the PGAS model [5] based on the C language. UPC realizes the PGAS memory model by providing a shared memory view across the system that can be accessed by any thread; each thread having an affinity to the part of the shared memory residing locally for best performance. The distribution of shared data across the different threads is controlled by a block size specified by the user for a given array: elements are distributed in group of block size elements in a round robin fashion. In order to address such arrays, UPC shared pointers are used. Shared pointers are similar to C pointers but are able to traverse shared arrays in their logical ordering. They effectively provide a mapping from the logical array ordering to the actual physical location of the data in the system across the whole shared space. To perform the mapping, a shared pointer is composed of three elements: thread, virtual address and phase.
RELATED WORK
Many systems have implemented some hardware support for shared memory across a system. However, nearly none address the issue of shared address mapping to provide a consistent view to the programmer, the Cray T3E being a notable exception [4] . The T3E network interface is augmented with a 'centrifuge' hardware allowing to perform some mapping for arrays using four registers (index, mask, base address, stride and addend). Although this provides a support for the data layout of PGAS languages, it does not mitigate the performance issue in common shared local accesses because of the latency involved (accesses done through a shared pointer targeting the local memory of the node). Hence, we propose to implement the hardware support directly at the core level with new instructions, providing the lowest latency possible.
PGAS HARDWARE SUPPORT
Two essential operations were implemented in order to get an efficient support. (1) Shared address incrementation allows a pointer to traverse an array in a logical order by updating the three fields of the pointer to point to a different element in the shared array. This is a particularly complex operation involving additions, subtractions, multiplications 2) When an element is accessed, the shared pointer needs to be converted to a virtual address and then to a physical address so that the processor can perform the access. This is done by finding the base address of the pointed thread and adding it to the virtual address component of the pointer. New load and store instructions pgas ld and pgas st were added.
EXPERIMENTAL SETUP
The Gem5 full-system simulator [2] was used to implement a 64-core Alpha BigTsunami architecture running under GNU/Linux. In this work, we used the atomic model; which issues a single instruction per clock cycle. Codes are compiled with the Berkeley UPC 2.14.2 compiler and GCC version 4.3.2. To maintain the productivity advantage of UPC, a prototype compiler was realized based on the Berkeley UPC compiler. It was able to generate code using the new instructions from unmodified UPC code.
RESULTS
In order to evaluate the PGAS hardware support, four kernels from the NAS Parallel Benchmarks [1] , implemented with UPC [3, 6] , were used : IS (Integer Sort), CG (Conjugate Gradient) MG (3D Multi-Grid Poisson solver) and FT (Fast Fourier Transform, 3D). They were implemented with different levels of manual optimizations; we used both the non-optimized version and the manually privatized version in which shared addresses are optimized away. Three different results are presented on the graphs: No Manual Opts uses the non hand-optimized NPB kernels with the unmodified Berkeley compiler with all compiler optimizations enabled, Manual Privatization uses the manually optimized NPB kernels in which the shared pointers have been replaced by normal C pointers, again it is compiled with the original, unmodified compiler with all the optimizations enabled. Finally, with HW support uses the hardware support with our prototype compiler on the non hand-optimized NPB kernels.
In Figure 1 , it can be seen that the hardware support brings a speedup of 2.6x to 5.5x on unmodified code. Our proposed hardware support provided close performance to the manually optimized version, surpassing it up to 17% on the FT kernel.
CONCLUSIONS
This work focuses on what we believe is presently the biggest impediment of PGAS languages: the manipulation of shared addresses which create an important performance penalty even for local accesses. We proposed the addition of a novel hardware support for PGAS with new instructions which are easily exploited by compilers. Testing and benchmarking were conducted using the Gem5 full system simulator and the well accepted NAS Parallel Benchmark suite written with UPC. The results were consistently comparable to those obtained from hand tuned code, which demonstrates the performance and the productivity of this approach. For example, unmodified code with our proposed hardware support, achieved up to 5.5x speedup as compared to the same code running without our hardware support but with full compiler optimizations.
