Hardware Support for Address Mapping in PGAS Languages; a UPC Case Study by Serres, Olivier et al.
Hardware Support for Address Mapping in PGAS Languages; a
UPC Case Study
Olivier Serres ∗1, Abdullah Kayi †2, Ahmad Anbar ‡1, and Tarek El-Ghazawi §1
1 ECE Department, The George Washington University, Washington DC, USA,
2 Intel, Hillsboro, OR, USA,
Abstract
The Partitioned Global Address Space (PGAS) pro-
gramming model strikes a balance between the
locality-aware, but explicit, message-passing model
(e.g. MPI) and the easy-to-use, but locality-agnostic,
shared memory model (e.g. OpenMP). However,
the PGAS rich memory model comes at a perfor-
mance cost which can hinder its potential for scal-
ability and performance. To contain this overhead
and achieve full performance, compiler optimizations
may not be sufficient and manual optimizations are
typically added. This, however, can severely limit
the productivity advantage. Such optimizations are
usually targeted at reducing address translation over-
heads for shared data structures. This paper pro-
poses a hardware architectural support for PGAS,
which allows the processor to efficiently handle shared
addresses. This eliminates the need for such hand-
tuning, while maintaining the performance and pro-
ductivity of PGAS languages. We propose to avail
this hardware support to compilers by introducing
new instructions to efficiently access and traverse the
PGAS memory space. A prototype compiler is re-
alized by extending the Berkeley Unified Parallel C
(UPC) compiler. It allows unmodified code to use the
new instructions without the user intervention, there-
by creating a real productive programming environ-
∗serres@gwu.edu
†abdullah.kayi@intel.com
‡anbar@gwu.edu
§tarek@gwu.edu
ment. Two different implementations of the system
are realized: the first is implemented using the full
system simulator Gem5, which allows the evaluation
of the performance gain. The second is implemented
using a softcore processor Leon3 on an FPGA to ver-
ify the implementability and to parameterize the cost
of the new hardware and its instructions. The new
instructions show promising results for the NAS Par-
allel Benchmarks implemented in UPC. A speedup of
up to 5.5x is demonstrated for unmodified and unop-
timized codes. Unoptimized code performance using
this hardware was shown to also surpass the perfor-
mance of manually optimized code by up to 10%.
1 Introduction
Considering parallel programming models, there has
always been a trade-off between programmability and
performance. For instance, the two widely accept-
ed parallel programming models, shared memory and
message passing have both their own advantages and
disadvantages. The shared memory model gives the
programmer an easy-to-program shared view of the
memory, that allows one-sided communication. At
the same time, the shared memory model has no
notion of data locality, which might lead to severe
degradation in performance. This degradation in
performance is attributed to tasks generally working
on remote data, as there is no way to infer from a
programmers perspective, which data are local and
which are remote to each task. On the other hand,
ar
X
iv
:1
30
9.
23
28
v1
  [
cs
.D
C]
  9
 Se
p 2
01
3
the message passing model allows the programmers
to fully exploit the data locality to achieve high per-
formance at the cost of productivity. As this model
allows only explicit two sided communication, it is
the programmer’s responsibility to explicitly express
the data movements, both at the sender and receiver
tasks.
Partitioned Global Address Space (PGAS) pro-
gramming model is capturing the good of these earlier
models. The PGAS programming model has shown
a great potential for scalability and performance, as
it furnishes a partitioned memory view allowing the
programmer to exploit the data locality. At the same
time it still maintains the shared view of the memory,
and thus still maintains the productivity advantage.
However, this partitioned global view of the memo-
ry entails a more complex addressing mode to map
the programmer’s view of the memory to the actual
physical layout of the memory. This mismatch neces-
sitates two major elements (1) a new way of repre-
sentation of shared addresses in the global view, and
(2) the translation between the global representation
and the virtual address representation. For exam-
ple, in the Unified Parallel C (UPC) language, three
different fields are necessary to represent a shared ad-
dress. This creates a significant overhead within the
runtime to translate this representation to a regular
memory address. This definitely increases the pro-
gramming complexity which in turn forces users to
manually optimize their codes, clearly reducing the
productivity advantage. Manual optimizations to im-
prove performance usually includes complex pointers
and MPI like messaging which totally degrades the
PGAS programming productivity.
To solve this issue, we propose a hardware sup-
port mechanism to handle complex PGAS address
mapping tasks with hardware assistance via newly
introduced instructions. This eliminates the need of
manual tuning of the code to work around this specif-
ic problem. A PGAS compiler can make use of such
instructions to efficiently translate from the shared
address representation to the actual memory address
representation on the target machine. The proposed
hardware support aims to close the performance gap
between shared pointer addressing and private point-
er addressing with modest hardware changes. To ver-
Figure 1: UPC memory model
Figure 2: UPC Memory layout for arrayA, some
shared pointers are also represented
ify performance improvements and feasibility of our
proposed hardware mechanism, we conducted full-
system simulation as well as FPGA prototyping.
The rest of this paper is organized as follows: Sec-
tion 2 presents the UPC language and particularly
its memory model. Section 3 reviews the related
work, Section 4 discusses the proposed PGAS hard-
ware support. Section 5 describes our prototype im-
plementations; one is using the Gem5 full system sim-
ulator to obtain performance results with up to 64
cores, the other one implement the hardware support
in an FPGA along with a softcore processor allowing
us to evaluate the feasibility and the chip area need-
ed for such a design. In Section 6 we present and
discuss the results. Finally, Section 7 concludes the
paper and considers future work.
2
2 Unified Parallel C
Unified Parallel C (UPC) is a parallel extension of
ISO C 99 programming language implementing the
PGAS model [20]. It follows an SPMD execution
model in which a specified number of threads are ex-
ecuted in parallel. UPC realizes the PGAS memo-
ry model by providing a shared memory view across
the system that can be accessed by any thread; each
thread having an affinity to the part of the shared
memory residing locally. In addition, each thread
has access to a private space that is accessible only
by the thread itself. The private space has low over-
heads and allows for the best performance. Figure 1
provides an overview of the UPC memory model. The
language also provides all the facilities needed for par-
allel programming: locks, memory barriers, collective
operations, etc. Accesses to either the shared or the
private memory space are syntactically identical and
done through simple variable accesses or assignments.
The distribution of shared data across the different
threads is controlled by a block size specified by the
user for a given array: elements are distributed in
block size elements in a round robin fashion. Thus,
the blocking factor gives the programmers a mecha-
nism to control the data distribution in the shared
space. For example, Figure 2 presents how the ele-
ments of the following array are distributed across 4
UPC threads. Each thread has its own contiguous
address space starting at base address.
shared [4] int arrayA[32];
In order to address such arrays, a UPC shared
pointer can be used. Shared pointers are similar to C
pointers but are able to traverse shared arrays in their
normal ordering. They effectively provide a mapping
from the logical array order to the actual physical
location of the data in the system across the whole
shared space.
The shared pointer fields allow to perform the map-
ping between the shared space and the physical dis-
tribution. A shared pointer is usually composed of
three elements: thread: thread affinity of the point-
ed data, virtual address: address of the current
element in the local space and phase: position in-
side the current block. This allows the program-
mer to traverse the array in a logical way. Cur-
rent implementations of UPC usually use 64 bits
to represent a shared pointer. Even if a compil-
er uses its own internal representation, the UPC
specification [20] provides the following function to
indirectly access them: upc threadof, upc phaseof,
upc addrfieldof and upc resetphase. Figure 2 also
presents a few examples of shared pointers (ptrA,
ptrB and ptrC).
3 Related Work
Previous studies have analyzed both the productivity
advantage and the performance of PGAS languages
under different levels of manual code optimizations
and on a variety of systems. Many of those studies
have shown that the shared pointer arithmetic and
the address translation are the main performance im-
pediments in UPC codes.
In [6], Cantonnet et al. demonstrated clear ad-
vantage in terms of productivity when using UPC
compared to MPI. Their experiments has shown that
UPC has consistent improvement over MPI in terms
of number of lines of code, number of characters,
and conceptual effort to write the same program.
Ebcioglu et al. [10] performed a 4.5 day study on
27 subjects, to compare the productivity of parallel
programming languages. In this study, they com-
pared the time to reach correct output when using
several parallel languages including: C+MPI, UPC.
The study showed that the use of a PGAS language
can improve the productivity.
Along with the productivity studies, many efforts
were performed in the direction of evaluating the po-
tential of achieving performance using UPC. In [22],
the authors demonstrated that hand-tuned UPC code
can achieve comparable performance to, and some-
times even better than, code in MPI. [23] evaluates
the performance of different UPC compilers on 3 dif-
ferent machines: a Linux x86 cluster, an AlphaServer
SC and a Cray T3E.
In [12], El-Ghazawi et al. clearly demonstrated
the overhead of the PGAS shared memory model.
They proposed a framework to assess the compil-
3
ers and the runtime systems capabilities to optimize
such overheads. In order to solve those issues, dif-
ferent compiler optimizations have been researched
including optimization techniques such as lookup ta-
bles: [5, 18]; the reduced overhead is still significant
and the methods can use a good amount of memory.
Also, alternative representations for shared pointers
have been implemented, for example phaseless point-
ers are used for shared addresses with a block size of
1 or infinity [8], this is only applicable to a few cases
and still present a significant overheads.
Multiple systems have implemented Hardware sup-
port for shared memory across a system. For exam-
ple, the T3D supercomputer used a ’Support Circuit-
ry’ chip located between the processor and the local
memory [1]; this chip, on top of providing function-
ality like message passing and synchronization, al-
lowed the processor to access any memory location
across the machine. In [14] is proposed a network en-
gine especially designed for PGAS languages: it al-
lows network communication between nodes by map-
ping other nodes memory space accross the network
and providing a relaxed memory consistency model
best suited for PGAS. Results were only presented
in terms of read/write throughput and transaction
rates as no PGAS applications or benchmarks were
tested. This approach is complementary to our work,
as it focused on the network interface for PGAS lan-
guages and this paper focuses on the shared space
addressing. Combining both an efficient addressing
a an efficient network interface would provide a very
efficient support for PGAS; this is noted as future
work.
More interestingly, the T3E [17, 16, 7] improved on
that by providing E-registers and a ’centrifuge’ hard-
ware allowing to perform some mapping for arrays
using 4 registers (index, mask, base address, stride
and addend). This provides a good support for the
data layout of PGAS languages at the level of the
network interface, however this approach has multi-
ple drawbacks that needs to be addressed: it uses a
great number of registers, the registers are memory-
mapped and hence relatively slow to access and the
hardware is outside of the processor chip, close to
the network interface making it useless to improve
the performance of the very frequent local accesses.
input : blocksize, elemsize, increment,
numthreads, shptr
output: nshptr
phinc = shptr.phase + increment
thinc = phinc / blocksize
nshptr.phase = phinc % blocksize
blockinc = (shptr.thread + thinc) / numthreads;
nshptr.thread = (shptr.thread + thinc) %
numthreads;
eaddrinc = (nshptr.phase - shptr.phase) +
blockinc * blocksize;
nshptr.va = shptr.va + eaddrinc * elemsize;
Algorithm 1: Shared pointer incrementation
4 PGAS hardware support
In this section, we discuss the overheads of the
current approach, the general principles behind the
hardware support and how the hardware is made
available to compilers by extending the instruction
set.
4.1 PGAS Memory Model Overheads
Shared pointer manipulations are currently per-
formed in software. Incrementing a shared pointer
consists of updating the three fields of the pointer to
point to a different element in the shared array. This
is done using algorithm similar to the one present-
ed in Algorithm 1. The algorithm increments shptr
to the new pointer nshptr. This is a particularly
complex operation involving additions, subtractions,
multiplications and divisions. It requires temporary
registers which can increase register spill. It is to
be noted that to increment a pointer, extra informa-
tion like the array block size, the element size (e.g. 4
bytes for int) and the number of running threads are
needed.
The shared pointer manipulations may be opti-
mized by compilers in some cases using various meth-
ods like using a simpler representation for some kind
of pointers or optimizing the shared pointer manipu-
lation away when accessing the local space. However,
this is not always feasible due to the complexity of
the compiled code or due to dependencies on other
4
results or functions from a different compilation unit.
When an element is accessed, the shared pointer
needs to be converted to a virtual address that the
processor can manipulate. This is not as compute in-
tensive but it still requires to lookup the base address
for the thread pointed to and perform an addition.
This can greatly increase the time required to access
an element.
4.2 PGAS Memory Model Hardware
Support
The hardware support proposed here adds a specific
support for the shared pointers. This allows to close
the performance gap between the shared space ad-
dressing and the private space addressing. At least
two different types of operations need to be optimized
in order to get an efficient addressing of the shared
space: incrementing the pointers allowing to traverse
arrays and translating the shared addresses to the
final physical address allowing for reading and writ-
ing. Other operations like testing for the locality of a
shared pointer (checking if a shared pointer points to
local data or not, which can be used to quickly call
a communication sub-routine if the data is off-node)
can also benefit from the same hardware support.
Algorithm 1 can be pipelined fairly well for a
hardware implementation. This is especially true
when considering the most common case of having
numthreads, blocksize and elemsize as powers of
2. This assumption is used in our implementations,
allowing to replace the divisions and multiplications
with simple shifting and masking. At the same time,
it does not restrict the user as the compiler can fall-
back on the software implementation for the cases not
supported by the hardware.
When using a shared pointer to access data, the
physical location should be computed. This is done
by adding the base address of the thread specified
by the pointer to the virtual address contained in
the pointer. The system virtual address will then
be transformed to the final physical address using
the conventional translation lookaside buffer (TLB)
hardware.
At least two different implementations are possible
for the address translation: the thread address spaces
can be starting at regular intervals allowing the base
address to be simply computed from the thread num-
ber (similarly to [14]), or a lookup table can be used
to retrieve the base address. The first method is more
restrictive in the addresses possibly used but more
scalable as it does not require storing a table of base
addresses. We used the second one in our prototype
implementation for simplicity.
For example, for ptrC of Figure 2, the system
virtual address of the element would be comput-
ed by retrieving the base address of thread 1 and
adding the virtual address from the shared pointer:
0xff0b00000000 + 0x3f00 = 0xff0b00003f00.
4.3 Instruction Set Extension
The new hardware can be availed to the compilers
with new instructions: instructions to manipulate
pointers and instructions to load/store using a shared
pointer as an address. Shared addresses can be stored
in the normal processor registers, the other needed
information like the block size or the element size
can be directly encoded in the instructions. The in-
crement value can be an immediate value or coming
from a register. In the following implementations,
we also used a special register to store the number of
UPC threads for the currently running program.
5 Experimental Setup
Our experimental setup is composed of two parts, one
using full system simulation allowing to evaluate the
performance characteristics of the hardware support,
the other one is an hardware implementation using
FPGAs allowing us to study the implementation de-
tails and evaluate the chip area needed for such an
extension.
5.1 Full System Simulation
In order to simulate the extended instruction set, the
Gem5 simulator [3] was used. Gem5 has multiple ad-
vantages for this work: it supports a great variety of
architectures, cache configurations and different CPU
5
Table 1: Instructions Added to the Alpha ISA
Shared Address Loads
pgas ldbu Load Byte Unsigned (8 bits)
pgas ldwu Load Word Unsigned (16 bits)
pgas ldl Load Long Unsigned (32 bits)
pgas ldq Load Quad Unsigned (64 bits)
pgas lds Load S float (32 bits, float)
pgas ldt Load T float (64 bits, double)
Shared Address Stores
pgas stb Store Byte Unsigned (8 bits)
pgas stw Store Word Unsigned (16 bits)
pgas stl Store Long Unsigned (32 bits)
pgas stq Store Quad Unsigned (64 bits)
pgas sts Store S float (32 bits, float)
pgas stt Store T float (64 bits, double)
Shared Address Incrementations
pgas inc imm Address increment, immediate
pgas inc reg Address increment, register
Initialization
set threads Initialize the ’threads’ register
set base address Set the base address
look-up table
Figure 3: New Alpha Instructions Format
models allowing trade-offs between speed of simula-
tion and accuracy. Gem5 has also been shown to be
an accurate simulator [4].
We selected the Alpha architecture as it sup-
ports full-system simulation up to 64 cores with
GNU/Linux. Gem5 simulates 64-bit Alpha 21264
processors with the BWX, CIX, FIX and MVI ex-
tensions. It provides 32 integer registers (R0-R31)
and 32 floating point registers (F0-F31). We used the
custom BigTsunami architecture in order to support
up to 64 cores. The Linux kernel version 2.6.27.62
patched for BigTsunami is used, programs are com-
piled with the Berkeley UPC 2.14.2 compiler and
cross compiled for Alpha using GCC version 4.3.2.
The Classic Gem5 memory model is used; each
core is configured with a 32kb L1 code and data
cache, and a shared L2 cache of 4MB. The frequency
is set at 2 Ghz.
The Alpha instruction set is extended with the in-
structions shown in Table 1. The instruction format
is presented in Figure 3. The integer registers (R0-
R31) are also used to store the shared addresses.
For loads and stores, RA and RB represent the source
and destination registers. Opcode is a free opcode
from the Alpha instruction set. Func defines which
type of load or store is going to be performed. Short
disp is a displacement added to the resulting virtual
address after the shared pointer has been translated
to the system virtual address. Short disp is partic-
ularly useful in order to access different members of
a data structure.
For the shared addresses incrementation instruc-
tions, RA and RC represents the source and destina-
tion registers. RB is used in the register version of the
instruction and specify the increment register. Any
increment value can be used when using a register.
Esize, Bsize and Increm are 5-bit encoded imme-
diate values for the element size, block size and the
increment; they can represent any 32 bits value in
which only one bit is set (1, 2, 4, 8, ...).
In order to maintain the productivity advantage of
UPC, the new instructions need to be usable with-
out any user intervention. A prototype compiler was
realized based on the Berkeley UPC source-to-source
compiler. For that, the first step was to disable the
phase-less pointer optimization; as this optimization
6
Figure 4: 4 core Leon3 SMP with PGAS support
Table 2: Leon3 configuration
Configuration
Cores 4x SPARC cores (SMP)
Features 2-cycle multiplier, branch prediction
Cache Cache Coherent
L1 I 2 Sets, 8 kB/set, 32 bytes/line, LRU
L1 D 4 Sets, 4 kB/set, 16 bytes/line, LRU
FPU Not implemented
BUS AMBA AHB with fast snooping
Memory Xilinx MIG-3.7 DDR3-800
Frequency 75MHz
OS GNU/Linux, Linux version 2.6.36
generates an incompatible shared pointer format and
the optimization is not required when the hardware
support is present. The second step was to replace
the shared pointer operation amenable to hardware
with asm() statements making use of the new in-
structions. This is not always possible; for example,
block sizes that are not powers of two. In such cases,
the normal software address incrementation is used.
Some simple optimizations are also performed, for ex-
ample, shared address incrementation with only two
bits set in the increment are performed via two imme-
diates : to increment a pointer by 3, an incrementa-
tion by 1 is done, followed by an incrementation by 2.
The C code generated by the source-to-source com-
piler is then compiled with the GCC compiler. The
assembler was also modified in order to recognize the
new instructions.
Figure 5: The 7-stage Leon3 pipeline extended with
PGAS support
Table 3: PGAS Hardware Support SPARC V8 ISA
extension
Coprocessor Load/Store
LDC Load to Coproc. reg.
(32 bits)
STC Store from Coproc. reg.
(32 bits)
Shared Address Load/Store
LDCM Load Long (32 bits)
STCM Store Long (32 bits)
Branch
CB123 Branch on locality
Shared Address Incrementation
SH ADD INC Immediate
SH ADD INC REG Register
7
5.2 Hardware Based Implementation
In order to evaluate the implementability of the pro-
posed solution on real hardware, we realized a proto-
type on FPGA. For that, a softcore processor (Leon3)
was extended with hardware support for PGAS.
The Leon3 softcore processor implements the 32-
bit SPARC V8 architecture with a 7-stage pipeline. It
has the advantage of supporting cache coherent SMP
systems allowing it to the full GNU/Linux operating
system. The VHDL source code of the base Leon3 is
available under a GNU Public License (GPL), allow-
ing for modification. The Leon softcore processor has
already been used to study various specific hardware
support possibilities [15, 9].
We extended the processor with hardware support
for PGAS shared addresses by using the reserved
SPARC V8 instructions for a coprocessor. More in-
formation about extending the Leon3 softcore pro-
cessor via the coprocessor interface can be found in
[19].
The coprocessor instructions are fully integrated
with the main processor pipeline : the instructions
are fetched by the main pipeline and the coprocessor
instruction execution is synchronized with the main
pipeline. A register file is introduced for storing the
shared address pointers as those are 64 bits; it is sim-
ilar in all points with the register file used for floating
point support in Leon3. It enables reading two 64-
bits values per clock cycle and writing one 64-bits
value. The extra register would not be needed for
64 bits architectures as it is the case for our GEM5
Alpha implementation.
Figure 5 presents the Leon3 pipeline extended with
hardware support for shared pointers. The address
incrementation is fully pipelined over two stages, al-
lowing to perform one address translation per clock
cycle. It also generates a coprocessor condition code
based on the locality of the incremented address.
Four condition codes are possible: 0: local (the point-
ed data is own by the current thread), 1: located
on the same memory controller, 2: accessible by the
load/store from shared instructions, 3: located on an
other node. The Coprocessor Branch (CB) instruc-
tion allows to branch based on any combination of
the condition code. Loads and stores from shared
addresses (LDCM, STCM) are performed as fast as
the normal SPARC load and store instructions.
The design was implemented on a Virtex-6 FP-
GA ML605 Evaluation board. The ML605 is based
on a Virtex 6 XC6VLX240T-1FFG1156 FPGA. The
logic synthesis, place and route for the design was
performed using Xilinx ISE Release 13.4. The final
design runs at a frequency of 75 MHz.
6 Results
In this section, the results from the simulation and
the hardware implementation are presented.
6.1 Simulation Results
Gem5 provides different CPU models which covers
different architecture implementations and provides
a trade-off between speed and accuracy of the simu-
lation. In this work, we used 3 different CPU models:
atomic; which is a single Instruction per Clock (IPC)
model, timing; which adds the simulation of the cache
hierarchy and detailed; (also called O3) which simu-
lates a 7-stage, out-of-order, CPU pipeline.
In order to evaluate the PGAS hardware support,
five kernels from the NAS Parallel Benchmarks [2],
implemented with UPC [11, 21], were used :
EP - Embarrassingly Parallel: generates pairs of
Gaussian random variates.
IS - Integer Sort: Bucket sort of small integers.
CG - Conjugate Gradient: approximately computes
the small eigenvalues of a symmetric positive ma-
trix. Exhibits long-range communication.
MG - Multi-Grid: 3D Poisson equation solving us-
ing a V-cycle multigrid method.
FT - Fast Fourier Transform: Solve a partial differ-
ential equation with a discrete 3D Fast Fourier
Transform (FFT).
These benchmarks were implemented with different
levels of handmade optimizations; we used both the
non-optimized version and the manually privatized
8
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1 2  4  8  16  32  64P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
EP - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
(a) Performance normalized to the code without manual opti-
mization
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 1 2  4  8  16  32  64
T i
m
e  
( s )
Number of threads
EP - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
(b) Execution Time
Figure 6: Gem5 atomic model: NAS Parallel Benchmark - EP class W
 0
 0.5
 1
 1.5
 2
 2.5
 3
 1 2  4  8  16  32  64P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
CG - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(a) Performance normalized to the code without manual opti-
mization
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1 2  4  8  16  32  64
T i
m
e  
( s )
Number of threads
CG - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(b) Execution Time
Figure 7: Gem5 atomic model: NAS Parallel Benchmark - CG class W
9
 0
 0.5
 1
 1.5
 2
 2.5
 1  2  4  8  16P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
FT - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(a) Performance normalized to the code without manual opti-
mization
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1  2  4  8  16
T i
m
e  
( s )
Number of threads
FT - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(b) Execution Time
Figure 8: Gem5 atomic model: NAS Parallel Benchmark - FT class W
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
 1 2  4  8  16  32  64P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
IS - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(a) Performance normalized to the code without manual opti-
mization
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 1 2  4  8  16  32  64
T i
m
e  
( s )
Number of threads
IS - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(b) Execution Time
Figure 9: Gem5 atomic model: NAS Parallel Benchmark - IS class W
10
 0
 1
 2
 3
 4
 5
 6
 7
 8
 1 2  4  8  16  32  64P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
MG - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(a) Performance normalized to the code without manual opti-
mization
 0
 1
 2
 3
 4
 5
 6
 1 2  4  8  16  32  64
T i
m
e  
( s )
Number of threads
MG - Class W
Without Manual Optimizations, but with HW support
Without Manual Optimizations
Manual Privatization
(b) Execution Time
Figure 10: Gem5 atomic model: NAS Parallel Benchmark - MG class W
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1  2  4  8  16P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
CG - Class W
With HW support
No Manual Opts
Manual Privatization
(a) Timing - Improvement
 0
 1
 2
 3
 4
 5
 6
 7
 8
 1  2  4  8  16
T i
m
e  
( s )
Number of threads
CG - Class W
With HW support
No Manual Opts
Manual Privatization
(b) Timing - Execution Time
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1  2  4P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
CG - Class W
With HW support
No Manual Opts
Manual Privatization
(c) Detailed - Improvement
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
 1  2  4
T i
m
e  
( s )
Number of threads
CG - Class W
With HW support
No Manual Opts
Manual Privatization
(d) Detailed - Execution Time
Figure 11: Gem5 : NAS Parallel Benchmark - CG class W
11
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
 1  2  4  8  16P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
FT - Class W
With HW support
No Manual Opts
Manual Privatization
(a) Timing - Improvement
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
 1  2  4  8  16
T i
m
e  
( s )
Number of threads
FT - Class W
With HW support
No Manual Opts
Manual Privatization
(b) Timing - Execution Time
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1  2  4P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
FT - Class W
With HW support
No Manual Opts
Manual Privatization
(c) Detailed - Improvement
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1  2  4
T i
m
e  
( s )
Number of threads
FT - Class W
With HW support
No Manual Opts
Manual Privatization
(d) Execution Time
Figure 12: Gem5 : NAS Parallel Benchmark - FT class W
 0
 0.5
 1
 1.5
 2
 2.5
 3
 1  2  4  8  16P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
IS - Class W
With HW support
No Manual Opts
Manual Privatization
(a) Timing - Improvement
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
 2
 1  2  4  8  16
T i
m
e  
( s )
Number of threads
IS - Class W
With HW support
No Manual Opts
Manual Privatization
(b) Timing - Execution Time
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1  2  4P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
IS - Class W
With HW support
No Manual Opts
Manual Privatization
(c) Detailed - Improvement
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1  2  4
T i
m
e  
( s )
Number of threads
IS - Class W
With HW support
No Manual Opts
Manual Privatization
(d) Detailed - Execution Time
Figure 13: Gem5 : NAS Parallel Benchmark - IS class W
12
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
 1  2  4  8  16P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
MG - Class W
With HW support
No Manual Opts
Manual Privatization
(a) Timing - Improvement
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 1  2  4  8  16
T i
m
e  
( s )
Number of threads
MG - Class W
With HW support
No Manual Opts
Manual Privatization
(b) Timing - Execution Time
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 1  2  4  8P e
r f o
r m
a n
c e
 N
o r
m
a l
i z e
d  
t o
 C
o d
e  
W
i t h
o u
t  M
a n
u a
l  O
p t
i m
i z a
t i o
n s
Number of threads
MG - Class W
With HW support
No Manual Opts
Manual Privatization
(c) Detailed - Improvement
 0
 0.5
 1
 1.5
 2
 2.5
 1  2  4  8
T i
m
e  
( s )
Number of threads
MG - Class W
With HW support
No Manual Opts
Manual Privatization
(d) Detailed - Execution Time
Figure 14: Gem5 : NAS Parallel Benchmark - MG class W
version of the benchmarks. The privatized version
was manually optimized in order to replace UPC
shared pointers by private pointers [13]. Due to the
very long time needed for such multi-core simulations,
only the relatively small W (workstation) class was
used.
Three different results are presented on the graphs:
Without Manual Optimizations uses the non hand-
optimized NPB kernels with the unmodified Berke-
ley compiler with all compiler optimizations enabled,
Manual Optimization uses the manually optimized
NPB kernels in which the shared pointers have been
replaced by normal C pointers, it is compiled with the
original, unmodified compiler with all the optimiza-
tion enabled. Finally, Without Manual Optimiza-
tions, but with HW support uses the hardware sup-
port with our prototype compiler on the non hand-
optimized NPB kernels.
Figures 6-10 present the runs using the atomic
model. The atomic model of Gem5 being fast, we
were able to run the benchmarks with up to 64 cores
(the limit of the BigTsunami architecture).
Figure 6 shows that the hardware support does not
provide any performance improvement for the EP
kernel; this was expected as EP does not use any
shared pointer in the main loops.
For the CG kernel, not all the shared address in-
crementation were compiled using the hardware in-
structions : the generated code contained 309 shared
address incrementation but 20 of those were using a
non-power of 2 element size (the arrays w and w tmp
with an element size of 56016); those incrementations
were implemented using software code. All the other
shared pointer manipulation, including 236 loads and
stores were implemented using the hardware instruc-
tions. The CG kernel (Figure 7) runs 2.6 times faster
using the hardware support over the implementation
without manual optimizations. The hardware sup-
port is also 17% faster than the manually optimized
version.
For the FT kernel, all the shared pointer manip-
ulations were compiled to hardware instructions (79
incrementations, and 47 loads and stores). The FT
kernel runs were limited to 16 cores due to the da-
ta distribution of the W class. The FT kernel per-
formance (Figure 8) is improved 2.3 times without
the need of manual optimizations. The performance
also surpasses the manually optimized one by 17%.
The hardware support can surpass the performance
of manually optimized code because not all the shared
pointers were optimized away. Optimizations often
focus on the inner loops and it is not always possi-
ble to remove all shared pointers (due to complex or
random access patterns, for example).
13
The non-optimized MG kernel code performance is
improved by 5.5x (See Figure 10), but the code is 10%
slower than the manually optimized version. Simi-
larly for IS, the base code performance is improved
by 3x but with HW support the code is still 13%
slower than the manually optimized one. This may
be due to some missed optimizations during the C
code compilation. The asm statements for the PGAS
stores instructions have been marked as volatile and
changing the memory; this prevents the GCC com-
piler from moving the stores around and also forces it
to reload data stored in register as the memory may
have changed preventing some optimizations.
Figures 11-14 present the results obtained using
the timing and detailed model. EP (Embarrassingly
Parallel) is not shown as the results are similar to
the atomic case since no shared pointers are used.
The timing model adds caches and memory timing
simulation. The improvements are less substantial,
in proportion, as more time is spent accessing the
memory; the single L2 also starts to be a bottleneck
with 16 cores. The detailed memory model intro-
duces an out-of-order processor core. The number of
cores presented for the detailed runs are limited as
the simulator running time becomes very long; mul-
tiple days are needed for a detailed run.
The detailed model brings more opportunities to
reorganize the instructions to reduce the software
overhead to shared address manipulations. Howev-
er, our proposed hardware support for PGAS address
mapping still provides results that are comparable or
better than the manually optimized codes. In addi-
tion, our proposed hardware mechanism does not re-
quire any complex manual tuning of the code keeping
all the productivity advantages of the PGAS model
intact.
6.2 Hardware Implementation Re-
sults
Two micro-benchmarks (vector addition and matrix
multiplication) were implemented to verify the func-
tionality and the performance of the hardware de-
sign. They were compiled using Berkeley UPC 2.12.1
and GCC 4.4.2 for SPARC with all the optimizations
enabled (-O, -opt for BUPC, -O3 for GCC). As the
 0
 5
 10
 15
 20
 25
 1  2  4
P e
r f o
r m
a n
c e
 n
o r
m
a l
i z e
d  
t o
 ’ n
o  
h a
n d
- o
p t
s ( d
y n
. ) ’
Number of threads
Leon 3 - Vector addition
No hand-opts(dyn.)
No hand-opts(static)
Privatization
With HW support
(a) Improvement
 0
 5e+07
 1e+08
 1.5e+08
 2e+08
 2.5e+08
 3e+08
 3.5e+08
 4e+08
 4.5e+08
 5e+08
 1  2  4
N
u m
b e
r  o
f  c
y c
l e
s
Number of threads
Leon 3 - Vector Addition
No hand-opts(dyn.)
No hand-opts(static)
Privatization
With HW support
(b) Number of cycles
Figure 15: Leon 3 - Vector Addition
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 1  2  4
P e
r f o
r m
a n
c e
 n
o r
m
a l
i z e
d  
t o
 ’ N
o  
M
a n
u a
l  O
p t
s ’
Number of threads
Leon 3 - Matrix Multiplication
With HW support
No Manual Opts
Privatization 1
Privatization 2
(a) Improvement
 0
 5e+06
 1e+07
 1.5e+07
 2e+07
 2.5e+07
 3e+07
 3.5e+07
 4e+07
 1  2  4
N
u m
b e
r  o
f  c
y c
l e
s
Number of threads
Leon 3 - Matrix Multiplication
With HW support
No Manual Opts
Privatization 1
Privatization 2
(b) Number of cycles
Figure 16: Leon 3 - Matrix Multiplication
support for the extra register file is not implemented
in GCC, the hardware accelerated code was partly
written in assembly.
For the vector addition benchmark, the version of
the code without hand-made optimizations was com-
piled both in dynamic mode (without specifying the
number of threads) and in static mode (the number
of threads being specified during the compilation).
In the dynamic mode, the compiler is prevented to
14
Table 4: Area cost evaluation for the hardware support
Configuration Slice resources BRAM DSP48Es
Registers LUTs 18kB 36KB
Leon3, 4 cores 46, 718 59, 235 106 34 16
Leon3, 4 cores + PGAS hardware support 49, 325 62, 572 126 34 24
Virtex 6 - XC6VLX240T 301, 440 150, 720 832 416 768
Increase 2, 607 3, 337 20 0 8
Area increase, % of base +5.6% +5.6% +18.9% +50.0%
Area % of Virtex 6 +0.9% +2.2% +2.4% +1.0%
optimize the division by threads during the incre-
mentation of a shared address which reduces the per-
formance. Results are shown in Figure 15. The code
compiled with static runs 5 times faster. The opti-
mized code to use private pointer or the hardware
support both runs 16 times faster (3.5 times faster
when compared to the code compiled in the stat-
ic mode). The hardware version does not need to
be compiled in static mode as the special register
threads can be setup at runtime, allowing the user
to run the same executable with different number of
threads. The performance improvement gets small-
er with the number of threads as vector addition is
quickly able to saturate the shared AMBA bus.
The matrix multiplication benchmark was com-
piled in static mode, i.e. the number of threads was
set at compile time. Two different levels of manu-
al optimizations were performed: the first one uses
private pointers to access one of the matrix (priva-
tization 1) and the second one uses a non-standard
UPC extension to be able to access all the matrices
with private pointers. As seen in Figure 16, the code
with hardware support matches the performance of
the fully optimized version.
More importantly, the FPGA implementation al-
lowed us to evaluate the chip area needed for the
PGAS hardware support. Table 4 presents the FP-
GA resources used for a 4 core Leon3 SMP system
with and without the hardware support. Results are
both presented in terms of increase compared to the
base Leon3 implementation, and in percentage of the
FPGA chip used. The area evaluation is very con-
servative as we are comparing against a very simple
processor core without floating point support. Al-
so, adding a new register file (built with BRAM ele-
ments) is often not necessary as the normal register
file can be used to hold shared addresses on 64 bits
architectures, as seen in our Alpha based simulation
with Gem5. The proposed hardware support mecha-
nism for 4 cores utilizes less than 2.4% of the overall
FPGA chip.
7 Conclusions
The PGAS programming model is known for its
productivity; however, its performance can be hin-
dered with the overhead associated with accessing
and traversing its memory model. Automatic compil-
er optimizations may help but they are not sufficient
for competitive performance. Hand tuning of PGAS
code can achieve the needed performance levels, but
diminishes the productivity advantage. In this work,
we proposed the addition of a hardware address map-
ping support for PGAS. It was shown through FPGA
prototyping that a processor requires only a minimal
increase in the chip area to incorporate this hardware.
In addition, it was shown that this hardware can be
availed and used easily by compilers though simple
extensions to the instruction set. Substantial test-
ing and benchmarking were conducted using a Gem5
full system simulation as well as FPGA prototyping.
Benchmarking results were based on representative
kernels of the well accepted NAS Parallel Benchmark
written with UPC. Due to the very long time needed
for such simulators, only the W class was used. In
spite of the smaller version of the benchmark, sub-
15
stantial speed up was achieved. Larger benchmarks
are expected to provide even better scaling results.
The results were consistently comparable to those
obtained from hand tuned code, which demonstrates
the power and the productivity of this approach. The
results, using un-optimized code using our proposed
hardware support, achieved up to 5.5 speed up, as
compared to the un-optimized code running without
our hardware support.
This work focuses on what we think is presently
the biggest impediment of PGAS languages: the ma-
nipulation of shared addresses which create an im-
portant performance penalty even for local accesses.
For future work, we will consider hardware solutions
that also allow to further improve the accesses of re-
mote data across a full system of interconnected node.
This requires extending the PGAS hardware support
to the network interface. We believe that the global
solution will be hierarchical to limit the cost of addi-
tional hardware and that the network interface will
be able to rely on shared addresses to quickly locate
and communicate with other nodes.
Acknowledgment
The authors also wish to acknowledge the Xilinx Uni-
versity Program (XUP) and the Sun Microsystems
OpenSPARC University Program for their hardware
and software donation which has been essential to
complete this work.
References
[1] Remzi H. Arpaci, David E. Culler, Arvind Kr-
ishnamurthy, Steve G. Steinberg, and Katherine
Yelick. Empirical evaluation of the CRAY-T3D:
A compiler perspective. In ACM SIGARCH
Computer Architecture News, volume 23, pages
320–331. ACM, 1995.
[2] David H. Bailey, T. Harris, W. Saphir,
R. van der Wijngaart, A. Woo, and M. Yarrow.
The NAS parallel benchmarks 2.0”, nas techni-
cal report nas-95-020. Technical report, Moffett
Field, CA, USA, 1995.
[3] Nathan Binkert, Bradford Beckmann, Gabriel
Black, Steven K. Reinhardt, Ali Saidi, Arkapra-
va Basu, Joel Hestness, Derek R. Hower, Tushar
Krishna, Somayeh Sardashti, Rathijit Sen, Ko-
rey Sewell, Muhammad Shoaib, Nilay Vaish,
Mark D. Hill, and David A. Wood. The gem5
simulator. SIGARCH Comput. Archit. News,
39(2):1–7, August 2011.
[4] Anastasiia Butko, Rafael Garibotti, Luciano
Ost, and Gilles Sassatelli. Accuracy evaluation
of gem5 simulator system. In 7th Internation-
al Workshop on Reconfigurable Communication-
centric Systems-on-Chip (ReCoSoC), pages 1–7.
IEEE, 2012.
[5] Franc¸ois Cantonnet, Tarek El-Ghazawi,
P. Lorenz, and Jaafar Gaber. Fast address
translation techniques for distributed shared
memory compilers. In Proceedings of the
19th International Parallel and Distributed
Processing Symposium, 2005.
[6] Franc¸ois Cantonnet, Yiyi Yao, M. Zahran, and
Tarek El-Ghazawi. Productivity analysis of the
UPC language. In Proceeding of the 18th In-
ternational Parallel and Distributed Processing
Symposium, pages 254–, April 2004.
[7] William W. Carlson, Jesse M. Draper, David E.
Culler, Kathy Yelick, Eugene Brooks, and Karen
Warren. Introduction to upc and language speci-
fication. Technical report, Center for Computing
Sciences, Institute for Defense Analyses, 1999.
[8] Wei-Yu Chen, Dan Bonachea, Jason Duell,
Parry Husbands, Costin Iancu, and Katherine
Yelick. A Performance Analysis of the Berkeley
UPC Compiler. In Proceedings of the 17th annu-
al international conference on Supercomputing:
June 23-26, volume 4, pages 63–73. Association
for Computing Machinery, ACM, 2003.
[9] Martin Danek, Leos Kafka, Lukas Kohout, and
Jaroslav Sykora. Instruction set extensions
for multi-threading in LEON3. In 2010 IEEE
13th International Symposium on Design and
16
Diagnostics of Electronic Circuits and Systems
(DDECS), pages 237–242. IEEE, 2010.
[10] Kemal Ebcioglu, Vivik Sarkar, Tarek El-
Ghazawi, and John Urbanic. An experiment in
measuring the productivity of three parallel pro-
gramming languages. In Workshop on Produc-
tivity and Performance in High-End Computing
(P-PHEC). IEEE, 2006.
[11] Tarek El-Ghazawi and Franc¸ois Cantonnet.
UPC performance and potential: A NPB experi-
mental study. In Proceedings of the ACM/IEEE
conference on Supercomputing, pages 1–26.
IEEE Computer Society Press Los Alamitos,
CA, USA, 2002.
[12] Tarek El-Ghazawi, Franc¸ois Cantonnet, Yiyi
Yao, Smita Annareddy, and Ahmed S. Mo-
hamed. Benchmarking parallel compilers: a
UPC case study. Future Gener. Comput. Syst.,
22(7):764–775, 2006.
[13] Tarek El-Ghazawi and Se´bastien Chauvin. UPC
benchmarking issues. In International Confer-
ence on Parallel Processing (ICPP), pages 365–
372. IEEE, 2001.
[14] Holger Fro¨ning and Heiner Litz. Efficient hard-
ware support for the Partitioned Global Address
Space. In IEEE International Symposium on
Parallel Distributed Processing, Workshops and
Phd Forum (IPDPSW), pages 1–6, pr 2010.
[15] Pierre Guironnet De Massas and Paul Amblard.
Experiments around SPARC Leon-2 for MPEG
encoding. In Proceedings of the International
Conference on Mixed Design of Integrated Cir-
cuits and System (MIXDES), pages 285–289,
June 2006.
[16] Matthias M. Mueller. Efficient address transla-
tion. Interner Bericht. Universita¨t Karlsruhe,
Fakulta¨t fu¨r Informatik; 2000, 12, 2000.
[17] Steven L Scott. Synchronization and communi-
cation in the T3E multiprocessor. In ACM SIG-
PLAN Notices, volume 31, pages 26–36. ACM,
1996.
[18] Olivier Serres, Ahmad Anbar, Saumil G. Mer-
chant, Abdullah Kayi, and Tarek El-Ghazawi.
Address translation optimization for Unified
Parallel C multi-dimensional arrays. In Pro-
ceedings of the 16th International Workshop on
High-Level Parallel Programming Models and
Supportive Environments, HIPS. IEEE, 2011.
[19] Olivier Serres, Vikram K. Narayana, and Tarek
El-Ghazawi. An architecture for reconfig-
urable multi-core explorations. In proceedings of
the International Conference on ReConFigurable
Computing and FPGAs (ReConFig), 2011.
[20] UPC Consortium. UPC language specifications
v1.2, May 2005.
[21] UPC NAS Parallel Benchmarks.
threads.seas.gwu.edu/sites/npb-upc.
[22] Katherine Yelick, Dan Bonachea, Wei-Yu Chen,
Phillip Colella, Kaushik Datta, Jason Duell, Su-
san L. Graham, Paul Hargrove, Paul Hilfinger,
Parry Husbands, et al. Productivity and per-
formance using partitioned global address space
languages. In Proceedings of the internation-
al workshop on Parallel symbolic computation,
page 32. ACM, 2007.
[23] Zhang Zhang and Steven R. Seidel. Bench-
mark measurements of current UPC platforms.
In Proc. of IPDPS (PMEOPDS Workshop),
19th IEEE International Parallel and Distribut-
ed Processing Symposium, 2005.
17
