Efficient address translation by Mueller, Matthias M.
Ecient Address Translation
Matthias M Muller
Institute for Program Structures and Data Organization
Universitt Karlsruhe Germany
muellermiraukade
Technical Report No 
Abstract The address calculation for distributed data access plays a
major role for the performance of negrained dataparallel applications
This paper reports about the hardware centrifuge of the Cray T	E which
enables the shift of the address calculation from software into hardware
This shift minimizes address calculation overhead reducing communica
tion cost of dynamic communication patterns The centrifuge is com
pared with complex integer division and modulo and with integer mask
and shift operations The measurements show for a onedimensional dy
namic communication pattern for several distributions a runtime advan
tage of T	E
s hardware centrifuge of at least a factor  over integer
division arithmetic But the centrifuge is barely faster compared with
integer mask and shift operations
  Introduction
The address calculation for distributed data accesses plays a major role in ne
grained dataparallel applications Many datadistributions has been proposed
for dierent purposes and all of them come with a more or less complex calcula
tion scheme First of all the usage of a special datadistribution depends on the
supposed workdistribution among the processors But if the locality of a data
element cannot be determined eciently all the intended benets of a data
distribution are meritless Thus the calculation of datadistribution information
has to be fast to be eective This paper studies blockcyclic distributions that
can be computed with bitwise mask and shift operations As these distributions
can be processed by Cray T	E
s hardware centrifuge the main focus of this
paper is the question about the benets from using this hardware translation
mechanism
Scott says in  about the centrifuge
The T	E supports the data distribution features of many implicit pro
gramming languages he cites HPF CRAFT Fortran D and Vienna
Fortran via an integrated hardware centrifuge
To my knowledge nobody mentioned its benets nor referred to its usage at
all I compare the address translation mechanism of the hardware centrifuge
with complex division and modulo arithmetics and with bitwise mask and shift
operations The rst comparison shows the impact of time consuming arithmetics
on the computation time while the second one explains the advantage of the
hardware over fast integer manipulation
This paper is part of the work done in the context of the HPF KarHPFn compiler
 	  and latency hiding techniques  
The organization of this paper is as follows The next section explains the ad
dress translation for special blockcyclic distributions using mask and shift opera
tions Thereafter basic Eregister addressing and programming of the hardware
centrifuge is explained Afterwards the basic communication technique used
throughout the measurements is explained The result section shows the per
formance of the dierent address translation mechanisms for BLOCK CYCLIC
and blockcyclic distributions
 Data Distributions
This section explains the structure of blockcyclic distributions with block size k
The section focuses as the whole paper does on distributions where the number
of processors P  the block size k and the local problem size
 
V are powers of
two Only then it is possible to calculate the processor number and the local
address with bitwise mask and shift operations
  Structure of an Address for BlockCyclic Distributions
A global address for a blockcyclic distributed array with block size k consists
of three elds see Figure 
Address PE Block
Fig  Composition of a global address for a blockcyclic distribution
The rightmost eld indicates the oset within a block of size k The next eld
on the left hand side describes the processor number The leftmost eld contains
the remaining bits to form the local address The sizes of the elds are calculated
using the logarithm with basis two
For example the following global address of a blockcyclic distribution with
block size k   and P   processors
  
references the local address    on processor three
BLOCK and CYCLIC distributions are special blockcyclic distributions A
BLOCK distribution has k  V while a CYCLIC distribution sets k  
 
The local problem size is called virtualization too
   Calculation with mask and shift operations
The address translation needs a mask to select those bits which form the pro
cessor number In our example the bit eld
M    
would be such a mask The calculation of the processor number is done in two
steps First the bits from the global address are selected using the mask The
second step shifts the mask log

k bits to the right Hence the processor num
ber PE is calculated from a global address G with the following Fortran 
commands
PE  ISHFT ANDGM log

k
The local address A is formed from the two remaining elds The rst eld
consists of the oset within a block and the second one of the address The
oset is obtained with one mask operation while the remaining address needs
two shift operations
A  ORISHFT ISHFT G log

P   log

k log

k ANDG k   
The binary operations consume only a few processor cycles and they are ex
pected to improve communication time compared with integer DIV and MOD
operations which are done in the oating point unit of the T	E
s Alpha proces
sors
 Hardware Centrifuge
This section presents T	E
s hardware centrifuge Eregisters build the center
for remote dataaccesses within the T	E Thus discussion of the centrifuge
starts with an overview of their functionality Afterwards programming of the
Eregisters is shown explaining address translation within the centrifuge The
remaining paragraph explain its initialization
 Eregisters
The network interface consists of  user and  system Eregisters memory
mapped into the address space of each processor Eregisters provide the only
means to transfer data between processors Reads and writes between Eregisters
and global memory are called gets and puts To load a global memory content
into the processor a get and a subsequent read of the Eregister has to be
executed The latter operation stalls the processor until the value arrives This
is achieved in hardware using the readiness state of the Eregister On a put
the memory of a remote node is modied and the cache is updated  Hence
the T	E implements a global address space with locally consistent memory E
registers address this global address space which includes memory local to the
issuing processor
Eight Eregisters can be combined to a vector Distance between successive vec
tor elements have to be equidistant to ensure correct address translation
  Programming the Eregisters
Eregisters are memorymapped into the IOspace of the processor Therefore
every Eregister command is a store into IOspace
Eregistercommand  Eregisternumber   Index
The left hand side of the assignment accounts for the operation and the selected
Eregister Every pair of operation and register number points to a separate
memory location The right hand side provides the source of the operation which
can be an arbitrary address of the global address space The hardware centrifuge
performs calculation of the node number and the local node address in two steps
For that purpose it needs an additional block of four Eregisters A pointer in
the upper half of the Index addresses this block see Figure 
Index
Addend
Stride
Base
Mask
Fig  Eregisters for nonlocal data access
Every address translation needs the Mask and the Base The Stride is used
to calculate consecutive addresses in vector commands Atomic operations like
fetchandadd use the Addend as additional parameter The pointer is scratched
out from Index after referencing the additional Eregister block The rst step
of the address calculation uses the Mask to select the virtual node number from
the Index see Figure 	
The result of this step is the Oset with the selected bits scratched out and the
virtual processor number PE The second step adds Oset and Base forming
the local virtual node address Further transformations to physical addresses do
not matter and are left out for brevity
To move address calculation from soft to hardware Base points to the local
start of a distributed array Consequently the Index provides only the global
PE
PE
Offset
Mask
Address
Base
Index
+
&
Fig  Address calculation within the Hardware Centrifuge
index of the array needed for the calculation of the processor number PE and
the local Oset The Mask is initialized according to Section 		 to point to
the bits signicant for the processor number Now a dataparallel program that
wants to perform address calculation in hardware sets up a separate block of four
Eregisters for each distributed array or it provides separate Mask and Base if
the number of Eregisters does not suce
 Initializing the Hardware Centrifuge
The initialization is similar to Section  TheMask selects those bits from Index
forming the virtual node number For this purpose the Mask is divided into four
segments see gure  Their meaning is described supposing an arbitrary block
cyclic distribution with block size k V denotes the local problem size in data
elements and P accounts for the number of processors
0k’-10p’-1k’v’-1 s’-1 0
Mask
Fig  Calculation of Mask from gure 
The leftmost segment contains those bits responsible for a correct alignment of
the appropriate data type Its size in bits is s
 
 log

sizeofdatatype The
following k
 
 log

k bits describe block size These rst k
 
 s
 
bits are set to
zero The next p
 
 log

P  bits are set because they select the bits forming the
virtual node number The remaining log

V   k
 
bits are reset again
The results section show how far the hardware centrifuge improves runtime com
pared with software mask and shift operations
 Used Communication Technique VSCAP
The measurements are done in the context of overlapping communication The
technique used is called VSCAP software controlled access pipelining with vec
tor commands an extension to SCAP which was developed by Warschko 
The major aim of VSCAP is network latency hiding through overlapping of
computation and communication by splitting nonlocal memory access into low
overhead prefetch and access The duration for issuing the prefetch instructions
dominates communication time Therefore fast prefetch instructions caused by
fast address calculation result in fast communication leading to a lower execution
time of a dataparallel application Hence the goal of this section is to give a
short overview about VSCAP to understand the communication technique used
for the measurements
 Basic Idea of SCAP and its extension to VSCAP
The aim of SCAP is a runtime improvement achieved by overlapping several com
munication requests leading to a communication pipeline in negrained data
parallel applications For a better understanding of the basic principle of SCAP
we rst explain how communication is usually done
The processor issues a request to the network downwards arrow in Figure 
and waits until the network replies upwards arrow Only then the processor
continues its execution and issues a new request This is done as long as the pro
cessor requires remote data elements to perform its local part of computation As
the processor blocks after each data request we call this kind of communication
the blocking mode see the upper half of Figure 
Now let us assume the processor could issue all its communication requests
and the network would be able to process them in an overlapped fashion This
would lead to a shorter waiting period for the processor accessing the rst and
all other successive remote data elements Finally communication could be per
formed faster compared with the above mentioned blocking execution We call
this kind of communication overlapping communication see the lower half of
Figure  To enable overlapping communication the network interface has to
provide a prefetch buer that decouples the processor from the network execu
tion The second task of the prefetch buer is to synchronize the processor with
the network execution The synchronization becomes necessary if the processor
wants to access a data element which has not been delivered by the network yet
In this case the processor is stalled until the value arrives
VSCAP extends SCAP by the means of vector commands for prefetch and access
Instead of issuing a communication requests for each nonlocal data element the
processor prefetches and accesses L   data elements at once L is the vector
length of the vector commands VSCAP
s vector commands reduce prefetch and
access overhead of SCAP and improve communication time further
Wait
Time
Wait Wait Wait
Access + Reply 1
Access + Reply 2
Access + Reply 3
Blocking
Overlapping
Access + Reply 1 Access + Reply 2 Access + Reply 3
Fig  Basic idea of SCAP
  Transformation rules
This paragraph describes the techniques used in the transformation from a data
parallel forallstatement to VSCAP The communication loop used in this section
is used for the measurements
The transformations are illustrated using the following simple forallstatement
FORALL i   TO N
Ai  Bqi
END
The program fragment updates array A in parallel indexing array B with array
q A parallelizing compiler maps the problem size N onto P real processors
N  P  This technique is called virtualization Assuming that P dividesN each
processor emulates V 
N
P
virtual processors within a virtualization loop Both
A and B are distributed over the P processors using the the ownercomputes
rule Since the value of qi can not be determined at compile time the compiler
has to insert remote memory accesses Each remote memory access causes an
address translation from global to local addresses
The following transformation of the loop shows the virtualization and how com
munication and computation can be overlapped
FORALL j   TO P
FOR kjV TO jV  Prefetch loop
adr  calculate addressBqk  Calculate remote address
prefetchadr  Start read request
END
FOR kjV TO jV STEP L  Vector access loop
adr  calculate addressBqk
vector accessadrAk  Access L data elements
END
END
In this transformation the main loop is split into two instances a prefetch and an
access or calculation loop Instead of stalling on a remote memory access as in
blocking mode the processor issues remote memory prefetch requests After the
prefetch loop is executed the calculation loop accesses nonlocal memory without
waiting time if the data is already present in the prefetch buer Due to the
dynamic communication pattern vector commands can only be used for access
Thus the second loop is blocked with block size L For simplicity we assume
that L divides V  Otherwise additional element wise access operations has to be
used to get the remaining V mod L data elements which do not ll a vector of
length L Within the loop the vector access vector accessadrAk copies
L entries from the prefetch buer starting at address adr to successive memory
locations beginning with Ak If we assume that a vector access lasts as long
as an element wise access operation the duration of the vector access loop is
decreased about the factor L of the vector length
If the number of nonlocal memory accesses is too large to t into the prefetch
buer VSCAP
s transformation rule uses a three loop execution pattern where
the middle loop alternates between access and prefetch instructions This trans
formation is not shown for the sake of brevity
The measurements replace the call to calculate address in the prefetch loop
with integer division integer mask and shift or it is completely omitted in the
case of the hardware centrifuge On the Cray T	E the call to the function
calculate address is not needed in the second loop Therefore manipulations
take place only in the prefetch loop
 Results
The runtimes of Indirect the example code shown in  for BLOCK CYCLIC
and blockcyclic distributions are given The discussion of each distribution in
cludes two plots The rst one shows the runtimes and the second one presents
the relative performance compared to an execution with integer division and
modulo commands DIVMOD Each plot contains three dierent versions of
Indirect  DIVMOD shows a VSCAP execution with integer division and modulo
commands for address translation MASKSHIFT uses integer mask and shift
operations while HWC indicates an execution with the hardware centrifuge DI
VMOD and HWC are compiled by the KarHPFn compiler MASKSHIFT was
hand coded with the DIVMOD version as starting point Tests were measured on
	 processors varying the local problem size V from  to 	 vector elements
The Figure  shows the results for a BLOCK distribution
1
10
100
1000
10000
100000
1 10 100 1000 10000
R
un
tim
e 
in
 m
s
Virtualizations
divmod
hwc
maskshift
0.8
1
1.2
1.4
1.6
1.8
2
2.2
1 10 100 1000 10000
R
el
at
iv
e 
to
 D
IV
M
O
D
Virtualizations
Fig  Runtimes and speed for BLOCK distribution
The runtime plot on the left hand side shows two lines The upper line belongs
to DIVMOD The lower line contains the runtimes for HWC and MASKSHIFT
The plot indicates two facts First integer division and modulo arithmetic is very
slow compared with mask and shifts and second the hardware centrifuge HWC
has only a minor advantage over MASKSHIFT The plot on the left hand side
of Figure  quanties the advantage of HWC and MASKSHIFT over DIVMOD
HWC and DIVMOD are about  times and  times faster respectively The
small dierence between HWC and MASKSHIFT is astonishing MASKSHIFT
is at most  slower although compared with HWC it executes 	 additional
integer operations for each nonlocal memory access This behavior is due to the
multiple integer units of the Alpha processors which overlap several operations
Another reason for this behavior could be a limited issue bandwidth for the E
register commands which is reached by the MASKSHIFT version Then faster
prefetch operations as the ones issued by the HWC version would have no eect
But this is speculation beyond my scope
A similar result shows Figure  for a CYCLIC distribution
The plot on the left hand side shows the runtimes of the three dierent versions
And again there are only two lines The upper one explains the runtime of DIV
MOD while the lower one denotes the runtime of MASKSHIFT and HWC The
CYCLIC distribution shows the same runtime behavior as the BLOCK distri
bution MASKSHIFT and HWC are faster than DIVMOD and the former two
versions are equally fast The plot on the right hand side emphasizes these obser
vations MASKSHIFT and HWC are more than  times faster than DIVMOD
and MASKSHIFT is as fast as HWC 
The results obtained so far are conrmed by the blockcyclic distribution see
Figure 
110
100
1000
10000
100000
1 10 100 1000 10000
R
un
tim
e 
in
 m
s
Virtualizations
divmod
hwc
maskshift
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1 10 100 1000 10000
R
el
at
iv
e 
to
 D
IV
M
O
D
Virtualizations
Fig 	 Runtimes and speed up of INDIRECT for CYCLIC distribution
1
10
100
1000
10000
1 10 100 1000 10000
R
un
tim
e 
in
 m
s
Virtualizations
divmod
hwc
maskshift
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
1 10 100 1000 10000
R
el
at
iv
e 
to
 D
IV
M
O
D
Virtualizations
Fig 
 Runtimes and speed up of INDIRECT for the blockcyclic distribution with
block size k  
The runtimes of HWC and MASKSHIFT are equally fast and they show a sub
stantial advantage over DIVMOD The former two versions are at least  times
faster than DIVMOD
The measurements show two results The rst one conrms the expectation that
integer mask and shift operations for address calculations are faster than ordi
nary integer and modulo arithmetic The second more astonishing result is the
behavior of the fast integer operations compared with the hardware centrifuge
 Conclusions
This paper investigated the benets from using Cray T	E
s hardware centrifuge
The address translation mechanism was compared with complex integer division
and modulo arithmetic and with integer mask and shift operations
The hardware centrifuge is in a dynamic communication kernel about  times
faster than integer division arithmetic But and this result is quite surprising
it is only a few per cent  faster than integer mask and shift operations
This is caused by the multiple integer units provided by the Alpha processors
which can overlap several integer operations
The results show performance for onedimensional arrays The advantage of the
hardware centrifuge would be a little larger if the measurements had focused on
multidimensional arrays Then the software address calculation overhead would
be larger leading to a more signicant advantage of the centrifuge
As this work emphasizes support for fast address calculation it also shows the
weakness of T	E
s hardware centrifuge in doing this job for onedimensional
arrays compared with software mask and shift operations
References
 Matthias M Muller Compiling Applications with the KarHPFn Compiler Tech
nical Report  School of Computer Science Universitat Karlsruhe April

 Matthias M Muller KaHPF Compiler generated Data Prefetching for HPF In
High Performance Computing in Science and Engineering  pages 
Springer 
	 Matthias M Muller Latenzzeitverbergung in datenparallelen Sprachen PhD thesis
School of Computer Science Universitat Karlsruhe February 
 Matthias M Muller Thomas M Warschko and Walter F Tichy Prefetching on
the CrayT	E In th International Conference on Supercomputing pages 		
Melbourne July 	 
 Wilfried Oed Massivparalleles Prozessorsystem CRAY T	E Technical report
Cray Research GmbH Munchen November 
 Steven L Scott Synchronization and communication in the T	E multiprocessor
ACM SIGPLAN Notices 		 September 
 Thomas M Warschko Eziente Kommunikation in Parallelrechnerarchitekturen
PhD thesis School of Computer Science Universitat Karlsruhe 
