Fast parallel permutation algorithms by Hagerup, Torben & Keller, Jörg
FAST PARALLEL
PERMUTATION ALGORITHMS
Torben Hagerup Jorg Keller

Fachbereich  Informatik
Universitat des Saarlandes
Postfach 
 Saarbrucken
Germany SFB  B 	 D

FAST PARALLEL PERMUTATION ALGORITHMS
TORBEN HAGERUP

J

ORG KELLER
y
MaxPlanckInstitut fur Informatik Fachbereich  Informatik
		
 Saarbrucken Germany Universitat des Saarlandes
		 Saarbrucken Germany
Abstract
We investigate the problem of permuting n data items on an EREW PRAM with p
processors using little additional storage
 We present a simple algorithm with run time
Onp logn and an improved algorithm with run time Onp logn log lognp
 Both
algorithms require n additional global bits and O local storage per processor
 If prex
summation is supported at the instruction level the run time of the improved algorithm is
Onp
 The algorithms can be used to rehash the address space of a PRAM emulation

Keywords Parallel Algorithms Permutations Shared Memory Rehashing
 Introduction
Consider the task of permuting n data items on an EREW PRAM with p  n processors
according to a permutation  given in the form of a constanttime blackbox program

The task is trivial if n additional global or local memory cells are available The items
are rst moved to the additional storage with each processor handling Onp items and
then written back in permuted order
 We restrict attention to the case in which only O
additional memory cells per processor are available but the positions holding the items can
be marked as visited


Supported by the ESPRIT Basic Research Actions Program of the EU under contract No  project
ALCOM II
y
Supported by the Dutch Science Foundation NWO through NFI project ALADDIN under contract
number NF 	
 and by the German Science Foundation DFG through SFB  Part of the work was
done when this author was at CWI Postbus   GB Amsterdam The Netherlands

An application of this problem is rehashing a hashed address space in a PRAM emulation

If both old and new hash functions are bijective maps of addresses to cells then rehashing
can be described as a permutation of the PRAM address space 
 Examples are hash
functions of the form x  ax mod m where m is the size of the shared address space
and a is chosen relatively prime to m
 While the complete address space gets rehashed
there is no additional global space available
 Moreover processors usually only have small
local memories to store additional information
 These considerations motivate our decision
to allow only O additional memory cells per processor

The problem of permuting arrays has been investigated before both in the setting of se
quential computers and in the setting of PRAMs

Knuth  describes a simple sequential algorithm that runs in time On

 and needs only
one buer and a few counters
 He also analyzes the average run time and shows it to be
On logn
 Melville  presents a timespace tradeo
 If t additional bits are available his
algorithm runs in time On

t
 Fich Munro and Poblete  give an algorithm with run
time On logn that needs only Ologn

 additional bits

Aggarwal Chandra and Snir describe an algorithm for the Block PRAM 
 This is a
PRAM where access to a block of b consecutive cells in the shared memory takes time
l  b i
e
 there is a startup delay of l followed by a unit delay for each cell read

Their algorithm runs in time Onp if n  lp

 for some xed   
 However
they assume the permutation to be known in advance


 Chin  improves their result for
rational permutations i
e
 permutations that can be expressed as permutations on the
bit positions if numbers are given in binary representation
 Keller  gives an algorithm
for linear permutations i
e
 permutations of the form x  ax mod 
u
 where a is odd

This algorithm runs in time Onp  log p and requires Ologn local memory cells per
processor
 All these parallel algorithms take advantage of some a priori knowledge of the
permutation

We consider the more general case in which the permutation is not xed and we have no
knowledge of its structure
 Our work can be summarized as follows We follow an idea from
the simple Ontime sequential algorithm  and mark as visited the original positions of
items that have been moved
 This idea leads to a simple algorithm that runs in time
Onp logn and needs only constant space per processor
 By breaking the algorithm
into Olog lognp phases and redistributing work to processors after each phase we
obtain an improved algorithm with run time Onp  logn log lognp
 The overhead
comes from executing a prex summation after each phase
 If prex instructions can be
executed in constant time the run time improves to Onp
 By using a CRCW PRAM
and faster load balancing strategies that do not rely on prex summation we obtain run
times of Onp  log

n log lognp randomized and Onp  log log n

log lognp

They do not state this but otherwise they would need a preprocessing phase that includes the compu
tation of switch positions in a ClosNetwork  not to mention the space required to store this information

deterministic
 These algorithms however will be less practical

The paper is organized as follows We describe the simple algorithm in section  and analyze
its complexity in section 
 In section  we show how to improve the simple algorithm

Possible further improvements are discussed in section 

 The Basic Algorithm
The standard sequential algorithm to permute n data items according to a permutation  of
the n positions holding the items works as follows  Search for a position that has not yet
been visited
 Permute along the cycle starting in this position until you reach it again
 Mark
all positions that you visit
 Continue until all positions have been visited
 This algorithm
requires time On to move all items and time On to search for unvisited positions
 The
time bound for searching is obtained by maintaining a pointer that keeps track of how far
the array holding the items has been searched so far
 As marked positions are never again
unmarked this nds all unvisited positions in time On
 The space requirements are a
buer a pointer and n bits to mark the positions

We adapt the idea of marking visited positions and obtain a simple parallel algorithm for
an EREW PRAM with p processors
Without loss of generality assume that p divides n
 We partition the n positions in p blocks
of size B  np
 Each processor P takes care of one block of B positions
 P starts with an
unvisited position x in its block and follows the cycle of  that starts in x moving the items
encountered as it goes along until it meets a position x

that is already marked as visited

P now searches for another unvisited position in its block and continues
 It terminates
when all items in its block have been visited

A processor can be in one of three states either it is searching for an unvisited position in
its block or it is working on a cycle or it is terminated

If a processor is searching it examines the positions in its block to test whether they
have been visited
 It continues until it nds an unvisited position x or until it reaches the
end of its block
 In the rst case it marks the position as visited picks up the item stored
there changes its own state to on cycle and moves to x
 In the latter case it changes
its state to terminated
 Each processor maintains a pointer into its block to keep track of
how far it has searched so far
 Hence if it changes its state from on cycle to searching
again it does not have to start from the beginning of the block

If a processor is on cycle and has reached position x then its action depends on the state
of x
 If the position has not yet been visited then the processor will pick up the item stored
in x mark x as visited store in x the item it picked up in the previous iteration or in the

same iteration if the processor just switched from searching to on cycle and move
to x
 If the position has already been visited then the processor will store the previous
item in x and change its state to searching

A processor may meet a visited position either because it reaches the end of the cycle the
position where it started in its own block or because another processor started to work on
the same cycle in this position
 A position x therefore is inspected at most twice Once by
the processor assigned to its block and once by a processor following the cycle containing
x
 In order to avoid an access conict between these two cases we split each iteration of
the algorithm into two parts such that searching processors and processors on cycle
proceed alternately

The program for the basic algorithm is shown in gure 
 There T denotes an upper
bound on the maximum number of iterations we will compute such an upper bound in
section 
 Each processor has local variables state index iptr and buffer
 The variable
state denes the current state of the processor index counts how far it has searched its
block iptr points to the currently visited position and buffer is used to store data items
temporarily
 Global arrays are visited and item
 The array visited contains the ags
of all positions and item stores the actual items

An improvement in practical terms omitted in the interest of clarity would be to let even
processors on cycle use the rst part of each iteration to continue the search in their
blocks for unvisited positions

 Analysis
We will now analyze the run time and the memory requirements of the basic algorithm

The results are described in Theorem 

Theorem  The basic algorithm runs in time Onp logn and requires n global bits and
O local memory cells per processor
Proof In order to analyze the run time we dene a potential  as the sum of the lengths
of the block parts that have not yet been searched plus the number of items that have not
yet been moved to their nal positions
 It is easy to see that  never increases and that
the algorithm may nish when   
 Also   n initially since the sum of the block
lengths is n and there are n items
 For i         denote by 
i
the value of  at the
end of iteration i for i   before the rst iteration and by p
i
the number of processors
that have not terminated at that time
 Clearly p

 p


for i   to p  pardo  initialization 
P
i
state  SEARCHING  P
i
index   
for j   to B   do visitediB  j   od
od 
for t   to T do  iteration t 
for i   to p  pardo
if P
i
state  SEARCHING then  rst part 
if P
i
index  B then P
i
state  TERMINATED
else
P
i
iptr  iB  P
i
index 
 P
i
index  P
i
index  
if visitedP
i
iptr   then
visitedP
i
iptr   
P
i
state  ON CYCLE 
P
i
buffer  itemP
i
iptr 
P
i
iptr  P
i
iptr


 
if P
i
state  ON CYCLE then  second part 
 P
i
buffer  itemP
i
iptr   exchange contents 
if visitedP
i
iptr   then
visitedP
i
iptr   
P
i
iptr  P
i
iptr
else
P
i
state  SEARCHING


od
od 
Figure  The basic algorithm

For i        each of the p
i
processors that have not terminated after iteration i decreases
 by at least one in iteration i hence 
i
 
i
 p
i

 To see this note that a searching
processor decreases  by increasing its pointer index in line  of gure  while a processor
on cycle moves one item to its nal position in line 
 A processor may decrease  by
two in a particular iteration namely if it switches from searching to on cycle in that
iteration

Also 
i
 Bp
i
 for i        since there are only Bp
i
positions in the blocks of the
active processors that could be unsearched and also at most Bp
i
items that are not yet in
their nal positions
 Then

i

i
 
i
 p
i

i
  p
i

i
   p
i
Bp
i
    B 
It follows that 
i
 

   B
i
 for i        so that the number of iterations
can be bounded by the smallest i with 

   B
i
 
 This relation can be
transformed into i  logn log  B
 Since each iteration takes constant time
log  B  B and B  np we obtain a run time of Onp logn

The fact that log  B  B follows from the mean value theorem If fx 
log  x then for   x   f

x  ln   x   ln  and hence fx 
fx f  x  ln   x

From the description of the algorithm it is clear that it needs n global bits and that O
local memory cells per processor are sucient
 
 An Improved Algorithm
The basic algorithm does not run in optimal time mainly because many processors could
terminate early causing the work load to be severely unbalanced
 We improve the basic
algorithm by breaking it into several phases and reallocating processors to unvisited posi
tions after each phase
 The array of items is dynamically partitioned into active and passive
blocks
 In a passive block all positions have already been visited
 Active blocks are split
into smaller ones as the algorithm proceeds
 In the beginning the whole array forms one
active block

In phase i for i        we form p active blocks out of the remaining active blocks from
the last phase
 Then we execute q
i
 d  
i
npe iterations of the original algorithm

We proceed until fewer than p unvisited positions remain
 It is easy to see that at this
point the remaining items can be collected and moved in time Onp logn

The improvements of the new algorithm are summarized in the following Theorem 


Theorem  The improved algorithm works in Olog lognp phases and runs in time
Onplogn log lognp Its storage requirements are n global bits and O local memory
cells per processor
Corollary  The algorithm is optimal for p  O nlogn log log logn
We rst show how to partition r remaining blocks into p blocks of roughly equal sizes if p
is not a multiple of r
 Then we prove Theorem 

 Partitioning blocks
At the beginning of each phase we want to partition the r blocks that were still active by
the end of the previous phase into p new blocks
 Suppose that each of the r blocks is of size
at most s
 We assume that r  p and that rs  p
 If we ignore any rounding problems
we obtain rsp as the new block size
 However when we implement the permutation
algorithm we have to cope with the fact that p may not be a multiple of the number r
of remaining blocks and that s may not be a multiple of the number of new blocks to be
formed out of an old block
 Then the new block size will be larger than rsp
 Lemma 
guarantees that the new block size will not be too large

Lemma  The partitioning described above can be done in such a way that the maximum
size of the new blocks is at most dsbprce which is less than   rsp
We prove Lemma  using the following simple fact

Lemma  For any two integers u and v with   v  u u can be written as a sum of v
integers each of which is buvc or duve
Proof of Lemma  We apply Lemma  with u  p and v  r and see that we can split
each remaining block into either bprc or dpre new blocks
 To nd the maximum size of
the new blocks we consider a block that is split into bprc new blocks
 We apply Lemma
 with u  s and v  bprc and see that the maximum size of a new block is at most
s

 duve  dsbprce

Using that pr    bprc and dsve  sv   we get s

 rsp  r  
 By the
assumptions r  p and rs  p we have s

   rsp      rsp  rsp 
  rsp
 

The computation to partition the remaining r active blocks into p blocks can be done using
parallel prex summation
 This summation requires Op global memory cells
 However
these cells can be made local by copying Op global cells to local memories in O time
and restoring them after the prex summation

 Analysis of the improved algorithm
In analogy to section  we denote by  the sum of the lengths of the block parts that have
not yet been searched plus the number of items that have not yet been moved to their nal
positions
 Let T be the number of stages necessary to reduce  to at most p and denote
by 
i
the value of  at the end of phase i for i        T  by denition 
i
 p for
all i  T 
 Let B
i
be the maximum block size in phase i for i        T 
 Arguing as in
section  one can see that for i        T  we have 
i
 B
i
p  p the extra term of
p accounts for the fact that a processor may hold an item picked up in the previous phase

In order to prove Theorem  we show that the block size shrinks very fast as the algorithm
proceeds
 This is formalized in Lemma 

Lemma  For i        T  the maximum block size B
i
in phase i is less than
e
B
i




n


i
i
p

Proof by induction on i
i   In phase  we can choose a block size of np which is less than
e
B



i i  Since this case is relevant only for i  T  we can assume that 
i
 p
 Moreover
by the induction hypothesis 
i
 B
i
p  p  n

i
i
 p
 Denote by
p
i
the number of processors active at the end of phase i
 Since 
i
 
i
 p
i
 q
i
recall that q
i
 d  
i
npe is the number of iterations executed in phase i we
obtain p
i
 
i
 
i
q
i
 n

i
i
  
i
np  p

i



An argument used above shows that 
i
 p
i
B
i
 p
i
 i
e
 p
i
B
i
 p
 Since also
p
i
 p we can apply Lemma  which shows that the maximum block size B
i
in phase i   is less than p
i
B
i
p
 By the induction hypothesis and the upper
bound on p
i
established above p
i
B
i
p  n

i
i
p 
e
B
i



Proof of Theorem  By Lemma  
i
 pB
i
 p  n

i
i
 p for
i       T 
 It follows thatOlog lognp phases suce to reduce  and hence the number
!
of unvisited positions below p
 The remaining items can be moved in time Onplogn

The phases take time
P
iT
d  
i
npe  Onp
 The prex summation takes time
Olog p  Ologn per phase
 Hence the total run time is Onp logn log lognp

The n global bits are required by the original algorithm
 Local cells are needed to back up
one item during the permutation and to back up a constant number of global memory cells
during the parallel prex summation
 Hence O local cells are sucient
 
 Discussion
Some further improvements are possible
 First from the proof of Theorem  we can
immediately derive the following Corollary 

Corollary  If prex summation can be realized in constant time then the improved algo
rithm runs in time Onp and hence is optimal for p  n
This is important for architectures that emulate the PRAM model and support prex
computation at the instruction level
 Examples are the Fluent Machine ! the NYU
Ultracomputer   and the SBPRAM 

Improvements are also possible if a CRCW PRAM is used and the prex summation is
replaced by faster load balancing subroutines
 Using the techniques from  for a ran
domized PRAM and from  for a deterministic PRAM the run time of the algorithm
can be reduced to Onplog

n log lognp and Onplog logn

log lognp respec
tively
 However these improvements seem to be less practical because of larger constant
factors in the advanced load balancing algorithms

In our analysis we have distinguished between bits and memory cells
 Bits are considered
dierent because implementing them often will not increase the storage used
 In the
representation of the items there will often be an unused bit that can be used to encode
the visited ags
 Also many memory subsystems today provide each cell with additional
bits that are used for parity access control etc
 One of these probably could be used for
implementing the ags

The behaviour of our simple algorithm depends on the permutation 
 For many permu
tations the behaviour should be much better than indicated by our worstcase bound of
Onp logn
 We support this belief by simulation results
 For n  
i
 where   i  
and p  bn lognc we simulated the algorithm on  randomly chosen permutations

The average and standard deviation of the number of iterations needed are shown in gure

 The standard deviation is small and the number of iterations is always smaller than
 
logn
       





 
	
T 
n
  
T 
n
















 
  

 
 
 
 
Figure  Average run time and standard deviation of the simple algorithm
 logn
 This hints at the average behaviour of the simple algorithm being much better than
its worst behaviour
 However the average run time still needs to be analyzed

Acknowledgements
We want to thank Dany Breslauer and John Tromp for their patience to listen and their
helpful suggestions

References
 J
 Keller Fast rehashing in PRAM emulations in Proc th Symp on Parallel and
Distributed Processing Dallas TX Dec
    "

 D
 E
 Knuth Mathematical analysis of algorithms in Proc of IFIP Congress 
Information Processing  NorthHolland Amsterdam    

 R
 Melville A timespace tradeo for inplace array permutation J Algorithms 
 !  "

 F
 E
 Fich J
 I
 Munro and P
 V
 Poblete Permuting in Proc 	st Symp on Foun
dations of Computer Science St
 Louis Miss
 Oct
    " 


 A
 Aggarwal A
 K
 Chandra and M
 Snir On communication latency in PRAM com
putations in Proc st Symp on Parallel Algorithms and Architectures Santa Fe NM
June  !  "

 M
 Snir Personal communication   

 A
 Chin Permutations on the block PRAM Inform Process Lett      "

! A
 G
 Ranade S
 N
 Bhatt and S
 L
 Johnson The Fluent Abstract Machine in Proc
th MIT Conference on Advanced Research in VLSI Boston Mass
 Mar
  !! " 

  A
 Gottlieb R
 Grishman C
 P
 Kruskal K
 P
 McAulie L
 Rudolph and M
 Snir
The NYU ultracomputer # designing an MIMD shared memory parallel computer
IEEE Trans Comput   ! "! 

 F
 Abolhassan R
 Drefenstedt J
 Keller W
 J
 Paul and D
 Scheerer On the physical
design of PRAMs Comput J !   

 H
 Bast and T
 Hagerup Fast parallel space allocation estimation and integer sort
ing revised Technical Report MPII  MaxPlanckInstitut fur Informatik
Saarbrucken June   

 T
 Hagerup Fast deterministic processor allocation in Proc 
th Symp on Discrete
Algorithms Austin TX Jan
    "


