A parametrized sorting System for a large set of k-bit elements by Gamkrelidze, Alexander & Burch, Thomas
A Parametrized Sorting System for a Large Set
of k bit Elements
Alexander Gamkrelidze and Thomas Burch
Technical Report A 
Department of Computer Science University of Saarland
 Saarbr	ucken Germany
e
 mail sandrocsuni
sbde burchcsuni
sbde
A Parametrized Sorting System for a Large Set of k bit Elements

Alexander Gamkrelidze and Thomas Burch
Department of Computer Science University of Saarland
 Saarbr	ucken Germany
e
 mail sandrocsuni
sbde burchcsuni
sbde
Abstract
In this paper we describe a parametrized sorting system for a large set of k bit elements The
structure of the system is independent from the problem size the number of elements to be sorted
and the type of the sorting set for example a set of k bit numbers an alphabetical list of k bit
words etc as well as from the ordering relation dened on the set of the elements such as ascending
or descending order of k bit numbers or a specic order of alphabetical words
The general structure of the underlying parallel network is based on the n dimensional hypercube
The node circuit construction denes the type of the sorting elements thus dening the semantics
of the system The structure of the circuit implements the Columnsort algorithm introduced by
Leighton in Lei	
 By changing only one subcircuit of the size Ok in the node we can dene
dierent ordering relations of the sorted elements The system is based on specic VLSI chips that
were developed in Gam
 with the CAD system Cadic Bur	
 that has been developed in the
project B VLSI design systems and parallelity under guidance of Prof G Hotz
The result is a fast system that sorts the sets of up to 

 bit numbers The maximal sorting time
is less than  seconds that is better than some of the fastest software realizations implemented at
 processor Paragon Hard
 Cray Y  MP ZagBlel
 and MasPar MP   BrockWan

Key words sorting n dimensional hypercube bicategorial calculus logic topological net graphical
user interface
I Introduction
The classical calculus for dealing with logical circuits the Boolean Algebra was sucient as
long as the cost of wires were negligible compared with the cost of gates Since this is no
longer the case in integrated circuit design new calculus has been developed the bi
categorial
calculus Mol  where the design is represented by its logical function as well as by some
information about the geometrical arrangement of its components The rst step to extend
the Boolean Algebra to the bi
categorial calculus was given by the introduction of x
category
by G Hotz in  Hot  and is described in HotRe 
Consider the circuits layed out into a rectangular R In order to suppress geometrical and
physical details of manufacturing processes and thus to become independent of technology
we forget the width and the layer of wires In doing so wires become simple lines which may
branch and cross one other Furthermore we suppose that the circuit is constructed by cells
which compute digital values Assuming that these cells are physically correctly designed we
suppress their internal structure and size and maintain only the order of external connectors on
their boundaries If we consider crossings and branchings of wires also as cells which perform
crossings and branchings of signals this abstraction results in an arrangement of cells in the
plane whose interconnections consist of crossing
free non
overlapping lines Fig 
Now we dene for each cell a northern southern eastern and western side on which connectors
are placed note that no connector should belong to a corner We denote for a cell A the
number of connectors onto the northern southern eastern and western side by NA SA

Supported by the DFG SFB  VLSI design methods and parallelism
Fig 
EA WA To suppress precise geometrical relations of this abstract layout we consider
two such layouts to be equivalent i they can be transformed into each other by a sequence
of deformations that maintain the planar topological structure of the layout We call a set
of nets that can be transformed into each other by a sequence of such deformations a logic
topological net the elements of which are called topographical representatives of logic
topological net
The advantage of a logic topological net is that it gives a precise characterization of an in

tegrated circuit which is suciently abstract to suppress geometrical and physical details
and which is suciently concrete to control the arrangement of cells and the global routing of
wires A detailed and precise theoretical background of this calculus is given in Mol  If
we relate to each cell its behavior by a boolean function or a more general model we also get
a precise mathematical characterization of the behavior of logical nets Kol  Mol 
Having the components logic topological nets we can now dene the operations between
them namely the compositions of nets which are dened by abutment of topographical repre

sentatives There are two kinds of compositions namely the horizontal composition
e
left
ftom and the vertical composition
e
above The composition N

e
N

is dened for two
nets N

 N

i there are two topographical representatives of N

 N

so that the southern side
of N

matches the northern side of N

 his operation can be carryed out i SN

NN


Consider the construction of a full adder as shown in Fig 
Fig 
In the bi
categorial calculus it could be described as follows
HA  
e

e

e

e

e

e
AND
e
EXOR FA  HA
e
j
e
j
e
HA
e
OR
e
j
In the following example we consider the construction of the circuit shown in Fig  a
Fig 
In CADIC the VLSI design system that is based on the bi
categorial calculus it could be
realized as shown in Fig 
Fig 
Using the bi
categorial calculus these circuits can be described as follows
A  
e

e
AND An  An 
e
An 
The above example shows how compact the hierarchical representations of parametrized cir

cuits could be compared to the traditional methods To build a circuit of  elements with
CADIC one has to generate the circuit A
Now consider a circuit shown in Fig  b This is the same circuit as in Fig  a with ANDs
changed with ORs In CADIC you can build it by substituting only one element AND with
OR in A That means to change the semantics of the parametrized circuit in hierarchical
representation one has to modify only one element in the design In non
 hieerarchical design
up to  modications would be necessary
This advantage pays out in complex systems such as a sorting system of a large set of elements
described in this work Having a basic structure of an underlying graph such as an n
 dimen

sional hypercube n  m dimensional meshes a buttery network etc you can implement
dierent parallel divide
and
conquer algorithms constructing the specic node circuits
As an example of the application of our system we have constructed a sorting system for a
large set of k
 bit elements By changing only one subcircuit of the size Ok in the design
and by the automatic generation of the whole system with Cadic we can specify dierent
sorting systems that suits for dierent purposes such as sorting a large set of k
 bit numbers
in ascending or descending order or sorting the alphabetical lists of k
 bit entrys in specic
order
As a basic structure in our system we have chosen an n
 dimensional hypercube that is
emerging as a popular network for parallel machines For example it is used in the Intel
iPSC NCube of NCube Corporation and the Connection Machine One of the key features of
a hypercube is a rich interconnection structure which permits many other important network
topologies Furthermore the hypercube can be divided into subcubes of which the implemen

tation of recursive divide and conquer algorithms is supported
However this structure has its disadvantages as well in terms of poor scalability and expo

nential growth of the circuit It rises a problem of circuit partitioning  the development of
partial circuits on several chips and assembling them as a whole system on a specic board
The system can be applyed to sort up to 


 bit numbers The time needed to sort 



bit numbers is less than  sec that is faster than the software realizations implemented
at the 
 processor Paragon Hard Cray Y  MP ZagBlel and MasPar MP  
BrockWan supercomputers
This paper is organized in the following manner Section  describes the general topological
structure of an n
 dimensional hypercube independent from further implementations of the
network semantics In section  the Columnsort algorithm will be introduced The problems
of the ecient implementation of Columnsort at the n
 dimensional hypercube are discussed
in section  Section  gives the upper bounds for the growth of time and area complexity of
the whole system Based on the results of section  we can realize that for a large number
of k
 bit elements the whole system can not be implemented as one chip That raises the
problem of circuit partitioning discussed in section  where we give a method of building a
sorting system of a large number of k
 bit keys In section  we give the runtimes of our
system on dierent problem sizes and compare it with other software realizations of sorting
algorithms on dierent parallel supercomputers Some of the most important parts of the
circuit are introduced in section  followed by the layouts of the 
 dimensional hypercube
system and its node
II Description of the Network
In this section we give a parametrized description of the hypercube irrespective of the con

struction of the vertexes we describe the topology of the network based on the n
 dimensional
hypercube the semantic of which could be specied by the processors inserted into the ver

texes
The n
 dimensional hypercube is represented as a graph G
n
 V
n
 E
n
 where V
n
contains 
n
elements A unique address is corresponded to every node v  V
n

addr  V
n
 f g
n

Additionally every vertex has degree n and every edge between two vertexes corresponds to
a dimension d   d  n
Two vertexes v v

 V
n
with addrv  a
n
  a

 and addrv

  a

n
  a


 are con

nected via an edge i
 i  f  n g  a
i
 a

i

We assume that for the moment all the connections in the hypercube are of undened type
The type will be dened later depending on specic vertex circuit For example the edge of
type t could be a k
 bus a single
 or bidirectional connector etc For simplicity we assume
that all the edges are of the same type t
In order to get a compact topological description of the hypercube with short connections
between the vertexes we construct any n
 dimensional hypercube by setting two n   di

mensional subcubes aside or above one another and connecting the corresponding vertexes via
n   
dimensional edges if n is odd set the subcubes besides else over one another An
example of the construction of a 
 dimensional hypercube is shown in Fig 
Fig 	 Construction of a  dimensional hypercube
Applying this method we can recursively describe any n
 dimensional hypercube The recur

sion terminates on the cube of dimension  the vertex of the hypercube Fig 
Fig  The vertex of the hypercube
	In order to work for any n the interface of the vertex must intend all the connections for the
higher dimensions To beware the symmetry of the construction the dimensions are attached
by turns on the NS and EW sides respectively Moreover same dimensions are passed
through NS and EW sides of the circuit
Consider another example of the construction of a 
 dimensional hypercube
Example 	
 Construction of a 
 dimensional hypercube
 The 
 dimensional subcube is constructed by setting two 
 dimensional subcubes ver

texes one besides the other and connecting the 
 dimensional edges cuted projected
on the eastside resp westside of the nodes
Fig  The vertex a and the  dimensional subcube b
In Fig  b the edges of dim  are passing the vertexes and are connected to the
EW sides of the subcube of dimension  providing the connections in higher steps of the
construction
 To construct a 
 dimensional hypercube two 
 dimensional subcubes are set one above
the other connected via dimension  Fig 
Fig  Construction of a  dimensional hypercube
The edges of dimension  on the northside of the upper and the southside of the lower
subcubes are cuted note that the edges of dimension  are cuted earlier and do not appear
at this level
 In the last step of the iteration Fig  there is no direct connection between the vertexes
via dimension  But still it is possible to connect the 
 dimensional subcubes directly
connecting their west
 and eastsides because of the correct allocation of the edges of
dimension  in previous steps
Fig  Construction of a  dimensional hypercube
As shown in previous example in every step of the construction some edges must bypass the
nodes and be connected to the NS respectively EW sides of the subcube Until now this
was done manually that means in every step each edge of this kind is passed to specic node
Fig  Preparation of the edges a and the  dimensional subcube b
If we dene the vertex as in Fig a the process of the construction will be simplied in
terms of connecting all the edges at the EW resp NS sides of the subcubes the edges with
higher dimensions will be bypassed automatically Fig b Fig  shows a 
 dimensional
hypercube constructed with the vertexes dened as in Fig a
As shown in the examples some edges must be cuted in the respective steps of the iteration
This is provided by the circuits CutE CutW  CutN and CutS dened in Fig 
Fig  The circuit Cut
nmp

Fig  the four dimensional hypercube
CutS CutE and CutW are projecting the respective edges on the southern eastern and
western sides of the subcube and are constructed similarly to CutN  CutW is the 
o
rotated
circuit CutN  CutS  vertically mirrored CutN  CutE  horizontally mirrored CutW 
The preparation of the edges of the vertex could be described as in Fig 
Fig  The edge preparation circuits
The circuit V N respectively V S  is the circuit V E respectively VW  rotated 
o
clockwise
Using the bi
categorial notation from KMO  where
e
means !set besides!
e
!set above!
we can write

Cube
njk


CutW
nkb
k

c
e
Cube
nk
e
Cube
nk
e
CutE
nkb
k

c
 for odd k
CutN
nkb
k

c
e
Cube
nk
e
Cube
nk
e
CutS
nkb
k

c
 for even k
here and further in the notations of Cube
njk
 j  d
k

e  b
k

c
Fig  shows the construction of the subcube in the odd step m
Fig  construction of the subcube in the odd step k

III Columnsort
In this section we describe a sorting algorithm Columnsort introduced by Leighton in Lei

For simplicity we describe the algorithm as a series of elementary matrix operations
Let Q be an r  s matrix of N numbers to be sorted where r  s  N  sjr and r 	 s 


Initially each entry of the matrix is one of the numbers to be sorted After completion of the
algorithm the i j entry   i  r   j  s of Q will contain the p
 th sorted number
  p  N where p  i" j  r
As an example we give a    matrix before and after sorting

B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Sort


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Columnsort works in eight steps In steps    and  it sorts the numbers within each
column of the matrix In steps    and  the entries of the matrix are permuted
Steps  #  The permutation of the matrix in step  corresponds to a !transpose$ of the matrix
The entries are pieced up column by column and then deposited row by row always going
from top to bottom in a column and from left to right in a row The permutation in step 
is the inverse of that in step 

B
B
B
B
B
B
B
B

a

a

 a
s
a

a

 a
s
  
  
  
a
r
a
r
 a
rs

C
C
C
C
C
C
C
C
A






B
B
B
B
B
B
B
B

a

a

 a
s
a
s
a
s
 a
s
  
  
  
a
rss
a
rss
 a
rs

C
C
C
C
C
C
C
C
A
Note that these steps do not match to real transposition if they are not applyed to a square
matrix but for simplicity we still call them !transpose$ and !retranspose$
Steps  #  The permutation in step  corresponds to an b
n

c
 shift of the entries
In this step the matrix r  s is transformed into the matrix r  s" 


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

a

a

 a
s
  
  
  
a
b
r

c
a
b
r

c
 a
b
r

cs
a
b
r

c
a
b
r

c
 a
b
r

cs
  
  
  
a
r
a
r
 a
rs

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A






B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

 a
b
r

c
 a
b
r

cs
a
b
r

cs
   
   
   
 a
r
 a
rs
a
rs
a

a

 a
s
"
   
   
   
a
b
r

c
a
b
r

c
 a
b
r

cs
"

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
The permutation in step  is the reverse of that in step  We call these steps !shift$ and
!reshift$ For short we call all these steps !permutation$ steps
The following example shows a step
 by
 step application of Columnsort to a    matrix

B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Sort


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
transp


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Sort


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
retransp


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Sort


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
shift


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

   
   
   
   
   "
   "
   "
   "
   "

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
Sort


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

   
   
   
   
   "
   "
   "
   "
   "

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A
reshift


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

  
  
  
  
  
  
  
  
  

C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A

IV Implementation of Columnsort
To implement Columnsort on the n
 dimensional hypercube we proceed as follows
 Every vertex of the n
 dimensional hypercube corresponds to a column of the matrix Q
That means the number of columns in Q must be s  
n
 The elements of the columns
are stored in the memory of size r of corresponding vertexes
 Hence rjs and s 	 r  

must hold we assume r  
n

That means we have a 
n
 
n
matrix where 
n
elements will be sorted
In this section we represent the method of the implementation of each !sorting$ and !permu

tation$ steps of Columnsort
The !sorting$ steps are implemented with OddEvenMergeSort circuit Knu  contained in
each vertex Since every vertex sorts its data independent from his neighbours no problems
of synchronization arise
The problems arise while implementing !permutation$ steps As an example transform a 
matrix Fig 
Fig 	 Connection problems while transforming a matrix
The columns  and  must share data with one another but there is no direct connection
between them The solution is to send data along the shortest path to the neighbours
We apply the following scheme of transformation
 Transform the matrix in n steps
 Share data between the p  
th neighbour via the edge of dimension p  
In step p  f  ng the data will be shared between the vertexes with the addresses
a

  a
np
  a
np
  a
n
 and a

  a
np
  a
np
  a
n
 via dimension n p as fol

lows

The data in memory of each vertex is divided into 
p
blocks	 The vertexes exchange the halfs
of each block with one another
 the vertex with lower address changes the lower half of the
block with the upper half of the corresponded block of the vertex with higher address	
As an example we transform a    matrix with this method Fig 
Step  p   divide each column in 

  blocks
Exchange data via dimension  node  with node  node  with node 
Vertexes with lower addresses  and  change lower halfs with upper halfs of
the vertexes with higher addresses  and 
Fig  Step  of transposition
Step  p   divide each column in 

  blocks
exchange data as described Fig 
Fig  Step  of transposition
The above method transposes a n n matrix To implement it on the r s matrix we apply
it step by step on the s s partial matrixes
As an example consider a    matrix to be transposed After applying the above method
on the upper and lower partial matrixes we get

B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

a
	
b
	
c
	
d
	
a

b

c

d


C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A


B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B
B

a

a

a

a

b

b

b

b

c

c

c

c

d

d

d

d

a

a

a
	
a

b

b

b
	
b

c

c

c
	
c

d

d

d
	
d


C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
A

This matrix does not correspond the transposed matrix described in section  but as one can
see each column of it is a permutation of the corresponded column of Columnsort matrix in
step  Hence after applying the sorting algorithm in step  we get exactly the same matrix
as in Columnsort
To implement the !retransposition$ we apply the above method of n n matrix transposition
on the whole matrix Fig 
Fig  retrans
Note that similar to !transposition$ the matrix does not correspond that of Columnsort but
after applying !Sorting$ in step  we get the same result
While !shifting$ the method of Fig  will be applied
Fig  shift
The nodes share data in ascending order that means in step k they share data with the
k   
th neighbour via edge k   Note that in step k only the nodes with the addresses
a
n
  a
k
    share data with their neighbours the rest of them is blocked on the
exception of step  where all the nodes share data with their corresponding neighbours

The principle of the !reshift$
 step Fig  is similar to !shift$ with some exceptions
 The nodes share the upper halfs of their data with their neighbours on exception of node
 that shares the lower half of its data with the upper half of its neighbours
 Only the nodes with the addresses a
n
  a
k
    share data the rest is blocked
Fig 
There is one important thing to be noted normally the elements in each node must be sorted
In node  of the !reshifted$ matrix  however the upper and lower parts are exchanged This
is considered in the data output algorithm described later in this work so that the user gets
the sorted elements in ascending order
Input and Output
The Input  Output algorithm is realized as follows
For a 
 dimensional cube
 Write read data to from the node 
 Exchange data between the nodes  and 
 Execute step 
This process could be generalized for an n 
dimensional hypercube
 Write read data to from the lower n  dimensional subcube
 Exchange data between the n  dimensional subcubes
 Execute step 
Note that in step  no changes must be undertaken in the higher n  dimensional subcube
	
V Area and Time Complexity
In this section we give the upper bounds of the growth functions for time the depth and
area number of cells of the sorting circuit discussed in previous sections
Because of the specic construction the vertexes of the hypercube contain the memory cells

n
each It has the negative eect in terms of the area expansion but the alternative
construction would require at least 
n
 k pins the connection of each vertex to memory or
the system should be realized sequentially we assume that each number to be sorted consists
of k bits
A	 Time Complexity of the Circuit
First we give the time function of the Odd
Even
Merge
Sort shown in Fig 
Fig 
The circuit Mergen Fig  sorts two pre
 sorted sequences of elements Its components
are explained in Fig 
Fig 
The circuit Shufflen is the reverse of OEn CMP sorts two elements of its input
As one can easily see the depth i e time of Sortn could be calculated with the following
formula
T Sort
n
 
n  n" 


For Sort
n
 it is

Fig 
T Sort
n
 
n " n " 

 n" n "   n

" n " 
The upper bound of it is On

" n "   On


For N elements to be sorted the upper bound would be Olog

N
Let Cube
n
be our sorting circuit realized as the n
 dimensional hypercube Than the time
function would be
T Chip
n
    n  
n
"   T Sort
n

Its upper bound is O  n  
n
" n

" n"   On  
n

That means the circuit sorts N  
n
elements in time O

p
N

 logN
B	 The Area Complexity
As described in Gam  the circuit Chip
n
consists of two parts  the n
 dimensional
hypercube and the logic unit Hence the area function CChip
n
 could be represented as
CChip
n
  CCube
njn
 " CLU
n

where LU
n
is the logic unit of the system
It is also shown in Gam  that
CST
n
    n"  and CCube
njn
  
n
 
n
 n

" n "  " n " 
It follows
OCChip
n
  On

 
n

That means the upper bound of the area of the network that sorts N  
n
elements is
ON  log

N


VI Circuit Partitioning
Because of the exponential growth of the network the circuit could be unrealizable for su

ciently large n
Denition 	
 Let Cube
njn
be a hypercube to be constructed A subcube Cube
njm
is called
a maximal subcube of n
 dimensional hypercube if following holds
Cube
njm
is realizable as one chip
Cube
njm
is not realizable as a chip
One method of the construction of large hypercubes could be a construction of their max

imal subcubes and assembling them to a hypercube Fig  shows a construction of a 

dimensional hypercube with four 
 dimensional subcubes
Fig 
The closer look at the system shows the impossibility of this method As an example we
construct a 
 dimensional hypercube from four 
 dimensional subcubes First we count the
number of connections on the EW and NS sides of these subcubes
As shown in Gam  the number of pins of Cube
njm
could be calculated as follows
P Cube
njm
  
b
n

cd
m

e
" 
d
n

eb
m

c
 
m
"   n  m" 
In our case it is
P Cube
	
  
	
" 
	
 

"    "   
In a 
 bit model the number of pins corresponding to the number of connections of the 

dimensional subcube chip is at least  That means Cube
	
is not realizable because of
the large number of its pins
Hence the only problem is the large number of pins we can solve it by reducing the number
of connections in Cube
n

To avoid the above problems we proceed as follows
We construct a maximal hypercube for a 
 bit sorting system it is a 
 dimensional hyper

cube
A 
 dimensional hypercube sorts 

elements that means it could be used as a node of
a 
 dimensional hypercube Figure  shows a possible scheme for a construction of a 

dimensional hypercube system Each chip is observed as a node of a 
 dimensional hypercube
sorting its data independent from one another and sharing it with its neighbors according to
Columnsort The data paths connecting the nodes of the cube with one another or with the

outside world are set with additional circuits which we denote as C It is not realized as one
chip because of a large number of pins but for simplicity we still represent it as a unit
In other words we develop the same system as in previous chapters that do not t in one chip
and is placed on a specic board
Fig 	
The system can be even enlarged if we use a 
 dimensional hypercube as a vertex of a
hypercube with a dimension up to  up to 


 bit elements could be sorted We describe
the method constructing an n
 dimensional hypercube see Fig 
Fig 
For the sake of parallelization we use at least  RAM
 blocks because of the same number
of sorting chips 


 bit words in total These RAMs build the columns of the Columnsort
matrix as shown in Fig 
The general algorithm is implemented as follows
 for i    "i
 f
 for j   
n
 "" j
 f
 Read j
 th column into the system
 Sort the red
 in data
 Write sorted data
 g
 % all 
n
columns of the matrix are sorted %
 Exchange data between the columns according to Columnsort
 g
In other words we read the data of each RAM column sort it and store it back into RAM
Then we exchange data between the nodes according to Columnsort note that some steps in
this schemesuch as data exchange and readwrite could be parallelized

VII Performance Results
In this section we give the sorting times of the whole system and compare it with other
software realizations on several parallel supercomputers
Our calculations are based on the implementation algorithm from previous section
The system could be even accelerated by the parallelization of the data readwrite and ex

change steps
 write the sorted data in two columns that must exchange data with one another
 while exchanging data between these two columns write data to another pare of columns
The parallelization requires exact analysis based on the timing diagrams of the selected RAM
modules so we do not discuss it in depth in this paper and present the sorting times of the
unparallelized system
The total sorting time of a non
 parallelized n dimensional system can be expressed wit the
following formula
T
n
   
n
 T
R
" T
sort
" T
W
 "   
n
 T
ex

where T
R
 T
R
 

are times needed to readwrite data fromto sorting chips T
sort
is the
time needed to sort data in a 
 dimensional system and T
ex
is the time needed to exchange
data between the nodes of an n dimensional hypercube the system to be developed
It is easy to show that
T
sort
   T
s
"     

     

"     

   


where T
s
   

is the sorting time of one sorting chip
For T
ex
we have T
ex
 n    



elements exchanged in n steps
Applying these formulae we get
T
n
   
n
   

" 

 "   
n
 n    

 
n
 n" 
So we have estimated the sorting time of the whole system
T
n
 
n
 n" 

The Following table shows the number of elements and sorting times of various sorting systems
for time calculations we take the clock delation of ns
n elements time  sec
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 
	
 
 

 
In the following diagram we compare the sorting time of our system with that of Bitonicsort
and Samplesort algorithm realizations on the  processor Partysec GCel Dea we have
chosen the fastest implementation from the software realizations on the 
 processor Paragon
Hard Cray Y  MP ZagBlel MasPar MP   BrockWan and  processor Par

tysec GCel Dea
Because of the assumption of Columnsort section III to sort 
n
 l elements   l  
n

we build a sorting system of the size 
n
 that explains the discrete time graph in the diagram
220
228
227
226
224
225
223
222
221
0,118
0,249
0,525
1,101
2,307 4,824 10,067 20,972 43,621
3 4 9 11 20 22 426
7 8 1912 13 14 15 16 17 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
our system Bitonic Sort on Partysec GCel Samplesort on Partysec GCel
Fig 

1
2
3
0
1 2 4 8 16 32 64 128 256 512 1024
CRAY C 90
CRAY T3E
CRAY T3D
CRAY T916
Fujitsu VPP 700
IBM RS/600 SP
Wide- Node 1
time (sec.)
Processors
(a)
Our system
time (sec.)
Processors
1 2 3 4 5 6 7 8 9 100
1
2
4
8
16
32
64
128
256
512
1024
CRAY C 90
CRAY T3E
CRAY T3D
CRAY T916
Fujitsu VPP 500
Fujitsu VPP 300
NEC SX 4/32
IBM RS/600 SP
Wide- Node 1
(b)
Our system
Faster
Slower
Faster Slower
Fig 
According to the benchmark results from the NASA Ames Research Center BBDS  we
can compare our system with the realizations of the Integer Sort algorithm on dierent super

computers of the problem sizes 

Fig a and 

Fig b

VIII Important Circuits
In this section we describe some of the most important circuits of the sorting chip as developed
in CADIC Bur 
In several cases the diagrams do not match the circuits analyzed earlier because of the com

plicated wiring system
As basic elements we use the CMOS cells described in the following table
Cells Description
anda Boolean AND
idfra Flipop
iinva Boolean NOT
imuxb Multiplexer
Multiplexer with
imuxa
two inputs
iora Boolean OR
vddcont The !Lo! Sygnal

CCubenuppernlowernn	
STEn	
L
Sort 
 WR
AA
EA
Sort
A
A
trans
WR
Shift
Sort
AA
EA
trans
Sort 
 WR
WR
Shift
L
End In Out Res CK Ready
Res
H
Output
CK
OK
L
H
Input
r
n
r
n
r
n
r
n
r
n
r
n
r
busbreite
r
n
r
busbreite
r
n
r
n
r
n
The circuit CHIP n  The system at its highest hierarchy level with the control unit STEn

Cubenm	 Cubenm	
CutOnmlowerm	
CutWnmlowerm	
r
vConnm	
r
vConnm	
r
n
r
hCm	
r
n
r
vConnm	
r
vConnm	
r
hCm	
r
inm	
r
busbreite
r
n
r

r
nm
r
nm
r
nm
r
inm	
r
n
r
inm	
r
busbreite
r

r

r

r
n
r
n
Construction of the odd subcube
	
VNlowern	
VOuppern	
VWuppern	
VSlowern	
Noden	
r
n
r
vcC
r
kukan	
r
dE
r
vC
r
vcC
r
dO
r
dE
r
dO
r
hcC
r
hC
r
r
dE
r
vC
r
hC
Preparation of the edges

Blockn	
SORTn	
RAMn	
DATn	 CADRn	
EA
CK
transADR
AA
Shift
SetRes
Res
In
ExOut
Sort 
 WR
WR
Sort
Inp
 WR
A
ExIn
zu H
r
n
r
n
r

r
n
r

r
n
r
busbreite
r

r
n
r
n
r

r
n
r
n
r
n
r
n
r
busbreite
r
busbreite
r
busbreite
r
n
r

r

r
n
r
n
r
busbreite
r
n
r
n
r
n
r

r

r
n
r
n
r
n
r
n
r

r
The node of the hypercube at the highest hierarchy level
The circuit CADRn Count Address contains the binary address of the node It determines
whether the data of the node must be shifted


imuxb
Blockn	
iinva
r
n
r
n
r
n
imuxb
iinva
Upper diagram Hierarchical representation of the circuit BLOCKn that determines the
dimension along which the data must be exchanged
Lower diagram Termination of the hierarchy of BLOCKn  BLOCK

RAMn	 RAMn	
r

r

r

r


n
r
n
r


n
r
n
imuxb idfraimuxa iora ianda
idfra iora iandaimuxaimuxb
CK 
SOut
SOut
WR
ExOUT
ExIn
WR
Sort 
 WR
SIn
SIn
CK 
Sort 
 WR
ExOut
Sort
ExIn
Sort
Upper diagram Hierarchical construction of the memory cells
Lower diagram Termination of the hierarchy two memory cells

DATn	
imuxa
r
n
r
n
r
n
imuxa
Upper diagram The circuit DAT n used to calculate the actual addresses of the data to be
shared using EA AA and a
j

Lower diagram Termination of the Hierarchy of the circuit DAT n  DAT 

CSTATUS
Dron	
iinva
ianda
ianda
idfra
vddcont
idrsa
iora
vddcont
ianda
CK
WR
Inputsig
OK
NInp
CK STATUS
Sort
NSo
Ntr tr
Sort
 WR
Sh
StartOut
trans
EA
A AA
Res
EndIn
Inputstart
Out
In
End
r
n
r
n
r
n
r
n
r
n
r
n
The control unit at the highest hierarchy level
The circuit CSTATUS Count Status determines the state variables of the system
DROn determines the variables EA A and AA used to calculate the data addresses to be
exchanged
References
BBDS D H Bailey EBarszcz L Dagum and H D Simon
NAS Parallel Benchmark Results 

RNR Technical Report RNR


NASA Ames Research Center March 
BrockWan K Brockmann R Wanka
Ecient Oblivious Parallel Sorting on the MasPar MP  
In Proc th Hawaii International Conference on System Sciense HICSS IEEE Jan 
Bur Th Burch
Eine graphische Arbeitsumgebung fur den parametrisierten Entwurf integrierter Schaltungen
PhD thesis department of Computer Science University of Saarland 
Dea Ralf Dickmann et al
Sortieren gro&er Datenmengen auf einem massiv parallelen System
Gam A Gamkrelidze
Entwurf eines booleschen Sortiernetzes mit der Struktur eines n dimensionalen W	urfels
Masters thesis University of Saarland 
Hard J C Hardwick
An Ecient Implementation of Nested Data Parallelism for Irregular Divide
 and
 Conquer
Algorithms
In Proceedings of First International Workshop on High
 Level Programming Models and
Supportive Environments pp    Apr 
Hot  GHotz
Eine Algebraisierung des Syntheseproblems f	ur Schaltkreise
EIK   pp      
HotRe  G Hotz and A Reichert
Hierarchischer Entwurf komplexer Systeme
In I Wegener editor Highlights aus der Informatik Springer Verlag 
Kol  R Kolla
Spezikation und Expansion Logisch Topologischer Netze
PhD thesis Saarbr	ucken 
KMO R Kolla P Molitor H G Ostho
Einf	uhrung in den VLSI Entwurf Leitf	aden und Monographien der Informatik B G
Teubner Verlag Stuttgart 
Knu D E Knuth
Sorting and Searching The Art of Computer Programming Addison  Wesley 
Lei T Leighton
Tight Bounds on the Complexity of Parallel Sorting IEEE Transactions on Computers Vol
C
 April 
Mol  P Molitor
	
Uber die Bikategorie der Logisch
 Topologischen Netze und ihre Semantik
PhD thesis Saarbr	ucken 
ZagBlel M Zagha G Blelloch
Radix Sort for Vector Multiprocessors
Supercomputing!  November 
