Area complexity of merging  by Palko, Vladimir et al.
Theoretical Computer Science 82 (19Jl) 157-163 
Elsevier 
157 
Vladimir Palko, Ondrej Sykora and Imrich 
Insrirure of Technical Cybenletics, Slovak Academy of Sciences, Dtibrauska’ cesta 9, 
842 3 7 Brarisdwa, Czechoslovakia 
Communicated by G. Mirkowska 
Received May 1989 
Revised December 1989 
1. Introduction 
During the ten years of the existence of the theory of VLSI complexity the greatest 
attention has been devoted to the design of optimal algorithms for the problems of 
ordering including sorting, I-selection and merging. These problems played a key 
role in the development of lower bound techniques with regard to the complexity 
measures A (area) and AT2 (area-time squared tradeoff). At present the problem 
of designing optimal VLSI sorting algorithms has been completely solved according 
to the both measures [3, 4, 6, 7, 91. For instance, area complexity of sorting of n 
elements, each being represented by k-bits, is 
A= 
I 
O( n log n) for k 3 2 log n, 
O(min{2’, n}()k -log nl+ 1)) for k s 2 log n. 
Similarly, there exist area-optimal VLSI algorithms for I-selection [8]. Some results 
for this problem with respect to AT2 measure are in [IO, 121. 
Baudet and Chen [2] have investigated the problem of merging of two n-element 
sorted arrays of (c log n)-bit elements (c > 1). ey have shown that AT2 = 
fl(n2 log’ n). They have proved also, that A = il( At the same time, they have 
challenged to study the more general problem of merging m-element and n-element 
sorted arrays of k-bit elements. 
In this paper we derive optimal lower bounds for this problem with t 
area complexity 
= O(m(log n-log n4 
I O(min{2’, m}((k - 
Elsevier Science Publishers B.V. 
158 K Salk0 et al. 
provided that m - < ri. From our result it follows that merging in general is easier 
than sorting of (m + &element arrays, On the other hand, if m=n and kslogn 
then these problems are of the same area complexity. Finally, our paper completes 
the investigation 3f area complexity of the problems of ordering. 
The paper is organized as follows. In the next section we define the problem, the 
model of computation and prove lower bounds for merging. In Section 3 we describe 
optimal upper bounds. 
Let X=(xl,x2,.~.,hA Y=hy2,mmm, yn) be two sorted arrays of k-bit num- 
bers in ascending order. The problem is to merge them into an array 2 = 
( 21922, . . . , z ) m+n l 
Let xi = x+1 . . . xi,o, i = 1,2,. . . , m, denotes the binary representation of xi, where 
Xi = C:zi 2jXi, jm The integers yi, i = 1,2, . . . , n, and zip i = 1,2, . . . , m + n, are represen- 
ted in a similar way. Without loss of generality, suppose that m and n are powers 
of 2 and m==n. 
We assume the standard model of VLSI computation [11) from which we underline 
only two basic prop,-rties necessary for deriving iower bounds: semelectivity-each 
input variable is read in exactly once; time determinate input and output-the times 
at which the inputs are supplied and outputs are delivered are fixed and independent 
of input values. 
Any semelective, time determinate VLSI circuit for merging m-element and 
n-element sorted arrays of k-bit integers has area 
A= 
O(m(logn-logm+l)) for kslogn 
n(min{2k, m}( 1 k - log ml + 1)) otherwise. 
We apply the standard lower bound techniques proposed in [ 1,5]. First 
assume that 2 c k 6 log n and m 3 4. We claim that each output variable Zi.0, irn + 1 s 
i s n functionally depends on each input variable Xj.1 9 1 c j G brn, 1 G I6 k - 1, i.e. 
there exist two assignments of values to input variables differing only in the bit xj,/ 
such that variable zi.0 takes different values. Set 
Xp,,= 1 for pSj+l, xP,o = 1 for 16 p s ‘in, 
yP,,=l for pai-j+l. 
The remaining variables, except Xi,/, are set to zeros. Now 
if Xj.1 = 0 then Zi,o = 1, if Xj,/ = 1 then Zi.0 = 0. 
er the time t in w the last input variable from the set (Xi,,: 1 s j s irn, 
Iclsk-1) was read in. ing to the above functional dependency and the 
Area complexity of merging IS9 
time determinate assumption each output variable qo, irn + I 5 is PZ, must be 
delivered after the time t. 
Consider all problem instances with the following fixed assignment of values to 
some variables: set 
x* 1 I.0 = for lGi+m; 
x. 1.1 =l for$m+l~i~m, lsI&L-1; 
yi,l = 0 for W&m, lGl<k-1; 
yi+,,s=2(+I) for I G i62”-‘* 3 
Yi,l = 1 for 1 -I-trn +2&-’ SiSn, IS/Sk-1 
(see Fig. 1). Now if we assign the shaded variables i.e. Xi j, for 1~ i < $m, I 6 I < k - 1, 
to all permissible values then the output vector z,+,+;,~, . . . , ~,,,/~+~*-l,~ creates all 
permutation of the multiset consisting of brn ones and 2’-’ - 1 zeros. The number 
of such permutations is 
N= 
( 
$m +2&-l- f 
$m ) 
. 
Since the variables +,fm+2< i~$~+2~-‘, must be output after the time t, the 
circuit must be able to distinguish between N distinct states in the time t. Therefore 
the circuit contains at least log N memory bits: 
A = Q(log N). 
(a) If k 3 log n then the necessary area for merging k-bit elements is bounded 
from below by the necessary area for merging (log n)-bit element. Therefore 
A=R(log(am~~-l)) =ICZ(m(logn-logm+I)). 







V. Palko et al. 
(c) If kc log m then 
&r+2’-‘-I 
p- - 1 )) 
= R(2’i(log m -k+ 1)). 
In the remaining special cases m = 1,2,3; k, n arbitrary and k = 1; m, n arbitrary, 
one can easily prove A = R(min{ k, log n + 1)) resp. A = 2(log m + 1) using the above 
method. Cl 
In this section we shall describe two merging circuits depending on the relative 
size of k and log n. 
(I) In the case k s log n, the circuit is based on the idea of classical “insertsort”. 
Consider a one-dimensional array A4 of size m. Initially, the array is filled by the 
integers of the sequence X in nondecreasing order. The merging algorithm consists 
of n phases. In the ith phase, the integer Y,~++~ is inserted into the array preserving 
the order and the greatest integer of the considered (m + 1) integers is released. 
After each phase the array M saves m smallest integers from all integers read in 
so far. The straightforward implementation of the array M requires O(mk) bits. 
But we employ a more efficient way [7,8] of storing m k-bit integers using only 
3(min{2k, m}(lk-logml+l)) 
bits. We keep the m k-bit integers in the form of a string. Every item of the string 
is composed of two numbers of a variable length DELTA and COUNT. DELTA 
represents the difference between the number and its predecessor, COUNT indicates 
the multiplicity of its occurrences. 
The circuit which realizes the above algorithm consists of a control unit containing 
a program, ALU and a shift register storing the string of size 
O(min{2”,m}(~k-logml+l)). 
Clearly, the areas of the control unit and ALU do not exceed the area of the shift 
register which, in turn, is proportional to O(min{2”, m}(ik - log ml + 1)). 
(II) Let k 3 log n. In this case we apply the idea of the “radix sort”, i.e. the 
merged integers are processed in bit levels. Consider m blocks of memory 
S,rS*r***, S,. The block §i, 1 s is m, stores a number ai, which indicates the 
current number of elements of Y less than Xi and greater than Xi-1 ; a number bi 
which indicates the current number of elements of Y equal to xi; a current bit of 
Xi (i.e. Xi, i if the jth most significant bits are processed); a bit ti which is set to 1 if 
Xi # Ki+l for i 5 m - i with regard to rne bit ieveis read in so far, otherwise ti = 0, 
for i= set constantly ti = 1; a bit Wi, which is set to 1 if the current bit of Xi was 
read in Si and reset to 0 if xi was released. 
gorithm consists of k phases. In the jth phase the (k - j)th bits of 
are processed, the ( -j)th bits of the elements of 2 are produced 
Qi, b,, t,, Mi are Up 
Area complexity of merging 
The algorithm is described in a high-level language as follows: 
161 
we RADIX MERGE 
begin {Initialization} 
foreach i (lsism) o ap0, bi:=O, ti:=O; b,:=logn, tm:= 1; 
end 
forj:=k-1 step -1 to 0 do 
begin {the jth phase} 
foreach i (1-z < . s m) do STORE(xi,j into S,), ui := 1 l 9 
for each i (1 s i 6 m) do if + # x,., l,j then ti := 1; 
for 1:= n step -1 to 1 do 
begin 
INPUT(Y/,j); 
{Henceforth we will use the symbol D,, instead of CT=, ( ai + bi)} 
if I > D,,, them OUTPUT( y,,j) ; 
else 
begin 
FIND(the minimum index r such that 2~ D,); 
if y,,i<Xr,j and D,-b,.<l< 0, then a,:= I-Dr_,, b,:= Dr-l; 
if I = 0, then 
begin 
if b, # 0 and y/-j c x, then a, := a, + b,, b, := 0; 
if b, # 0 and y/,j > x’, and t, = 1 then 
begin 
if r=m then b,:=b,-I; 
else 
begin 
b,:= b/l, ~1,+~:=4,+,+1; 
if there exists s 2 1 such that u,+, = 1 and 
u~+.~+~ =0 then 
begin 
hat q>rand tq=l); 
b,:= b,--I; 
162 V. Palko et al. 
if there exists s 3 1 such that u,,, = 1 and 
I~~+,~+~ = 0 then 
egin 
OUTPUT( xq+,,j 9 l l l 
1 such that u,+,~ = 1 and 
OUTPUT(Xr+.~,j 3 l l - 9 ~,+l,,i); 
reach i (Kiss) do u~+~:=O; 
en 
end 
The circuit realizing this algorithm consists of a control unit containing the 
program, ALU and a shift register storing the blocks S, , . . . , S,,,. Again as above 
the areas of the controD unit and ALU do not exceed the area of the shift register. 
The area of the shift register linearly depends on its length. If the numbers ai, bi 
are stored as the variable-length variables then for their representation one needs 
[log( ai + 1) 1, [log( 6; + 1) 1 bits. Each block Si can be represented in the shift register 
bY 
bits. The constant c expresses the number of bits necessary for coding delimiters 
between the numbers ai, 6i, t;, ui. The entire length of the shift register is 
IN 
6: C ([logta,+1)]+ [hgU4+1)1+2) 
i-1 
I)1 
SC C (log(a,+1)(6,+1)+4) 
i=l 
Area complexity qf merging 163 
nowle 
The authors are grateful to the Institute of Informatics of Warsaw University 
where this research was partially done. 
eferences 
[l] GM. Baudet, On the area required by VESI circuits, in: VLSI Systems and Computations (Computer 
Science Press, Rockville, 198 1) lOO- 107. 
[2] G.M. Baudet and Wen Chin Chen, Area-time tradeoffs for merging, in: Proc. VLSI: Algorithms 
and Architectures (North-Holland, Amsterdam, 1985) 61-68. 
[3] C. Bilardi and F.P. Preparata, The Influence of key length on the area-time complexity of sorting, 
in: Proc. 12th SCALP (1985). 
[4] G. Bilardi and F.P. Preparata, Area-time lower bound technique with application to sorting, 
Algorithmica 1 ( 1) ( 1986) 65-91. 
[S] R.P. Brent and H.T. Kung, The chip complexity of binary arithmetic, J. ACM 28 (3) (1981) 521-534. 
[6] R. Cole and A.R. Siegel, On information flow and sorting; new upper and lower bounds for VLSI 
circuits, in: Proc. 26th FOCS, Portland, OR (1985) 208-221. 
[7] P, DuriS, 0. Sykora, C.D. Thompson and I. Vrio, Tight chip area bounds for sorting, in: Comput. 
and ArtiJicial Intelligence 4 (6) (1985) 535-544. 
[8] P. DuriS, 0. Sykora, C.D. Thompson and 1. Vrio, A minimum area for I-selection, Afgorithmica 2 
(2) (1987) 251-265. 
193 A.R. Siegel, A minimum storage sorting networks, IEEE Trans. Comput. 34 (4) (1985) 355-361. 
[lo] C.D. Thompson and H. Yasuura, On the area-time optimal design of I-selectors, in: Proc. Asilomar 
Con$ on Circuits, Systems and Computers (1985). 
[ 1 l] I.D. Ullman, Computational Aspects of VLSJ (Computer Science Press, Rockville, 1984). 
[ 121 I. Vrio, Area-time tradeoffs for selection, in: Proc. Parallel Algorithms and Architectures, Lecture 
Notes in Computer Science 269 (Springer, Berlin, 1987) 163-168. 
