Polymorphic arrays: A novel VLSI layout for systolic computers  by Fiat, Amos & Shamir, Adi
JOURNAL OF COMPUTER AND SYSTEM SCIENCES 33, 47-65 (1986) 
Polymorphic Arrays: 
A Novel VLSI Layout for Systolic Computers 
AMOS FIAT AND ADI SHAMIR 
Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel 
Reiceived March 27, 1985; revised January 6, 1986 
This paper proposes a novel architecture for massively parallel systolic computers, which is 
based on results from lattice theory. In the proposed architecture, each processor is connected 
to four other processors via constant-length wires in a regular borderless pattern. The mapp- 
ing of processes to processors is continuous, and the architecture guarantees exceptional load 
uniformity for rectangular process arrays of arbitrary sizes. In addition, no time-sharing is 
ever required when the ratio of processes to processors is smaller than I/,,/? 6 1986 Academic 
Press, lnc 
1. INTRODUCTION 
The declining cost of VLSI components and the recent advances in wafer-scale 
integration makes it technically possible and economically feasible to build com- 
puters with hundreds of thousands of processors. Since each processor can be 
physically connected only to a limited number of other processors, the choice of 
interconnection pattern is a crucial architectural decision. In this paper we propose 
a novel pattern based on results from lattice theory, which is efficient, easy to use, 
and optimally versatile (in a sense which will be made precise later). 
The interconnection pattern is designed to support systolic algorithms, which are 
implemented on a rectilinear grid of synchronized processors. A large number of 
such algortihms have been proposed in the literature, and they represent a very 
attractive and cost-effective solution to the problem of massive parallelism. Most 
algorithms are based on linear vectors, fixed-width strips, rectangular arrays, or 
hexagonal arrays (which are just tilted versions of rectangular arrays), and our 
interconnection pattern is optimized specifically for such shapes. 
The main problem in the design of a general purpose systolic computer is that 
the hard-wired array must have fixed dimensions which may differ from the size and 
shape required to solve a particular problem. Consider, for example, the case of a 
single-wafer computer with 10,000 processors arranged in a 100 x 100 square grid. 
To evaluate a polynomial on this computer, we need a long vector of processes, 
which has to be bent into a space-filling curve such as a serpentine or a spiral; the 
frequent change of direction complicates the programming for different problem 
sizes. To maintain a priority queue, we need a long strip of width 2; the bends 
47 
0022~0000/86 $3.00 
Copyright 0 1986 by Academic Press, Inc. 
57133/l-4 
All rights of reproduction in any form reserved. 
48 FIAT AND SHAMIR 
introduce discontinuities (where logically adjacent processes are mapped into 
physically non-adjacent processors) which disrupt the perfect synchronization. To 
multiply large matrices on this computer, we may need a 120 x 120 array; when this 
process array is folded over the 100 x 100 processor array, some processors have to 
work four times harder than other processors by time-sharing the execution of four 
processes. In addition, some data structure algorithms require dynamically chang- 
ing arrays in which existing processes die or spawn new processes at runtime; it is 
desirable to accommodate the extra processes without relocating the existing ones. 
In this paper we propose a borderless interconnection pattern for M processors 
into which rectangular arrays of arbitrary aspect ratios and sizes can be efficiently 
mapped. The special structure of this pattern guarantees that for any rectangle with 
up to M/fi processes, at most one process will be mapped to each processor, and 
for arbitrarily large rectangles the number of processes mapped to any two 
processors can differ by at most log(M) (note that for each computer this is an 
additive constant that does not depend on the number of processes). 
In all the other known systolic architectures (such as Hewitt’s Apiary Network 
[H], Martin’s Torus [M], and Sequin’s Doubly Twisted Torus [S]), a process 
overlap can occur even for arrays with less than fi processes. In these architec- 
tures the load imbalance for arbitrary process rectangles is either completely 
unbounded (the Apiary Network) or can be as high as a. 
Aleliunas and Rosenberg give an architecture [AR] where every rectangular 
process array of area M/e can be embedded in a square processor array with M 
processors. In [AR] every processor is connected to eight neighbors (called the 
king’s move grid). Any rectangle is embedded in the grid so that two adjacent 
processes are no further than d processors away, but the embedding is highly non- 
uniform. Typical values of e and d are (e = 1.2, d= IS), (e = 1.45, d= 9), and 
(e = 1.8, d= 3). In our solution each processor has only four neighbors, adjacent 
processes are always placed in adjacent processors, and the embedding is uniform. 
The problem of dynamically changing systolic arrays can be easily handled in 
polymorphic arrays: As long as the intermediate sizes of the arrays remain at most 
A4/& the array can change its shape (e.g., from a horizontal vector to a square to 
a vertical vector) and spawn new processes in any direction without creating 
overlaps and without relocating the existing processes. Similarly, the array can 
grow explosively, change direction, etc., and yet the difference between the number 
of processes executing on the most heavily loaded processor and the most lightly 
loeaded processor will be smaller than log(M). 
The emphasis in this paper is on the choice of the interconnection pattern rather 
than on the (extensively investigated) problem of optimally embedding one given 
structure (such as a tree) in another given structure (such as a grid). In fact, the 
embedding problem is trivial in our architecture since each processor has exactly 
four links which are permanently labelled by Up, Down, Left, and Right. The bot- 
tom left process in any array is mapped to an arbitrary processor, and all the other 
processes are mapped continuously from there by following the Up and Right links 
in the natural order. 
POLYMORPHIC ARRAYS 49 
We call the proposed structure a polymorphic array since it can assume many 
different shapes at the same time. A typical example of a polymorphic array is 
obtained by superimposing the edges and the reverse of the edges in Figs. 1 and 2. 
These four sets of edges represent the Up, Right, Down, and Left links of each 
processor, respectively. Each set of edges forms a Hamiltonian cycle, and the local 
topology is equivalent to a grid (i.e., any two directions commute and the 
(Up, Down) and (Right, Left) pairs of directions cancel each other out). The global 
topology resembles a torus on which the four Hamiltonian cycles are diagonally 
wound as interlaced helixes. 
By experimenting with this 55-processor computer, it is easy to verify that any 
rectangular systolic array with at most 35 processes (e.g., 35 x 1, 17 x 2, 11 x 3, 
8 x4, and 7 x 5) is mapped into the computer without overlaps. The natural 
embedding of a 6 x 6 array, on the other hand, maps processes at opposite corners 
into the same processor. Note that 35 is a uniform upper bound, and in fact larger 
arrays (such as a 55 x 1 vector or a 5 x 8 rectangle) can be embedded without 
overlaps. 
When a million-process problem is mapped into this 55-processor computer, the 
average load is 18,182 processes per processor. However, it is the maximal load 
5 
6 
4 
7 
3 
8 
2 
9 
I 
IO 
0 
R LINKS U LINKS 
FIGURE 1 FIGURE 2 
50 FIAT AND SHAMIR 
which determines the effective clock rate and the average processor utilization. In 
this example, the maximal load is at most 18,184 and the loads on any two 
processors can differ by at most 3, regardless of the shape of the problem. This sim- 
plifies the synchronization and guarantees essentially optimal performance in spite 
of the extensive time-sharing. 
An important advantage of the proposed structure is that the length of all the 
links is constant (at most 3). The short wires, the regularity of the design, and the 
bounded wire density make it possible to implement large polymorphic arrays 
efficiently in silicon. This layout can be constructed using nine basic types of blocks 
to tile the plane. In addition, the symmetry of the interconnection pattern makes it 
fault tolerant to any single failure since it is possible to embed any rectangular 
array with at most IV/> p recesses without using or bypassing any particular 
processor and without creating overlaps by shifting the origin of the embedding. 
The polymorphic arrays proposed in Section 2 of this paper can only be con- 
structed for certain machine sizes, i.e., the number of processors must be a 
Fibonacci number whose index is even and indivisible by three. We can increase the 
variety of possible machine sizes to Fibonacci numbers with arbitrary indices by 
giving up the regularity of the layout. All the other properties of polymorphic 
arrays are not affected by this change. A fuller description of these non-regular 
polymorphic arrays appears in Appendix A. 
2. DEFINITIONS AND RESULTS 
The interconnection pattern for regular polymorphic arrays is based on a W by L 
rectangular grid (with M= W x L processors). The grid is labelled by a non-stan- 
dard coordinate system: Along each axis, the even locations are numbered in a 
forward direction and then the odd locations are numbered in a backward direction 
(see Fig. 1). This guarantees that along each axis cyclically successive numbers 
(such as 0 and 1, 1 and 2, 2 and 3,... and 10 and 0) are at most two locations apart. 
Each processor has a processor number 0 6 k K it4 and a processor location which 
is defined as (k mod W, k mod L) in the modified coordinate system. When W and 
L are relatively prime, the Chinese remainder theorem guarantees that the mapping 
from numbers to locations is one-to-one and onto. We can thus use either numbers 
or locations to refer to particular processors. 
The up, down, left, and right links of the processor located at (i, j) lead to the 
following processors: 
Up -P (i+ 1 (mod W), j+ 1 (mod L)) 
Down + (i- 1 (mod W), j- 1 (mod L)) 
Left -+ (i+ 1 (mod W),j- 1 (mod L)) 
Right + (i- 1 (mod W), j+ 1 (mod L)). 
POLYMORPHIC ARRAYS 51 
Note that the up link connects processor number k to processor number k + 1 
(mod M) and the down link connects processor number k to processor number 
k - 1 (mod M). The other two links are harder to describe in terms of processor 
numbers since they treat the two coordinates differently. 
When a path from (i, j) contains u up links, d down links, I left links, and Y right 
links, its final destination is 
(i+u-d+l--r(mod W),j+u-d-f+r(modL)). 
This destination does not depend on the order in which the links are traversed, 
since modular addition is commutative and associative. 
We call this interconnection scheme a diagonafly connected torus. The length of 
each link is at most ,/& since the horizontal and vertical separation between the 
link’s endpoints is at most 2. This locality minimizes the communication delays and 
simplifies the synchronization between the various processors. The number of 
crossovers per unit area is a constant which does not depend on the number of 
processors, and thus the graph is easy to embed in silicon despite it’s non-planarity. 
DEFINITION. A Polymorphic Array of order n = 2k (for k not divisible by 3) is a 
diagonally connected torus with W= Fk and L = Lk, where { Fi} is the Fibonacci 
sequence 
F, = 1, E;= 1, Fi=Fi-1 +Fi-* for i>2, 
and ( Li} is the Lucas sequence 
L, = 1, L,=3, Lj=Li-,+Li-* for i>2. 
The number of processors in a polymorphic array of order n is M = F,,. These 
polymorphic arrays are also called regular polymorphic arrays to distinguish them 
from irregular polymorphic arrays whose layout is irregular and whose index is 
either odd or divisible by 3. 
Remarks. (1) F, and Li can be explicitly defined as Fi = (# - $)/fi), 
Li = &+ 8 where 4 is the golden ratio (1 + ,/?)/2 and 6 is its conjugate 
(1 - JY2. 
(2) The natural extension of Fi and L, to negative indices with the same 
recurrence is F-i = ( - 1 )j+ ’ Fiand Lpi=(-l)‘Li (exceptL,=2). 
(3) The aspect ratio Lk/Fk of polymorphic arrays converges to fi = 2.236.... 
(4) Values of k which are divisible by 3 must be exluded, since for such k, Fk 
and Lk are both even and thus the mapping from processor numbers to locations is 
not one-to-one. 
52 FIAT AND SHAMIR 
When a rectangular array of processes is mapped into a polymorphic array, each 
processor has a load defined as the number of processes assigned to it. If the load is 
larger than 1 (a situation called process overlap), the processor must time-share the 
execution of all these processes. It is advantageous to avoid the time-sharing 
whenever possible, and to distribute the loads as evenly as possible in all other 
cases. This makes the following properties of polymorphic arrays particularly 
interesting: 
(a) The overla result: When a rectangular array of arbitrary aspect ratio 
with at most M/ P 5 processes is mapped into a polymorphic array with M 
processors, the loads on all the processors are at most 1. 
(b) The uniformity result: When a rectangular array with arbitrarily many 
processes is mapped into a polymorphic array with M processors, the loads on any 
two processors differ by at most L$ log,MJ. 
Remark. An exhaustive computer search proved that for all n < 14, the maximal 
load imbalance in a polymorphic array of order n is exactly L$ log&J. This may 
be the case for all n but we were unable to prove this tighter bound. 
Property (a) is related to a result proved by Chor, Leiserson, and Rivest [CLR] 
in connection with the organization of raster graphics memories. Our polymorphic 
arrays and their memory organizations are based on similar mathematical concepts 
from lattice theory, even though they differ in their details due to the extra con- 
straints imposed in our application by the short wires and regular layout. The other 
property of polymorphic arrays was not discussed at all in the [CLR] paper, even 
though it applies equally well to their memory organization and can make it even 
more attractive from an engineering point of view. 
The optimality of our polymorphic arrays with respect to property (a) follows 
from one of the theorems proved in [CLRS], which states that the constant ,,& 
cannot be replaced by a smaller constant in any two-dimensional grid-like architec- 
ture. 
The optimality of our polymorphic arrays with respect to property (b) follows 
from a deep theorem on irregularities of distribution due to K. F. Roth [SC, p. 10, 
Theorem 2A; R]. 
A complete list of the first 13 non-trivial regular polymorphic arrays and their 
characteristics is given in Table I. The constants represent the best known results 
for these small arrays which are somewhat better than the general bounds proved in 
this paper. For each order of magnitude there is at least one possible array, and 
thus one can choose a convenient size for any budget and technology. The table 
also illustrates the difference between the polymorphic array sizes (M= 21, 55, 377, 
987,...) and the number of chips in the [CLR] raster graphics memory organization 
(M= 5, 13, 34, 89, 233,...). However, by using the non-regular layouts described in 
Appendix A, it is possible to obtain polymorphic arrays with any Fibonacci number 
of processors (M=5, 8, 13, 21, 34, 55, 89, 144, 233 ,...) (see TableII). 
POLYMORPHIC ARRAYS 53 
TABLE I 
Regular Polymorphic Arrays 
Order Width Length No. of No. overlap Max proved Conjec. Max 
?I W L processors M size 0 OIM imbalance imbalance 
8 3 7 21 15 0.714 2 2 
10 5 11 55 35 0.636 3 3 
14 13 29 377 195 0.517 4 4 
16 21 4-l 987 483 0.489 5 5 
20 55 123 6165 3135 0.463 7 6 
22 89 199 17711 8099 0.457 8 7 
26 233 521 121393 54755 0.451 10 8 
28 311 843 317811 142883 0.449 11 9 
32 987 2201 2178309 976143 0.448 13 10 
34 1597 3571 5702887 2553603 0.447 14 11 
38 4181 9349 39088169 17489123 0.447 16 12 
40 6765 15127 102334155 45718755 0.447 17 13 
44 17711 39603 701408733 313714943 0.447 19 14 
TABLE II 
Irregular Polymorphic Arrays 
Order 
n 
No. of 
processors M 
No. overlap 
size 0 O/M 
Max proved 
imbalance 
Conjec. Max 
imbalance 
9 34 23 0.676 3 3 
11 89 53 0.595 3 3 
12 144 80 0.555 4 4 
13 233 125 0.536 4 4 
15 610 307 0.503 5 5 
17 1597 769 0.481 6 5 
18 2584 1224 0.473 6 6 
19 4181 1959 0.468 7 6 
21 10946 5039 0.460 8 7 
23 28657 13049 0.455 9 7 
24 46368 21024 0.453 9 8 
25 75025 33929 0.452 10 8 
27 196418 88451 0.450 11 9 
29 514229 230957 0.449 12 9 
30 832040 373320 0.448 12 10 
31 1346269 603667 0.448 13 10 
33 3524578 1578823 0.447 14 11 
35 9227465 4130829 0.447 15 11 
36 14930352 6682224 0.447 15 12 
54 FIAT AND SHAMIR 
3. FIBONACCI LATTKXS AND THEIR RELATIONSHIP TO POLYMORPHIC ARRAYS 
The purpose of this section is to establish the connection between regular 
polymorphic arrays and Fibonacci lattices. Once this is done we can exploit the rich 
structure of these lattices to prove our claims for the regular polymorphic arrays. 
DEFINITION. A Fibonacci lattice of order n, T,,, is the two-dimensional lattice 
spanned by the basis vectors (0, F,,) and (1, F,, ~ 1). 
Some of the useful properties of Fibonacci and Lucas numbers are: 
LEMMA 1. (1) For any n which is not divisible by 3, F,, and L, are relatively 
prime and odd. 
(2) F,, . L, = Fzn. 
(3) F;L,-,=F,,p,-(-1)“. 
(4) F,-,.L,=F,,-,+(-1)“. 
(5) Fi+lq+FiFj_1=Fi+j. 
(6) (F~n,z,+ WV',,,,,+ 1Wnl,b 
Proof All these identities have straightforward inductive proofs (see 
[HW, Theorem 179; K, Sect. 1.2.81). Q.E.D. 
DEFINITION. The processor number associated with grid location (i, j) is the 
processor reached by traversing i right links and j up links in the regular 
polymorphic array. 
It is important not to confuse processor locations (which correspond to the 
physical layout of the processor array) and grid locations (which correspond to the 
logical links between processors). In the grid representation, each processor is 
linked to four rectilinearly adjacent neighbors. Note that in the grid representation i 
and j can be arbitrarily large positive or negative integers, and that processor num- 
bers are mapped to the infinite grid in a repetitive pattern due to the cycle structure 
of the regular polymorphic array. An example of a grid which is labelled by 13 
processor numbers appears in Fig. 4 (where the indices represent column numbers), 
but this example does not represent any regular polymorphic array as 13 = F7 and 7 
is not even. 
To determine the mapping of processes to processors, we simply place the 
process array on the grid representation so that the array’s lower left corner is at 
the grid’s origin. The load on each processor is the number of occurrences of its 
number within the rectangle, and the absence of overlaps is indicated by the dis- 
jointness of all the processor numbers within the rectangle. A typical example is a 
5 x 3 process array paced on the 13-processor grid in Fig. 4. Since labels 2 and 7 
occur twice within the rectangle, these processors must time-share the execution of 
two processes, while all other processors are assigned exactly one process. 
POLYMORPHIC ARRAYS 55 
To analyze the structure of the labels on the grid, we first consider processor 0. 
This processor is associated with the origin (0,O) and with any other grid location 
(i, j) with the property that i right links followed by j up links form a (not 
necessarily simple) cycle in the regular polymorphic array. These locations are 
related to Fibonacci lattices in the following way: 
THEOREM 2. The grid locations (i, j) with which processor 0 is associated in a 
regular polymorphic array of order n = 2k are the lattice points of T,, for even k and 
its mirror image T; ’ (with respect to the x axis) for odd k. 
Proof: By the definition of the right and up links, the processor associated with 
grid location (i, j) is the processor which is located at (j- i (mod W), j + i 
(mod L)) in the regular polymorphic array. Processor 0 is thus associated with any 
grid location (i, j) such that j - i = 0 (mod W) and j + i = 0 (mod L). Since these 
equations are linear and homogeneous, any integral linear combination of solutions 
is also a solution and thus the set of solutions forms a lattice in the two-dimen- 
sional grid of integers. 
To prove that (0, (- l)kFZk) and (1, (- l)“FZk- 1) is a basis for this lattice, we use 
the following facts: 
(1) (0, (- l)kF2k) belongs to the lattice since by Lemma 1, FZk = Fk. L, and 
thus FZk = 0 (mod Fk) and Fzk = 0 (mod Lk). 
(2) (1, (- l)kFzk_ ,) belongs to the lattice since by Lemma 1, FZkp 1 = (- l)k 
(mod Fk) and FZk- 1 = -( - l)k (mod Lk). 
(3) (0, (- l)kF2k) is the shortest lattice vector of the form (0, a) since the up 
links form a Hamiltonian cycle and thus it is impossible to return to processor 0 by 
following fewer than M= FZk up links. 
(4) The basic parallelogram defined by these two vectors is contained in the 
unit-width strip 0 Q x d 1 and thus cannot contain any other integral lattice points 
in its interior. Since (0, (- l)kFzk) is the shortest vertical lattice vector, the boun- 
dary of the parallelopiped cannot contain lattice points other than its corners. The 
absence of lattice points in this parallelopiped proves that these two vectors span 
the whole lattice. Q.E.D. 
THEOREM 3. For any 0 6 m < Fzk, the grid locations with which processor m is 
associated in a regular polymorphic array of order 2k are TZk + (0, m) for even k and 
T;i’ + (0, m) for odd k (i.e., upshifts by m units of the corresponding lattices). 
Proof Processor m is associated with all the grid locations (i, j) for which 
j-i = m (mod IV) and j+ i= m (mod L). The general solution of these 
inhomogeneous linear equations is the general solution of the homogeneous 
equations (characterized by Theorem 2) shifted by the particular solution (0, m) of 
the inhomogeneous equations. Q.E.D. 
56 FIAT AND SHAMIR 
For any n, the union of all the upshifted Fibonacci lattices T,, (or their mirror 
images T;‘) for 0 < m < F,, is a labelling of the infinite two-dimensional grid which 
we call a Fibonacci labelling. Our goal in the rest of this paper is to analyze the dis- 
tribution of labels in rectangles placed on Fibonacci-labelled grids. 
Remark. Theorems 2 and 3 have been stated and proved for regular 
polymorphic arrays. The same theorems hold for non-regular polymorphic arrays 
and the proof in this case is immediate as non-regular polymorphic arrays are 
simply “folded” Fibonacci lattices. A more detailed description is given in Appen- 
dix A. 
4. THE OVERLAP RESULT 
In this section we prove that in a Fibonacci-labelled grid with F, labels, the 
smallest rectangle that contains a repeated occurrence of some label has at least 
F,,/fi integral points. Since all the labels share the same lattice structure and the 
property is invariant under mirroring, it sufIices to prove this result for unmirrored 
Fibonacci lattices and the particular label 0. 
Since the smallest rectangle with two occurrences of label 0 has these labels at 
opposite corners, we can assume that the rectangle is placed in the first quadrant 
with one corner at the origin, and handle the other three cases analogously. 
For any integral vector u = (i, j) in the first quadrant, let R(u) be the rectangle 
whose opposite corners are at the origin and at u, and let Z(V) be the number of 
integral points in it (i.e., Z(u) = (i + l)(j + 1)). For any subset S of the grid, we 
define P(S) to be the minimum size of a rectangle whose intersection with S con- 
tains at least two points: 
P(S) = mF{Z(u) 1 1 R(u) n SI 3 2). 
By definition, any rectangle with fewer than P(S) integral points can contain at 
most one representative from S. Since the occurrences of label 0 form a Fibonacci 
lattice in our grid, our goal is to bound P(T,J from below. 
THEOREM 4 [CLR]. P(T,) > F,I>. 
Proof: This is the main theorem from [CLR] where it was proved for odd n. 
The proof given here applies to all n, and it is considerably simpler than their proof. 
Let {ok} be a sequence of vectors defined as 
uo = (0, f’n) 
ul=(LFn-1) 
uk=uk-2-uk-l for k>,2. 
POLYMORPHIC ARRAYS 57 
An explicit expression for these vectors is 
ok= (F-k, Fn-k) = (( - l)k+lFk, Fn-k). 
These vectors alternate between the first and second quadrant (see Fig. 3). All these 
vectors belong to T,, and any pair of vectors of the form { uk, ok+ 1 } or {ok, ok + 2} 
is a basis for this lattice (since the original basis vectors uO, u1 can be recovered 
from any such pair by inverting the recurrence). 
Let us consider now the sequence of vectors u,,, ul, u3, us,.... These vectors are 
sorted around the origin in the first quadrant (see Fig. 3), and any successive pair 
of vectors in the sequence forms a basis for T,,. 
Consider now the polygonal area A bounded between the positive x and y axis 
and the polygonal line that connects the uk points. It is the union of the triangles 
formed between any two successive vectors in the sequence. The basic property of 
bases in lattices is that parallelograms whose edges are the basis vectors do not con- 
tain lattice points in their interiors. Thus, these triangles cannot contain other lat- 
tice points and their union A cannot contain other lattice points from Tn. 
To complete the proof that P( T,) > F,/fi, it suffices to show that the area H 
bounded between the positive x and y axis and the hyperbola (i + 1 )(j + 1) = 
F,,/,,h is strictly contained within A, and thus H cannot contain any lattice points 
from T, (except the origin). Consequently, any rectangle R(u) which contains 
another occurrence of label 0 must satisfy Z(u) = (i + l)(j+ 1) > F,,/& 
By using convexity arguments, we can show that H is contained in A by proving 
that all the corners of A (except the origin) are outside H. By the definition of these 
corners and H, it suffices to show that for all 0 <k < n, 
(F/x + 1 Wn ~ k + 1) ’ f’r,Ifi. 
t 
(F,.F,)= (0,21) 
(Fi,Fn-i) i Odd 
c (F,,o) 
FIGURE 3 
58 FIAT AND SHAMIR 
This result follows from the monotonicity property 
(Fk+l)(F,+l)<(Fk-l+l)(F,+1+1) 
for 2 <k < I and k + 13 7 (left as an exercise to the reader) and Lemma 1, since 
min (Fk+1)(F,-k+1)=(FLn12,+1)(FTnlz, +l)> ,I& Q.E.D. 
Oikcn 
5. THE UNIFORMITY RESULT 
Let Q,(u, k) be the number of times label k (0 <k < I;,) occurs within the rec- 
tangle R(u) (including its boundary) in a Fibonacci-labelled grid of order n. We 
further define D,(u) to be the difference between the heaviest and the lightest loads 
among the F, labels: 
D,(o) = m;x QJu, k) - rnp Q,(u, k). 
Our goal in this section is to prove: 
THEOREM 5. max, D,(u) </-n/2]. 
We will prove that 
maxD,(u)<maxD,-,(u)+ 1. 
” ” 
The desired result follows from this inequality by induction. 
LEMMA 6. There is a u = (i, j) with 0 < i, j < F,, which achieves the maximal load 
difference for any R(u). 
Proof: Let R(u) be an arbitrary rectangle. Since all the labels occur exactly once 
along any horizontal or vertical segment of length F,, we can eliminate arbitrarily 
many such segments without changing the load difference between processors in the 
remaining shape. We can thus reduce i and j modulo F,, and obtain 0 d i, j < F,. 
Q.E.D. 
The last lemma can be extended to state that maximal load difference between 
any two processors is also attained for some rectangle R((i, j)), 0 < i, j < F+ , ; this 
is a slightly stronger statement but follows from the same proof used in the lemma. 
We are going to color the Fibonacci-labelled grid with F, colors, one for each 
label (or processor). These colors split into two distinct groups: the red tints are 
given to labels 0 through F,- , - 1 and the blue tints are given to labels Fnp I 
through F, - 1 (see Figs. 4, 5, and 7). The reader is assumed not to be color-blind 
as we will continue to split the colors into finer shades of grey (or pink). 
POLYMORPHIC ARRAYS 59 
92 13 6.4 I 115 I 36 
62 IIS 
A, n. 6, I l i I 7: 
- I  -* --o -‘, 
I I, 12, 4; 9, 1, I Ii:: I 11:: I 3:: 1 
1 121 ( 42 1 93 1 14 65 11, 37 88 0, 
I 55 106 27 78 120 60 1 111 1 32 1 83 1 04 
10 
E 00 
FIG. 4. The first quadrant labeled by T, (F, = 13). 
34 3s 08 59 211 712 
26 77 49 111 612 
29 710 412 
I 4, 10 1 610 312 , -, , I I 
65 37 1001 510 212 
27 78 410 112 [ 04 1 55 j 
40 12 07 58 210 
FIG. 5. The “red” labels in the grid. 
4-n 
1 210 1 $, 
31n 
58 
FIG. 6. The “red” labels in the grid after the “squeeze.” 
60 FIAT AND SHAMIR 
FIG. 7. Red, blue, yellow, purple, light, and dark labels. 
Our next lemma states that the internal load difference amongst the “red” labels 
{O,..., F,- r - 1 } is bounded by the maximal load difference attained for a 
Fibonacci-labelled grid of order n - 1. Likewise, the load difference amongst the 
“blue” labels {FE-, ,..., F, - 1 } is attained in a Fibonacci-labelled grid of order 
n - 2. We extend the definition of D, and define DJu, S) to be the difference 
between the heaviest and the lightest loads among the labels S c {O,..., F,- 1}: 
LEMMA 7. In a Fibonacci-labelled grid of order n: 
(1) Among the “red” labels: 
max Dn(u, (0 ,..., F,-, - l})<max D,-,(u). 
” ” 
(2) Among the blue labels: 
max Dn(u, {F, ~ 1 ,..., F,,-l))<maxD,-,(u). 
u ” 
Proof The proof is based on a cut-and-paste argument. Consider the red labels 
in the square U = R((F, - 1, F, - 1)) whose opposite corners are the origin and the 
point (I;, - 1, F,, - 1). Erase all the blue labels and squeeze the red labels to the left 
as far as they can go (see Figs. 46). What we get is a rectangle cut out of a 
mirrored Fibonacci-labelled grid of order n - 1, T,-’ , . Since mirroring makes no 
difference with respect to the maximal load difference, we ignore its effect from now 
on to simplify the case analysis. 
If we consider a rectangle whose opposite corners are the origin and a point 
(k, F, - 1) for some 0 < k < F,, - 1 and perform the erase-and-squeeze process we 
get a similar structure. The only difference is that rather than a rectangle we get a 
rectangle whose last column may be incomplete-missing a segment at the top of 
the column. 
POLYMORPHICARRAYS 61 
The maximal load difference between any two processors is attained for a rec- 
tangle R(u), u = (i, j) E U, as a consequence of the last lemma. Consider the intersec- 
tion of R(v) and the red labels called I. We distinguish between two cases: 
(1) Squeezing the red labels in Z to the left gives us a rectangle cut out of a 
Filbonacci-labelled grid of order n - 1. This implies that the maximal load dif- 
ference between the different red labels is D, _ ,(u) for some u, which is obviously 
bounded by max, D, _ 1(u) and the proof of item (1) is complete. 
(2) Consider the rectangle C = R( i, F,, - 1) - R(o) (u = (i, j)). We claim that 
the maximal load difference amongst the red labels in R(u) is equal to the maximal 
load difference amongst the red labels in C. The reason is that in C u R(u) the load 
difference amongst all processors is zero as every rectangle column has exactly F, 
labels and contains all labels. Thus C’s processor load distribution is the F, com- 
plement of R(u)% processor load distribution. Now, performing the erase-and- 
squeeze process on the red labels in C gives us a complete rectangle cut out of a 
Fibonacci-labelled grid of order n - 1. Using the same argument as in case (1) this 
implies that the load difference amongst the red labels is at most max, D,- ,(u). 
The proof of this lemma’s second claim is handled analogously. Squeezing the 
blue labels to the left gives us a rectangle missing the bottom portion of the last 
column cut out of a Fibonacci-labelled grid of order n - 2. Once again we take 
either the original rectangle R(u) or it’s complement depending on the height of the 
missing portion of the column and u’s height (the y coordinates). Q.E.D. 
COROLLARY. In a grid labelled by the mirror image of a Fibonacci lattice T,- ’ we 
have 
max DJu, (0 ,..., F+- l})<max DnW2(u) 
” ” 
and 
max D,Ju, {F, _ z ,..., F,-f})<maxD,-,(u). ” ” 
Proof Just follow what happens to every label when the grid is “flipped over” 
around the x axis. Q.E.D. 
The last lemma tells us that the load difference amongst the different labels 
(processors) in a Fibonacci-labelled grid of order n can be greater than the load dif- 
ference in a Fibonacci-labelled grid of order n - 1 onfy if the processor pair attain- 
ing the difference has one element from the blue set and one element from the red 
set. 
What we do next is split the red tints into two parts, yellow and purple. The 
yellow set is {O,..., F, _ 2 - 1 } and the purple set is {F,, _ 2 ,..., F, _ 1 - 11. We also dis- 
tinguish between light and dark: the light set is {O,..., Fn--3 - 1 > and the dark set is 
(Fn--3x F,, _ 1 - 1 }. Note that the union of the yellow and purple sets gives us the 
62 FIAT AND SHAMIR 
red set, and likewise the union of the light and dark sets gives us the red set. So we 
have six sets: blue, red, yellow, purple, light, and dark, where red = yellow u purple, 
red = light u dark, and blue u red = {O,..., F, - 1 }. Figure 7 gives us this rainbow 
effect for T7. 
LEMMA 8. One of the following claims must hold: 
(a) The maximal load difference in the yellow set is bounded by max, D, ~ 2(v) 
and the maximal load difference in the purple set is dmax, Dnb3(v), or 
(b) the maximal load difference in the light set is bounded by max, D,-,(v) 
and the maximal load difference in the dark set is Gmax, Dn-Jv). 
The proof of Lemma 8 will be deferred. We first show that if Lemma 8 holds then 
Theorem 4 (the Uniformity Result) is true. 
LEMMA 9. Any two labels which are physically adjacent horizontally have a load 
dzyference of at most 1 in any rectangle. 
Proof: We know that we can limit ourselves to a discussion of rectangles within 
the square U. The two labels are both inside the rectangle or both outside the rec- 
tangle, except when they are along the edge columns. Each such column can con- 
tain at most one occurrence of any label, and thus the difference between the loads 
of these two labels can be only 0, + 1, or - 1. The difference cannot be &-2 because 
the edge columns contain at most one occurrence of every label. Q.E.D. . 
Assume claim (a) holds. Now, every element in the yellow set is adjacent to some 
element in the blue set (to its immediate left) in an F, Fibonacci-labelled grid. Any 
pair of labels, one from the yellow set and one from the blue set, has a maximal 
load difference of max, D,_ 2(v) + 1. Consider the 2 x F+, rectangle formed by the 
blue and yellow labels. We know that within each column the load difference is 
bounded by max, Dn-*(v) and that within a 2 x 1 rectangle consisting of one blue 
label and one yellow label the load difference is bounded by 1. Both these con- 
straints imply that the maximal load difference between any two labels, one blue 
and one yellow, is bounded by max, DnpZ(v) + 1. 
Every element in the purple set is adjacent to some label in the blue set (to its 
immediate right). Consider the set of blue and purple labels: this set has two 
columns (one of height F,-, and one of height F,-,). The load difference within 
the labels of the blue column is bounded by max, DnpZ(v) and the load difference 
within the labels of the purple column is bounded by max, D,-,(u). The load dif- 
ference within any 2 x 1 rectangle consisting of one purple label and one blue label 
is at most 1. These constraints imply that the load difference between any two 
labels, one from the blue set and one from the purple set, is at most 
max, D,-?(v)+ 1. 
Assuming claim (a) holds, and the inductive claim that the maximal load dif- 
ference in a Fibonacci-labelled grid of order k < n - 1 is at most max, Dk _ *(v) + 1, 
POLYMORPHIC ARRAYS 63 
we can conclude the proof of Theorem 6 simply by considering all possible pairs of 
labels, i.e., both red, both blue, one purple and one blue, and one yellow and one 
blue. 
If we assume claim (b) the proof is very similar to the one given above and will 
be omitted. 
We still have to prove Lemma 8. Consider (again) the intersection of the red 
labels and the rectangle R(u) for which D,(u = (i, j)) attained its maximal value. If 
the erase-and-squeeze process gives us a rectangle cut out of a grid labelled by T,-’ 1 
then using the corollary to Lemma 7 and identifying the light set as the blue set and 
the dark set as the red set, we prove claim (b) of Lemma 8. If the erase-and-squeeze 
process does not give us a rectangle then we consider the set C= 
R(i, F, - 1) - R(u). Taking the intersection of the red labels and C and then perfor- 
ming the erase-and-squeeze process gives us a rectangle cut out of the T;J i labelled 
grid. In this case we prove claim (a) of Lemma 8, flip the grid over along the line 
x = F, - 1, and renumber all the labels taking the point (0, Fn - 1) as the new 
origin. This is now a T,, _, labelled grid and using Lemma 7 we identify the yellow 
and purple sets as the red and blue sets. Following the label transformations in 
reverse order gives us claim (a) of Lemma 8. Q.E.D. 
APPENDIX A 
In this appendix we show how to associate a processor interconnection layout 
with every two-dimensional integer lattice. We use this to construct processor inter- 
connection networks from those Fibonacci lattices whose index is either odd or 
divisible by 3. These lattices do not have a matching regular polymorphic array 
associated with them, but since it is dersirable to increase the variety of possible 
machine sizes, these non-regular layouts may be quite useful. 
Given any two linearly independent vectors in the plane with integral coefficients 
{b,, b2} cZ2, they span a lattice n = {zlb, +z2b2 1 z,, z,oZ}. The area within the 
fundamental parallelogram defined by the two basis vectors is equal to the absolute 
value of the determinant d(n) of the matrix of basis vectors, which is an invariant 
of the lattice. 
THEOREM 10. For every two-dimensional lattice there is a consistent VLSI 
processor interconnection layout with the following properties: 
(1) The number of processors is (d(A)I. 
(2) Each processor is connected to four other processors via fixed links labelled 
by up, down, left, and right. 
(3) The layout is a homomorphic picture of the lattice. 
(4) The layout is rectangular and the processor interconnection wire length is a 
constant which does not depend on the number of processors Id(A 
571/33/l-5 
64 FIAT AND SHAMIR 
Prooj Partition 2’ into ld(,4)l subsets each of which is a shifted version of A 
(without rotation), and associate a unique processor number with each subset. This 
labelling is consistent in the sense that the four rectilinear neighbors of each 
processor are the same for all its occurrences in Z2. Our goal is to fold this 
representation in a compact rectangle in which each processor label occurs exactly 
once and in which the distances between labels which are neighbors in Z2 is boun- 
ded from below and from above by a constant. 
By using the Kannan [Ka] algorithm for the shortest lattice vector (which is 
particularly simple in two dimensions), it is possible to replace the original basis 
{b,, b2} of A by a new basis { ul, v2} for the same lattice with the property that the 
angle between the two vectors is at least 60 degrees and at most 120 degrees. We 
note that odd indexed Fibonacci lattices have an orthogonal basis and even indexed 
Fibonacci lattices have a basis which is nearly orthogonal, but the subsequent 
analysis deals with arbitrary lattices rather than polymorphic arrays. 
By rotating Z* it is possible to change v, to a vector u1 along the positive x axis, 
and by applying a slight shear operation in the x direction to the two-dimensional 
plane it is possible to change u2 to a vector u2 along the y axis. Since u1 and v2 were 
almost orthogonal, these transformations change the distances between points in Z2 
by at most a constant (m) which does not depend on A. 
Consider now the rectangle R formed by U, , u2, which is closed along the sides 
that contain the origin and open along the other two sides. The rectangle R con- 
tains exactly one occurrence of each of the Id(A)/ labels, since it is the basic cell of 
the lattice which is replicated throughout the two-dimensional plane. The locations 
of these labels form a rotated and slightly sheared version of Z*, and they no longer 
occupy points with integral coordinates. 
By stretching the rectangle R along the x and y directions by a constant factor 
c$E, we can guarantee that any 2 x 2 rectilinear square in R will contain at most 
one label. We now fold R twice along its middle x and y coordinates, and obtain 
four quarters of R stacked on top of each other. We merge the left/right and 
up/down edges of R to form a folded torus. In this folded torus, each 2 x 2 rec- 
tilinear square contains at most four labels, and thus we can move them to the 
lower left corners of the four 1 x 1 squares in the 2 x 2 square. This restores the 
integrality of the coordinates of each label, guarantees that each label occupies a 
distinct location, and increases the interconnection wire length by at most a con- 
stant factor (2 &). Q.E.D. 
ACKNOWLEDGMENTS 
We would like to thank Ehud Shapiro for motivating this research and Benny Chor, Charles 
Leiserson, and Ron Rivest for introducing us to the fascinating world of Fibonacci lattices. 
Note added in proof: Amos Fiat recently improved Theorem 5 to max, D,(u)=Ln/3J (which is 
optimal). The detailed case analysis which proves this result can be found in his Ph.D. thesis. 
POLYMORPHIC ARRAYS 65 
REFERENCES 
CAR1 R. ALELIUNAS AND A. L. ROSENBERG, On embedding rectangular grids in square grids, IEEE 
Trans. Comput. C-31, No. 9 (1982). 
[CLR] B. CHOR, C. E. LEISERSON, AND R. L. RIVEST, An application of number theory to the 
organization of raster-graphics memory, FOCS 82. 
[CLRS] B. CHOR, C. E. LEISERSON, R. L. RIVEST, AND J. B. SHEARER, “An Application of Number 
Theory to the Organization of Raster-Graphics Memory,” revised Feb. 1984, Laboratory for 
Computer Science, Massachusetts Institute of Technology, Cambridge, Mass. 
PI A. FIAT AND A. SHAMIR, Polymorphic arrays: A novel VLSI layout for systolic computers, 
25th FOCS 1984. 
WI C. E. HEWITT, The apiary network architecture for knowledgeable systems, in “Proceedings, 
Lisp Conference,” Stanford, August 1980, pp. 108-l 18. 
[HW] G. H. HARDY AND E. M. WRIGHT, “An Introduction to the Theory of Numbers,” 3rd ed., 
Oxford Univ. Press, London/New York, 1956. 
CKI D. E. KNUTH, “Fundamental Algorithms,” Addison-Wesley, Reading, Mass., 1977. 
Ml R. KANNAN, Improved algorithms for integer programming and related lattice problems, in 
“Proceedings, STOC 83.” 
CL1 C. E. LEISERSON, “Area Efficient VLSI Computation,” MIT Press, Cambridge, Mass., 1983. 
[LLL] A. K. LENSTRA, H. W. LENSTRA, JR., AND L. LovAsz, Factoring polynomials with rational 
coefficients, Math. Ann. 261 (1982), 515-534. 
CM1 A. J. MARTIN, The torus: An exercise in constructing a processing surface, Caltech Conference 
on VLSI, 1981. 
[RI K. F. ROTH, On irregularities of distribution, Muthematika 7, (1954), 73-79. 
PI C. H. SEQUIN, Doubly twisted torus networks for VLSI processor arrays, in “Eighth Annual 
Symposium on computer Architecture,” Mineapolis, May 12-14, 1981. 
I31 W. M. SCHMIDT, “Lectures on Irregularities of Distribution,” Tata Institute of Fundamental 
Research, Bombay, 1977. 
