Miel, G., Trends in systolic and cellular computation, Journal of Computational and Applied Mathematics 38 (1991) 297-321.
Introduction
Very Large Scale Integration (VLSI) mesh array processors have received considerable attention for applications requiring high throughput.
In this type of parallel computers, the cells of the array operate in Single Instruction Multiple Data (SIMD) mode and algorithms are executed in systolic or cellular fashion. Two areas of applications for VLSI mesh arrays are in signal processing and in scientific computing.
In the former case, the mesh array is used as a special-purpose processor embedded in a sensor system. In the second case, it serves as an accelerator, performing matrix-based and other "locally recursive" computations involved in large scale simulations, under the control of a conventional general-purpose computer. For certain applications, the resulting throughput is potentially that of a supercomputer at a fraction of the cost.
G. Mid / Systolic and cellular computation
An array processor has two modes of operation: cellular and systolic. In the cellular mode, the input data are first loaded into the array, then processed in unison, the results are then unloaded from the array. and a similar cycle starts anew. In the systolic mode of operation, the loading, computing and unloading occur concurrently.
In this mode every processor regularly pumps data in and out, each time performing a short computation, with the data rhythmically flowing through the network.
Due to the importance of linear-algebraic methods in scientific computing. matrix problems have become preferred benchmarks for high-performance computers [18, 19] . A survey of parallel algorithms for dense linear-algebraic computation was recently presented in [25] . A comprehensive review of iterative methods has been presented in [67] . Both surveys concentrate on commercially available computers, consisting of a modest number of processors, with sharedmemory multivector units like the CRAY-2 or distributed memory like the hypercube.
Surprizingly. the surveys hardly consider massively parallel-array processors, which in fact are ideally suited for matrix computation.
Linear-algebraic operations have properties of locality, recursiveness and regularity that match well the fine-grain parallelism of array processors. When the dimension of the problem corresponds to the dimension of the array processor, a matrix operation can have very low overhead in communication and synchronization. Speiser and Whitehouse [Sl] have shown that computational requirements for many real-time signal processing tasks (such as adaptive filtering, beamforming, cross-ambiguity calculations, data compression, etc.) may be reduced to a common set of linear-algebraic operations.
It is no surprise, therefore, that there is considerable research activity on matrix-based cellular and systolic computation.
Our goal is to portray a profile of such activity by describing recent algorithms, both systolic and cellular. The outline of the paper is as follows. Section 2 illustrates a systolization process by matching an arbitrary gaxpy operation onto a fixed-size square array processor. The next section describes two systolic methods, one iterative and the other direct, for solving systems of linear equations. Section 4 derives a cellular algorithm for a fast Fourier transform, based on a new implementation on rectangular-array processors of the perfect-shuffle permutation. Finally, Section 5 gives snapshots of active research areas, using annotated lists of recent references on systolic linear solvers, the singular value decomposition, artificial neural networks and the simulated annealing algorithm. Our notation is standard with !RH" denoting Euclidean n-space and !N mxn representing the space of real m X n matrices. The subscript t denotes transposition of a matrix.
For the benefit of the reader not familiar with array processors, we briefly describe an illustrative architecture, namely, a prototype systolic/cellular processor [65, 71] built at Hughes Research Laboratories.
As shown in Fig. 1 , the system consists of a host processor, a back-end array processor coupled with a corresponding array memory, and a controlIer with related program memory. The host coordinates such tasks as loading data into the array processor. initiating the execution of programs, and unloading results from the array processor. The array processor consists of 256 processing cells configured as a 16 X 16 square mesh. The cells operate in SIMD mode. A masking capability allows turning off a given subset of cells during execution. Each interior cell is connected to its four nearest neighbors via two horizontal and two vertical connections.
There is also a wraparound connection between the two ends of each row. Each processing cell possesses a local memory and multiple functional units for computation. Boundary cells on one side can optionally have a hardwired square root in order to accelerate certain operations, for example, Givens rotations in the solution of linear systems [62] . The program memory is separate from the data memory. The latter is partitioned into 16 columns, each affiliated with a corresponding column of the array processor. This memory has two independent ports, one of which is used to load data into the array processor, the other to store results back into memory.
The prototype demonstrated the viability of the hardware design. Evolutionary models, currently under study, enhance miscellaneous capabilities, and increase the size of the array processor while preserving its square geometry and connectivity.
A variety of algorithms have been mapped onto such architectures by Hughes researchers [55] [56] [57] [61] [62] [63] [64] 70, 72, 73, 77] . Table 1 shows comparative peak performance parameters for a CRAY supercomputer with 4 processors and two systolic array processors of different sizes.
The processors respectively have a clock rate of 105, 8, 16 and a memory cycle of 26, 16.7, 26 all measured in MHz. In both array processors, the word length is 32 bits and each cell has two multipliers, two adders, a divider, and a selector, four of which can operate concurrently.
The 16 X 16 model uses fixed-point arithmetic.
The memory Bandwidth, MIPS and MFLOPS parameters are compared to those of the CRAY-IS normalized to 1. The benchmark is the time in milliseconds to solve a 100 X 100 system of linear equations. The CRAY uses FORTRAN routines DGEFA and DGESL from LINPACK, with Rolled BLAS calls, compiled with the CFT [18] . The two array processors use an assembly-coded version of the systolized partitioned Faddeev algorithm [62] described in the sequel.
Systolic gaxpy computation
Given an algorithm satisfying prerequisite properties, the process of deriving a model of a systolic array for executing the algorithm is called a mapping. Diverse methods exist for mapping algorithms onto systolic arrays. Cape110 an Steiglitz use geometric [9] and linear space-time transformations [lo] . Kung [37] proposes a technique, which we use in part in the sequel, based on data dependence and signal flow graphs. Leiserson [47] presents a scheme for minimizing the number of delay elements. Moldovan [58] uses a formalism based on transformation of index sets and data dependence vectors. Navarro et al. [66] study partitioning of matrix operations onto fixed-size array processors. Quinton [74] produces a method for algorithms that can be expressed as a set of uniform recurrence relations. Other contributions have been made. Although the solutions to mapping problems promote conceptual understanding of time-space issues at hand, the process by itself is only of partial use in practical situations that require the extraction of systolic algorithms for a given specific architecture.
The inverse of the mapping problem, in which one starts with an array processor and wishes to get systolic algorithms for it, is called the constrained mapping or matching problem. The former term denotes the process of fitting algorithms under constraints of array topology, number of cells, global and local memory sizes, etc. A special case of the matching problem occurs when the size of the dependence graph of the algorithm is bigger than the size of the array processor. This is the so-called partitioning problem.
In this section, in order to illustrate how systolic algorithms are derived, we shall match a simple linear-algebraic operation onto a fixed-size square array processor of the type described in Section 1. Specifically, our aim is to match an affine operation 
onto a square N X N systolic array, of fixed size N, with square mesh connectivity, whose primitive operation in each cell is a scalar multiply-add ffP+v.
In effect, we shall specify the equivalent of a BLAS2 primitive [25, Section 31 for the targeted architecture. The affine operation (1) is also called a gaxpy operation, a name that has its origin in the LINPACK software package. One can think of the term as a mnemonic for "general Ax plus y ". This operation, which may be viewed as an extension of the scalar operation (2) , is in fact computed as a systematic combination of its scalar counterpart. Our matching procedure proceeds in three steps. With a view of systolizing the procedure, the algorithm is put in single assignment form (each indexed variable is assigned a single value) and transmittent variables x(i, j) and z( i, j) are used to localize all data dependencies. These transmittent variables will allow us to distribute the xi and zj values without broadcasting.
In order to systolize the algorithm, we apply the so-called canonical mapping methodology [37, Section 3.31 . In this approach, three structures are successively developed:
(1) a dependence graph (DG) which shows the data dependencies in the algorithm; (2) a signal flow graph (SFG), which serves as an abstract model of the processor arrray, obtained by operating on the DG; (3) and lastly, a systolic array design obtained by incorporating temporal considerations in the SFG.
We briefly describe each of these structures. The DG is a directed graph whose nodes are elements of the index space {k=(i, j): l<i<m,l<j<N} and whose edges denote data dependencies. An index point has incoming edges from all points it is directly dependent on. The DG for the affine operation with m = 4, N = 3 is shown in Fig. 2 . The DG satisfies two prerequisites for the use of the canonical mapping methodology.
(1) The lengths of edges are independent of the problem size m, N. This property of the DG in fact formally defines a locally recursive algorithm.
---(2) The DG is shift imaria+ in the sense that for index vectors k,, k,, k if a variable at q depends on a variable at q -k, then a variable at k, depends on a variable at & -k. Inputs and outputs at the border nodes are exempted from this requirement.
A graph with these properties represents a locally repetitive and homogeneous algorithm, well-suited for systematic systolization.
An SFG is also a directed graph. Nodes represent primitive operations and each edge denotes either a dependence relation or a time delay operator, in which case that edge is labelled with a capital letter D. In the SFG, primitive operations are assumed to require no execution time. Consequently, the derivation of time-space relations associated with pipelining are postponed to the third stage of the methodology. The SFG is obtained by operating on the DG. The nodes of the index space are assigned to nodes of the SFG and the order in which these nodes are to be executed is scheduled. These two steps proceed like this:
(1) A projection map 7~ defined on the index space is used to assign nodes of the DG onto nodes of the SFG. For the affine operation under consideration, we take r(i, j) =j, The map u, which can be written in the form a(k) = ix' -1 where fi = (1, 1) is orthogonal to the hyperplanes, is sometimes called a linear schedule. The third stage of the canonical mapping methodology is to transform the SFG into a systolic array. Whereas the SFG is always spatially localized, it may not be temporally localized because some of its edges contain no delay. Technically, a systolic array differs from an SFG only in that it has a positive number of delays on every edge. Put in equivalent terms, a systolic array may be considered as an SFG combined with retiming and pipelining. Formal techniques can be used to perform this transformation.
However, as illustrated in Fig. 4 , the passage from the SFG to the systolic array is straightforward for an affine operation. The result is a one-dimensional array with N nodes each performing a multiply-add as a primitive operation. A cycle is the time span for the SIMD execution of a primitive operation.
Partitioning
Two schemes are commonly used to map a DG onto an SFG, namely, the locally sequential globally parallel (LSGP) method and the local& parallel globally sequential (LPGS) method [37, Section 6.3.21. In both of these methods, the DG is partitioned into blocks. In the LPGS method, block size is chosen to match array size, the nodes within one block are processed by the array in parallel, and the blocks are thus processed one block at a time in sequential fashion. We use this method to partition large-scale affine operations onto a fixed-size linear array.
Consider again the operation is partitioned as in Section 2.2 on a row of the square array. The N rows operate in parallel. Each row has its own vertical pipeline and there is no data transfer between rows. The process requires T = pp + N -1 cycles to produce the complete output. Ignoring costs of data movement, idealized speedup is ppN2/T compared to serial computation. In the special case m = p = pN, a method for transforming a linear systolic array into a rectangular p x N systolic array has recently been given [49] . However, this is not a matching procedure, since the dimensions of the resulting array are problem dependent.
Systolic linear solvers
In this section, we describe two systolic algorithms for solving systems of linear equations. The first algorithm consists of an iterative method based on the systolized gaxpy operation described earlier. The second algorithm uses a modification of the Faddeev procedure [24, 62] .
Iterative method
Given a matrix L = J -K E 8 nxn, where J is invertible, and a system Lu = v. If M = J-'K has a spectral radius less than one, then x,,, the Neumann iterates
When the matrix M is dense and large, the iteration (3) is rarely used to its linear convergence.
because it is costly relative A little-known modification due to [50] , which yields quadratically convergent iterates, is advantageously implementable on square systolic arrays. The Hotelling-Lonseth iterates y, are
L is invertible and, for any When r = R, the second gaxpy and the matrix multiplication need not be done. Hence, the procedure takes (2 R -1)To + (R -l)T, cycles for arbitrary y. and RT, + (R -l)T, cycles if y, = w, where TG, TM are respectively the cycle counts for a gaxpy operation and a matrix-matrix multiplication.
Assume that the algorithm is executed on an N X N square mesh array processor and that n = vN, where v is a positive integer. From Section 2.3, we find that TG = v* + N -1. Each column of the matrix product MM can be computed as a gaxpy operation partitioned as described in Section 2.2 on a single row of the array processor, with the N rows working in unison. We thus get that TM = ( v3 + 2v) N -2v, referring to [57] for details.
The Neumann iteration (3) for computing y, on the same systolic array takes (2R -1)To cycles. Thus the general Hotelling-Lonseth iteration has smaller arithmetic complexity on the systolic array only if the number of iterates R satisfies For example, on a 16 x 16 systolic array, this is the case for a 128 x 128 linear system if R 2 10, whereas on a 128 X 128 systolic array it is so for any R >, 1.
We emphasize that the cycle counts refer only to numbers of scalar multiply-add operations executed in SIMD mode. For simplicity, we have ignored the cost of data management.
In practice, however, data formatting and partitioning become increasingly costly when the dimension of the problem gets larger than the dimension of the systolic array.
Modified Faddeev method
Consider four matrices A, tableau: In contrast to the LU or QR factorization method, the Faddeev algorithm avoids the backsubstitution step used in solving an upper triangular system. This feature greatly facilitates mapping of the algorithm onto an array processor. On the other hand, partial or full pivoting, needed to avoid zero pivots and to control roundoff, is costly to implement on an array processor with nearest-neighbor connectivity. Furthermore, the direct application of orthogonal transformations to annul -C results in a modification of the Schur complement. In order to circumvent these problems, Nash and Hansen [62] proposed a modification of the Faddeev algorithm based on both orthogonal transformation and Gaussian elimination. Consider again the tableau, this time with m > n, and assume that A has rank n. The modified algorithm proceeds in two stages. From the point of view of algorithmic mapping, the main advantage of the modified Faddeev algorithm is that both its stages can be mapped onto the same systolic array. As illustrated in Fig. 8(a) , this array has trapezoidal shape and mesh connectivity. The circles and squares denote respectively boundary and internal processing cells. Cell specifications for the two stages are indicated in Figs. 8(b) and 8(c) , respectively. In Stage 1, the boundary cells compute sines and cosines of Givens rotations while internal cells perform the rotations. In Stage 2, boundary cells width equal to that of the array. Each strip is further divided into blocks matching the size of the array. The processing proceeds in several steps, with each step operating on all corresponding blocks of the strips. In any one step, the boundary cells generate their elements using data in the left-most strip, and during the remainder of the step, the internal cells use these elements for rotations or eliminations on the remaining data. We refer to [62] for details.
Cellular fast Fourier transform
The algorithms that we have described so far are of systolic type. During their execution, the cells in the array processor regularly pump data in and out, each time performing simultaneously a short computation, with the data systematically pipelined through the network. This mode of 310 G. Miel / Systolic and cellular computation operation requires that the algorithms be homogeneous and locally repetitive, as exemplified by the linear-algebraic operations that we considered. In the cellular mode, the input data are first loaded into the array, then processed in unison using as many steps as needed, including possible intercellular data transfers, and the results are then unloaded from the array. There is no orderly pipelining as in the systolic mode, and consequently, computation and data movement usually cannot be balanced for optimal concurrency. Hence, the cellular mode is generally not as efficient as the systolic mode. However, procedures that cannot be systolized must then be mapped in cellular mode. Fast Fourier transforms (FFTs), which have an inherently global data dependency that hinders local communication, are among such procedures.
We illustrate the cellular mode by describing a recent FFT algorithm for rectangular array processors [56] . The FFT is said to have constant geometry because the communication pattern shown in its dependence graph is periodic, namely, the addressing of operands for the arithmetic operations is the same from stage to stage. Such FFTs were originally derived by Pease [69] , using matrix factorizations to modify and parallelize the usual Cooley-Tukey procedure. For the N-point radix 2 case, the algorithm consists of log,N stages each preceded by a perfect shuffle of the data. The most natural mapping of this algorithm is onto a linear-array architecture with +N cells and a shuffle-exchange interconnection network [82, 84] . Our strategy is to further decompose the Pease factorization in order to map a 2MNs-point radix 2 FFT onto an M X N rectangular array processor. The result depends fundamentally on a factorization of the perfect-shuffle permutation.
The corresponding data movement is realized in parallel as relatively small perfect shuffles inside each local memory and along each row and column of the array processor, without requiring that the complete array itself have the shuffle-exchange interconnection network. While matrix operations were heretofore the objects of our algorithms, in this section we additionally use linear algebra to validate the actual mapping of the algorithm to the architecture.
Matrix factorizations
A perfect shuffle is a permutation that transforms the 2m-vector z= (0, l,..., m-l, m, m+l,..., 
Components that were m apart become adjacent as a result of .the perfect shuffle. For convenience, we simply call (5) the shuffle of z.
A standard result in algebra provides a constructive method for factoring any permutation as a product of 2-cycles, e.g., [30, p.781 . Moreover, any 2-cycle is itself a product of 2-cycles involving only adjacent components.
Hence, letting (k) denote the exchange of the k th and (k + 1)th components, one can show that S,, = T,_, . . . T,T,,
where q is a product of i adjacent exchanges:
q=(m-i)(m-i+2)**+(m+i-2).

G. Miel / Systolic and cellular computation 311
The case for m = 4 is shown below:
T, = ( Since the i exchanges in q. are independent of one another, they can be done in parallel. A consequence is that the factorization (6) can be implemented on an array processor with nearest-neighbor connectivity in m -1 steps, with the jth step consisting of the parallel execution of the i exchanges in 7].
The following factorization of the shuffle permutation has a basic role in our presentation.
Proof. We show that the effect of the factorization on the vector z = (0, 1,. . . ,2KL -1)' is the vector 
The concatenation for i = 0, 1,. . . , K -1 of these 2 L-tuples yields the desired vector (7) . III
The factorization in the lemma is analogous to cutting a deck of 2KL cards into a row of 2K subdecks of L cards each, rearranging these subdecks according to the shuffle permutation, shuffling together adjacent pairs of subdecks, and then collecting the resulting K subdecks into a single deck.
The discrete Fourier transform of a complex P-vector z = ( zO, zr, . . . , z~_~)~ is Fz, where F is a complex P x P matrix whose (A, p)th element is &', 0 < X, I_L < P -1, where w = eCiZn/' and i2 = -1. The Pease factorization [69] of the matrix F can be expressed as
UF= (A'P-?Sp) ... (A"'S,&4'"'Sp), (8) (9)
The k th stage of this FFT is represented by nlkjSp. where k = 0, 1, _. . . p -1 and p = log,P. The matrix C' is the permutation matrix that represents the bit-reversed order of the output vector. Each 2 x 2 matrix B(m). applied on a pair of adjacent data, is called a hutterf(rl operation. The element u"' is called a rGd& Jacror. For our purposes here, we need not define the exponents c('). The usual parallel implementation of (8) is on a linear array processor with +P cells, with &ch cell containing a pair of adjacent data. Each stage of the FFT begins with a perfect shuffle, involving data transfers among the cells, followed by a SIMD butterfly operation.
The following theorem, which further decomposes the Pease factorization, engenders a parallel FFT algorithm for an M x IV rectangular array processor.
The desired result follows. 0 04)
Basic algorithm
Let P = 2M~s where M. N. s are powers of 2 with nonnegative integer exponents. Our goal is to describe a parallel FFT algorithm for a 2-dimensional M x hi array processor. We first specify needed features of the targeted architecture.
Storage. Each of the MN cells has a local memory with 2s locations, each able to contain a complex datum, denoted by R,: R,, . . . 1 R ?$_ 1. A complex P-vector z is stored so that pairs of represents the data in the local memory of the (m, n)th cell.
Communication.
In addition to (15) , consider the vectors X'"' = (R$;,", R$;;';, R;~~", R$'$;, . . . , R$y+'), R(211;7-') )'T y(", = (RIO."), RIO;;', @s"), R;>;), . . . , R;"-lTJ), R;y;ld)t. I
We assume that the array processor can execute the following three operations: rshuffle: replace R'"+) by S,,yR'"s"' simultaneously for m = 0, 1,. . . , A4 -1 and n = 0, I..., N-1; xshuffle( i): for fixed i, replace X/"' by S,, X!"' simultaneously for m = 0, 1,. . . , M -1; yshuffle( i): for fixed i, replace KC') by SZMy(") simultaneously for n = 0, 1,. . . , N -1. The rshuffle operation consists of MN parallel shuffles of the 2s memory locations in each cell. This operation does not involve data transfers among cells, unlike the other two operations. The xshuffle(i) operation consists of M parallel shuffles, one in each row of the array, involving the pair of locations Rzi, R2,+, in each cell. The yshuffle(i) operation consists of N parallel shuffles, one in each column of the array, involving the pair of locations R;, R,+s in each cell. The connectivity in these operations is illustrated in Fig. 10 for the case M = N = s = 4.
Arithmetic.
For fixed i and k, let arith( k, i) denote the parallel butterfly operation that replaces 2, = ( R2;, Rzj+l)' in each cell by B( X)R,, where B(X) is the 2 x 2 matrix defined in (9) and X is an appropriate exponent. The totality of ;P butterflies in the k th stage of the FFT is obtained by executing arith( k, 0), arith( k, l) , . . . ,arith( k, s -1).
The desired algorithm is a transliteration of the factorization (10). The correspondence between the contents of the local memories and a P-vector z is given by the bijection @:z-+ { Rjr","' } SE 2, 
The effect of C, is to perform a shuffle on every 2N-tuple in z and the application of @ after this permutation results in a natural ordering of the data in the local memories. Since both the input and output vectors in (10) are premultiplied by the matrix C,, the mapping \k:z+.GJ?, *z = @C,z,
defines the correspondence between the P-vector and the local memories, respectively during the loading at the start and the unloading at the end of the FFT. During the FFT itself the correspondence is given by @. The natural ordering induced by 4 facilitates loading and unloading and it is useful for bit-reversed sorting of the output.
It can be shown formally [56] that the operation rshuffle carries out the permutation C,, that the collective effect of the s operations The For-k loop represents the p = log,P stages of the FFT. The unload operation usually includes a bit-reversed sort to get Fz in natural order. For the special cases M = s = 1 or N = s = 1, the above algorithm becomes the well-known Pease procedure for a linear array processor [69] . Thus, in its generality, the algorithm may be viewed as a partitioning of the Pease procedure for rectangular array processors.
Remarks
The communication complexity of the cellular algorithm for a P-point constant geometry FFT on a rectangular A4 x N array processor is dependent on the array's ability to perform the rshuffle, xshuffle, yshuffle operations.
In the ideal case, shuffle-exchange links in each local memory, row and column minimize the cost of data movement. In the case of an N X N array processor with square mesh connectivity, the shuffles become costly for large N. Thus, for an FFT with P = 2N2s points, using the factorization (6) of the perfect shuffle as a product of adjacent exchanges, if we assume that North-South and East-West intercellular exchanges and internal register exchanges each have unit cost, we find that the communication complexity is (2N.s -s -1)(2 log,N + log,s + 1).
On the other hand, it is 6 log,N + 3 log,s + 3 in the idealized case for an N X N array processor having maximally redundant shuffle-exchange links each with unit cost. In addition to constant geometry FFTs, the perfect shuffle has wide applicability in parallel processing, including bitonic sorting, polynomial evaluation, matrix transposition, and linear transformations [5, 21, 82, 84] . Hence, implementations of the perfect shuffle engendered by matrix factorizations (14) for rectangular array processors, or extensions for higher-dimensional arrays, can be put to good usage for such applications. Moreover, assuming bidirectional links in the shuffle connections, we have access to the inverse perfect shuffle for the class of algorithms based on recursive doubling [36, 83] . In this case, a factorization corresponding to that in the lemma, serves as the basis for deriving representations of the inverse perfect shuffles on array processors.
The lemma and the theorem extend to radix r constant geometry FFTs for d-dimensional array processors [56] . The matrix factorization corresponding to an FFT is then obtained by d applications of a basic factorization of the radix r shuffle permutation extending that of the lemma for the radix 2 case. The resulting data movement involves parallel radix r shuffles in each local memory and along each of the d-dimensional axes of the array processor. The data is loaded in natural order and each cell performs a butterfly by operating on r consecutive registers Ri> Ri+r,*. .T Ri+,-le
Active research areas
Due to space constraints, we are unable to describe numerous aspects of systolic and cellular computation. Our goal in this section is to provide snapshots, using annotated lists of recent references, of current research activities on the topic.
Systems of linear equations
The Gauss-Jordan method for matrix inversion, when compared to triangular matrix factorization techniques, performs poorly on serial computers. However, with no row or column interchange, it parallelizes remarkably well on array processors. Moreover, the method then embodies abstract recurrence relations solving the so-called algebraic-path problem, which encompasses as special cases, in addition to matrix inversion, the transitive closure and shortest-path problems in graph theory and regularization of languages in automata theory, see [28] . Descriptions of systolic arrays for the solution of the algebraic-path problem can be found in [48] . Such systolization illustrates nicely the symbiosis between mathematical structure and, at the conceptual level, array architecture.
G. Miel / Systolic and celhlar computation
Systolic arrays for LU and QR factorizations of matrices have been known since the early 1980s see [26, 46] . Partial or full pivoting, needed to avoid zero pivots and to control roundoff, is not readily amenable to parallel computation on array processors with nearest-neighbor connectivity. Gentleman and Kung [26] used so-called neighbor piuoting, in which the pivot is selected as the largest element among neighbors, to devise a systolic array for matrix triangularization. Roychowdhurry and Kailath [76] have presented a systolic array for LU factorization with partial pivoting, with the property that the rows of the resulting upper triangular factor are not naturally ordered, thus necessitating considerable overhead in the determination of the solution vector. Cholesky factorization LL', via the use of hyperbolic transformations, has been mapped onto a systolic array in [15] .
The systolized version of the modified Faddeev method, described in Section 3.2, dates back to the mid 1980s see [61, 62] . The Faddeev algorithm was also systolized in [12] for both fixed-size and variable-size problems, and in [14] for partitioned implementation on two-dimensional arrays of transputers.
Moreno and Lang [59, 60] used their mapping technique, the so-called multimesh graph method, to partition the Nash-Hansen procedure onto both l-dimensional and 2-dimensional systolic arrays. With the hypothesis that there be no data duplication, it is widely believed, though apparently no one has proved it, that optimal complexity for inversion of a dense n X n matrix on a systolic array is 5n inner product steps. The Nash-Hansen version of the Faddeev method and other systolic matrix inverters reach this bound. Megson [54] has recently proposed a Faddeev array that achieves matrix inversion in just 4n steps with O(n2) basic cells using careful duplication of some data.
Deprettere and Jainandunsing [17,31- 
Singular-value decomposition
The singular-ualue decomposition (SVD) of a matrix provides the most efficient computational method for the determination of the rank of a matrix, it yields the best approximation in a least-squares sense of a high-dimensional matrix by a lower-dimensional one, and it provides the most numerically stable solution to least-squares estimation problems, see [27] .
Because of such properties, there has been considerable interest on the use of SVD techniques in digital image and signal processing. For example, it has been known for a long time that the SVD is useful in image enhancement, moving target tracking, harmonic retrieval problems and image reconstruction.
Refer respectively to [3, 4, 38, 78] . Continuing research aims to find high-resolution algorithms for extracting signal and system parameters from measurements, using the SVD of appropriate data matrices as a means for robust separation of signal and noise, see [6,161.
For real-time or high-throughput applications, such as in avionics and space systems, systolic and cellular array processors provide a modern and attractive approach to computing the SVD of given matrices. Pertinent algorithms, for the SVD and related decompositions, have been presented in [7, 8, 22, 23, [51] [52] [53] 851 . Comon and Golub [13] have recently analyzed various numerical methods for finding extreme singular values and corresponding left singular vectors when the matrix is slowly varying in time. Such algorithms are of crucial importance in signal-processing applications based on low-rank approximation of a time-dependent covariance matrix, see the references cited in [13] .
Artificial neural networks
There is extensive ongoing research on neural networks and their representations on array processors. An introductory description of the subject can be found in [37, Section 4.61. Recent texts on the topic are [20, 79, 80] . Applications in neural computing include combinatorial optimization, image and signal processing, artificial intelligence, etc. An early systolic architecture for an artificial neural network was designed in [39] . The same authors recently proposed [40] a unified architecture, consisting of a programmable ring systolic array, that provides for both the retrieving and the learning phases of a wide variety of artificial neural networks. A concise taxonomy of artificial neural networks is given in the introduction of that article. Przytula et al. [70] have described graph-theoretic techniques for implementation of a wide class of neural networks of arbitrary size on mesh-connected SIMD arrays of fixed size. Shams and Przytula [77] used a multilayer perception network, implementable on 2-dimensional array processors, for underwater target detection. Other applications of artificial neural network in signal processing are described in references cited in [70] . A quick overview of other current activities on neural computing, e.g., on theory and algorithms, speech and image processing, and on implementation methods, can be gleaned from the proceedings [2] of a recent EURASIP workshop.
Simulated annealing algorithm
Simulated annealing, so called because of its conceptual analogy with metallurgical annealing, uses a stochastic approach to combinatorial optimization. The algorithm has diverse applications in job shop scheduling, circuit design, artificial intelligence, image restoration, etc. See, e.g., [29, 44, 45] . A general-purpose optimization routine, based on the annealing algorithm and written in Pascal, was published in [68] . A design for a simulated annealing array processor was presented in [37, Chapter 81, [41] .
We make special mention of simulated annealing because the algorithm is closely related to Boltzmann machines, see [l] , and because the latter topic provides a rigorous mathematical base from which to penetrate the much less rigorous, but quickly emerging, area of neural computing. A Boltzmann machine is a network of elements, each with state either 0 or 1, bidirectionally interconnected with strengths that can take arbitrary values. The aim is to maximize a consensus function C, defined as the sum of products of corresponding states and strengths. A transition unit is chosen for proposed changes in the machine. Such a change is accepted if either C increases or if l/(1 + e-Ac's) is less than a uniformly distributed random number in [0, 11. A decreasing sequence of s values, called a cooling schedule, is used to bring the Boltzman machine into near optimal configuration.
Since the appearance of papers by Kirkpatrick et al. [35] in 1982 and Cerny [ll] in 1985, who initiated the research on the subject, the analytical framework behind simulated annealing has considerably evolved. There is an extensive body of knowledge on global convergence properties of the simulated annealing algorithm. Since Boltzmann machines are analogous to simulated annealing, there are predictably corresponding results for Boltzmann machines. However, if in the definition of the Boltzmann machine one allows the choice of transition units to be implementable in parallel, asymptotic convergence for the resulting network appears to be still an open problem.
