We first present a formal deriuation and proof of correctness of a systolic army for the knapsack problem, an NP-complete problem whose dependency gmph that is not completely known statically. With p PE's, each with a fixed size memory, the armystretch PYM in e(?), which gives optimal speedup of the ALoo-rithm. However, it has an intricate tag-based control mechanism which is dificult to implement, and we improve the arehi-TECH-tun so that the control can be implemented with two simple counters and a few flip-flops. Coeficient loading is done with a multi-mte clock which avoids the need for shadow registers. We then ezplore the tradeoff between the number of PE's and memory size, a, using the ezpected running time of the algorithm as a cost measure and a register level model of VLSI. We show analytically how Q may be chosen to optimize the total computation time, yielding an a n a time optimal cir-curr.
Introduction
The general knapsack problem can be stated as follows. Given a knapsack of capacity c, and m items, each with weight wi and value pi, where the coefficients are non-negative integers. This is a classical NP-complete, combinatorial optimization problem with a wide range of applications [Hu82, MT90] . It can be solved sequentially in O(mc) time. This time bound is not polynomial since c is exponential w.r.t. the input size, and such a time bound is called pseudo-polynomial [GJ79] .
Moreover, it has been shown that time consuming instances of these problems are easily generated [Chv80] , and this makes it an ideal candidate for implementation on application specific VLSI processors. One of the standard approaches to solving this problem is dynamic programming, [Bel57, GN72, Hu82] , which may be formulated as follows: for all with boundary conditions f(j,O) = f ( 0 , k ) = 0 and f(j,k) = -00 when j < 0. This equation describes an algebraic schema of problems which may all be solved by the same algorithm: general knapsack problem: (here, the function is max, and p = 0); 0/1 knapsack problem: (e is max and p = 1); subset-sum problem: (a 0/1 knapsack problem with pi = w i ) and change making problem: (e is min and p = 0). In this paper, we consider only the general problem, but our results can be adapted to any of the Variants. Now, notice that (2) contains the term wk, a problem input, in one of the data dependencies. In general, such recurrences belong to the following claas:
Vz E D + f(z) = . . . f(Az + a + g ( z ) ) . . .
The Az + a is a classical affine dependency [RF90] , but the presence of a data variable g ( z ) means that the dependency itself is not known statically (see Fig 1) . Such recurrences are said to have "dynamic" (or "run-time") dependencies [Meg93]), and the usual techniques for systolic array synthesis [QaSS] are not applicable. A general synthesis method for such recurrences is not available, and thus there are two motivations for studying the systolic implementation of (2): the problem is important in its own right, and a careful study may lead to a synthesis method for recurrences with run-time dependencies.
In this paper we present the following results.
We give a formal derivation (and proof of correctness) of a systolic knapsack array.
The array is known in the literature, [AQ92] and is based on a fixed size ring topology and haa optimal speedup. It uses an intricate partitioning scheme, with dynamically computed tags, and our derivation here provides a rigorous proof of correctness.
The complicated control of the array makes it unrealistic to implement in VLSI. We show how the control can be reduced to two simple counters and a few flip-flops. We also show how the coefficients can be loaded efficiently by means of a multi-rate clocking scheme, which obviates the need of shadow registers.
The time complexity of the array is 8 (7 [-I), where q is the number of PE's,
W -
is the largest weight among all problem instances, and a is the size of the memory in each PE, a design parameter. Note that a and q are not independent:
in a VLSI design, the number of PE's that one can put on a chip depends on how much memory they have. We explore this tradeoff by formulating the choice of Q as an optimization problem, whose solution is obtained analytically
where, a1 is the cost of the CPU + control + IO, and a2 is the memory cost (per word). For our design, this improves the performance by 26% without any incrwrse in a m , and yields a final design which is area time optimal.
In this paper, we focus on improving the forward computation, i.e, (2). It is known [AQ92] that if a few registers are maintained during the forward phase, the values of the optimal solution can be easily obtained in c + m additional steps at the end.
Systolic Scheduling of the Knapsack Recurrence
The following formal derivation of the Andonov-Quinton array, serves as a caae study for systolic array synthesis for recurrences with run-time dependencies. There are two crucial observations about the dependencies of (2) that allow us to synthesize the array, even though we do not completely know the DG statically-the dynamic component of the dependencies is vertical and downward (see Fig 1) . The first observation implies that if we choose the allocation function to be a(j,k) = k, all the "irregularity" is projected into the processor space. The PE must manage a certain amount of memory (not just a few registers), but the interconnections are purely systolic. The second observation implies that t ( j , k) = j + k is a valid linear schedule.
This yields an array with m PE's, with w-words of memory. The PE's are initialized with wk, and at each cyde they compute @ (i.e., max) using one value that arrives from the left neighbor and one value from the memory; the result is sent to the right neighbor and also stored in the memory (only the first W k memory locations are used). The program executed by the k-th PE is as follows:
The problem with this array, is that the memory size of each PE is w-, which is a problem parameter, and this is not truly systolic. To solve this, we use a space time map given by the following allocation and timing functions (a, the memory size of each PE, is a constant-a design parameter).
We will also need an auxiliary function for our proof of correcness:
This space-time map is not a simple linear transformation, as used in the literature [QR89] . Indeed, with such a mapping, there is even no guarantee that the transformed algorithm is systolic, and this needs to be proved formally.
Lemma 1 The space-time map given by (4-5) is free of conflict.
Proof: We show that for any two points, b, k] and v, k'l, in the index space, if
then j = j' and k = k'. The details are in the Appendiu. I
However, merely proving that the mapping is conflict-free, is not enough. We can readily see from the above definitions, that
Thus, the dependency b, k] + b, k + 1 1 is mapped to processors that are separated by A(j,k). Since this is not a constant vector, but a function of the index point and (more importantly), of wk which is not known statically, our array does not even have local interconnections! Fortunately, we also we that A(j,k) = t ( j , k + 1) -t ( j , k ) . Hence, a result produced by any processor, 2, at time, t , is required by processor z + A at time t + A. As a result, we may design a systolic array that operates as follows. Each value of f computed in the array is tagged with a A value. Every PE merely transmits it, while decrementing the tag. By the time the value arrives at the destination, the tag has reduced to 1, and the PE Uknows" that the d u e is for it. The code executed by each PE is shown in Fig 2. In order to prove that this strategy works correctly we must show that when a data value is being propagated in this manner, the PE's that it passes through are not busy with their own computation (otherwise there will be a conflict for the interconnection wires, and also for the ALU to decrement the tag). It can be implemented with a couple of comparators and an adder dong with a few multiplxers. However, the fag based control is significantly more expensive, since it requires two divisions modulo a, one modulo wk and one modulo Wk+l. These considerations make it unrealistic for VLSI implementation, and we now seek to improve this. We will show how this can be achieved with two simple counters and a few flip-flops.
As we have seen, the function A(j, k) is crucial for the tagging scheme. Consider the PE, a ( j , k) which has to compute f(j, k), and let p = Thus we see that each PE starts its activity at to = p + z + (z -1)" + p w k + v, and after that it stays active for the next a steps, and this pattern repeats with a period of W k . Based on this, we have the following theorem.
Theorem 2 The control of each PE may be implemented with two counters, one that counts modulo a and one that counts modulo W k . when the data in the memory is assumed to be -m. This part is the same as in the Andonov-Quinton array (except for some reorganization), and can be laid out in a fairly standard manner. The control part uses three flip-flops (ACT, MEMO, and FIRST) and the two counters, C, and Ca. The former is a programmable counter and its upper limit is controlled by W k , but the latter counts modulo aOPt, a fixed constant.
International Conference on ApplicationSpacific Array Pmceasore
Coefficient loading
For correct operation, each PE must know w,, Pk, obj, and FIRST,. We now discuss how these values can be efficiently loaded into the array. Since the host knows the wi's for the particular problem instance, there is a simple preprocessing step to determine these values (how many successive PE's have the same coefficients, etc.) and the appropriate sequence of coefficients can be generated off-line. A standard loading method is to simply pipeline these values into the array from the left (the coefficients of the last PE are entered first), and to issue a broadcast at the q-th step (there are q PE's in the array). However, for good throughput one would like to do this while another problem is being solved in the array. So, until the broadcast is issued, the PE must save the new values, but continue to use the older values until it is time to commence the next problem instance. This necessitates a set of "shadow registers" and a control signal to indicate the start of a new pass, and this is too inefficient.
Consider another alternative. We load the coefficients in the reverse order (the coefficients of the first PE are the first in the sequence). We have a control signal that is propagated with twice the delay as the Coefficients (one extra latch in the control path), which is aligned with the first data value. Hence, this control bit meets the i-th data value at the i-th PE, and can be interpreted as a "LOAD" command. The only problem is that this happens at t = 2i. The solution is simple. We run this loading circuitry at twice the clock rate as the main circuitry. This is feasible in terms of speed, because the adder and the comparator in the main data path imply a critical delay. Furthermore, there is no need for shadow registers, and the LOAD signal can also be used to start the computation of the next pass.
A VLSI Model and Optimal Memory Size
Throughout our development so far, we have not specified the value of the PE memory, a, other than saying that it is constant, independent of the problem parameters. Hence we can treat it as a design parameter. We now show how it can be chosen optimally. The time complexity [AQ92] of the array is: c m whose dominant term, ; C[w;/al will be denoted by F(a). We note that q depends on Y i=l a, since the number of PE's that we can implement on a chip depends crucially on how much memory they have. Hence, the problem that we must solve, is formulated as follows. which minimizes E ( F ( a ) ) is either [a*] or [a'l, where Otherwise, aopt = w-.
I

Corollary 1 The optimal time wmplezity is
Proofi By substituting a* in (9) and simplifying.
I
Performance Estimation
We now apply these results to the VLSI design of Sec 3 to estimate the improvement that we can obtain. We assume l p CMOS technology, and that 20 x 21° static registers (16-bit) can be laid out on a lcm2 VLSI chip. We also assume that w-is 1023 (typical of values used in standard texts [MT90) ).
In our design, the main data path for the computation off is fairly standard, and takes up an area comparable to 16 registers. The control part takes up about 14 registers, and thus, the entire PE can be laid out in 30 registers. We choose to use a DRAM because we know that every memory location (of interest) will be accessed periodically, with a period of exactly wk, and hence refresh circuitry is not needed (in fact, the minimum speed required to avoid the refresh is only 0.25 MHz). For the 1p CMOS technology, we assume that each memory cell takes up about half the area of a static register. Hence a2 = 4, and al = 30. Using this data in our equations, we have, Result 1 The optimal memory size of our design is 247, and the chip has 133 PE's. We compare this with an array where each PE has w, , memory cells (note that this array does not need to use any tagging scheme, so its CPU + control + IO is only 26 registers).
The expected running time of our optimized design is 73.5% of the other.
Related Work and Conclusions
Since the knapsack problem is an important combinatorial optimization problem, many authors have developed parallel algorithms for it. In addition to branch and bound, and approximate algorithms, dynamic programming [LSSSS, CCJ90, LS91, Ten901 is a popular technique. The performace depends on three parameters: c, m and w-, the largest object weight, unlike the sequential case when only m and c are relevant [Tengo] . The number of processors required by most of the parallel algorithms is exponential in the size of the input and these algorithms have a very low processor efficiency. For example Teng's algorithms req+res M(c) processors and takes O(log2(mc)) time. The function M ( n ) above denotes the number of processors needed for multiplying two n by n matrices in O(1ogn) parallel time (which is known to be between n2 and n3). Hence, at least O(c2) processors are needed, and $ is a factor in the the efficiency, which approaches zero as c increases.
Recently, a dynamic programming algorithm for the 0/1 knapsack problem, was presented by Lin and Storer [LS91] . It improves on Lee et al. [LSSSS] , and its running time is O(mc/q) on an EREW PRAM of q processors and this algorithm has optimal speedup and processor efficiency. The authors report that on a Connection Machine, the time complexity increases to @(-) because of communication costs. Chen et al.
[CCJSO] present a linear array of q general purpose processors (each with a sufficiently large RAM). It has a time complexity O(mc/q), which is asymptoticaly optimal.
Our model of computation is weaker-a systolic ring, where each processor has only a constant memory, and we achieve optimal performance. Furthermore, a software version of our algorithm has very good performance, as reported in a companion paper [ARQ93] (see references therein for a detailed survey of software implementations).
Independent of the knapsack problem, many researchers have proposed systolic arrays for dynamic programming. This includes arrays for specific instances, such as optimal string parenthesization [GKT79] , etc., some general arrays [RV84, My0911 as well as arrays that can solve any instance of dynamic programming, [LW85, LL861. Usually, the arrays for specific instantaces have better performance than the general purpose arrays because they can exploit properties of the particular problem at hand. This is the case for the knapsack problem too-if we use Li and Wah's array [LW85] for our problem, we will need c processors and take me time steps, which is clearly unacceptable (since we can achieve the same time on a uniprocessor). Even the delta transformation proposed by Lipton and Lopresti [LL86] yields only logarithmic improvement in space and time. Because on these considerations, our array was designed by studying the specific structure of our problem formulation (2).
In conclusion, we have presented a complete derivation of an optimal VLSI design for the general knapsack problem. Since it belongs to the class of NP-complete problems its fast parallel implementation is interesting in its own right. Because the algorithm for the dynamic programming solution of this problem has run-time dependencies, it is also interesting from the view of design methodology.
We have followed the design trajectory all the way from the level of algorithm through architecture and finally to circuit. At each time we have used appropriate theoretical tools for analysis of the performance. At the final circuit level, there is still scope for improvement. Different memory technologies may be tried out, different clocking schemes may be explored, and so on. Our results show how it is important to combine many diverse areasoptimal parallel algorithms, VLSI design, operations research and dependence analysis to achieve high performance ALGO-TECH-CUIT engineering for ASAP. The authors would like to thank Dominique Lavenier and Doran Wdde for helpful discussions. This means that there will never be any need to propagate values to the left. From this fact, and (7), it follows that the allocation function is monotonically increasing with E. From (4-5) we have, t ( j , k) = j + a ( j , k), so t ( j , k) + z = j + a ( j , k) + z.
Let, if possible, for some z : 0 < z < A(j, k)
Since z < A, we substitute from (4) and ( 6 ) , and simplify to obtain Substituting this in (10) and simplifying, we have
(since e is arbitrarily small) Setting its first derivative to zero we get a quadratic equation, whose solution, a*,
. Now observe from (13) that E ( F ( a ) ) has the form C1a + % -I-C,, i.e., it is the sum of linear and hyperbolic functions. Hence it is a strongly convex function, and has a unique real minimum at a* which is positive (and real, in general). In the interval [O,a*], E ( F ( a ) ) is monotonically decreasing, and in the interval [a*,m] it is monotonically increasing. Now, the integer value aOpt which minimizes it in the range a E { 1 . . .tu-}, is determined as follows.
If a ' 5 w-, then aOpt is one of the integers closest to a*, i.e., la*J or [a*1. Hence we simply compare E(F( [a*J)) and E(F( [a*l)) and choose the one that is smaller.
Otherwise, if a* > w, , then E ( F ( a ) ) is monotonically decreasing in the range (1.. .tom,}, and hence the minimum occurs at w-.
Simplifying a* 5 w-, yields I a l ( w " + 1)
is a* = Lthe condition mentioned in the theorem, 2 5 a
