We present a parallel solution to the unbounded knapsack problem on a linear systolic array. It achieves optimal speedup for this well-known, NP-hard problem on a model of computation that is weaker than the PRAM. Our array is correct by construction, as it is formally derived by transforming a recurrence equation specifying the algorithm.
Introduction
The unbounded knapsack problem is a classic NP-hard, combinatorial optimization problem with a wide range of applications Hu69, MT90]. It can be solved sequentially in (mc) time, where m is the number of objects and c is the capacity. This is called pseudo polynomial complexity GJ79] since it is polynomial with respect to a parameter, c, which itself, can be an exponentially growing function of input size. Many time consuming instances of these problems exist CHR88], and it is therefore important to investigate e cient parallel implementations. One of the standard approaches to solving this problem is dynamic programming Bel57, GN72, Hu69] . In this paper, we address the problem of deriving a systolic VLSI array for the dynamic programming algorithm for the knapsack problem.
Systolic array synthesis has been an active research area for some time, and there is now a large body of work on deriving systolic arrays from systems of recurrence equations. Typically, one starts with a recurrence of the following form.
for all index points, z; X z] = g(: : : Y d(z)] : : :) ( 
1)
It speci es that in order to calculate the variable X at any point z, it is necessary to apply the atomic function g to its arguments. All arguments are of the form, Y (d(z)); i.e., the variable Y at a point d(z). Y may be de ned similarly, by its own recurrence equation, or it may be an input. The function d is called a dependency function. The problem of synthesizing systolic arrays from recurrences is now well developed. The simplest case is when all the dependencies are uniform, i.e., d(z) = z + a for some constant a. The synthesis techniques for this case are now more than a decade old and are based on a seminal paper by Karp et al. KMW67 ] which even predates systolic arrays. The synthesis problem can be reduced to determining appropriate linear scheduling and allocation functions. The problem is a little more complicated when the dependency function has the form d(z) = Az + a for some constant matrix, A, and a constant vector, a (this is called an a ne dependency). In addition to determining linear scheduling and allocation functions, it is also necessary to localize the recurrence (i.e., convert it into an equivalent uniform recurrence). This problem, too, has received a satisfactory solution (see RF90] , QR89] Ch 13, and references therein), and many researchers have developed software systems for systolic synthesis from systems of a ne recurrences.
The dynamic programming problem for the knapsack problem has a very simple speci cation as a recurrence equation. However, we shall see that the dependency function here is neither uniform nor a ne. Indeed, the function d(z) is di erent for di erent z and may change from one problem instance to another. Such a class of recurrences are said to have dynamic (or runtime) dependencies Meg93], and the well-known synthesis techniques are no longer applicable.
A complete synthesis method for such recurrences is not available at present.
Thus, there are two motivations for studying the systolic implementation of the dynamic programming algorithm for the knapsack problem: rst, the problem is important in its own right and would bene t from parallel VLSI implementation on dedicated processor arrays; second, a careful study of this particular recurrence may lead to a synthesis method for recurrences with dynamic dependencies. In this paper we present the following results.
We describe a linear systolic array for the unbounded knapsack problem. The array is modular (all PEs are identical, independent of all problem parameters), and each PE has only a xed memory ( words). The array can be adapted to run on only q PEs (connected in a ring). Then, its average time complexity is mc q l wmax+w min m , where w max and w min are, respectively, the expected largest and smallest object weights among all problem instances. When the PE memory is large enough (this is the case for general purpose machines, and is implicitly assumed by all researchers), this yields optimal speedup. For VLSI implementation we use a more precise notion of optimality as discussed below.
Our array is correct by construction: it is formally derived by applying transformations to an initial system of recurrence equations that specify the dynamic programming algorithm.
The derivation serves as a case study for systolic synthesis in the presence of dynamic dependencies and highlights the fact that it is not enough to merely determine a con ict free space-time transformation, as in existing synthesis methods. We tackle the additional problems that need to be resolved and give su cient conditions that a transformation must satisfy, in order that the resulting recurrence can be interpreted as a systolic array.
We address some pragmatic considerations: in addition to implementing the array on only a xed number of PEs, we simplify the control to just two counters and a few latches and give a method for loading the coe cients so that successive problems can be pipelined without any loss of throughput.
In terms of practical implementation, we seek an optimal tradeo (between memory and processors) given that we have xed silicon resources. Using a register level model of VLSI, we formulate a non-linear optimization that seeks to minimize the expected running time of the array (for a uniform distribution of problem instances). We solve it analytically, enabling us to choose the memory size and yielding an optimal circuit. It is our strong conviction that in order to obtain good implementations, one must investigate the problem at many di erent levels: from algorithms, through architecture, and all the way to VLSI circuit level optimizations. Our results highlight the cross disciplinary nature of designing application-speci c array processors, or \algo-tech-cuit engineering".
The remainder of this paper is organized as follows. In the next section, we provide a review of background and notation, de ne the problem, and derive a linear array for the knapsack problem which, although not systolic, serves as a starting point of our synthesis. In Sec. 3 we give the systematic derivation of the systolic array and illustrate how the synthesis methods must be modi ed to adapt to dynamic dependencies. In Sec. 4 we deal with the pragmatic aspects, and in Sec. 5 we show how the PE memory size which minimizes the running time can be determined by solving a non-linear optimization problem. In Sec. 6, we develop a general synthesis method for deriving systolic arrays with non-linear transformations. Finally, in Sec. 7, we conclude with a summary of related work, and an outline of future work.
2 Background, Problem and Naive Solution
We brie y review the techniques of systolic synthesis, formally de ne the algorithm that we want to map to silicon, and present a naive solution that serves as our starting point.
Review of systolic array synthesis
We start with a speci cation in the form of a system of uniform recurrence equations, or UREs KMW67]. Associated with each such system is its data ow graph, which models every computation as a distinct node, and whose edges indicate the precedences for the computation. Note that the data ow graph may be arbitrarily large (even in nite for some problems), but due of uniformity, it can be represented compactly by means of a nite number of constant vectors, called dependence vectors or dependencies. To derive a systolic array from such recurrences, we assume that the body of the equation can be computed in one step (by means of special hardware units), and seek two functions, namely, an allocation function and a timing function (also called a schedule), that map each index point to a processor location and a time instant, respectively. Together, they constitute a space-time mapping or transformation of the original recurrence and its data ow graph. The timing function must satisfy causality, meaning that no computation can be scheduled until its arguments have been computed. Furthermore, the allocation and the timing functions must not con ict, i.e., two distinct points must never be mapped to the same PE at the same time. Note that in general, a space-time transformation may yield an array where all PEs are not active at all time instants (we say that a space-time point is active if it is the image of some point in the original index domain). For a number of reasons, it is very useful to restrict these functions to be linear (or a ne) functions of the indices: 4
Many techniques have been developed to derive linear schedules systematically using integer linear programming. The number of constraints is the ( nite, and usually small) number of distinct dependence vectors, independent of the size of the data ow graph. A linear schedule is speci ed by a single vector, . Such schedules are known to be optimal for UREs.
A linear allocation function is characterized by a single projection vector, u.
Ensuring that the mapping is con ict free involves a simple test that T u 6 = 0.
Techniques exist for systematically enumerating linear allocation functions. If linear space-time transformations are used, the resulting recurrence (and its data ow graph) remain uniform. Indeed, the new URE can be automatically derived by means of simple rewriting rules. We say that UREs are closed under linear transformations. As we shall see, this is crucial for the synthsis, because the new dependencies can be directly interpreted as the interconnection links of the processor array and may be used to generate masks automatically for VLSI layouts.
To illustrate this, consider the problem of designing a systolic array for the URE of Eqn. (2), below. It speci es the computation of X over an N N domain, initialized to 0 at the boundaries.
It has two dependencies, 0; 1] and 1; 0], and its data ow graph is shown in Fig. 1 -a.
Because the URE has two dependencies, the (linear) schedules must satisfy just two constraints, namely that t(0; 1) > 0, and t(1; 0) > 0, and thus, t(j; k) = j + k is a valid (indeed, the optimal) schedule. Also, a(j; k) = k is a valid allocation function. The corresponding space-time transformation is shown in Fig.1-b , and when it is applied to Eqn. (2), we obtain the URE below, whose data ow graph is shown in Fig.1 
Its dependencies are 1; 1] and 1; 0], and it can be interpreted as an array where each PE sends one value to its right neighbor (with one clock cycle delay), and updates an internal register every cycle (see Fig. 1-d) . 
where z i is the number of i-th type objects included in the knapsack. The problem, as speci ed above is often called the unbounded knapsack problem, since the only constraint on the solution (other than the capacity constraint) is that it is non-negative. There are many variations of this problem MT90], such as the bounded knapsack problem (here, additional constraints of the form z i b i must be satis ed), the 0/1 knapsack problem (a particular case of the bounded problem, where b i = 1: there is exactly one copy of each type of object), the subset sum problem (a 0/1 problem with w i = p i ), the change making problem, etc. They arise in di erent application domains, and are all NP-hard. The most common (exact) algorithmic techniques for solving knapsack problems are dynamic programming LSS88, CCJ90, LS91, Ten90] and branch-and-bound LS84, JF88] . In this paper, we concentrate on dynamic programming, which is based on the principle of optimality Bel57] and consists of two phases. In the rst (forward) phase, the optimal value of the pro t function is computed; this is used in the second (backtracking) phase to nd the actual solution. The forward phase is usually formulated in a recursive manner by decomposing the problem into sub-problems. In particular, we de ne a function f(j; k) which denotes the value of an optimal solution of the sub-problem, where only the rst k objects are considered and only a capacity of j is available. It is well-known that the computation of f(j; k) is speci ed recursively as follows. Traditionally, f(j; k) is viewed as a m c table which is \ lled up" during the forward phase. The backtracking phase consists of using this table of f(j; k) to determine the actual solution. For the unbounded knapsack problem, Hu has given an elegant, memory e cient implementation of this phase Hu69]. If certain book-keeping information is maintained during the forward phase, the solution can be obtained in (c + m) time and c space (only the last column needs to be saved, not the entire table). The book-keeping corresponds to simply keeping track of which of the two arguments f(j; k ? 1) or p k + f(j ? w k ; k) was chosen in calculating f(j; k). At present, the question of whether a similar technique exists for the other variants of the knapsack problem is still open. This is one of the reasons why we consider only the unbounded knapsack problem in this paper. All our results can be trivially extended to the forward phase of the other variants. 
Observe that we have rewritten Eqn. (5) to account for the fact that is max and = 0, and have split the last line into two conditions. By doing so, we see that the two recurrences have identical conditional branches. Also note that the body of Eqn. (7) involves no real computation.
Hence, any PE that computes f(j; k) can be easily adapted to also compute u(j; k) by adding two registers and a multiplexor. Henceforth, we will not deal explicitly with the calculation of u(j; k), except to account for this additional hardware cost. Now, we see that in order to compute f(j; k) at any given point (other than those near the boundaries), we need two arguments: one that comes from the point j; k ? 1], and the other from j?w k ; k]. The dependencies are thus 0; 1], which is uniform, and w k ; 0], which is di erent from one column to another, and will doubtless be di erent for di erent problem instances (see Fig. 2 ). It is this that complicates the problem. Indeed, this recurrence is neither uniform nor even a ne, since the dependency is not even known at compile time. Such recurrences are said to have dynamic or run-time dependencies Meg93]. In spite of this, let us continue to use the notion of timing and allocation functions, and try to obtain a systolic array for it. There are two crucial observations that allow us to synthesize a locally connected (albeit, non-systolic) array, even though we do not completely know the dependency graph statically: The dynamic component of the dependencies is vertical. This implies that if all the points in the k-th column are allocated to the k-th PE then, for any computation performed by this PE, its two arguments must come from the k ? 1 th PE, and the k-th PE itself respectively . In other words, if we choose the allocation function to be a(j; k) = k, all the \irregularity is projected within each processor" and the interconnections will be localized.
The dynamic component of the dependencies is also downward. As a result, a valid linear schedule is given by t(j; k) = j + k.
Notice that our timing and allocation functions are linear, and can be used to obtain the transformed recurrence shown below. Please try to visualize the transformed data ow graph (the transformation is the same as in Fig. 1 ) and verify the boundary conditions and dependencies in the following recurrence: Let us look closely at this recurrence and interpret x as the PE label, and t as the time. Note that when x = 0, the value produced is always 0, so PE 0 is trivial. All other PEs have the same structure (no need to implement the test for x > 0 explicitly). The PEs start at t = x, and this can be implemented as a control signal propagating from one PE to the next. The value produced by any PE is either f(t ? 1; x ? 1), i.e., the value produced by the preceding PE in the previous cycle, or p x + f(t ? w x ; x). Thus, we have a linear array, where each PE has the simple architecture shown in Fig. 3 (we have omitted the additional control for implementing the second line of Eqn. (8), for simplicity). However, notice that at any time instant, the PE needs a value that was calculated at t ? w x . Hence any result must be stored for exactly w x time steps, and this implies a memory of size w x in each processor, accessed cyclically, modulo w x (alternatively, it may be viewed as a shift register of length w x ). Since the w's are problem parameters, and could be arbitrarily large, this architecture does not represent a systolic array! 3 The Systolic Knapsack Array and its Synthesis
We rst present an intuitive explanation of our array and its derivation, then give a formal proof of correctness, and nally address the necessary control.
Informal Explanation
In order to obtain a systolic array, we pose the question from the opposite perspective: how can we implement Eqn. (6) on an array whose PEs have only constant memory (say each PE has only words of memory)? An obvious answer is to continue with a space-time transformation similar to the one above, but to allocate each column to a block of contiguous PEs rather than a single PE. In general, the k-th column will be allocated to d w k e PEs, so that, between them, they have su cient memory. In other words, Eqn. (8) corresponds to a virtual processor array, which is emulated in a blocked fashion by a linear array with only memory per PE. Although this idea seems straightforward, there are a number of complications that arise.
In order to ensure that the dynamic dependency is properly handled, we would still like to allocate the point, j; k] to the same PE as the point j ? w k ; k]. One way to do this is to divide the column into blocks of size w k , allocate the rst points of each block to the rst PE, the next points to the second one, and so on. Subsequent blocks are allocated similarly. This is illustrated in Fig. 4 . Please do the exercise given there, and it will be clear that in general, the 0; 1] dependency must cross a number of PEs, and that di erent instances (of the The picture on the left shows how two adjacent columns are mapped to PEs with = 4, for a particular example (w 1 = 8 and w 2 = 12). The rst two PEs, (marked with open and solid circles), are responsible for the correspondingly marked nodes of the rst column, and the other three (marked with squares) compute the second column. Three instances of the 0; 1] dependency and their images in the processor space are also shown. They highlight the fact that values produced by PE 1 may be required by PEs 3 and 5, and that data required by PE 3 may come from di erent PEs (1 and 2). This is actually the simple case (the w k 's are multiples of ). The picture on the right, is an exercise. Determine all the inter-PE communications possible and draw the corresponding dependencies. cycles). Note that the start of the rst PE in each block is delayed (with respect to the rst PE in the preceding block) in order to allow its inputs to arrive. It is easy to verify (for the example) that the timing function does not con ict with the allocation function. Now consider the arrows. They are the images in the space-time domain, of the individual instances of the 0; 1] dependency, and we shall call them propagation paths. We see right away that there are a number of problems. The arrows are not parallel, and this means that the propagated values will not ow at constant rates. Some paths must cross 2 PEs in 2 time steps, some must cross 4 PEs in 2 steps, and some must go to the next PE in 2 steps (does the gure show all possibilities?) This also implies that we will not have nearest-neighbor interconnections. Another problem is that there is at least one instance where two propagation 13 paths are superimposed: the value produced by PE 1 at t = 4 will pass through PE 2 at t = 5.
But at this time, PE 2 is active and will be computing its value, and will try to write a di erent value on its output line, and this leads to a con ict.
Please study Fig It is this clear that the timing function must be chosen carefully. Before reading further, please try to devise a valid space-time mapping. It may be helpful to draw gures similar to the ones presented here to ensure that all the problems raised above are addressed.
Resolving the propagation con ict Let us rst try to address the problem of con ict, without worrying about whether the propagation paths are parallel or not. One of the reasons for the con ict can be seen as follows: consider the last value computed by a PE in consecutive steps (for example the value computed by PE 1 at t = 4). This value needs to pass through the next PE at the next time instant, but at this instant, this PE has just started its computation (PE 2 starts its block at t = 5, the same time that it is supposed to be transmitting the value computed by PE 1 at t = 4).
To resolve the con ict, why not delay the start of PE 2 by one cycle, to allow the propagated value to pass through? In general, we delay the start of each PE by one cycle (relative to its predecessor). This is illustrated in Fig. 6 -a (once again for the simple case). First ignore the arrows and compare with Fig. 5 -a. Note how columns 2, 4 and 5 are shifted down. Surprisingly, this simple change also makes the new propagation paths parallel to each other. As before, we leave it as an exercise to work out the details of the complicated case (this is the simplest exercise).
There still remains one nal problem. Note that although the propagation paths are parallel, their lengths are di erent. So we still need a proper control mechanism that will inform each PE when to compute a new result, and when to propagate. Before we address this problem, let us formally prove our results so far. Doing so will also lead to a solution for the control problem. 
Proof of Correctness
We rst prove that the space-time mapping is con ict free in the conventional sense. Formally, our (non-linear) allocation function is as follows.
Since dw i = e PEs are needed to execute the i-th column, the last term represents all the PEs in the rst k ?1 blocks (the summation ends at k ?1); the rst term gives the precise PE in the k-th block. Note that the presence of the (j mod w k ) implies that any point j; k] is mapped to the same PE as j ? w k ; k], and the dynamic dependency is properly localized. The schedule is speci ed as follows. 
Note that, by de nition, (j; k) is a \distance" function, since it is the spatial distance that the value f(j; k), computed in PE a(j; k) must travel to reach the PE a(j; k + 1), where it will be consumed. In other words, the dependency j; k] ! j; k + 1] is mapped to PEs that are separated in space by (j; k). Observe another important property of this function, that we can verify by simple substitution.
This has a very fortunate practical implication: after space-time mapping, the j; k] ! j; k + 1] dependency has the same spatial and temporal components. Thus, a value produced by any PE x at time t needs to be sent to PE, x + (j; k) at time t + (j; k). This is why all the propagation vectors in Fig. 6 had a 45 degree slope. Our propagation strategy thus corresponds to sending the data by repeatedly (exactly (j; k) times) passing it from one PE to the next over a link of unit delay.
In order to formally prove that this strategy works correctly we must show that when a data value is being propagated in this manner, the PEs that it passes through are neither performing some other computation, nor propagating any other data value (otherwise there will be a con ict).
Theorem 1 The tagging strategy is con ict free.
Proof: We need to show that at any instant a PE is propagating a value, it is not expected to be (a) performing some other computation; nor (b) propagating any other value. To do this, we show that that for two distinct points, j; k] and j 0 ; k 0 ], and for any x : 0 < x < (j; k) and y : 0 y < (j 0 ; k 0 ) " a(j; k) + x t(j; k) + x 
leads to a contradiction. The details are in the Appendix.
Control
Let us now return to the problem of control. Since (j; k) is the distance that f(j; k) needs to travel, a naive control mechanism is as follows. For each f(j; k) value that is computed, we attach a tag which denotes the distance that it needs to go. Each PE rst tests to see whether the tag on the incoming data is 1. If so, the PE \knows" that the value is intended for itself, so it performs a new computation (of f(j; k) and its new tag) and sends these values on its output lines. Otherwise, the PE simply transmits the input values unchanged, but decrements the tag by 1. The functions (j; k) thus serves as a dynamically computed routing tag. Note that since (j; k) is a function of the index point and w k and w k+1 , we can view Eqn. (11) as yet another equation, just like f(j; k), and de ned over the same domain, D. It too, is mapped using the same space-time transformation, and gives us an array where each PE also has a special unit to calculate (j; k), in addition to f(j; k) and u(j; k). The result is used to produce the tag and is a major component of the PE controller shown in Fig. 7 . Each PE receives a tag.in and produces a tag.out. It has a special comparator that tests whether tag.in = 1, and its result determines whether tag.in is propagated (decremented by 1) or whether the output of the :local unit is sent as tag.out. The same signal also controls the multiplexors of Fig. 3 , and also (write enable) whether the output register is saved into memory or not. 
Pragmatic Considerations
We shall now address a few implementation related problems. In particular, we discuss how standard techniques can be used to obtain a xed size array and derive the performance of this implementation. We will also address two other problems: an e cient control mechanism and coe cient loading and setup. The number of PEs needed to achieve this is
These are the two standard metrics of any systolic (indeed, any parallel) implementation, and as expected, they depend on some of the problem parameters. Note that here, they both depend on the w k 's and hence will be di erent from one problem instance to another. In practice, it is important to have a xed number of PEs, since we envisage the array as a specialized coprocessor board. For this, we use the well-known locally parallel, globally sequential (LPGS) partitioning scheme MF86]. The approach is as follows: assume that only q PEs are available, rather than P m . Then we \time multiplex" the array so that it solves any instance of the knapsack problem in multiple passes. Of course, the results that are produced by the last PE must be bu ered and fed back into the rst PE, and this necessitates a ring topology, and some intervention by the host for proper synchronization. In other words, the domain D is partitioned into subdomains, which are individually executed in a pipelined fashion as explained in the previous section (locally parallel). Viewing this as a single \step", the steps are executed serially on the array (globally sequential).
To determine the running time of the algorithm using this strategy we make the following observations which can be readily veri ed. First, note from Eqns (9) and (10), that t(j; k) = a(j; k)+j. Second, the very rst \computation" performed by any PE x is either the calculation of f(0; k) (if x is the rst PE in its block), or the propagation of this value. This occurs at time t = x. Finally, the very last \computation" (which may either be a real computation or a propagation) performed by any PE occurs c steps later. We say that the PE is busy from t = x to t = x + c. When the array is partitioned using the LPGS strategy on q PEs, these observations are still valid for each pass: if the r-th pass starts at t r , PE x is busy for this pass from t r + x to t r + x+c. The rst result of this pass is available from the last PE at t r + q. Recall that c grows exponentially as compared to m (the knapsack problem is NP-hard), and it is thus reasonable to assume that c > q. Thus, the start of any two successive passes di ers by exactly c clock cycles. Note that this requires bu ering the results produced by the last PE (for exactly c ? q cycles), and we assume that the host (or possibly a specialized I/O co-processor, or a smart DMA controller) is responsible for this. Since the entire algorithm needs dP m =qe passes, the last pass can only start at t l = c(dP m =qe ? 1), its rst result is produced at t l + q, and the nal result of the array is produced at t l + q + c. Hence 
Note that, in addition to c, m, q and , the performance of this implementation depends on w k 's, and will thus be di erent for di erent instances of the problem (even if they may have the same \size" parameters, m and c). In spite of this, it is important to note that the array will correctly implement any instance of the unbounded knapsack problem, even when the di erent PEs that constitute a block (for one column of D) are partitioned across the ends of the ring, or if any individual w k is so large as to require multiple passes itself, etc. Indeed, there is no need to even prove this formally, since the original array with P m PEs has been proven correct by construction, and LPGS partitioning is a proven technique, applicable to any systolic array.
E cient Control
We have seen that our PE has a \black box" within its control unit (see Fig. 7 ) which is responsible for computing (t; x) at all times. If this were implemented naively as speci ed by Eqn. (11) it would require two divisions by , two incrementers, one modulo w k and one modulo w k+1 (even if we assume that dw k = e can be computed o line and latched into a local register when each new problem instance is loaded). This is clearly unrealistic: this heavy machinery is to be used to control a very simple data path, essentially an adder-comparator. Note that our development here (and also in the following subsection) will be more informal, since it involves circuit level optimizations. At present, we are not aware of a formal transformation scheme to obtain the nal hardware systematically.
In order to obtain a more practical circuit, let us recall the main purpose of the controller, namely to indicate to the PE when to propagate and when to compute a new result. The formal development and the distance function was necessary in order to prove that the propagation strategy would not have any collisions. For e cient implementation, we would like to achieve the same objective without calculating (t; x). This can be done if we can study the pattern of activity of the PEs and use it to deduce this information statically. Once again, please return to Fig. 6 and try to determine these patterns before reading any further.
First of all, the activity of each PE in the k-th block is periodic with a period of exactly w k . Let us rst consider the simple case when w k is a multiple of . Then each PE is active for exactly consecutive steps in each period of w k . As a result, the control can be easily implemented with a stoppable counter that counts modulo . When it cycles over, the counter stops until it is restarted by another signal. In other words, one can think of a \token" (indicating that the PE is active) being passed among the PEs. Each PE keeps the token for exactly cycles, and then passes it on to its neighbor. The rst PE in a block is special. It ignores any tokens coming from its left neighbors, but simply generates a new token every w k cycles, using a modulo w k counter. A subtle point to note: recall that in our schedule, we need the nal result produced by any PE to be propagated through its neighbor before the neighbor itself can start. This is easily achieved in our token passing control mechanism: after being active for cycles, each PE simply delays the passing of the token by one cycle! Because of this, it is possible that more than one PE is simultaneously active (the word token is possibly a misnomer in our explanation, since it does not imply mutual exclusion). Also note that since any PE could be required to be the rst PE in its block for some problem instance, each PE must have a counter that counts modulo w k . In fact, this counter modulo w k is essential for a couple of other functions, as we shall see. Now consider that case when w k is not a multiple of . The above arguments are still valid for all PEs except the last one in the block, which is active for fewer than cycles. However, since we know that the PE activity is periodic with a period of w k , we simply use a counter modulo w k to \turn o " the last PE in the block. Hence the operating rule of the PEs now becomes as follows: start computing when you are given a token (or generate one, if you are the rst PE in the block); stop computing when either the counter modulo or the counter modulo w k cycles over. Indeed, the PE doesn't even have to \know" whether it is the last one in its block. For this scheme to work, the modulo w k counters must cycle in a skewed fashion (i.e., the one in PE, x must be reset at t = x). Finally, recall from the second line of Eqn. (8), that when t ? x < w k , i.e., for the w k cycles after a PE becomes active for the very rst time, each PE must simply propagate its input, and this is also easily controlled by our modulo-w k counter.
In the above discussion, we have talked about the fact that a PE must know whether it is the rst PE in a block or not. This is perfectly reasonable, since for any given problem instance, this information can be determined o line, and loaded with the problem coe cients. Thus, we conclude that the control of the PE can be reduced to two counters: one xed (that counts modulo-) and the other programmable (counting modulo-w k ) and a few latches.
Coe cient loading
For correct operation, each PE must know w k , p k , k.local and FIRST k (a bit that indicates whether it is the rst PE in its block). We now discuss how these values can be e ciently loaded into the array. Since the host knows the w i 's for the particular problem instance, there is a simple preprocessing step to determine these values (how many successive PEs have the same coe cients, etc.) and the appropriate sequence of coe cients can be generated o -line.
A standard systolic loading method is to simply pipeline these values into the array from the left (the coe cients of the last PE are entered rst), and to issue a broadcast at the q-th step (there are q PEs in the array). However, for good throughput one would like to do this while another problem (or a previous pass) is being solved in the array. So, until the broadcast is issued (i.e., until it is time to commence the next problem/pass), the PE must save the new values, but continue to use the older values. This necessitates a set of \shadow registers" and a Consider another alternative. We load the coe cients in the reverse order (the coe cients of the rst PE are the rst in the sequence). We have a control signal that is propagated with twice the delay as the coe cients (one extra latch in the control path), which is aligned with the rst data value. Hence, this control bit meets the i-th data value at the i-th PE, and can be interpreted as a \LOAD" command. The only problem is that this happens at t = 2i. The solution is simple. We run this loading circuitry at twice the clock rate as the main circuitry. This is feasible in terms of speed, because the adder and the comparator in the main data path determine the critical delay. Furthermore, there is no need for shadow registers, and the LOAD signal can also be used to start the computation of the next pass. Fig 8 shows the control for the nal design.
5 A VLSI Model and Optimal Memory Size
In our development so far, we have not speci ed the value of the PE memory, , other than saying that it is constant, independent of the problem parameters. Of the di erent parameters that govern the running time of the array, we note that q and are design parameters, while the others depend on the problem instance. We now address the problem of choosing these two optimally. In particular, we seek to determine \good" values of and q, i.e., those that minimize the running time of the array. However, the two parameters are related. Suppose that we have only some xed real estate on silicon. This may be used either to implement a large number of PEs with small memory, or just a few PEs with larger memory. This leads to a classic tradeo problem. Fortunately for us, our performance metric is easily quanti able, and we are able to set up a (non-linear) discrete optimization problem, and solve it analytically.
Note that the dominant term in Eqn. (15) It is bounded from below by mc q , the theoretical optimum for executing the knapsack problem on q processors. Even with this assumption, we see that there are too many parameters (the w k 's), and they could vary arbitrarily, even for some xed c and m. We therefore consider a range of problem instances, whose w k 's are uniformly distributed in an interval, w min ; w max ], and we seek to minimize the expected value of F(q; ) for all problem instances in this range. It is thus, the average case analysis of the performance of our array.
Lemma 2 If the coe cients w i are uniformly distributed in the interval w min ; w max ], the expected value of F(q; ) is given by E F(q; )] = cm 2q w max + w min ? 1 + 1
Proof: See Appendix
The above result allows us to simplify the cost function for our optimization problem, which can now be formally stated as follows.
For a given resource, R (silicon area), which may be used either as memory or as PEs, what values of and q minimize the objective function, E F(q; )]?
As mentioned above, q and are related, since the number of PEs that we can implement with a given real estate depends on how much memory they have. Let a 1 be the area of the CPU, control and I/O (i.e., the xed cost of a PE), and let a 2 be the area of the memory per word. The cost of our PE is thus a 1 + a 2 . Then the resource restriction is expressed as the non-linear constraint q(a 1 + a 2 ) R which is equivalent to q R ? qa; where R = R=a 2 ; and a = a 1 =a 2
If we denote w = w max + w min ?1 our problem is reduced to nding the values of and q which minimize E F(q; )] and satisfy the resource constraint, i.e. In general, solving such a non-linear integer optimization problem is NP-hard. However, as we will see, this particular problem has some nice properties which allows us to show that its solution is close to the solution of the relaxed problem (i.e., the problem when q and do not have to be integers, also called the continuous version). This latter solution is easily obtained analytically, and hence all we need to do is to look at the nearest integral points to it.
It is clear that the solution of the continuous version of Pb. 1 always satis es the equality q = R ? qa. Hence, consider the following problem. Proof: By substituting and q in Eqn (17) and simplifying.
We can see from the above result that does not depend on R, and this has an important practical implication. The design remains optimal, even as more real estate is added. In other words, once an optimal PE is designed, it may be mass produced, and the possibility of adding additional PEs does not sacri ce optimality. Also remember that what we have minimized is the expected value of the running time for a range of problems (and this depends on w min and w max ).
Nevertheless, the array is capable of solving arbitrary instances of the knapsack problem, and if we occasionally feed it problems whose weights are outside these bounds, all that will happen is a slightly degraded performance for that particular problem instance. Also note that the xed cost of the PE gures prominently in Eqn. (21) and hence, all the pragmatic improvements that we have developed in Sec. 4 do not simply reduce the area of the PE. They have the added bene t of also reducing the expected running time of the array!
Performance estimation
We now apply these results to our VLSI design to estimate the improvement that we can obtain. We will compare the expected running time of our optimized architecture to an array that does not choose the memory size optimally. To be fair, we allow the nonoptimized array to use all the hardware improvements presented so far, namely the improved control mechanism and coe cient loading method. Thus we measure only the improvement due to our optimization result. We assume 1 CMOS technology, and that 2 2 10 static registers (16-bit) can be laid out on a 1cm 2 VLSI chip (the actual value is not important since it cancels out later). We also assume that w max is 1000 (a typical value used in standard texts MT90]) and w min = 1.
In our design, the main data path for the computation of f is fairly standard, and takes up an area comparable to 13 registers. The control part takes up about 12 registers. Hence, the entire PE can be laid out in 25 registers. We choose to use a DRAM because we know that every memory location (of interest) will be accessed periodically, with a period of exactly w k , and hence refresh circuitry is not needed (in fact, the minimum speed required to avoid the refresh is only 0:25 MHz best performance is to be squeezed out), and discovered that the true optimum was at q = 16, for which = 206. Substituting these into Eqn (17) the expected running time of the optimal array is 0:18275mc. On the other hand, the unoptimized array requires 1000 words of memory which take up an area comparable to 500 static registers. This array does not need to use any tagging scheme, so its CPU + control + IO is only 22 registers, enabling 4 PEs to be squeezed on a chip. However, the naive design always allocates one PE per column, and the dominant term in the running time is mc q = 0:25mc (obtained from Eqn (15) by letting = w max ); we then have we have the following.
Result 1 The memory optimization reduces expected running time by 28% without any area penalty.
Observe that the choice of w max = 1000 is an arbitrary one, favorable to the naive design. Indeed if w max > 4000, the naive design cannot even t on a single chip. On the other hand, our modular design can easily adapt to larger values of weights, at the expense of increased running time. This illustrates the importance of designing modular and extensible (i.e., purely systolic) architectures.
6 Non-Linear Mappings to Systolic Arrays
The technique that we used to derive the knapsack array can be easily generalized. In this section, we present a method for systematically deriving systolic arrays from arbitrary recurrences (with or without dynamic dependencies) using non-linear space-time transformations. Recall that the success of the traditional systolic synthesis methods relies very intricately on the coherent interaction of uniform dependencies, and linear transformations. It is because of this that we can simply interpret the image of the original dependency vectors as the description of the interconnections of the systolic array. Unfortunately, this is no longer true when we use non-linear space-time transformations, as we have already seen.
Consider the general recurrence equation (1), de ned over a domain D Z n with an arbitrary dependency, d(z), which may even be dynamic (repeated below for convenience). Consider condition (2) for the case when a = a 0 = 0. This simply states that the space-time transformation is con ict free: no two distinct points are mapped to the same processor at the same time.
Now consider condition (2) when a 0 = 0, and a > 0. The point, T(z) + a represents a space-time index point through which the value X(z), which is computed at T(z), will pass in its a-th step if propagated along . The condition states that this space-time point
is not the image of any other point z 0 2 D in the domain of X.
Finally, by a similar reasoning, when a; a 0 > 0, the condition ensures that if all the values produced in the space-time domain are propagated along the direction , no distinct values will be propagated through the same space-time point.
We call (1) the feasibility condition for systolic propagation and (2) the controllability condition. When both conditions are satis ed we say that the transformation T(z) is a realizable space-time mapping. The function, H(z) is called the control function.
The systematic generation of a systolic array following this propagation strategy is also straightforward. As we have done for the knapsack problem, we introduce, in our system of recurrences, a new equation that speci es the computation of H(z) for all z 2 D. It is transformed using the same space-time mapping, T(z), and its value is used as a dynamic routing tag. The data values are propagated using the space-time vector . Whenever a processor sees that its input tag is 1, it computes a new value (for the data as well as the tag) and propagates it, otherwise it simply propagates the data value unchanged with the tag decremented by 1.
Note that the nature of the control function determines the complexity of the nal implementation, which may well be easily amenable for improvement. A systematic method for doing this is at present open. It requires extending the control signal generation methods for systolic arrays Raj89] to the nonlinear cases. The main idea is as follows. Observe that we need to compute a function H(z) at all points in an index domain which may not necessarily be speci ed as a recurrence relation. It may be easier to implement this function if we are able to decompose it into a recurrence. This is similar to strength reduction techniques used in parallelizing compilers. Furthermore, note that a simple variation is also possible. We may get a systolic array even though the conditions of Theorem 3 are not strictly satis ed. If the cardinality of the number of con icts is bounded by a constant, we can derive an array with nitely many propagation channels.
Related Work and Conclusions
Since the knapsack problem is an important combinatorial optimization problem, a number of researchers have developed parallel algorithms for it. We note that a software version of our algorithm has very good performance, as reported elsewhere ARQ93] (see references therein for a detailed survey of software implementations).
A parallel dynamic programming implementation for the 0/1 knapsack problem, was presented by Lin a su ciently large RAM). It has a time complexity (mc=q), which is asymptotically optimal, and is similar to the naive array of Sec. 2.3 (the di erence is that because they address the 0/1 problem, they need to retain the entire table of f(j; k) values, and hence their processors require c words of memory). As noted by Teng Ten90], the performance of parallel algorithms depends on c, m and also w max (unlike the sequential case when only m and c are relevant). The number of processors required by many parallel algorithms is exponential in the size of the input and these algorithms have a very low processor e ciency. For example the best time complexity algorithm for Eqn. (4) proposed by Teng Ten90] requires M(c) processors to solve the problem in O(log 2 (mc)) time. The function M(n) above denotes the number of processors needed for multiplying two n by n matrices in O(log n) parallel time (which is known to be between n 2 and n 3 ). Hence, at least O(c 2 ) processors are needed, and 1 c is a factor in the e ciency, which approaches zero as c increases.
Note that the systolic model is weaker than other models such as the PRAM in two respects: xed size PE and restricted connectivity. If we relax the rst constraint (i.e., let be as large as needed), the running time becomes (mc=q), which is optimal: the sequential running time is reduced by a factor of (q), which is the best one can expect from any parallel implementation, as illustrated in Sec. 2.3, and by the results of Chen et al. CCJ90] . This model, although not systolic, is still weaker than a PRAM. We conjecture that the added slowdown factor of our systolic array, (w max ), is a lower bound because of the constraint that each PE has only nite memory. This would be an interesting open problem to resolve.
Independent of the knapsack problem, many researchers have proposed systolic arrays for dynamic programming. This includes arrays for speci c instances, such as optimal string parenthesization GKT79], path planning BK88], etc., some general arrays RV84, Myo91] as well as arrays that can solve any instance of dynamic programming, LW85, LL86]. Usually, the arrays for speci c instances have better performance than the general purpose arrays because We use the word processors when we discuss software implementations, and PEs when referring to hardware implementations.
they can exploit properties of the particular problem at hand. This is the case for the knapsack problem too|for example, if we use Li and Wah's array LW85] for our problem, we will need c processors and take mc time steps, which is clearly unacceptable (since we can achieve the same time on a uniprocessor). Even the delta transformation proposed by Lipton and Lopresti LL86] yields only logarithmic improvement in time and the number of PEs.
Andonov has investigated the problem of implementing the knapsack recurrence on systolic arrays for a number of years. The rst solution was proposed by Andonov and Benaini AB90].
The array is extensible with respect to m, c and w max , but has a running time of (mc 2 =q ) on a ring of q PEs, which is unacceptable (the sequential implementation takes mc steps). The running time was brought down to (mcw max =q) by Andonov and Quinton AQ92]. It was designed in an ad hoc manner without a formal proof of correctness, and has an expensive control strategy.
In this paper, we have presented a number of results: rst, we formally derived the AndonovQuinton array for the knapsack problem. The derivation illustrated the steps to follow in deriving systolic arrays in the presence of dynamic dependencies. We used this derivation as a case study and presented su cient conditions for the general case, thus advancing the state of the art of systolic synthesis methods. Obviously, many open problems remain. We have only given su cient conditions. Our results are not constructive, and it needs a lot of designer ingenuity to come up with realizable space-time transformations. Automating this is the subject of our ongoing investigation.
Another result of this paper was that we formulated the problem of choosing the memory size as a non-linear optimization problem, which we solved analytically. We reiterate that the design will have optimal expected running time for a range of problem instances whose weights are distributed randomly in the interval, w min ; w max ], but will nevertheless correctly solve any instance of the problem. It is also important to note that the optimal memory size does not change with the available resource constraint (silicon real estate), and hence the PE design remains optimal when more PEs are cascaded to make a bigger array. The technique of formulating such optimization problems and solving them analytically can be used for many related problems: determining the optimal size for iteration space tiling, a code optimization used by parallelizing compilers, selecting the digit size when deriving digit-based arithmetic circuits from bit-serial dependency graphs, and other related problems.
Finally, we have developed circuit level optimizations to make the nal implementation practical for VLSI. These pragmatic improvements allowed us to reduce substantially the control to a couple of counters, and also had the additional bene t of reducing the time complexity of the array. Our development of these practical improvements was not formal, and it would be interesting to derive this formally by rst localizing the function (j; k), and then determining an appropriate space-time mapping. We note here that there is still a (minor) problem with the array presented here. The presence of the modulo w max counter means that the array is not purely systolic: one cannot arbitrarily increase all the problem parameters (speci cally, w max ) and not have to change the PE design. In practice this is not a serious limitation, since it is just the size of a single register, which can be made large enough (just as one makes arithmetic units large enough to ignore over ow, etc.) Nevertheless, we have recently improved on this to develop a purely systolic array, using the idea of of programmable shift registers AQRW95]. To boot, this array has almost no control, and thus we obtain further improvement in the expected running time.
We note that there is still scope for improvement. Di erent memory organizations may be tried, di erent clocking schemes may be explored, etc. Finally, we could come back full circle, and try to improve the algorithm itself. Indeed, we have recently done precisely that in AR94], where we use a sparse algorithm, which has better average case running time than the classical dynamic programming, and give an asynchronous VLSI array processor for it. This yields an algorithm that has better average case behavior. However, there is much that needs to be done if that solution is to be implemented in VLSI. It may turn out that the pragmatic complexities may reduce the e ectiveness of the improvements. Furthermore, the synthesis methodology used there is completely di erent and requires much more insight from the designer. A detailed discussion is beyond the scope of this paper.
In conclusion, we have presented a complete derivation of an optimal VLSI design for the unbounded knapsack problem. Since it belongs to the class of NP-hard problems its fast parallel implementation is interesting in its own right. Because the algorithm for the dynamic programming solution of this problem has run-time dependencies, it is also interesting from the view of design methodology. We have followed the design trajectory all the way from the level of algorithm through architecture and nally to circuit. At each time we have used appropriate theoretical tools for analysis of the performance. Our results show how it is important to combine many diverse areas|optimal parallel algorithms, VLSI design, discrete optimization and dependence analysis to achieve high performance application speci c array processors. In essence, we are really designing \algo-tech-cuits", which span the continuum from algorithms through architectures and all the way to circuits.
