The problem of outputting all parse trees of a string accepted by a context-free grammar is considered. A systolic algorithm is presented that operates in O(m. n) time, where m is the number of distinct parse trees and n is the length of the input. The systolic array uses n 2 processors, each of which requires at most O(logn) bits of storage. This is much more space-efficient than a previously reported systolic algorithm for the same problem, which required O(n log n) space per processor. The algorithm also extends previous algorithms that only output a single parse tree of the input.
INTRODUCTION
General context-free language (CFL) recognition is an important problem with a wide range of applications: formal language theory, pattern recognition, natural language processing, compiler design, to name a few. To date, the Cocke-Kasami-Younger (CKY) algorithm (1~ and Earley's algorithm (2~ remain the best known practical methods for solving this problem, both having a worst-case time complexity of O(n 3) for inputs of length n.
(Valiant (3) presented an asymptotically faster algorithm; however, the constant of proportionality is too large for practical applications.) Kosaraju (4) first considered the problem of parallel CFL recognition and presented a parallel implementation of the CKY algorithm on a twodimensional iterative array of/72 processors. The array operates in linear time and uses only finite-state processors (i.e., the processor stores information whose size is independent of the length of the input). Another algorit~hm, using a systolic array, is also implied by the work of Guibas, Kunbg and Thompson, (51 who gave a parallel implementation of the dynamic programming algorithm (similar to the CKY algorithm) for computing the cost of an optimum binary search tree. Both algorithms are optimal; the speed-up is linear in the number of processors used. A parallel algorithm which has a faster running time (in fact, O(log 2 n)) has been presented by Rytter(6) ; however, the algorithm is implemented on parallel random-access machine (PRAM), a hypothetical model that ignores communitation costs, and uses more processors (/76). Chiang and Fu (v) considered the more general problem of CFL parsing, which unlike recognition, also requires a parse tree as output. They gave a parallel implementation of Earley's algorithm on a systolic array of n 2 processors. Besides recognizing the input, the array also outputs a parse tree in linear time. However, the processors are no longer finitestate since each is required to store O(logn) bits of information. Later, Chang, Ibarra and Palis (8) developed a more space efficient systolic algorithm for the same problem: their algorithm also uses n 2 processors and runs in linear time, but requires only constant space per processor.
An interesting extension to the CFL parsing problem is that of outputting all parse trees of the input string. In some applications such as natural language parsing, the underlying grammar is usually ambiguous. Thus, one would be interested in generating all parse trees of the string, which later can be disambiguated by applying some semantic rules. Langlois (9~ considered the all-parses problem and gave a systolic algorithm based on the systolic architecture of Ref. 5 . The systolic array uses O(n 2) processors. However, each processor is required to store O(n log n) bits of information, resulting in a total space complexity of O(n 3 log n). This paper presents a systolic algorithm for the all-parses problem which is more space efficient than that of Langlois/9~ Specifically, for an input string of length n with m distinct parse trees, the systolic algorithm outputs all m parses in time O(m.n) using n 2 processors, each of which requires only O(log n) bits of storage. Thus, the total space complexity is O(n 2 log n). Our systolic algorithm is based on a modified CYK algorithm which consists of two phases: (1) a recognition phase that determines whether the input string is valid, and (2) a parsing phase that outputs a parse tree of the string. Multiple parse trees are generated by repeated "re-runs" of the parsing phase.
A historical note: This paper was first submitted to IJPP in May 1989 for review and was also presented in First Workshop on Algorithms and Data Structures; Ottawa, Ontario in August 1989. (l~ Subsequent to our submission to IJPP, we were informed that Langlois improved his own allparses algorithm (described in Ref. 9) . His new algorithm (1~) has a total space complexity of O(n2), and this is more space efficient than our algorithm. However, our algorithm is more time efficient: in order to generate the next parse tree, Langlois' algorithm has to re-run the entire algorithm, viz., both the recognition and parsing phases. In contrast, our algorithm executes the recognition phase only once. The reader is encouraged to look at both papers and study how the two different approaches can be combined to produce a new syystolic algorihm which is both more time and space efficient than either algorithm.
The paper is organized as follows. In Section 2, we first describe a sequential parsing algorithm on which the systolic algorithm is based. In Section 3, we introduce the systolic array model and give an informal description of the all-parses systolic algorithm. The following two sections describe the algorithm in finer detail: Section 4 describes the recognition phase and Section 5 describes the parsing phase. Finally, Section 6 gives an analysis of the time and space complexity of the entire systolic algorithm.
A SEQUENTIAL PARSING ALGORITHM
We first describe the sequential parsing algorithm on which the systolic algorithm is based. We assume familiarity with context-free grammars (CFG's); see, e.g., Ref. 12. Let G = (VN, Vr, P, S) be a CFG where VN and Vr are finite sets of nonterminal and terminal symbols, respectively, S ~ VN is the start symbol, and P is a finite set of productions in Chomsky normal form. That is, every production in P is either of the form A ~ BC or A ~ a, where A, B, C E V u and at V r. The language generated by G is L(G)= {w ~ V~ ]S *~ w}. Given An example of a CFG G and the corresponding matrix of R(i, j)'s for the string w =abaa is illustrated in Fig. 1 . Henceforth, the matrix R = {R(i, j)[1 ~< i ~< j ~< n } shall be referred to as the recognition matrix. For the given example, we see that abaa ~ L(G) since R(1, 4) contains a production whose LHS is S.
If w ~ L(G) then w has one or more parse tres, where a parse tree is a binary tree of productions used in the derivation S *~ w. For the example in Fig. 1 , the string abaa has five distinct parse trees, as shown in Fig. 2 . For each production, the pair of numbers (i, j) denotes the matrix entry R(i, j) to which the production belongs. Fig. 1 . A CFG G and the recognition matrix R for w = abaa. 828/19/4-4 We now describe a procedure PARSE for generating all parse trees of the input string. PARSE is a recursive procedure that takes four arguments (A, i,j, tag) , where Ae VN, 1 <~i<~j<~n and tage {FIRST, CURRENT, NEXT}. Informally, PARSE(A, i, j, tag) returns a parse tree for the derivation A *~ ai... aj. The parse tree is represented as follows: if a production ~t in the parse tree belongs to R(i, j), then the occurrence of rc in R(i, j) is "marked" by some special symbol, say *. (There is no ambiguity here since all productions in a parse tree belong to distinct R(i, j)'s.) For example, the first parse tree in Fig. 2 would be represented as shown in Fig. 3 . Note that the actual tree can be retrieved since for every marked production, its left
G -< {S
.,-
[c-,~l,. (right) child in the actual tree is simply the next marked production above it along the same column (diagonal).
The argument tag states which parse tree is returned. If tag = FIRST, then PARSE(A, i, j, tag) returns an initial parse tree for A ~ ai...aj. If tag = NEXT, then it returns the next (distinct) parse tree following the one last generated. Finally, if tag= CURRENT, then it returns the current parse tree.
To keep track of the order of parse tree generation, the procedure makes use of a number of auxiliary variaNes. For each (i, j), 1 ~< i ~<j ~< n, there are Boolean variables done(i, j) and lasLid(i,j), and an integer variable id(i, j). The variables are utilized as follows: Let t be the tree that results after a call to PARSE(A, i, j, tag). Then, (1) done(i, j)= true iff t is the last parse tree for A *=> a~...aj. (2) id(i, j)= k, i <~ k < j, iff the root of t has a left subtree whose root is a production in R(i, k) and a right subtree whose root is a production in R(k + 1, j). (id stands for "index of decomposition.")
Procedure PARSE is given next. In the procedure, each R(i, j) is treated as an ordered subset of productions, so that we can refer to the first, second, etc., production in the set.
/* there is a marked production in R (i ,j) *1 let [A ---> BC] be the marked production in R (i,j); One can verify that running the main program using the recognition matrix of Fig. 1 outputs the parse trees of w = abaa in the order shown in MATCH is performed to determine its children, and this takes O(n) time. Moreover, all calls to UNMARK within PARSE takes at most O(n 2) steps. Thus, the total running time is O(n 3 +mn2), where m is the number of distinct parse trees of the string. Note that the second term dominates when m = g2(n).
OVERVIEW OF THE SYSTOLIC PARSING ALGORITHM
As in the sequential algorithm, the systolic algorithm consists of two phases: a recognition phase which computes the recognition matrix, and a parsing phase which outputs the parse trees. In this section, we first give an overview of each phase of the algorithm. The complete descriptions are given in Sections 4 and 5.
The systolic parsing algorithm is implemented on the systolic array depicted in Fig. 4 . It consists of two triangular arrays: the P-array (the square nodes) and the Q-array (the circular nodes). Both triangular arrays have n processors along each dimension, where n is the length of the input string to be parsed. The processors are assumed to be indexed as shown.
For the P-array, P(i, j) denotes the processor in the ith leftmost column, of the jth row. For the Q-array, Q(i, j) denotes the processor in the ith rightmost column, of the (j-i+ 1)-st row. For convenience, we call a processor of the P-array (Q-array) as a P-processor (Q-processor). The processors are interconnected as shown in the figure. All communication links are assumed bi-directional (i.e., data can travel in either direction). 
Overview of the Recognition Phase
The recognition phase computes the recognition matrix R and determines whether the input string is in the language generated by the grammar. During this phase, only the processors of the P-array take part in the computation; the Q-array is not used. The input string ala2...an$ is fed serially to processor P(1, 1) beginning at clock cycle 1 ($ is a special symbol denoting end-of-input).
During the recognition phase, the movement of data in the P-array is only from lower-indexed to higher-indexed processors (i.e., from left to right and from top to bottom). We take advantage of the uniformity of the data flow by introducing the notion of a forward sweep. For a processor p of the P-array, let dp be the rectilinear distance (i.e., counting only horizontal and vertical links) of p from processor P(1, 1). Then, p is said to be at forward sweep s if it is at clock cycle dp + s. For example, forward sweep 1 is clock cycle 1 for P(1, 1), clock cycle 2 for P(1, 2), clock cycle 3 for P(2, 2) and P(1, 3), etc. The important thing to note is that for a given forward sweep, processor P(i, j) is one clock cycle earlier than P(i + 1, j) and P(i,j+ 1). Thus, a computation that takes place in the former processor can affect the latter processors also at the same forward sweep. For a primary processor P(j,j), the set of processors {P(i,j)] 1 ~<i<j} are called its secondary processors. The secondary processors play a different role. Suppose that primary processor P(j, j) is assigned to compute entry R(a, b) at forward sweep s. Then, also at forward sweep s, the pairs of entries { JR(a, c), R(c + 1, b)] i a ~< c < b} will be stored in the secondary processors of P(j, j). Thus, in one forward sweep, the value of R(a, b) can be obtained by computing the convolutions of the pairs in each secondary processors, and taking the union of the results.
The mapping from pairs of entries to secondary processors is best explained by means of an example. Consider the case when processor P(5, 5) is to compute R(2, 6) at forward sweep 6. Then the required pairs {[R(2, c), R(c+ 1, 6) ]]2~<c<6} will be stored in processors P(1, 5) ..... P(4, 5) as shown in Fig. 5b . Intuitively, the mapping is obtained . (As we shall see later, this "folded" mapping guarantees that the data can be routed among processors using only nearest-neighbor conections.) Figure 6 depicts the configurations of the P-array for forward sweeps 1 to n for the case n=4. In Section4, we formalize the mapping just described, and explain how data should be routed in the P-array in order to achieve this mapping.
Forward sweep n+ 1 (at which processor P(1, 1) reads the end-ofinput marker $) is used to terminate ther recognition phase for all processors. When $ is read, processor P(1, 1) issues a "halt" signal which travels downwards and to the right with unit-delay. When received by a processor other that P(n, n), the processor terminates its computation. For processor P(n, n), it checks if R(1, n) contains a production whose LHS is the start symbol S. If there is no such production, it sends a "reject" signal back to processor P(1, 1) and the systolic array halts. Otherwise, P(n, n) initiates the parsing phase, described next.
Overview of the Parsing Phase
The parsing phase is a systolic implementation of procedure PARSE described in Section 2. During this phase, both P-array and Q-array take part in the computation. Conceptually, the phase is divided into m stages, where m is the number of distinct parse trees of the input string. During each stage, the P-array identifies and "marks" the productions of the next parse tree from the recognition matrix entries stored in its processors. The marked productions are then routed to the Q-array. At the end of the stage, the Q-array holds the new parse tree: if the parse tree contains a production from R(i, j), then this production is stored in processor Q(i, j).
In addition to the marked production (if one exists), each processor Q(i, j) stores the following auxiliary information about R(i, j) (see procedure PARSE): done(i, j), id(i, j), and last_id(i, j). If id(i, j) = k, then Q(i, j) also holds the values of done(i, k) and done(k+ t, j). As in procedure PARSE, the auxiliary information is used to determine the parse tree to be generated in the next stage.
In describing the steps involved, we use the notion of a reverse sweep, which is analogous to a forward sweep. At the start of each stage, processor P(n, n) issues a "begin-parse" signal that reaches all other processors by moving upwards and to the left with unit-delay. Thus, a processor a (rectilinear) distance d away from P(n, n) receives the signal d clock cycles later. For a processor, let reverse swweep 1 (of the current stage) be the clock cycle at which it receives the "begin-parse" signal. Then, reverse sweep 2 is the next clock cycle, reverse sweep 3 the clock cycle after reverse sweep 2, etc. That is, a reverse sweep is just like a forward sweep except that processor P(i, j) is one clock earlier than P(i-1, j) and P(i, j-1).
The following invariant holds for the first n reverse sweeps of every stage: at reverse sweep s, 1 ~< s ~< n, each P-processor contains the recognition matrix entries it had at forward sweep n-s + 1 of the recognition phase. For example, for the case n=4, the entries stored in the P-processors at reverse sweeps 1, 2, 3, and 4 are those depicted in Fig. 6 , but in reverse order of forward sweeps, During the parsing phase, however, each entry stored in a P-processor may have two additional pieces of information "attached" to it: a nonterminal symbol sym and a tag that can the value FIRST, CURRENT, or NEXT. Intuitively, an entry with an attached sym and tag signifies a call to procedure PARSE that will be executed by the P-array.
The primary P-processors also receive information from the Q-array via the wrap-around links in Fig. 4 . At the same time that primary processor P(j, j) holds the entry R(a, b), it also receives, from the wraparound link, the contents of processor Q(a, b) at the end of the previous stage (i.e., the marked production, if any, in R(a, b), done(a, b), id(a, b) , etc.). Section 5 discusses the data routing scheme that realizes the mapping just described.
We now explain how procedure PARSE is implemented on the systolic array. In the following, it is instructive to refer to the steps of this procedure. Informally, PARSE is executed whenever a primary P-processor receives an entry, say R(a, b), with an attached sym and tag. Depending on the value of tag (i.e., FIRST, CURRENT, or NEXT), the processor performs the corresponding steps in procedure PARSE. That is, it marks a production in R(a, b) then initiates recursive calls to PARSE. Note that the auxiliary information about R(a, b) is available to the processor from the wrap-around link.
In PARSE, the arguments of the recursive calls are determined by subroutine MATCH. Once identified by MATCH, the matching pair of entries are each given their sym and tag values to initiate the recursive calls to PARSE. As described in Section 5, these entries are then routed to other processors until they eventually land in primary P-processors. When this happens, the P-processors execute PARSE in the manner just described.
For each stage of the parsing phase, the initial call to PARSE is made by processor P(n,n) at reverse sweep 1. For the first stage, this is accomplished by simply attaching syrn = S and tag = FIRST to the entry R(1, n) stored in P(n, n). For any other stage, P(n, n) first checks the value of done(i, n) which it receives from the wrap-around link. If done = true, the parsing phase terminates. Otherwise, P(n, n) attaches sym=S and tag=NEXT to R(1, n) and proceeds with the rest of the computation. Note that these steps are analogous to those of the main program that calls procedure PARSE.
Lastly, we describe the action of the Q-array. When MATCH is executed by a primary P-processor and its secondary P-processors, three pieces of information leave the leftmost secondary P-processor and enters the rightmost Q-processor in the same row. These are: (1) the production marked by the primary P-processor; (2) id, the location (processor index) of the matching pair of entries found by MATCH; and (3) last_id, which is true if the matching pair found is the last such pair. To accommodate this incoming data, the Q-processors shift their own data to their left neighbor. Consequently, if ~ s R(i, j) is a marked production, then at the end of n reverse sweeps, Q(i, j) will hold 7c, id(i,j) and last_id(i, j).
After the marked productions of the new parse tree have been stored, the Q-array performs one final update step. Specifically, if id(i, j) = k then
Q(i, j) gets the values of done(i, k) and done(k + 1, j) from Q(i, k) and Q(k + 1, j) respectively, then computes done(i, j) from these values. For Q(i, i), 1 ~ i <<. n, done(i, i) = true.
The details are given in Section 5. After this update step, the systolic array is ready to commence the next stage of the parsing phase.
THE SYSTOLIC RECOGNITION PHASE
This section and the next give full descriptions of the recognition and parsing phases, respectively, of the systolic algorithm. Before proceeding, we first describe the local memory organization of the systolic array processors. This is shown in Fig. 7 . A P-processor has six registers rpq and tp (p, q ~ {0, 1 }), each capable of holding an ordered subset of productions of the underlying grammar. In addition, it has four cells, Cpq (p, qe {0, 1}), where a cell is a collection of three registers: tag, syrn, and pset. Register l, b) where t is an integer in the range O<<.l<<.n and b E {0, 1 }. We shall explain the use of these registers in subsequent sections. We now describe the recognition phase in detail. The function of this phase is to compute recognition matrix R and to determine whether the input string ata2...an is in the language generated by the grammar. During this phase, only the P-array is used. The input string, followed by an end-ofinput maker $, is fed serially to processor P(1, 1) beginning at forward sweep 1. The entries of the recognition matrix are computed by the primary processors over n sweeps: P(j, j) computes R(s-j + 1, s) at forward sweep s, j~< s ~< n. The secondary processors {P(i, j) ll ~< i< j} assist in the corn- 
4.1, The Formal Mapping
The formal mapping is given by Invariants 4.1 and 4.2. The processors use the four rpq registers to store the entries. The notation rpq(i, j, s) means the contents of register rpq of processors P(i, j) at forward sweep s. -j + i+ 1, s) otherwise. Invariants 4.1 and 4.2 specify the register values for secondary and primary P-processors, respectively. All registers are assumed to be initialized to the empty set ~. Observe from Invariant 4.1 that some secondary processors may have some registers permanently set to ~; this indicates that no matrix entry is mapped onto the register. Moreover, for primary processors (Invariant 4.2), roo and r H are always ~, and rm and rio hold the computed entry. Although one register should be sufficient, this mapping simplifies the routing of data (to be explained later). Finally, the invariants define the register values of P(i, j) only for forward sweeps s/> j. If s <j, the registers of P(i, j) retain their initial values ~.
The Routing Scheme
It is easy to see how Invariant 4.2 can be realized for every primary processor given that Invariant 4.1 holds for secondary processors. For a given forward sweep, Invariant 4.1 states that all the pairs of entries required to compute the entry at the primary processor are available in the secondary processors to its left. Thus, the desired value is simply the union, over all secondary processors, of (roo * rol)u (rl0 * rll). This value can be computed as follows: Each processor has a left input terminal INv and a right output terminal OUT~ (for a processor in the leftmost column other than P(1, 1), UNv is assumed to be permanently set to ~). At the start of each forward sweep, the processor receives a value from IN~, computes INv ~ (roo * rol) u (r m * rll) then sends the result to OUT~. The output from OUT~ then travels with unit-delay to the IN~ terminal of the next processor. It is clear that the value that arises at the primary processor in the desired matrix entry. The primary processor then stores this value in its rol and r m registers. Processor P (1, 1) newly computed entry in its registers, the processor routes the register contents to the associated output terminals the same way as described.
For processor P(i,j), the previous data routing step (and the Fig. 9 . Updating the rpq registers of processors P(i, j) for the case (a) 2i~ j and (b) 2i= j. associated computational step which computes the convolutions) is performed at every forward sweep s~>j. For forward sweeps s< j, the processor is "inactive." The processors can be activated at the right forward sweeps as follows: At clock cycle 1 (when the first input symbol is read), processor P(1, 1) generates a "start" control signal which travels downwards w:ith 2-delay (i.e., hops from processor to processor every 2 clock cycles) and to the right with unit-delay. One can easily verify that the "start" signal reaches processor P(i, j) at forward sweep s = j.
At this point, we explain the use of registers to and t 1 in each processor (see Fig. 7 ). At the clock cycle when a processor receives the "start" signal, it also copies into its to and tl registers, the updated contents of its ro~ and r~l registers, respectively. In subsequent clock cycles, the contents of t o and tl are left unchanged. The information stored in these registers will be used later in the parsing phase.
The computational and data routing steps just described guarantee that Invariants 4.1 and 4.2 hold for all processors of the P-array. In particular, at the end of forward sweep n, processor P(n, n) would have computed the value of R(1, n). The proof is straightforward induction (on the sweep number and processor index) and is left to the reader.
Forward sweep n+ 1 (at which processor P(1, 1) reads the end-ofinput marker $) is used to terminate the recognition phase for all processors. When $ is read, processor P(1, 1) issues a "halt" signal which travels downwards and to the right with unit-delay. When received by a processor other that P(n, n), the processor terminates its computation. For processor P(n, n), it checks if R(1, n) (which is stored in its rol and rio registers) contains a production whose LHS is the start symbol S. If there is no such production, it sends a "reject" signal back to processor P (1, 1) and the systolic array halts. Otherwise, P(n, n) initiates the parsing phase described in the next section.
Remark 4.1. We have some final remarks about the recognition phase. If one wishes only to determine whether the input string is in the language generated by the grammar, then the systolic array need not execute the next phase. In this case, one gets the answer from processor P(n, n) at the end of forward sweep n + 1, which corresponds to clock cycle 3n-1. Furthermore, observe that every processor stores in its registers values which are dependent only on the size of the grammar and not on the length of the input (i.e., the processor is finite-state). It is also a simple exercise to modify the systolic algorithm just described so that each processor does not need to know its index (e.g., as is required to distinguish processors P(i, j) such that 2i=j).
THE SYSTOLIC PARSING PHASE
The systolic parsing phase uses both the P-and Q-array. Conceptually, the phase is divided into m stages, where m is the number of distinct parse trees of the input string. During each stage, the P-array marks the productions comprising the next parse tree and routes this information to the Q-array. At the end of the stage, the new parse tree will be stored "on-the-fly" in the Q-array: if the parse tree contains a production from R(i, j), then this production would be stored in processor Q(i, j).
The Routing Scheme
As stated in the overview section, the following invariant holds for the first n reverse sweeps of every stage: at reverse sweep s, 1 ~<s ~< n, each P-processor contains the recognition matrix entries it had at forward sweep n -s + 1 of the recognition phase. To see how this can be achieved, observe that during the recognition phase, every newly computed entry starts from an l'pq register of a primary P-processor then follows a unique directed path through the P-array. Moreover, the path always ends either at a tp register at some forward sweep s<~n (after which the tp register is no longer changed) or at an ?'pq register at forward sweep n. Thus, the rpq and tp registers at the end of the recognition phase contain all the entries computed in all n forward sweeps; in n reverse sweeps these entries can be sent back to their previous locations by routing them along the paths opposite to what they took during the recognition phase.
In order not to lose the information stored in the rpr and tp registers at the end of the recognition phase (they will be required at the start of each new stage), we instead use the cells of the P-array for storing and routing the data (see Fig. 7 ). In particular, we let register pset of cell Cpq (or pset(@q) for short) take the place of register rpq. The "routing scheme" for cells is essentially the reverse of that shown in Fig. 9 : simply replace "rpq" by "@q" and reverse the directions of all the arrows. The delays associated with the links (see Fig. 8 ) remain the same. (To "route a cell" we mean to route the contents of the three registers tag, sym and pset that make up the celt.) Processor P(i, j) performs the routing step for its cells at every reverse sweep. There are two exceptions: The first is reverse sweep 1, when processor P(i, j) updates pset(Cot) and pset(C1o) to roj and r~o, respectively, instead of getting the data as inputs (which turn out to be nonexistent at reverse sweep 1). The second exception is reverse sweep n-j+ 1, when processor P(i, j) instead updates pset(Co~) and pset(CH) to t o and t~, respectively; this has the opposite effect of copying rol and rH into t o and t~, respectively, at forward sweep j. (We shall explain later how processor P(i,j) would know when it is at reverse sweep n-j+ 1).
Finding a Matching Pair
We now describe the systolic implementation of subrouting MATCH of procedure PARSE. Given a production ~z in a primary P-processor, MATCH identifies a pair of entries [R(a, c), R(c+ 1, b) ] in one of its secondary processors that contains ~'s children in the parse tree. This is completed in one reverse sweep. Formally, the P-processor issues an 
POd)
l'U-2,j)
l-c.lc~
The ceils of secondary processors to the left of P(j, j) depicted as a "chain" of cellpairs.
with unit-delay), Now, let [C~, C2] be the first celt-pair satisfying the property that (*) there is a production in pset(Ci) whose LHS -B and there is a production in pset(Ca) whose LHS = C. received, the Q-processor simply clears the three registers.) For processors of both the P-array and Q-array, the routing steps and the computational steps associated with the MATCH instruction are executed at every reverse sweep starting at reverse sweep 1 (which is when they receive the "begin-parse" signal). For all processors on the jth row (from the top), reverse sweep n-j+ 1 is the last reverse sweep when these steps are performed. A processor on the jth row can know when it is at reverse sweep n-j+ 1 as follows: At reverse sweep 1, processor P(n, n)
issues an "end-parse" control signal which travels upwards with 2-delay and to the left with unit-delay. A processor receives this signal at reverse sweep n -j + 1.
Systolic Execution of PARSE
The MATCH instruction is issued by a primary P-processor whenever it receives a "marked" cell Cpq, i.e., a cell with values stored in its syrn and tag registers. A marked cell signifies a "call" to procedure PARSE that is to be executed by the primary P-processor. (Only at most one marked cell will arrive at a primary P-processor in any reverse sweep.) Recall that at each reverse sweep, a primary P-processor also receives the auxiliary information about the entry stored in its cell from the wrap-around link. This information is needed by the primary P-processor to execute PARSE. For Note that a primary P-processor may receive a marked production (from the previous stage) from the wrap-around data/, yet not have a marked cell. As stated above, the P-processor does nothing. This produces the same effect as subroutine UNMARK in procedure PARSE.
For each stage of the parsing phase, the initial call to PARSE is made by processor P(n, n) at reverse sweep 1. For the first stage, this is accomplished by simply setting sym(Col)=S and tag( Col) = FIRST. For any other stage, P(n, n) first checks the value of done from the wrap-around input/. If done = true, the parsing phase terminates. Otherwise, P(n, n) sets sym(Col) = S and tag(Col)= NEXT. These steps are analogous to those of the main program that calls procedure PARSE.
The Update Step for the Q-array
The clock cycle at which the "end-parse" signal is received represents the end of the stage for each processor of the P-array. On the other hand, the Q-array performs another step which involves the update of the ldone, done and rdone registers of its processors. This is accomplished as follows: At reverse sweep n (which is also when it receives the "end-parse" signal), processor Q(n,n) sends out an "update" control signal to all other processors of the Q-array, this signal traveling diagonally downwards with 2-delay and to the right with unit-delay. For processors on the top row of the Q-array (i.e., processors Q(j, j), 1 <~j<<.n), the following is performed when they receive the "upsate" signal: set ldone = rdone = done = true and send the contents of done diagonally downwards with 2-delay and vertically downwards with unit-delay. For a processor in a lower row, one diagonal input and one vertical input would arrive at the time it receives the "update" signal. The processor then does the following:
(1) If its p register does not contain a production, then clear its done, ldone and rdone registers and route the vertical and diagonal (2) If its p register contains a production, then set ldone to the value of the vertical input and rdone to the value of the diagonal input. Set rdone to true iff (i) ldone = rdone = true, (ii) last_id = true, and (iii) the p register contains a distinguished production r~. Otherwise, set done to false. Route the contents of done vertically and diagonally downwards.
After the update step for processor Q(1, n), it sends the contents of all of its local registers to processor P(n, n) of the P-array to begin the next stage. In addition, processor Q(t, n) sends a signal to all processors of the Q-array, this signal traveling upwards and to the left with unit-delay. When received by a Q-processor, it sends the contents of all its local registers to the processor to its left and receives the update values from the processor to its right. The effect is that the entire parse tree is shifted out of the Q-array and l~petined into the primary processors of the P-array using the wrap-around links in Fig. 4 , Figure 11 illustrates the configurations of the systolic array during the first n reverse sweeps of stage a and Fig. 12 depicts the Q-array after the update step. Figures 13 and 14 show the configurations during the second stage. 
