This paper presents an FPGA-based implementation of a co-processing unit able to parse context-free grammars of real-life sizes. The application fields of such a parser range from programming languages syntactic analysis to very demanding Natural Language Applications where parsing speed is an important issue.
Introduction
Formal languages, in particular those described by context-free (CF) grammars [ l 11, are used in many appli- goal; e.g., in the case where the output of the speech recognizer is further processed with a syntactic parser to filter out those of the hypotheses (i.e. sentences) that are not syntactically correct. The case of a sequential coupling, presented in [5] , is such an example. Due to the real-time constraints in such an application, fast parsing is again required.
Various low complexity (i.e. polynomial) parsing solutions have been proposed for context-free languages, in particular several parallel implementations of the standard O(n3) time complexity' Cocke-Younger-Kasami (CYK) algorithm [ 11. This algorithm is known to have O(n2) time complexity when executed on a 1D-array of processors and O(n) time complexity when executed on a 2D-array of processors [9] .
For such arrays, various VLSI designs have been proposed : a syntactic recognizer based on the CYK algorithm on a 2D-array of processors [4] and a robust (error correcting) recognizer and analyzer (with parse tree extraction) based on the Earley algorithm on a 2D-array of processors [3] . Although these designs meet the usual VLSI requirements (constant-time processor operations, regular communication geometry, uniform data movement), the hardware resources they require do not allow them to accommodate real-life CF grammars used in large-scale NLP applications*.
In this context, we propose an FPGA-based 1D-array of processors implementation of the CYK algorithm able to accommodate real-life CF grammars that can parse input sentences of parametrical, i.e. customizable, maximal length. The design was described in VHDL, simulated for validation, synthesized and tested on an existing FPGA board, and finally compared for performance against two software implementations. Its main features are:
word-lattice parsing : the output of a speech recognizer is a number of possible sentences, often represented in the compact form of a word-lattice. Unlike the usual parsing algorithms that process sentences, our system is able to parse whole word-lattices and is therefore better adapted for integration in the framework of a speech recognition system. When integrating the parser within a speech recognition system, the ability to parse word-lattices is an important functionality that is required by a large number of applications relying on speech recognition interfaces; 0 scalability : the system can be exactly tailored to the characteristics of any given Chomsky normal form (CNF) CF grammar so that no hardware resources are wasted:
0 extensibility : the number of processors is not limited by the resources available in the FPGA. It can be increased on demand by cascading several P G A circuits.
The design was described in VHDL in order to have a technology independent implementation that can be later used to target an ASIC implementation. In section 2 we briefly present the CYK algorithm and the changes required for its adaptation to word-lattice parsing. Section 3 describes the design, its main components, and scalability and extensibility properties. Section 4 analyzes the performance of the FPGA design in comparison with two software implementations of the CYK algorithm. Conclusions and future extensions are presented in section 5. P is a set of grammar rules, i.e. a subset of N x (N U E)* written in the form of X -+ a, where X E N and CL! E ( N U E)*. For instance, a grammar rule can be S + N P V, representing that a sentence can consist of a noun-phrase followed by a verb; The lines 1 -6 in the algorithm correspond to the initialization step, when the sets Ni,l are initialized by only using the grammar rules of the form X + wi. In the CYK table this corresponds to the initialization of all entries on the bottom row. The lines 7-13 correspond to the subsequent filling-up of the CYK table once the initial sets Ni,j were constructed. Finally, the parsing trees can be extracted from the CYK table if necessary. If S is the topmost symbol (root) of such a tree, the sentence is syntactically correct for the grammar G.
CYK algorithm for

2:
Ni,l = {X : (X -+ Wi) E P} for j = 2 t o n -i + 1 do
4:
Nj,j = 0 Ni,j = Ni,jU{X : (X + YZ) E P with y E Ni,k and E Ni+k,j-k}
Example
Assume that the CF grammar G is given by: 
Word-lattice adaptation
An example of word-lattice representation is given in figure 2. Each path starting in the leftmost node and ending in the rightmost node of the word-lattice corresponds to a possible recognized sentence. These sentences are subject to be filtered-out by the syntactic parser.
When dealing with word-lattices, the difference with the previously presented CYK algorithm is that not only the sets Ni,l may be initialized during the initialization step, but any of the sets Ni,j as with word-lattice representation words may occur anywhere in the CYK table (see figure 3(b) ). Thus, in order to adapt the CYK algorithm to word-lattice parsing, the initialization step needs to be extended [5] .
To illustrate this point, let us consider the word-lattice given in figure 2, containing the six sentences "a a b b", figure 3 (b) ).
The hardware design we are going to present implements the CYK algorithm adapted for word-lattice parsing. 3 The FPGA Design
The general idea of our current hardware design is to use n -1 processors to parse sentences of at maximum n words. For example, the block diagram in figure 4 represents a 10 processor system that can parse any sentence of length less or equal to 11 words and that we have effectively implemented on a RCIOOO-Pp FPGA board. In the figure, the elements inside the dashed line are implemented within the on-board FPGA chip. The other elements (CYK and grammar memories) are implemented in SRAM chips also present on the board. Before any parsing can start, the grammar memories have to be configured with the binary image of the data-structure representing the CNF CF grammar (see section 3.4). Similarly, the CYK memory need to be initialized, where needed, with the structures used to represent the sets Ni,j (see section 3.2) and the GLOBAL controller (G-CTRL) has to be initialized with the length of the sentence to be parsed. All initializations are done offline in the current implementation. The startPARSE signal starts the parsing and the overPARSE signal indicates the end of the parsing. The parse result is available at some output outPARSE (not represented in figure 4 ) of each processor and can be collected to build the parse tree. When the parsing starts for a sentence of length 1 5 n + 1, the processors P1 to P I -1 are first activated and the processors P1 to Pn are deactivated (i.e. not used for that parsing). It is the task of the G-CTRL to activate or deactivate the processors, based on the length 1 of the sen- tence. The G-CTRL also synchronizes the processors at the end of iteration j (line 7 of the CYK algorithm), before starting iteration j + 1. This is necessary due to the data dependency among the sets Ni,j. The CYK memory stores the sets Ni,j and is shared for read and write by all working processors in the system. A token passing priority arbiter handles concurrent accesses to this memory.
The grammar memories store identical copies of the binary representation of the CNF CF grammar. During parsing, the processors intensively access the grammar memories but due to physical constraints (i.e. the number of U 0 pins of an P G A ) it is impossible to have a grammar memory for each processor in the system. Instead, processors are grouped in clusters that share the same grammar memory. The number of processors in a cluster may vary from one cluster to the other, and clusters are built in such a way that, in the general case, the number of concurrent accesses is as reduced as possible.
As for the CYK memory, a token passing priority arbiter handles concurrent accesses in every cluster.
Processor Datapath Structure
The processors have the task of filling-up the entries of the CYK table. More precisely, if 1 is the sentence length, the task of the processor Pi, i 5 1 -1, is to fill the CYK table entries on column i during 1 -i iterations in a bottomup order. In other words, to compute the sets N i , j , based on previously constructed sets (line 10 of the CYK algorithm). ( R H S 1 , R H S 2 ) E Ni,k X Ni+k,j-kr 1 5 k 5 j -1, 0 the synchronization unit: used by a processor to achieve synchronization with the other processors after each iteration of the CYK algorithm.
CYK table representation in memory
The purpose of the CYK memory is to store the sets Ni,j. The data-structure used to represent these sets is critical and also has to correspond to a good compromise between memory size and access-time to the non-terminals in a set. Ni,, can be any subset of N , hence JNi,jI' can be equal to IN1 in the worst case. However, to allocate for each set Ni,j an amount of memory proportional to IN1 would represent an important memory waste, as in practice I Ni,j I << I NI.
Therefore, in order to reduce the size of the CYK memory, we impose to INi,jI an upper limit C. During run-time, if Ni,j receives more than C non-terminals, the hardware generates a fault signal and the parsing stops. This is, however, a very unlikely event for a well chosen value of C and we can thus use tables of size proportional to C to store the non-terminals of the sets N i j . Note that, if the number of processors in the system is n, the CYK memory has to store n(n -1)/2 such tables.
Given i, j and k, during parsing, a processor has to implement the following three functionalities: we give in figure 6 the CYK table organization in memory. We discuss each of these functionalities in turn.
For the implementation of Fl, the Phead pointer is used as a base address that points to the first non-terminal in the non-terminal table. It is stored either in RHSlbase (Ni,k) or RHS2base ( N i + k , j -k ) registers (see figure 5) . A displacement, i.e. index, for addressing any non-terminal stored in the non-terminal table is kept in RHSlIndex, respectively MS2Index. The addition of the base address with the index address gives the physical address in memory of the non-terminal. For implementing F1, the processor needs to know whether the table is empty or not, and, in the later case, to know which is the last non-terminal in the table. This is implemented by means of two special bits attached to each non-terminal in the table. One bit is set when the table is empty, the other when the non-terminal is the last in the table. The current implementation uses 2 bytes for representing a non-terminal, thus, 2C bytes are used to represent an entry, i.e. a set, in the CYK If during the parsing Dtail > C, a fault signal is raised to signal that the Ni,j has too many non-terminals. In a non word-lattice parsing (i.e. sentence parsing) Dtail is always 0, and is not used since every set N i j is empty at the beginning. However, in a word-lattice parsing the sets Ni, j are not necessarily empty and the insertion of new nonterminals has to be made at the end of the non-terminals table where Phead + Dtail points.
The initialization of the CYK table corresponds to: the initialization of the non-terminals table, the associated Dtail indexes and the guard-vectors. The Phead and Pguard pointers are initialized only once and they do not change.
In order to retrieve the pointers Phead and Pguard and the displacement Dtail, the processor builds an address for addressing the indexing table (see figure 6 ) from i, j and k. The address is constructed in IJshadow as 8(32i + j )
for Ni,,, in RHSlshadow as 8(32i + k) for N i , k and in
RHS2shadow as 8(32(i
i-k) 4-j -k ) for Ni+k,j-k.
Update unit
The tasks of the update unit are: 0 set the flags (i.e. the 2 special bits) of each nonterminal before writing it in memory in the nonterminal table. This is done in the LHS update module; 0 update the guard-vec ors. Every time a new nonterminal is inserted in the non-terminal table its corresponding bit in the guard-vector has to be set. This is done in the guard update module.
Grammar representation in memory
The CYK algorithm uses CF grammars in CNF. Thus, the first pre-processing step is to transform a given general CF grammar in an equivalent CNF CF grammar. This is done off-line only once for each CF grammar used.
The CNF grammar is then represented by a datastructure that has to allow, for any grammar rule right-hand side (RHS) of the form YZ, to retrieve (1) all non-terminals Xi such that there is a rule Xi + Y Z in the grammar, and (2) a code that uniquely identifies the given RHS. As the data-structure used to represent the grammar is critical for the design, it has to correspond to a good compromise between the memory space taken by and the access time to the stored information.
Concretely, the data-structure representing the grammar is converted in a binary memory image ready to be dumped on the FPGA-board to configure the grammar memories.
As it is shown in figure 7 , the data-structure used in our implementation is organized on 3 levels. Level 1 is a table with an entry for each distinct non-terminal present in the grammar. Such an entry contains either a NULL pointer if there is no RHS starting, with the corresponding nonterminal Y or a non-NULL pointer pointing to the root of a tree at level 2. Level 2 is a collection of binary sorted trees, each containing all distinct second position non-terminals present in the RHSs that start with a given Y. Finally, in the binary trees, each node contains, in addition to the 
Scalability and extensibility properties
Scalability: when the binary memory image for the grammar and CYK memories are created, several parameters characterizing the grammar and CYK data-structures are also computed. These parameters are the number of bits needed to represent a non-terminal, the grammar memory address, the CYK memory address and the RHS code. These parameters are then used to configure the VHDL code. Such a parametric approach allows to scale the design, i.e. to assign to the hardware resources (registers, counters, multiplexers, etc.) bit sizes that match the grammar characteristics. Due to this scalability property, the FPGA resources can be optimally used and, for each grammar, an optimal number of processors can be fit in the FGPA.
Extensibilitv: in the case where the maximum length of a sentence to be parsed requires a number of processors that does not fit on a single FPGA, the system can be extended by cascading several FPGA circuits. Therefore, the number of processors can be increased as needed and the system can be adapted to parse sentences of any length.
Design Performance
All tests and performance measurements presented in this section were made with a grammar extracted from the SUSANNE corpus, referred henceforth as the SUSANNE grammar. In CNF the SUSANNE grammar contains 10,129 non-terminals and 74,350 rules. The grammar memory size required to store the data-structure representing the CNF SUSANNE grammar is of 558,576 bytes. The CYK memory size depends on the number of processors in the system (e.g. 446,496 bytes for a 10 processors system).
In order to determine the real maximal clock frequency at which the system is able to work, the 10 processor system shown in figure 4 was synthesized6 and placed&routed7 in a Xilinx FPGA, Virtex V1000bg560-4. The synthesized 10 processor system uses less than 35% of the FPGA resources. The system was then tested, and checked for correctness, on a RClOOO-PP board with a clock frequency of 48 MHz.
Due to the fact that the RClOOO-PP board can accommodate only 3 grammar clusters, the hardware run-times we present were obtained by simulating' the VHDL model of a system with 14 processors and 7 grammar memory clusters.
The software used for comparison is an implementation of an enhanced CYK algorithm developed in our laboratory [2] . The hardware performance (i.e. the run-time hereafter denoted by hard) was compared against two software run-times. The former (sofrl) uses the SUSANNE gram- Figure 8 shows the average run-times sofl, sop2 and hard as functions of the sentence length (vertical axe). The average speedup factor E ( S ) has been computed for both sofl and sof2. For sofl, Es,ftl (S) = 69.646 and for sof2, Esoft2(S) = 10.772. Figure 9 shows the hardware speedup in comparison with sofl and sof2 as a function of the sentence length.
Conclusions
In this paper we presented an PGA-based coprocessor implementation of the CYK algorithm adapted for wordlattice parsing and that can accommodate real-life CF grammars. The design interface was designed to allow an easy integration of the parser within a larger system (e.g. a speech-recognition application on a desktop computer) in which the parsing hardware would work as a co-processing unit. In addition, the experience acquired during the implementation of the system suggests the following possible extensions for our research work: 0 further improvements of the design such as a better processor control corresponding to a higher exploitation of the parallelism available in the CYK algorithm and therefore to a more efficient processor utilization.
In particular, the average number of processors idle during parsing should be substantially reduced; 0 further functional extensions of the design such as stochastic, i.e. probabilistic, parsing and the ability for the design to cope with general CF grammar not requiring a preliminary transformation in CNF.
