An FPGA-based syntactic parser for large size real-life context-free grammars by Ciressan, Cristian Raul
THÈSE NO 2522 (2001)
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
PRÉSENTÉE AU DÉPARTEMENT D'INFORMATIQUE
POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
PAR
ingénieur informaticien, Université Polytechnique, Timisoara, Roumanie
et de nationalité roumaine
acceptée sur proposition du jury:
Dr M. Rajman, directeur de thèse
Dr D. Lavenier, rapporteur
Prof. E. Sanchez, rapporteur
Prof. M. Vilares Ferro, rapporteur
Lausanne, EPFL
2002
AN FPGA-BASED SYNTACTIC PARSER FOR LARGE SIZE
REAL-LIFE CONTEXT-FREE GRAMMARS
Cristian Raul CIRESSAN
ii
Version abrégée
Le sujet de cette thèse est situé au mi-chemin entre le traitement automatique du langage naturel
et (TALN) et la conception de circuits numériques. Le but de cette thèse est la conception
d’un coprocesseur pour améliorer le temps de l’analyse syntaxique du langage naturel. Le
coprocesseur doit faire l’analyse syntaxique du langage naturel réel et est conçu pour être utile
dans plusieurs applications TALN qui ont des contraintes de temps ou qui utilisent beaucoup
des données.
Plus claire, les trois buts de cette thèse sont: (1) de proposer un efficient coprocesseur à
base d’FPGA pour l’analyse syntaxique du langage naturel qui prend des entrées de la forme
des treillis de mots, (2) d’implémenter un coprocesseur dans un outil matériel prêt à être
intégré dans un ordinateur et (3) d’offrir une interface (i.e. librairie de composants) entre
l’outil matériel et les éventuels logiciels de traitement automatique du langage naturel qui vont
s’exécuter sur l’ordinateur.
La technologie FPGA (Field Programmable Gate Array) a été choisie comme support pour
l’implémentation du coprocesseur grâce à son habilité d’exploiter de façon efficace tous les
niveaux de parallélisme existant dans les algorithmes implémentés, tout en gardant un prix
raisonnable. La dernière raison réside dans l’attente de voir les futurs coprocesseur génériques
contenir des ressources réconfigurables. Dans un tel contexte, un module (IP core) qui im-
plémente un analyseur syntaxique non contextuel prêt à être configuré dans les ressources ré-
configurables du processeur générique serait un support pour toute application basée sur des
analyses syntaxiques non contextuelles qui s’exécute dans ce processeur générique.
L’algorithme analyse syntaxique de la grammaire non contextuelle qui a été implémenté
est l’algorithme CYK standard aussi bien qu’une de ses versions améliorées. Cette version
améliorée de l’algorithme CYK a été développée dans le Laboratoire d’Intelligence Artificielle
de l’EPFL. Ces algorithmes ont été sélectionnés (1) grâce à leurs propriétés intrinsèques liées au
flux de données et au traitement des données régulières qui font d’eux des bons candidats pour
l’implémentation matérielle, (2) pour leur capacité à produire des arbres syntaxiques partiels
qui les rend adaptés pour des futures analyses syntaxiques de surface et (3) pour leur habilité à
faire l’analyse syntaxique des treillis de mots.
iv
Abstract
This thesis is at the crossroad between Natural Language Processing (NLP) and digital circuit
design. It aims at delivering a custom hardware coprocessor for accelerating natural language
parsing. The coprocessor has to parse real-life natural language and is targeted to be useful in
several NLP applications that are time constrained or need to process large amounts of data.
More precisely, the three goals of this thesis are: (1) to propose an efficient FPGA-based
coprocessor for natural language syntactic analysis that can deal with inputs in the form of
word lattices, (2) to implement the coprocessor in a hardware tool ready for integration within
an ordinary desktop computer and (3) to offer an interface (i.e. software library) between the
hardware tool and a potential natural language software application, running on the desktop
computer.
The Field Programmable Gate Array (FPGA) technology has been chosen as the core of
the coprocessor implementation due to its ability to efficiently exploit all levels of parallelism
available in the implemented algorithms in a cost-effective solution. In addition, the FPGA
technology makes it possible to efficiently design and test such a hardware coprocessor. A
final reason is that the future general-purpose processors are expected to contain reconfigurable
resources. In such a context, an IP core implementing an efficient context-free parser ready to
be configured within the reconfigurable resources of the general-purpose processor would be a
support for any application relying on context-free parsing and running on that general-purpose
processor.
The context-free grammar parsing algorithms that have been implemented are the standard
CYK algorithm and an enhanced version of the CYK algorithm developed at the EPFL Artifi-
cial Intelligence Laboratory. These algorithms were selected (1) due to their intrinsic properties
of regular data flow and data processing that make them well suited for a hardware implemen-
tation, (2) for their property of producing partial parse trees which makes them adapted for
further shallow parsing and (3) for being able to parse word lattices.
vi
Acknowledgements
First of all, I would like to thank my thesis director Dr. Martin Rajman who knew how to
guide and encourage me whenever long term decisions were to be made. His ability to guide
my work was priceless and the freedom he granted me during this thesis was something I truly
appreciated. I would also like to thank the members of my jury, Prof. Roger D. Hersch, Prof.
M. Vilares Ferro, Prof. Dominique Lavenier and Prof. Eduardo Sanchez for the interesting
discussions I had with them during and after the examination.
Many thanks to my colleagues for the environment they have created in the laboratory. I
enjoyed those broodwar evenings (or mornings) in the middle of the week when we struggled
for victory. Entaro Adun! I still do not understand how Cedric was able to do it on his laptop –
I always needed a large screen – neither how Romaric was always able to gather those hoards
of zergs. And a warning for those staying in the lab: do not dear to forget to invite me when
WarCraft III is coming out! I only have a concern: Who’s the next PC administrator in the lab?
Who’s left? Oh, yes! Many thanks to Ionut and Steve who provided me with the two black
beasts that run my simulations during the last 3 weeks before I had to hand-out my thesis. I am
sorry I had to install Windows 2000 on them! I know Linux is now in place . . .
I would also like to mention my friends here in Lausanne that were a nice company during
lunch time or during those long coffy breaks; especially to Mady for being always ready to help
and give advice.
Finally, I would like to mention my parents that believed in me and left me full freedom to
shape with passion my studies as I choose. Especially my father and my grandmother – still
amazed on how she knew to seed the grains of ambition and motivate me – that are no longer
here.
Claudia! Many thanks for your patience during the hard time I had writing this thesis. I
know I was uneasy. Many thanks for correcting the figures in this thesis in order to make them
look as they are now.
viii
Contents
1 Introduction 1
1.1 Natural Language Processing and parsing . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Why Context-Free Grammars ? . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Why fast parsing ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Why the CYK algorithm ? . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The FPGA technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 VLSI implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Hypercube architecture mapping . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Distributed memory systems and shared memory systems . . . . . . . 7
1.4 Thesis goals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing 11
2.1 Context-free grammars and the Chomsky Normal Form . . . . . . . . . . . . 11
2.2 The CYK algorithm adapted for word lattice parsing . . . . . . . . . . . . . . 12
2.3 The enhanced-CYK algorithm adapted for word lattice parsing . . . . . . . . . 15
2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 The Linear Array of Processors Hardware Design of the CYK Algorithm 21
3.1 Why not a 2D-array of processors architecture ? . . . . . . . . . . . . . . . . . 22
3.2 General system description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Functional Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 The CYK algorithm data-structures . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 The chart data-structure memory representation . . . . . . . . . . . . . 26
3.4.2 The CNF grammar data-structure memory representation . . . . . . . . 31
3.5 Design units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 The processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1.1 The processor’s interface to the chart memory . . . . . . . . 37
3.5.1.2 The processor’s interface to the grammar memory. . . . . . . 38
3.5.1.3 The update unit . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.1.4 The synchronization/validation unit . . . . . . . . . . . . . . 41
3.5.2 The I/O controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.3 The arbiters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Design testing on the RC1000-PP FPGA board. . . . . . . . . . . . . . . . . . 46
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
x CONTENTS
4 Linear Array of Processors Design Analysis 51
4.1 Processors utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Expected performance depreciation . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 The Dynamic Array of Processors Hardware Design 59
5.1 General system description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Functional description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Design units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 The sequence generator (SEQ_GEN) unit . . . . . . . . . . . . . . . . 63
5.3.2 The data-dependency checking (CHECKER) unit . . . . . . . . . . . . 65
5.3.3 The triplets buffer (POOL) unit . . . . . . . . . . . . . . . . . . . . . 66
5.3.4 The task dispatching (DISPATCHER) unit . . . . . . . . . . . . . . . . 68
5.3.5 The WRITER unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.6 The destination cells table (D-TABLE) unit . . . . . . . . . . . . . . . 73
5.3.7 The processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Design testing on the RC1000-PP board. . . . . . . . . . . . . . . . . . . . . . 82
5.6 Dynamic Array of Processors Design Analysis . . . . . . . . . . . . . . . . . . 83
5.6.1 Average processor utilization . . . . . . . . . . . . . . . . . . . . . . . 83
5.6.2 Expected performance depreciation . . . . . . . . . . . . . . . . . . . 85
5.6.3 POOL influence on DAP design performance . . . . . . . . . . . . . . 88
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 The Hardware Design of the enhanced-CYK Algorithm 95
6.1 General system description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 Functional description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3 The enhanced-CYK algorithm data-structures . . . . . . . . . . . . . . . . . . 100
6.3.1 The chart data-structure . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 The nplCFG grammar data-structure . . . . . . . . . . . . . . . . . . . 103
6.4 Design units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1 The sequence generator (SEQ_GEN) unit . . . . . . . . . . . . . . . . 107
6.4.2 The data-dependency checking (CHECKER) unit . . . . . . . . . . . . 108
6.4.3 The destination cells table (D-TABLE) unit . . . . . . . . . . . . . . . 108
6.4.4 The triplets buffer (POOL) unit . . . . . . . . . . . . . . . . . . . . . 109
6.4.5 The task dispatching (DISPATCHER) unit . . . . . . . . . . . . . . . . 109
6.4.5.1 The prefetching (READER) stage . . . . . . . . . . . . . . . 109
6.4.5.2 The cell-combinations tiling (TILER) stage . . . . . . . . . . 112
6.4.6 The processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4.7 The WRITER and LOOKUP units . . . . . . . . . . . . . . . . . . . . 115
6.4.8 The compact parse trees extractor (EXTRACTOR) unit . . . . . . . . . 117
6.4.9 The enhanced-CYK system monitor (MONITOR) unit . . . . . . . . . 117
6.5 Performance measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.6 Design analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6.1 Average processor utilization . . . . . . . . . . . . . . . . . . . . . . . 123
6.6.2 Expected performance depreciation . . . . . . . . . . . . . . . . . . . 125
6.6.3 Required bandwidth for transferring the compact parse forest . . . . . . 126
6.6.4 Tile size influence on the design performance . . . . . . . . . . . . . . 131
CONTENTS xi
6.6.5 POOL influence on the design performance . . . . . . . . . . . . . . . 132
6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 An accelerator FPGA-board for running the enhanced-CYK algorithm 137
7.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2 Functional description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3 The FPGA-board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.1 The PCI interface board . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.3.1.1 IOP 480 I/O processor . . . . . . . . . . . . . . . . . . . . . 140
7.3.1.2 The serial EEPROM . . . . . . . . . . . . . . . . . . . . . . 141
7.3.1.3 The flash memory . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.1.4 The SDRAM memory . . . . . . . . . . . . . . . . . . . . . 141
7.3.1.5 The Serial port . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.3.1.6 The PLX Option Module Connector . . . . . . . . . . . . . 142
7.3.2 The FPGA expansion board . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.2.1 The programmable clock . . . . . . . . . . . . . . . . . . . 142
7.3.2.2 The FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3.2.3 FIFO memory . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3.2.4 Chart memory . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3.2.5 Grammar memories . . . . . . . . . . . . . . . . . . . . . . 145
7.3.2.6 "Glue"-logic . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.2.7 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.2.8 Power supply . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8 Conclusions 149
8.1 Analysis of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A Transformation steps towards CNF grammars 153
A.1 SlpToolKit organisation of SUSANNE grammar . . . . . . . . . . . . . . . . . 153
A.2 Transforming the general SUSANNE CFG to its CNF. . . . . . . . . . . . . . 154
A.3 Transformation steps detailed. . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.4 Creating and validating the memory image . . . . . . . . . . . . . . . . . . . . 156
B Facts about the (enhanced)-CYK algorithm 159
B.1 The number of cell combinations for an  word sentence . . . . . . . . . . . . 159
B.2 The dynamic processor allocation method . . . . . . . . . . . . . . . . . . . . 159
B.2.1 Case-study on the CYK algorithm . . . . . . . . . . . . . . . . . . . . 161
B.2.2 Case-study on the enhanced-CYK algorithm . . . . . . . . . . . . . . 162
B.3 The chart cell size for the CYK algorithm . . . . . . . . . . . . . . . . . . . . 164
B.4 The chart cell size for the enhanced-CYK algorithm . . . . . . . . . . . . . . . 164
C FPGA expansion board schematics 169
xii CONTENTS
List of Figures
1.1 FPGA spatial (a) vs. general-purpose processor temporal mapping (b) for the
function      . R1 and R2 are internal registers of the
general-purpose processor and the mnemonics "mul" and "add" correspond to
the operation of multiplication, respectively addition. . . . . . . . . . . . . . . 5
1.2 An example of integration of the hardware context-free parser accelerator (FPGA-
board) in a speech-recognition application framework, namely a Vocal Informa-
tion Server. The input to the FPGA-board is a word lattice and the output is a
compact parse forest. GM stands for grammar memory. . . . . . . . . . . . . . 8
2.1 CYK (a) chart initialization (j=1) and filling (j=2,3,4,5), (b) two possible pars-
ing trees corresponding to the input sentence “b a d b c” . . . . . . . . . . 13
2.2 An example of initialization for a compound word ("credit card") . . . . . . . . 14
2.3 A toy word lattice containing 6 sentences. . . . . . . . . . . . . . . . . . . . . 15
2.4 (a) Word lattice of fig 2.3 represented as a speech word lattice (nodes are nat-
urally ordered since they represent different time-instants). (b) The representa-
tion of the word lattice in the form of a initialized chart. . . . . . . . . . . . . . 15
2.5 Enhanced-CYK (a) chart initialisation (j=1) and filling (j=2,3,4,5), (b) two pos-
sible parsing trees corresponding to the input sentence “b a d c b” . . . . . 18
2.6 Example of word lattice initialisation for the enhanced-CYK algorithm . . . . . 18
3.1 A -processor LAP design (implemented and tested on the RC1000-PP FPGA
board) for parsing any sentence of length up to  words. . . . . . . . . . . . . 25
3.2 CYK chart data-structure organization . . . . . . . . . . . . . . . . . . . . . . 28
3.3 The organization of an entry in the indexing table of the CYK chart data-structure. 29
3.4 The CNF grammar data-structure. Levels of representation. . . . . . . . . . . . 32
3.5 The CNF grammar data-structure representing the toy CNF grammar in exam-
ple 3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 The datapath of the processor used in the linear array of processors (LAP) ar-
chitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Grammar memory access (MAG) unit interface protocol. . . . . . . . . . . . . 40
3.8 The hardware used to achieve processor synchronization and for generating the
stop condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Average run-times for _ and 	
__, 	
__ LAP sys-
tems as a function of sentence length. For each sentence length more than 
sentences were parsed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.10 Hardware speedup for the 	
__ and 	
__ LAP systems
against (a) _ and (b) _ software as a function of sentence
length. For each sentence length more than  sentences were parsed. . . . . . 47
xiv LIST OF FIGURES
3.11 RC1000-PP FPGA-board block diagram illustrating the mapping of the -
processor LAP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 LAP design processor activity in BLACK+GRAY when parsing a sentence of
length (a)  words, (b)  words and (c)  words. GRAY: represents the time
spent for chart read/write accesses. BLACK: represents the time spend for data
processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 LAP design real vs. expected parsing time when parsing the sentences "a a",
"a a a", "a a a a", . . . up to a similar sentence of length . Real parsing time
is greater than the expected parsing time which illustrates the expected perfor-
mance depreciation phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 LAP design processor activity for a sentence of length (a)  and respectively
(b) 	 words. The larger gray area is the reason for the expected performance
depreciation phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 The block diagram of an -processor DAP design. . . . . . . . . . . . . . . . 61
5.2 A (possible) sequence of triples 

 

    generated when a sentence of length
	   is parsed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Another (possible) sequence of triplets, but with chronological ordered 

 

   
source cells, generated for the same sentence length 	  . . . . . . . . . . . . 65
5.4 The POOL unit datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Typical POOL operations on the register pile: read (read t2), write (write t5)
and concurrent read/write (read t1, write t6) . . . . . . . . . . . . . . . . . . . 69
5.6 The interface between the DISPATCHER and the processors. Triplet dispatch-
ing solution implemented in the DAP design. . . . . . . . . . . . . . . . . . . 70
5.7 The interface between the WRITER and the processors. Parsing results flushing
solution implemented in the DAP design. . . . . . . . . . . . . . . . . . . . . 72
5.8 The WRITER unit datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.9 The D-TABLE unit circuitry used for performing a data-dependency check. . . 75
5.10 The D-TABLE unit, circuitry used for inserting a destination cell. . . . . . . . 76
5.11 The D-TABLE unit, circuitry used for updating a D-TABLE entry. . . . . . . . 78
5.12 The datapath of the processor used in the dynamic array of processors (DAP)
architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.13 Average run-times for _ and 	
_ DAP system as a function of
sentence length. For each sentence length more than  sentences were parsed. 83
5.14 Hardware speedup for the 	
_, 	
_, 	
_ and 	
_ DAP
systems against (a) _ and (b) _ software as a function of
sentence length. For each sentence length more than  sentences were parsed. 84
5.15 DAP design processor activity in BLACK+GRAY when parsing a sentence of
length (a)  words, (b)  words and (c)  words. GRAY: represents the time
spent for chart read accesses and for flushing processing results. BLACK: rep-
resents the time spend for processing data. . . . . . . . . . . . . . . . . . . . . 86
5.16 DAP design real vs. expected parsing time when parsing the sentences "a a",
"a a a", "a a a a", . . . up to a similar sentence of length . Real parsing time
is greater than the expected parsing time illustrating the expected performance
depreciation phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
LIST OF FIGURES xv
5.17 DAP design processor activity for a sentence of length (a)  "a"s and respec-
tively (b) 	 "a"s. The larger gray area is the reason for the expected performance
depreciation phenomenon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.18 The speedup for the -processors DAP system for several POOL sizes when
compared to (a) _ and (b) _ software as a function of sen-
tence length. For each sentence length more than  sentences were parsed. . 91
6.1 The block diagram of an -processor enhanced-CYK design. . . . . . . . . . . 98
6.2 The enhanced-CYK chart data-structure memory organization. Indexing table
entry is on 
 bytes and cells table entry is on    
 bytes (each set  and
 requires  
 bytes). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 The organization of an entry   in the indexing table of the enhanced-CYK
chart data-structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4 Example of cell-combination for the enhanced-CYK algorithm. . . . . . . . . . 104
6.5 The data-structure used to represent a nplCFG without unitary rules. . . . . . . 105
6.6 The data-structure representing the nplCFG without unitary rules used in ex-
ample 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.7 A dual-port memory buffer used for overlapping the triplet prefetching (READER
stage) with their tiling and dispatching (TILER stage) . . . . . . . . . . . . . . 110
6.8 Internal -bank/-bank organization in the particular case when the set size
is 
 (_  	). . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.9 Two possible tilings for a cell-combination given by    and   
when a tile size 
 is used (left) and when a tile size  is used (right). . . . 113
6.10 The enhanced-CYK processor datapath. . . . . . . . . . . . . . . . . . . . . . 114
6.11 The monitoring system implemented within the enhanced-CYK design. . . . . 118
6.12 Hardware speedup for the 	
_, 	
_, 	
_ and 	
_ enhanced-
CYK systems against the software running on (a) a SUN machine and (b) a PC
machine, as a function of sentence length. For each sentence length more than
 sentences were parsed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.13 Enhanced-CYK processor activity in BLACK+GRAY when parsing a sentence
of length (a)  words, (b)  words and (c)  words. GRAY: represents the
time for flushing processing results. BLACK: represents the time spend for
processing data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.14 Enhanced-CYK design real vs. expected parsing time when parsing the sen-
tences "a a", "a a a", "a a a a", . . . up to a similar sentence of length  with the
enhanced-CYK design using a 

 tile size. . . . . . . . . . . . . . . . . 126
6.15 Enhanced-CYK processor activity for a sentence of length (a)  "a"s and re-
spectively (b) 	 "a"s. Note the small gray area in both cases. . . . . . . . . . . 127
6.16 Parse result output rate (left column) and output size (right column) for three
possible ways of formating the parsing results. . . . . . . . . . . . . . . . . . . 129
6.17 The speedup for a -processors enhanced-CYK system for several tile sizes as
a function of sentence length. For each sentence length more than  sentences
were parsed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.18 The speedup for the -processors enhanced-CYK system for several POOL
sizes as a function of sentence length. For each sentence length more than 
sentences were parsed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.1 An FPGA-board implementing a -processors enhanced-CYK design. . . . . 138
xvi LIST OF FIGURES
7.2 The organization of the chart memories on the FPGA expansion board. Two
chart memories are used in order to overlap the parsing and the chart initialization.145
A.1 Transformation steps for the SUSANNE grammar for obtaining the equivalent
Chomsky Normal Form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
B.1 The theoretical (worst-case) and real (average and maximal) number of pro-
cessors required to parse sentences of length     , extracted from the
SUSANNE corpus, when using the CYK algorithm in a case-study on the SU-
SANNE grammar. A number of at least  sentences were parsed for each
value of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.2 The theoretical (worst-case) and real (average and maximal) number of pro-
cessors required to parse sentences of length     , extracted from the
SUSANNE corpus, when using the enhanced-CYK algorithm in a case-study
on the SUSANNE grammar. A number of at least  sentences were parsed
for each value of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.3 Cell size distribution for the CYK algorithm in the particular case of the CNF
SUSANNE grammar. Computed for a number of   sentences with lengths
between  to  words, extracted from the SUSANNE corpus. . . . . . . . . . 165
B.4 (a)  set size distribution and (b)  set size distribution for the enhanced-
CYK algorithm in the particular case of the SUSANNE grammar without uni-
tary rules. Computed for a number of   sentences with lengths between 
to  words, extracted from the SUSANNE corpus. . . . . . . . . . . . . . . . 166
List of Tables
3.1 The maximal allowed size for the parameters that define the CYK chart data-
structure. Parameter sizes in the particular case of the CNF SUSANNE grammar. 29
3.2 last/empty bits meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 The maximal size of the parameters that define the CNF grammar data-structure.
Parameter sizes in the particular case of the SUSANNE grammar. . . . . . . . 33
3.4 MAG I/O signals used to interface with the processor, arbiter and cluster’s
grammar memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Parameter values used to configure the synthesized -processor LAP design. . 44
3.6 Virtex XCV1000 FPGA resource utilization per LAP design unit in terms of
Flip-Flops/Latches, function generators (FGs) and configurable logic blocks
(CLBs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Three sentences parsed with the LAP design for which the processor activity is
represented in figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 The LAP design average processor utilization for a set of sentences extracted
from the SUSANNE corpus with lengths between  and  words. . . . . . . . 52
5.1 The parameter values used to configure (i.e. instantiate) the synthesized -
processors DAP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Virtex XCV1000 FPGA resource utilization per DAP design unit in terms of
Flip-Flops/Latches (DFFs/Latches), function generators (FGs) and configurable
logic blocks (CLBs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3 Three sentences parsed with the DAP design for which the processor activity
is given in figure 5.15. Speedup factor is given against the _ software
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 The average processor utilization for the DAP (respectively LAP) design, for a
set of sentences extracted from the SUSANNE corpus with lengths between 
and  words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 The maximal size of the parameters that define the enhanced-CYK chart data-
structure. Parameter values in the particular case of the SUSANNE grammar. . 102
6.2 The maximal size of the parameters that define the nplCFG (without unitary
rules) grammar data-structure. Parameter sizes in the particular case of the
SUSANNE grammar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 The value of parameter _ to be configured for different maximal
/ set sizes and the number of sets that can be accommodated in a bank. . 111
6.4 The parameter values used to configure (i.e. instantiate) a -processors enhanced-
CYK system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
xviii LIST OF TABLES
6.5 Virtex XCV2000 FPGA resource utilization per enhanced-CYK design unit
in terms of Flip-Flops/Latches (DFFs/Latches), function generators (FGs) and
configurable logic blocks (CLBs) for the enhanced-CYK design. . . . . . . . . 121
6.6 Three sentences parsed with the enhanced-CYK design for which the processor
activity is given in figure 6.13. Speedup factor is given against the software
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.7 The average processor utilization for the enhanced-CYK design when using a
 tile size and respectively a 

 tile size, for a set of sentences
extracted from the SUSANNE corpus with lengths between  and  words. . . 125
6.8 Output for the enhanced-CYK design when parsing the sentence "It was a box". 130
7.1 Port addresses used to configure the dual programmable clock generator. For
details see the ICD 2051 circuit data-sheet. . . . . . . . . . . . . . . . . . . . . 143
7.2 Port addresses used to control the signals involved in the configuration of the
Virtex-E XCV2000 FPGA and for accessing the  
-bit FPGA internal general-
purpose registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3 Routing of the IOP 480 Local bus signals and respectively the FPGA bus signals
to the L and H banks of the chart memories under the control of the 
signal. The organization of the chart memory 1 and the chart memory 2 is
illustrated in figure 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A.1 Grammar data-structure parameters sizes for the SUSANNE CNF grammar (GC. 157
B.1 The worst-case number of processors required for filling the chart within  
time-steps, given the sentence length , when using the dynamic allocation
method. The number in parantheses ( ) is the number of processors
required by a 2D-array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.2 The values for the theoretical (worst-case) and real (average and maximal) num-
ber of processors as illustrated in figure B.1. . . . . . . . . . . . . . . . . . . . 163
B.3 The values for the theoretical (worst-case) and real (average and maximal) num-
ber of processors as illustrated in figure B.2. . . . . . . . . . . . . . . . . . . . 164
C.1 FPGA-board Memory Map (see [23] for more details) . . . . . . . . . . . . . . 169
Chapter 1
Introduction
The thesis you read is aimed at delivering a custom hardware coprocessor for accelerating nat-
ural language parsing. The coprocessor has to parse real-life natural languages and is useful
in several Natural Language Processing (NLP) applications that are time constrained or when
large amounts of data need to be processed. The coprocessor is built around the Field Pro-
grammable Gate Array (FPGA) technology that renders feasible the design and testing of such
a hardware coprocessor as well as its integration within a desktop application framework.
This introductory section will first introduce the notions of NLP and parsing and then dis-
cuss the types of grammars used for modelling and parsing natural languages. Some reasons for
why we chose to implement a parsing algorithm for context-free grammars and why fast pars-
ing is needed are given – the later by means of some typical application examples. The parsing
algorithm implemented in the hardware coprocessor and its main features are presented. A
short introduction to the FPGA technology is also given. The related work is reviewed and the
thesis goals are established. Finally, the thesis outline is provided.
1.1 Natural Language Processing and parsing
NLP is concerned with the study, design and implementation of computational machinery that
can communicate with the humans by means of natural language. In general, the processing
of natural languages is decomposable into a number of analysis stages coming in this order:
lexical and morphological, syntactic, semantic and pragmatics, although this is only an abstrac-
tion as there are interactions between any two of these stages. For the sake of completeness we
should also mention that some languages, such as Chinese, Japanese, and Thai, require a sup-
plementary stage, called segmentation, before the lexical and morphological stage [10]. The
parsing takes place at the syntactic stage and is usually an intermediate step towards further
processing, such as the assignment of meaning to a sentence.
In this context, parsing is an important ingredient and is the process of giving syntactic
structure to a sentence according to an underlying formal grammar1. Some basic notions (i.e.
concepts) of parsing are given bellow:
 recognizer : a procedure that decides whether a sentence is syntactically correct or not
according to a given grammar;
 parser : a recognizer that can also produce the associated structural analyses (i.e. parse
trees) according to the given grammar;
1A formal grammar is a human-constructed formalism that is meant to describe languages.
2 CHAPTER 1: Introduction
 robust parser : a procedure that attempts to produce partial analyses even in the case
when the input sentence is not actually (entirely) syntactically correct according to the
given grammar;
As input for the parsing stage we assume to have a sequence of words coming from the lexical
and morphological stage, and as output a data-structure (e.g. parse trees) suitable for semantic
interpretation.
1.1.1 Why Context-Free Grammars ?
It was in the late 1950s when the linguist Noam Chomsky introduced the formal syntactic
description of languages. He defined several classes of formal grammars – and their associ-
ated languages – structured in a hierarchy known today as the Chomsky hierarchy [14]. The
context-free grammars (CFGs) are also part of this hierarchy2. In computer science the gram-
mars introduced by Noam Chomsky are human-constructed formalisms used to describe classes
of formal languages that range from regular, context-free, context-sensitive to recursively enu-
merable. For linguists, the CFGs and context-sensitive grammars, are of particular interest as
they allow for describing the complex syntax of natural, human spoken languages.
The ability (i.e. expressiveness power) of context-free languages to deal with certain sub-
tle linguistic features of natural languages is sometimes considered to be too weak. Context-
sensitive grammars may be needed for dealing with such features of natural languages. On
the other hand, in NLP, efficiency (i.e. parsing time) is often a requirement for many practical
reasons. The worst-case parsing time of well-known algorithms for CFGs is much more effi-
cient than the parsing time for context-sensitive grammars. Therefore the linguists are often
required to find a tradeoff between expressiveness power and efficiency of grammars used to
model natural languages. Often, due to the fact that only few features of the natural languages
require such "complex" grammars as context-sensitive grammars, CFGs are widely used in the
NLP community. Among other reasons for using CFG for NLP we mention:
 any grammar however complex can be rewritten in the form of a CFG. Thus, we only
have to eventually rewrite a given grammar to an equivalent CFG in order to benefit of
the efficient CFG parsing algorithms;
 the time and space complexity of the state-of-the-art parsing algorithms for CFG are
polynomial and thus better adapted for NLP applications that are real-time and/or mem-
ory constrained;
1.1.2 Why fast parsing ?
NLP applications may be classified in three main fields: data processing, data production and
natural human-machine interfaces. For each of these application fields we give some typical
application examples that require fast parsing.
The first application field, data processing, includes automatic translation, information re-
trieval and text mining and each of these applications requires parsing for different reasons.
Due to the large amounts of data that need to be processed (i.e. parsed), fast parsing tools are
required.
2Interestingly, at about the same time, a committee defining the ALGOL programming language introduced a
programming language description formalism called Backus-Naur form (BNF) which turned out to be equivalent to
the class of CFGs of the Chomsky hierarchy.
1.1 : Natural Language Processing and parsing 3
The second application field, data production, includes optical character recognition (OCR)
systems and spell checkers. In the case of high quality and reliable OCRs it is not always
enough to produce the recognized words since syntactic errors may be produced during this
process. For this reason high quality OCR tools should integrate syntactic analysis in order
to rule-out the wrong variants of the recognised sentences in the text. On the other hand, the
recognition time is relatively small for some commercially available products (e.g. Recognita,
Omnipage) and the amount of time we spend in syntactic analysis for ruling-out the wrong
variants may be a big pour cent in the overall recognition time. Thus, for not being an overhead
in the overall OCR process the parsing has to be fast. An automatic spell checker has to look-up
a dictionary and find the best replacement for a misspelled word. Again, we can use syntactic
analysis for replacing the misspelled word in the sentence with a correct word that is also
syntactically correct. For the same reasons as in the case of the OCR systems we need a fast
parsing tool.
Finally, in the case of human-machine interfaces, state-of-the-art vocal interfaces use stan-
dard Hidden Markov Models (HMM) that only integrate very limited syntactic knowledge and
a better integration of syntactic processing within speech-recognition systems is an important
goal. For instance, in the case of a sequential coupling [22], the output of the speech recog-
nizer (often represented in a compact form called word graph or word lattice) may be further
processed with a syntactic parser to filter-out those of the hypotheses that are not syntactically
correct. Again, due to the real-time constraints of such an application, fast parsing is required.
1.1.3 Why the CYK algorithm ?
The first designed context-free parsers used backtracking to search exhaustively for the syntac-
tic structures matching the input string. The worst-case running time of these algorithms are ex-
ponential in the length of the input and therefore impractical. There are two well-known, practi-
cal parsing algorithms for CFG, the Earley’s algorithm [1] and the Cocke-Younger-Kasami (CYK)
algorithm [1, 32]. Both algorithms have the same polynomial3 worst-case time complexity (i.e.


 ,where  is the length of the input string) and are an instance of dynamic programming,
in which the syntactic structure is built in an incremental process with each step relying on
previous performed computations.
A disadvantage of the CYK algorithm would be that it requires a CFG written in a restricted
normal form (i.e. the Chomsky Normal Form that will be introduced latter). However, this
inconvenient was eliminated by the development in our laboratory of an extension of the CYK
algorithm [5, 6] – referred henceforth as the enhanced-CYK algorithm – that can deal with
almost unrestricted CFGs.
On the other hand, from the hardware point of view both the CYK and the enhanced-CYK
algorithms are regular in terms of data movement and data processing, throughout the pars-
ing process. These are important requirements for an algorithm to be implemented in hard-
ware [12, 18]. A required feature of the hardware coprocessor is its ability to parse word
lattices in order to be integrated within a speech recognition framework and both the CYK and
the enhanced-CYK algorithms can be adapted for parsing word-lattices. An intrinsic feature of
the CYK and enhanced-CYK algorithms is that they perform robust parsing. In other words,
they produce all the syntactic analyses of all the sequences of words in the input sentence. The
3Valiant [27] showed that the computation performed by the CYK algorithm is related to Boolean matrix mul-
tiplication and that it can be performed in subcubic time. The method of Valiant runs in , and there are also
more recent results for multiplying  matrices in time proportional to  [2]. However, the overhead of these
methods renders them impractical for values of  in the range of practical interest.
4 CHAPTER 1: Introduction
above feature is a requirement for performing shallow parsing . Finally, both the CYK and
the enhanced-CYK algorithms are highly parallel and the FPGA technology can easily exploit
the different levels of parallelism, i.e. fine-grained and coarse-grained, available in these algo-
rithms. For the reasons mentioned above we have choose the (enhanced-)CYK algorithm for
the hardware coprocessor implementation.
1.2 The FPGA technology
The concept of programmable logic was introduced around the year 1960 but due to technolog-
ical limitations it took some time, until 1975-1978, when the first programmable logic device
(PLD) came to market in the form of a Programmable Array Logic (PAL) circuit. Since then,
the technology advanced very fast under various types of devices such as PAL, PLA, GAL,
CPLD and Field Programmable Gate Arrays FPGAs [4, 19] . The two most successful tech-
nologies today are the complex programmable logic devices (CPLDs) and the FPGAs. The
difference between these technologies resides in the kind of hardware primitives they offer
for system implementations, although today the distinction between the two becomes more an
more blurred. Today state-of-the-art FPGAs offer  million gates (e.g. Xilinx’s Virtex-E family
XCV3200 FPGA) compared to hundreds at the eves of this technology. FPGAs with such a
high gate density make feasible the implementation of very complex designs.
The main advantage of the FPGA/CPLD circuits is that they can be configured after man-
ufacturing. Basically, they contain three main types of configurable elements: logic cells, in-
put/output cells and interconnection resources [26]. The logic cells are used to implement
combinatorial and sequential logic functions, the input/output cells provide an interface with
the external world and the routing resources are used to route signals throughout the circuit.
Each of the configurable resources can be assigned a configuration from a set of predefined
possible configurations. For instance a logic cell may be assigned a NAND logic function from
the set {AND, NAND, OR, NOR} of possible configurations, while an input/output cell may
be configured as an output from the set {input, output, input-output}. Complex circuits can be
built by interconnecting these basic elements.
The FPGAs may be clasifyied in two broad categories, namely coarse-grained and fine-
grained. The logic cells of the fine-grained FPGAs consist of simple functionally complete
logical gates (e.g. NAND) or a low-complexity universal function (e.g.  or  variables con-
trolled multiplexer). The logic cells of the coarse-grained FPGAs usually implement a logic
function of several variables (e.g. a  variables logic function, often implemented by means
of a look-up table) along with several flip-flops (e.g.   ). However, state-of-the-art FPGAs
often contain some more sophisticated primitives – that were previously features of CPLDs –
such as dual-port memories, full-adders or single-cycle multipliers in order to support a broad
range of applications. For configuring the FPGAs there are several technologies and we as-
sume throughout this thesis that the static RAM-based technology in used. With the static
RAM-based technology, each programmable resource in the FPGA is controlled by means of a
set of bits stored in the static RAM. This technology benefits from the fact that the static RAM
memory can be reprogrammed an indefinitely number of times.
In digital design, the FPGAs are an alternative to general-purpose processors and Applica-
tion Specific Integrated Circuits (ASICs) and are often referred to as "the third computational
paradigm" [28]. When compared to general-purpose processors and Digital Signal Processors
(DSPs) the FPGAs have the important advantage of being able to exploit different levels of
parallelism available in the implemented algorithm. When implementing a function, the FP-
1.2 : The FPGA technology 5
B
C
(b)
store R1,(addr)
mul R1,C
add R1,R2
load R2,B
mul R1,A
load R1,A
A
mul R2,B
f(A,B,C)
(a)
x
x x
+
Figure 1.1: FPGA spatial (a) vs. general-purpose processor temporal mapping (b) for the
function      . R1 and R2 are internal registers of the general-purpose
processor and the mnemonics "mul" and "add" correspond to the operation of multiplication,
respectively addition.
GAs are doing a spatial mapping while the processors are doing a temporal mapping. This is
illustrated in figure 1.1 for the mathematical function      . While the
circuit can compute the function in one time step – the latency involved is not relevant now –
the software requires seven time steps. However, the FPGAs and the general-purpose proces-
sors are two extreme cases of a rich architectural space and both FPGAs and general-purpose
processors have advantages and disadvantages when compared to each other. Some applica-
tions are better suited for being implemented on a general-purpose processor, while others are
better suited for an implementation on an FPGA. However, the distinction is not clear and it is
often the case that for a particular application there are parts that are better suited for an FPGA
implementation while others are better suited for being implemented within a general-purpose
processor. Concretely, when running an application, both the general-purpose processor and
the FPGA should execute those parts of the application that they are best suited for. A mixed
circuit consisting of both an general-purpose processor and an FPGA would be therefore an
elegant and effective solution [28]. On the other hand splitting an application in parts which
will run on the general-purpose processor, respectively parts that will run on the FPGA is not
easy and by the time this thesis is written still requires human intervention. A "smart" compiler
able to do all this process automatically is needed. Eventually, hardware synthesis technology
and conventional compiler technology will converge and the compiler will manage both the
reconfigurable and fixed resources on the general-purpose processor. At the time being, we
can already find on the market processors that contain reconfigurable resources. In the context
of this trend we believe that a design of the (enhanced)-CYK algorithm ready to be used (i.e.
configured) within the reconfigurable resources of a general-purpose processor running an NLP
application that intensively uses parsing, can be powerful tool.
6 CHAPTER 1: Introduction
On the other hand, when compared to application specific computers, usually implemented
around one or several ASICs, we can say that FPGAs are a cost-effective alternative, while
offering almost the same performance. Clearly stated, the FPGAs offer both the flexibility of
programmable general-purpose processors and the performance of ASICs in a cost-effective so-
lution. While not all applications benefit from an FPGA implementation, the (enhanced)-CYK
algorithm, due to its intrinsic characteristics can dramatically benefit from an FPGA implemen-
tation.
1.3 Related work
The various granularity and high degree of parallelism available in the (enhanced-)CYK algo-
rithm allows its implementation on several arrays of processors architectures.
It is known that, the (enhanced-)CYK algorithm has  time complexity when executed
on a sequential processor,  time complexity when executed on a 1D-array of processors,
respectively  time complexity when executed on a 2D-array of processors [17], where 
is the length of the input sentence.
Several hardware implementations (e.g. Very Large Scale Integration, VLSI ) and mappings
of the algorithm have been tried (e.g. on different computing platforms) in order to investigate
the potential speedup available. A survey is given in this section. The conclusion of this
survey is that (1) the proposed VLSI implementations require to many hardware resources
when dealing with large-size real-life CFGs and (2) the proposed algorithm mappings always
require an expensive parallel machine and are limited by factors such as intensive interprocessor
communication and expensive processor allocation methods.
1.3.1 VLSI implementations
Various VLSI designs have been proposed: a syntactic recognizer based on the CYK algorithm
on a 2D-array of processors [8] and a robust (error correcting) recognizer and analyzer (with
parse tree extraction) based on the Earley algorithm on a 2D-array of processors [7]. Although
these designs meet the usual VLSI requirements (constant-time processor operations, regular
communication geometry, uniform data movement), the hardware resources they require do
not allow them to accommodate large-size real-life context-free grammars used in large-scale
NLP applications4 . An analysis of such a VLSI architecture when mapped on a state-of-the-art
FPGA is given in section 3.1.
1.3.2 Hypercube architecture mapping
An implementation of a parallel CYK algorithm for recognition and parsing on a NCUBE/7
machine is given in [16]. This implementation is essentially concerned with the mapping of
a parallel CYK algorithm on a hypercube architecture and the problems related with such an
implementation (i.e. the cost of interprocessor communication, processor allocation methods).
Their benchmarks report a speed-up factor of about  in the best case. The disadvantage of
such an implementation resides in the fact that it requires an expensive hypercube machine. We
want a solution that allows us to integrate our hardware accelerator within a common desktop
computer, in a cost effective solution.
4For instance the context-free grammar extracted from the SUSANNE corpus [25] contains more than  
non-terminals and   grammar rules when written in Chomsky normal form
1.4 : Thesis goals. 7
1.3.3 Distributed memory systems and shared memory systems
Implementations of parallel CYK algorithms on distributed memory systems were tried in [3,
21]. The implementation in [21] uses a massively parallel computer AP1000+ (with 256 Super
Sparc at 50MHz) an the benchmark tests report a speedup factor of about 45 in the best case.
Again, the disadvantage resides in the requirement of an expensive parallel computer. The im-
plementation presented in [3] uses as test environment a network of computers in which each
computer stores a fragment of the working data and the communication between computers is
implemented by passing messages. The conclusions of this work is that inter-computer com-
munication is very expensive and is the cause of poor performance. Basically no significant
speed-up is reported.
Finally, an implementation of a shared memory system is given in [3] that uses a multi-
processor computer in which the processors share the same memory for storing the working
data. This implementation is again concerned with the mapping of the CYK algorithm on a
multiprocessor machine. The problems related with such an implementation are again – as in
the case of the NCUBE/7 machine – the overhead introduced by interprocessor communication
and processor allocation methods.
1.4 Thesis goals.
The three goals of this thesis are: (1) to propose an efficient coprocessor that can be integrated
within a speech-recognition system (i.e. can parse word lattices5) and investigate various meth-
ods for accelerating natural language syntactic analysis, (2) to integrate the coprocessor in a
hardware tool ready for integration within an ordinary desktop computer and (3) to offer an
interface between the hardware tool and natural language applications, requiring parsing, and
that are running on the desktop computer.
We have choose the FPGA technology as the core of such a coprocessor implementation,
due to its ability to efficiently exploit all levels of parallelism available in the implemented al-
gorithms in a cost-effective solution. A reason is also that the future general-purpose processors
are expected to contain reconfigurable resources. In such a context an IP core implementing an
efficient context-free parser – and ready to be configured within the reconfigurable resources of
the general-purpose processor – would be a support for any application relying on context-free
parsing and running on that general-purpose processor.
The context-free grammar parsing algorithms implemented are the CYK, respectively an
enhanced-CYK algorithm developed in the Artificial Intelligence Laboratory of EPFL. These
algorithms were elected (1) due to their intrinsic properties of regular data movement and data
processing that render them well suited for a hardware implementation, (2) for their property
of producing partial parse trees that makes them adapted for further shallow parsing and (3) for
being able to parse word lattices.
The hardware design (i.e. the coprocessor) of a context-free parser can be integrated in
several ways within an application framework. Once such integration was already presented
and relies on a processor that contains reconfigurable resources. Another typical integration
within an application framework would be as an accelerator FPGA board that has a "generic"
interface (e.g. a PCI or SCSI interface) to the outside world. An example of such an integration
is given for the particular case of a speech-recognition system, namely, a Vocal Information
Server in figure 1.2 , in which the coupling between the speech recogniser and the context-free
5A compact representation of the output produced by the acoustic modules of speech recognizers.
8 CHAPTER 1: Introduction
processor
speech
speech
recogniser
word
lattice
memory
lexicon
FPGA
(coprocessor)
context-free parser (FPGA board)
(CYK chart)
data structure
G
M
G
M
compact
parse forest
dialogue
manager
Figure 1.2: An example of integration of the hardware context-free parser accelerator (FPGA-
board) in a speech-recognition application framework, namely a Vocal Information Server. The
input to the FPGA-board is a word lattice and the output is a compact parse forest. GM stands
for grammar memory.
parser is sequential. The output of the speech recogniser (a word lattice) is the input of the
FPGA board and the output of the FPGA board (a compact parse forest) is the input of the
semantic module that feeds a dialogue manager.
More precisely, the processor on the accelerator FPGA board is the "glue logic" used to
convert the word lattice in a format used to initialise the coprocessor (i.e. context-free parser)
data structures by making use of a lexicon. The parsing takes place in the FPGA coprocessor
and relies on the intensive use of the grammar memories that store the grammar rules of the
CFG in a compact format. The resulting parse forest is extracted by the processor from the
coprocessor’s data structures and forwarded to an external processing unit (e.g. a desktop PC)
for further processing.
1.5 Outline
This thesis is on the crossroad between NLP and digital circuit design and for this reason a brief
introduction in both the NLP and the FPGA technology has been already given. An introduction
to the CYK and enhanced-CYK algorithms and their adaptation for parsing word lattices is also
given in chapter 2. The following chapters 3-6 are the milestones on the path we followed along
our design methodology.
As the project started we had little or no knowledge about the hardware implementation of
the (enhanced-)CYK algorithm. Therefore, the first step (chapter 3) of our design methodology,
was to build a working hardware implementation for the CYK algorithm in order to validate
and prove its feasibility. The proposed hardware architecture is a linear array of processors that
can deal with large-size real-life Chomsky Normal Form CFGs. This hardware architecture
1.5 : Outline 9
was physically implemented, tested and validated on a commercial FPGA board.
The second step (chapter 4), was a throughout analysis of the linear array of processors
design supposed to bring the necessary knowledge about several aspects of the CYK algorithm
such as: (1) required FPGA resources and size of the employed data-structures, (2) for get-
ting acquainted with the real-life behaviour of the CYK algorithm and (3) for having a first
evaluation of the speed-up factor against a software implementation.
The third step (chapter 5) proposes a new hardware architecture called dynamic array of
processors that addresses several drawbacks of the linear array of processors design, among
which a better processor allocation allocation method. The performance measurements and
analysis of the dynamic array of processors adds more knowledge about the real-life behaviour
of the CYK algorithm such as (1) unbalanced processor load and (2) required number of pro-
cessors in the design.
The forth and last step (chapter 6) of our design methodology, proposes a hardware architec-
ture that implements the enhanced-CYK algorithm and a method called tilling that balances the
processor load. The proposed hardware architecture reuses the useful features of the previous
designs and also addresses their drawbacks.
Chapter 7, presents an FPGA board built around a Xilinx Virtex-E V2000efg1156-6 FPGA
that runs the hardware design of the enhanced-CYK algorithm. The FPGA board has a PCI
interface and is supposed to work as an accelerator board within a host computer (e.g. PC).
In the conclusions (chapter 8), we describe some possible improvements to the current
implementation and propose future directions for the project.
10 CHAPTER 1: Introduction
Chapter 2
The (enhanced)-CYK Algorithm
Adapted for Word Lattice Parsing
This chapter first presents the CYK algorithm that is restricted to the use of CFGs written
in Chomsky Normal Form, and its adaptation for parsing word lattices. Next, an extension
of the CYK algorithm, referred henceforth as the enhanced-CYK algorithm is introduced. The
enhanced-CYK algorithm is able to deal with a class of CFGs which is larger then the Chomsky
Normal Form grammars, called nplCFG ("non partially lexicalized context-free grammars").
The adaptation of the enhanced-CYK to word lattice parsing is then presented.
2.1 Context-free grammars and the Chomsky Normal Form
We are familiar with finite state automata that are equivalent to regular languages [20]. The
context-free languages are a superset of the regular languages and are described by CFGs.
A simple example illustrating the difference between regular languages and context-free lan-
guages is the language 	      	 that cannot be generated by a finite state automata
but can be generated with a CFG (see example 2.1). The language 	 is commonly used to
represent simple syntactic structures in programming languages such as begin-end blocks. The
reason is that the phenomena of recursion illustrated by the language 	 requires infinite mem-
ory for dealing with. On the other hand, however, all computing machines have only a limited
amount of memory and can thus be modelled by a regular language. In this context, CFGs are
used more for practical reasons like compactness an simplicity of representation.
Formally, a CFG is a 4-tuple    	 where:
  is a set of non-terminals;
  is a set of terminals;
  
  is the top level non-terminal (corresponding to a sentence);
  is a set of grammar rules, i.e. a subset of  


 written in the form of   ,
where  
  and  
 



. For instance, a grammar rule such as     ,
expresses the fact that a sentence () consists of a noun-phrase ( ) followed by a verb
( ).
In the text that follows, we use capital letters  ,  ,  ,. . . to denote non-terminal symbols and
lower case letters , , , . . . to denote terminal symbols. An example of CFG is given bellow.
12 CHAPTER 2: The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing
Example 2.1 A CFG that can generate the language 	      	.
  	,
   	,
             	

There are several normal forms of the CFGs that have some practical and theoretical interest
and the Chomsky Normal Form is one of them. In the formal definition of a grammar rule for
CFGs there is no restriction on the right-hand side of the grammar rule, and any succession of
non-terminals and terminals is allowed. When written in CNF, every grammar rule right-hand
side is either a succession of two non-terminals or a single terminal symbol. Formally, every
grammar rule is either of the form     or   .
Any general CFG can be rewritten in an equivalent CNF grammar. The rewriting, however,
will increase the number of non-terminals and production rules in the CNF grammar and for
this reason the CNF rises some practical concerns.
2.2 The CYK algorithm adapted for word lattice parsing
A description of the CYK algorithm can be found in [1, 20]. The CYK algorithm is an instance
of dynamic programming and uses a chart (i.e. a lower triangular matrix) of size  ,
to store the data produced during the parsing process, where  is the size of the input sequence
to be parsed. The algorithm starts with an empty chart in which only the bottom cells are
initialised according to the input sequence and continues by filling the entries of the chart in
a bottom-up fashion. Being a dynamic programming method, it fills incrementally entries of
the chart, based on previously computed entries. The CYK algorithm is constrained to use a
grammar written in CNF.
The following definitions and terms are used in the NLP community: the term "sentence"
for the input sequence to be parsed and the term "word" for the terminals in the input sequence.
Also, we will use the term "lexical" rule to denote the rules of the form   . In practice the
lexical rules only appear in the lexicon and not in the grammars. This separation of the rules in
lexical rules (i.e.   ) and grammar rules (i.e.    ) is convenient for describing the
grammars we are usually dealing with in practice.
Assume that we have the CNF context-free grammar    	 and an input sen-
tence 



   

,   , where 


  is the  word in the sentence. Let 






   

be the part of the input sentence that starts at 

and contains the next
   words and 

the subsets of  defined by 

  
    



	. The notation
 



means that 

can be derived from  by applying a succession of grammar rules.
Every set 

can be associated with the entry on column  and row  of the lower triangular
matrix, henceforth referred as the chart.
The CYK algorithm is defined as follows: In the algorithm given above, the lines   
correspond to the initialisation step, when the sets 

are initialised by using the "lexical"
rules of the form   

. It may be clear now that the advantage of the lexical rules is that
they restrict the processing of the "lexicalized" rules to the initialization step. The initialization
step consists of filling each set 

in the bottom row (  ) of the chart with all the non-
terminals  for which there exists a lexical rule   

associated with that word. The time
complexity of the initialization step is thus .
The lines 	   correspond to the subsequent filling-up of the chart once the initial sets


were initialized. Finally, the parsing trees can be extracted from the chart if necessary. If
2.2 : The CYK algorithm adapted for word lattice parsing 13
Algorithm 1 The CYK algorithm
1: for    to  do
2: 

     

 
 	
3: for    to    do
4: 

 
5: end for
6: end for
7: for    to  do
8: for    to     do
9: for    to    do
10: 

 


      
  with  
 

and  
 

	
11: end for
12: end for
13: end for
 is among the topmost symbols (root) of such a tree, the sentence is syntactically correct for
the grammar .
The following example illustrates how the CYK algorithm works.
Example 2.2 Let’s assume that the CNF grammar is given by:
    	,
     	,
              
           
   	    
    	
where  is the top level non-terminal and      , are used to denote the grammar rules.
With this grammar, we want to parse the sentence “b a d b c”. During the initialisation
{ X, Z } { Z }{ Y } { Y } { X }
{ X }{ Z }
{ Y }
{ S , Z }
{ S }
b a d b c
1
3
1 2
2
4
5
3 4 5
i
j
{ X }
a d
XY
S
X
b cb
r2
r5r6
r1
YZ
Z X
Y
a d
XY
Y X
Z X
S
X Y
b cb
r2
r5r6 r6 r8
r2
r3
r1
r6 r8
r3
(b)(a)
r4
r9 r9
Figure 2.1: CYK (a) chart initialization (j=1) and filling (j=2,3,4,5), (b) two possible parsing
trees corresponding to the input sentence “b a d b c”
step, only the rules  to  are used (see figure 2.1(a)) to fill-in the entries on the bottom line
of the chart. During chart filling, the rows         are filled successively in this order,
14 CHAPTER 2: The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing
making use of the grammar rules  to . Finally, two possible parsing trees are extracted
from the chart (see figure 2.1(b)). In this figure we represented on the branches of the parse
trees the grammar rules that were actually used for building them.

One important aspect of the CYK algorithm is that its initialization step can be easily gen-
eralised to accommodate (1) compound words and (2) word lattices, useful within a speech
recognition framework for parsing the multiple sentences produced for a given acoustic input.
For doing so, the initialization step has to take into account the possibility to initialize any of
the sets 

and not only the sets 

of the bottom row. How this is dealt with in the particular
case of compound words and word lattices is explained bellow.
Compound words : An example of compound word is "credit card". If the input sentence con-
tains this compound word at position  in the input sentence, the algorithm should initialize the
set 

for the word "credit", respectively the set 

for the word "card" as well as the
entry at position   for the compound word "credit card" (see figure 2.2). Of course, in this
example we supposed that the following lexical rule     is present in the lexicon.
credit card
i
j
(i) (i+1)
{Adj, V} {N}
{N}
corresponds to
"credit card"
Figure 2.2: An example of initialization for a compound word ("credit card")
Word lattices : An example of word lattice representation is given in figure 2.3. Each path
starting in the leftmost node and ending in the rightmost node of the word lattice corresponds to
a possible recognised sentence. These sentences are subject to be filtered-out by the syntactic
parser whenever they are recognized as incorrect.
When dealing with word lattices, the difference with the previously presented CYK algo-
rithm is that not only the sets 

may be initialised during the initialisation step, but any of
the sets 

as with word lattice representation words may occur anywhere in the chart (see
figure 2.4(b)). Therefore, in order to adapt the CYK algorithm to word lattice parsing, the
initialisation step needs to be extended.
To illustrate this point, let us consider the same CNF grammar used in the example 2 and
the word lattice given in figure 2.3, containing the six sentences "a a b b", "a b d b",
"a b b a", "b b a a", "b c b a" and "b c d b". First the lattice nodes are ordered
by increasing depth1 (see figure 2.4 (a)). Such an ordering is natural in the case of lattices
produced by a speech recogniser, in which nodes correspond to (chronologically ordered) time
instants , , . . .
This new representation of the word lattice can be mapped over the chart as follows:
1. the intervals between successive nodes are associated to the column indices of the parsing
table;
1A node depth is the minimal number of arcs from the initial node, with random choice in case of equalities.
2.3 : The enhanced-CYK algorithm adapted for word lattice parsing 15
b
d
a
b
b
b
a
b c
a
ba t1
t4
t5
t7
t6
t3
t2
a a b b
a b d b
a b b a
b b a a
b c b a
b c d b
Figure 2.3: A toy word lattice containing 6 sentences.
2. if the lattice arc !  ( " !) is labelled , then the set 
		
corresponding to
the chart entry (!, !) is initialised with      
 	 (see figure 2.4 (b)).
b b
d
a
b
c
a a
b b b
a
t2 t3 t4 t5 t6 t7t1
1 2 3 4 5 6 7 8
1
2
4
3
5
6
7
8
t1 t2 t3 t4 t5 t6 t7
{X,Z}
{Y} {X,Z} {X,Z}
{X,Z}
{Y} {Y} {Y} {Y}
{Y}{X} {Z}
(a) (b)
Figure 2.4: (a) Word lattice of fig 2.3 represented as a speech word lattice (nodes are naturally
ordered since they represent different time-instants). (b) The representation of the word lattice
in the form of a initialized chart.
The time complexity required for the initialization of the chart with a word lattice is 
while any entry in the chart is subject to initialization.
2.3 The enhanced-CYK algorithm adapted for word lattice pars-
ing
The enhanced-CYK algorithm can be seen as derived either from a bottom-up Earley [11, 29]
(with generalizes items and without prediction), or from a CYK [1] (but performing dynamic
binarization of the grammar). It could also be seen as an extension of the algorithm presented
by Graham et al. [13]. A CYK-like point of view is used subsequently for the description of
the enhanced CYK algorithm.
The enhanced-CYK algorithm works with a subclass of CFGs, consisting of "non partially
lexicalized rules", hereafter denoted as nplCFG. The restriction of the CYK algorithm to CNF
grammars is thus relaxed to the use of nplCFG, a larger subclass of CFGs. A nplCFG is a
16 CHAPTER 2: The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing
CFG in which terminal symbols, i.e. words, only occur in rules of the form   





2
,
called lexical rules. In practice, such lexical rules do not even appear in the representation of
the grammar but are represented in the lexicon.
The restriction to the nplCFGs subclass of CFGs is not however "critical" for the algorithm.
It was introduced because it corresponds to the kind of grammars we are actually dealing with
in practice, and also because it restricts the processing of "lexicalized rules" to the initialization
step. The algorithm could however be easily extended to deal with any CFG.
The enhanced-CYK algorithm, like the CYK algorithm, uses a triangular table (hereafter
called chart) with  cells, where  is the length of the input sentence 

   

. How-
ever, the elements stored in the chart are different of those used for the CYK algorithm and are
a generalization of the Early-like items (the so called "dotted rules"). For the presentation of the
enhanced-CYK algorithm we are using the same notations introduced for the CYK algorithm.
The cell at row  and column  in the chart will contain two kind of sets of items:
 items of the first set (the  set) represent the non-terminals  that can derive 

(i.e.
that can produce 

after the application of a finite number of grammar rules). Formally
this set is a subset of  defined by:


  
    



	
 items of the second set (the  set) representing the partial parsings  of the string 

(i.e. sequences  of non-terminals such that   

). Formally this set is defined by:


   # 
 

  # 
 , s.t.   

	
The items in the  set represent a generalization of doted rules used in Early-like parsers,
since only the beginning (i.e. parsed) part of the rule is represented, independently both of the
left-hand side and of the end of the rule. This provides a much more compact representation of
dotted rules (allowed by the bottom-up nature of the parsing, during which the left-hand side
non-terminals can be ignored as long as they are not actually rewritten). With the notations
given above, the enhanced-CYK algorithm is defined as follows: The lines 1-6 of the algorithm
correspond to the initialization step. As defined above the initialization step can deal with
compound words or/and word lattices. During the initialization step the N1 and N2 sets of
each chart cell may be initialized (see lines 2-3 of the enhanced-CYK algorithm). Precisely, for
each subsequence of words 



   

in the input sentence, if there is a lexical rule
  



   

in the grammar, then:
 the non-terminal(s)  are added to the 

set;
 if the grammar contains a rule   #, and # is not empty then the partial right-hand
side   is added to the 

set;
The time complexity of the initialization step is thus .
The lines 7-14 of the algorithm correspond to the chart filling step. As defined above the
enhanced-CYK algorithm can only deal with CFGs without unitary-rules. This restriction is re-
quired in order to render feasible the hardware implementation of the enhanced-CYK algorithm.
The explanation is given in the technical note at the end of this chapter. For the enhanced-CYK
algorithm defined above the following operations are applied when two source cells   and
2usually   ;    for compound words.
2.3 : The enhanced-CYK algorithm adapted for word lattice parsing 17
Algorithm 2 The enhanced-CYK algorithm
1: for    to  do
2: for    to     do
3: 

    
    

 
 	
4: 

    
 # 
 

  # 
 , s.t.   

	
5: end for
6: end for
7: for    to  do
8: for    to     do
9: for    to    do
10: 

 


  



 


 s.t.    
	
11: 

 


   
 




#


 s.t.   #
	

   
 

# 
 
 s.t.   # 
 	
12: end for
13: end for
14: end for
respectively       are combined into the destination cell  . For each pair of items
 , where  
 

and  
 

if there is a rule   # in the grammar,
then:
 if # is not empty, the partial right-hand side  is added to the 

set;
 if # is empty, the non-terminal(s)  are added to the 

set and if there is a gram-
mar rule whose right-hand side starts with  then the partial right-hand side   is also
inserted in the 

set;
As there may be a grammar rule with an empty # and another grammar rule with a non empty
# simultaneously in the grammar, both operations mentioned above may take place!
Example 2.3 Enhanced CYK parsing. Let us consider the following simple nplCFG:
     $	,
     	,
    $       $    
            
   	    
	

where  is the top level non-terminal and      
 are used to denote the grammar rules. With
this grammar, we want to parse the sentence “badcb”. During the initialization step, only the
rules  to 
 are used (see figure 2.5(a)) to fill-in the entries on the bottom row (  ) of the
chart. During chart filling, the rows         are filled making use of the grammar rules 
to . Finally, two possible parse trees are extracted from the chart (see figure 2.5(b)).
Figure 2.6 illustrates the initialized chart for the enhanced-CYK algorithm for the toy word
lattice in figure 2.3.
Technical Note:
For the enhanced-CYK algorithm as described in [6] a "self-filling" procedure is executed on
18 CHAPTER 2: The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing













d c b
X
Z V
ZV Y 
W
S
Y 
a
Z 
b







Y 
a
Z 
b d c b
X Z 
XZ V
XZV Y
S





























b a d
1
3
1 2
2
4
5
3 4 5
i
j
c b
{Y} {Z} {V} {Y}{Z}
{ZV }
{Z }{Z } {Y }
{XZ }
{XZV }
{Y  }
{X}
{X }
{W}
{S }
(b)(a)
(r6) (r5)
(r3)
(r1)
(r6) (r5)
(r2)
(r8) (r7) (r8)
Figure 2.5: Enhanced-CYK (a) chart initialisation (j=1) and filling (j=2,3,4,5), (b) two possible
parsing trees corresponding to the input sentence “b a d c b”
t5t4t3t2
{X, Z }
{X,Z}
{X, Z }
{X,Z}
{X, Z }
t6t1 t7
{X, Z }
1
1
2
4
3
5
6
7
8
{X,W}
{X }
{X,Z}
873 4 652
{X, Z }
{Z }
{Z}
{Y } {Y } {Y } {Y }
{Y} {Y} {Y} {Y}
{Y } {Y }
{Y} {Y}{X,Z}
{X,Z}
Figure 2.6: Example of word lattice initialisation for the enhanced-CYK algorithm
each chart cell once it is filled. It is required in order to deal with the unitary rules in the
grammar. For each chart cell  , it consists in taking each non-terminal  in the 

set
and if there exists a rule of the form   # in the grammar two operations may take place:
 if # is empty (i.e. the unitary rule    is in the grammar), the non-terminal(s)  are
added to the 

set
 if # is non empty, the partial right-hand side  is added to the 

set
As there may be a grammar rule with an empty # and another grammar rule with a non empty
# simultaneously in the grammar, both operations mentioned above may take place! The "self-
filling" procedure is iterated until no non-terminal  in the 

set will produce new elements
2.4 : Conclusions 19
according to the procedure described above. It is easy to remark the recursive nature of the
"self-filling" procedure, which, if implemented in hardware requires stacks. The reason why the
"self-filling" procedure is required are the unitary rules. In order to eliminate the "self-filling"
procedure in a hardware implementation the unitary rules are eliminated in a preprocessing step.
In the particular case of the SUSANNE grammar the grammar  is used (see appendix A.3).
2.4 Conclusions
This chapter presents the formalism used to describe CFGs and introduces the Chomsky Normal
Form representation of the CFGs.
Next, it presents the CYK algorithm that is restricted to the use of CFGs written in Chomsky
Normal Form, and its adaptation for parsing word lattices and compound words. The enhanced-
CYK algorithm, which is an extension of the CYK algorithm, that is able to deal with a subclass
of CFGs – larger then the Chomsky Normal Form grammars – called nplCFG ("non partially
lexicalized context-free grammars") is next introduced. The adaptation of the enhanced-CYK
to word lattice parsing is then presented.
The presented algorithms (and their adaptation to word lattice parsing) will be further im-
plemented in the hardware. In a first step we will study the implementation of the CYK algo-
rithm in hardware due to its (relative) simplicity and then based on the knowledge accumulated
in the first step we will target a hardware implementation of the enhanced-CYK algorithm
which is more complex.
20 CHAPTER 2: The (enhanced)-CYK Algorithm Adapted for Word Lattice Parsing
Chapter 3
The Linear Array of Processors
Hardware Design of the CYK
Algorithm
This chapter starts with an argument for why a 2D-array of processors architecture as presented
in [8] is not feasible when using state-of-the-art FPGAs and real-life CFGs. A linear array of
processors architecture for the CYK algorithm adapted for word lattice parsing is then proposed.
The linear array of processors architecture fills the chart in a row-by-row manner and was
the starting point for exploring the behaviour and characteristics of a hardware implementation
for the CYK algorithm when dealing with large-size real-life grammars. The design was useful
for several reasons:
1. for acquiring knowledge about the required FPGA resources, the size of the grammar
memories (storing the grammar) and the chart memory (storing the chart);
2. for getting acquainted with the real-life behaviour of the CYK algorithm;
3. for having a first evaluation of the speed-up factor over a software implementation;
The first point was the most important at the time the project started, as it was a key requirement
for such a design to be feasible. It was useful for defining the data-structures for representing
the CNF grammars and the chart in an efficient way both for reducing the size of these data-
structures and for reducing the required access time to the stored information. The second point
was required to identify the potential bottlenecks in the system in order to pinpoint the critical
regions of the design where the efforts for further improvements should be concentrated. The
third point was necessary in order to evaluate the advantage of such a hardware implementation
over a software implementation. Finally, the design was synthesised and tested for validation
on a commercial FPGA-board (RC1000-PP) .
The three points mentioned above and the implementation of the design on the commercial
FPGA board (RC1000-PP) was the first step of our design methodology, namely to prove the
feasibility, correct functionality and speedup over an equivalent software implementation of the
enhanced-CYK algorithm.
The main features of the linear array of processors design are: (1) a speed-up factor of
about  over our best software implementation of the enhanced-CYK algorithm and (2) its
ability to parse sentences with up to  words (see section 3.6) or time-stamps when dealing
22 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
with word lattices. These features make the design an interesting solution for integration within
real-life NLP applications frameworks that have strong data-size and/or real-time constraints.
3.1 Why not a 2D-array of processors architecture ?
A 2D-array of processors hardware architecture best exploits the parallelism available in the
CYK algorithm. Such an architecture is proposed in [8]. The major drawback of this imple-
mentation, that makes it uninteresting in practice, resides in the large amount of resources it
requires. The amount of memory required for storing the grammars grows with the number
of non-terminals in the used CNF grammar, becoming impractical for implementation within a
state-of-the-art FPGA even for a small number of non-terminals (e.g.  non-terminals). The
following example illustrates in a concrete case-study the length of the sentence that can be
parsed as a function of the number of non-terminals in the used CNF grammar.
Example 3.1 This example assumes that a Xilinx Virtex-E XCV2000 FPGA is used for imple-
menting the 2D-array of processors architecture proposed in [8].
For a CNF grammar with  non-terminals, the FPGA has enough internal memory re-
sources (see Xilinx’s Virtex-E family manual [30]) for implementing a 2D-array design that
can parse sentences with up to  words. For a CNF grammar with  non-terminals, the
available memory resources are enough to parse sentences with up to  words and for  non-
terminals sentences with up to  words! Such a sentence length and number of non-terminals
is of no interest in real-life cases.

The memory space required for real-life grammars such as the CNF SUSANNE grammar is
large (more than  KBytes, see section 3.4.2) and cannot even be accommodated within
state-of-the-art FPGA resources. The solution in this case is to use external memories for
storing the grammars. The memory space required for the chart is also large (see section 3.4.1)
within FPGA resources and an external memory is again required. On the other hand, if external
memories are used for storing the chart and the grammar memories, the number of user pins
available on a state-of-the-art FPGA becomes the limiting factor. The number of user pins
available limits the number of grammar memories that can be connected to the FPGA. The
following example illustrates for the 2D-array of processors architecture proposed in [8], the
length of the sentence that can be parsed if external grammar and chart memories are used.
Example 3.2 This example assumes that a Xilinx Virtex-E XCV2000 FPGA is used for im-
plementing the 2D-array of processors architecture proposed in [8]. The CNF SUSANNE
grammar is used.
The memory space required to store the CNF SUSANNE grammar is more than  KBytes
and therefore requires  address lines. Assuming that the data transfers take place on an 
-bit
databus and that  command lines are used for interfacing with the memory – SRAM memory
is used – a total number of  pins are required for the interface between the FPGA and a
grammar memory. The XCV2000 FPGA has 
 user pins and therefore about  grammar
memories can be attached (the remaining pins are used for interfacing the chart memory and
for other purposes) which allows the 2D-array design to parse sentences of up to 	 words. If
the data transfers take place on a -bit databus, then a total number of  pins are required for
the interface between the FPGA and a grammar memory. A number of 
 grammar memories
can be attached, which allows the 2D-array to parse sentences with up to  words.

3.2 : General system description 23
The above examples show that the 2D-array of processors presented in [8] cannot accommodate
real-life CFGs if implemented in a state-of-the-art Virtex-E XCV2000 FPGA due to the limited
available hardware resources.
3.2 General system description
Given the facts presented in the previous section we decided to design and implement a linear
array of processors – henceforth referred as LAP – that requires    processors to parse
sentences with maximum  words or word lattices with maximum    time-stamps. We
understand by a processor, a processing element whose basic task is to perform (i.e. process) a
cell-combination.
Note: A cell-combination is the operation described by line  of the CYK algorithm (see
page 13) for a particular value of . Each cell in the row  of the chart is filled by performing a
number of    cell-combinations.
The input to the LAP design is an initialized chart – and grammar lookup tables – and the output
is a compact parse forest.
During the parsing, each processor requires access to the CNF grammar for performing the
cell-combinations it is responsible for. While real-life CNF grammars require a large amount of
storage memory the CNF grammar data-structure is currently stored in SRAMs (Static Random
Access Memories) that are referred henceforth as grammar memories.
Note: The SRAMs have several advantages: are easy to control and require a minimum of
interfacing signals with the FPGA, state-of-the-art SRAM chips have relatively large sizes and
fast access times. A disadvantage would be the high power consumption.
Ideally, each processor should have its own grammar memory, such that all processors can
process the assigned cell-combinations in parallel. However, as the number of pins available
for an FPGA is limited, only a limited number of grammar memories can be connected to the
FPGA being available to the processors. Therefore, several processors have to share the same
grammar memory and the processors sharing the same grammar memory form a cluster. The
way the processors are clustered around a grammar memory is arbitrary and a priori a solution
that aims to equilibrate the number of interprocessor collisions to the grammar memories –
during runtime – over all the clusters is preferable. All the processors in the system implement
a procedure for accessing the CNF grammar data-structure stored in a grammar memory.
The chart data-structure is also stored in a SRAM, referred henceforth as the chart memory
and that is shared by all the processors in the system. Again, all the processors in the system
implement a procedure for accessing the chart data-structure stored in the chart memory in order
to fetch the sets required for performing the cell-combinations they are assigned to compute.
Each processor of the LAP design is (statically) assigned to a column of the chart and
therefore only performs cell-combinations for filling the cells on this column. The LAP design
executes line 8 of the CYK algorithm (see page 13) in parallel. Concretely, assuming that the
sentence length is , for each row  in the chart the columns       are filled in
parallel and the cell   on column  is filled by the processor 

. When the topmost cell of
a column is filled the processor associated with that column becomes idle for the rest of the
24 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
parsing process. Before starting to fill the cells in a new row all the processors should finish
to process (i.e. fill) the cell in the current row in order to guaranty that all the data required
to process the new row is available. For this reason the processors are synchronized after each
filled row (corresponds to line 7 of the CYK algorithm, see page 13).
The block diagram in figure 3.1 depicts a system consisting of  processors that can parse
any sentence of length less or equal to  words. The -processor system was synthesised and
physically tested (i.e. the results were validated for correctness) on a commercial RC1000-PP
FPGA board containing a Xilinx Virtex XCV1000bg560-4 FPGA.
Note: from now on we will use sentence to mean both sentence and word lattice. Depending
on the context, when speeching about sentence parsing we also mean word lattice parsing and
when speeching about sentence length we also mean number of nodes (i.e. time-stamps) in a
word lattice . . .
In the figure, the elements inside the dashed line are implemented within the on-board FPGA
chip. The other elements (chart and grammar memories) are implemented in SRAM chips
available on the board. As the RC1000-PP FPGA board contains  SRAM chips,  is used
for storing the chart while the remaining  are used for storing identical copies of the data
structure representing the CNF grammar. The  processors are assigned to  clusters: pro-
cessors 

 

 

	 to cluster , processors 

 

 

	 to cluster  and the processors


 
	
 


 

	 to cluster .
3.3 Functional Description
A LAP system with  processors can only parse sentences with up to    words but not
longer. Therefore, the maximum sentence length to be parsed has to be known a priori, such
that the appropriate number of processors are integrated within the LAP design. Based on the
length of the sentence to be parsed, the module (e.g. on-board processor) that initializes the
chart memory decides whether the parsing will be done in the hardware – if the sentence length
has no more than   words – or by the software.
If a LAP design with  processors is available and the sentence we want to parse has no
more than   words, the following initializations are necessary before the parsing can start:
 grammar memories : before any sentence can be parsed, each of the grammar memories
have to be configured with the binary image of the data structure representing the CNF
grammar (for details see section 3.4.2). The initialization of the grammar memories is
done by some software running on the on-board processor or on the host system. As
the grammar data structure does not change during the parsing, the grammar memories
only have to be configured once for multiple parsings with the same grammar. Only if a
different grammar has to be used the grammar memories have to be reconfigured;
 the chart memory : the initialization of the chart memory is done by some software
running on the on-board processor or on the host system. It consists of initializing certain
cells   of the chart with sets of non-terminals 

along with their corresponding
guard-vectors (see section3.4.1).
 sentence length : the global controller (IO-CTRL) is initialised before every parsing with
the length of the sentence to be parsed. Based on the sentence length the IO-CTRL gen-
3.3 : Functional Description 25
(DA
TA
 + A
DD
RE
SS 
+ C
ON
TR
OL
) bu
s
(DA
TA
 + A
DD
RE
SS 
+ C
ON
TR
OL
) bu
s
(DA
TA
 + A
DD
RE
SS 
+ C
ON
TR
OL
) bu
s
m
em
or
y
ar
bit
er
m
em
or
y
(DA
TA
 + A
DD
RE
SS 
+ C
ON
TR
OL
) bu
s
GR
AM
MA
R
ME
MO
RY
GR
AM
MA
R
ME
MO
RY
GR
AM
MA
R
ME
MO
RY
clu
ste
r 3
clu
ste
r 1
ar
bit
er
clu
ste
r 2
ar
bit
er
ar
bit
er
(SR
AM
)
(SR
AM
)
(SR
AM
)
(SR
AM
)
Pro
ces
sor
P1
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
Pro
ces
sor
P2
P3
P4
P5
P6
P7
P8
P9
P1
0
FP
GA
 ch
ip
FP
GA
 bo
ard
 (R
C10
00-
PP)
PR
OC
ES
SO
R C
ON
TR
OL
 bu
s
cha
rt
cha
rt
GR
AM
MA
R
CL
US
TE
R 1
GR
AM
MA
R
CL
US
TE
R 2
GR
AM
MA
R
CL
US
TE
R 3
IO
-C
TR
L
co
ntr
ole
r
GL
OB
AL
len
gth
se
nte
nc
e
ov
er
PA
RS
E
sta
rtP
AR
SE
w
ait
 cy
cle
s
Figure 3.1: A -processor LAP design (implemented and tested on the RC1000-PP FPGA
board) for parsing any sentence of length up to  words.
26 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
erates the signals that activate, respectively deactivate the processors during the parsing
process. A register is used for configuring the sentence length.
 wait cycles : due to the fact that the chart memory is accessed by a relatively large number
of processors the logic required to access this memory may have a large propagation
time (i.e. delay) and in consequence the chart memory access cycles may be long. In
order to keep the rest of the system frequency high, wait cycles are introduced in the
chart memory access cycles. As the number of necessary wait cycles cannot be foreseen
from the beginning (they actually depend on the way the signals are routed, the number
of processors and other factors), we use a register for configuring the number of wait
cycles. A first estimate for the number of wait cycles required to access the chart memory
is known after the design is synthesized. Within the current design a number of 1 to 8
wait cycles can be pre-configured (default value) in the VHDL code with the generic
__.
Among these initializations, the initialization of the chart memory is the most time consuming
an may represent and important amount in the overall parsing time.
Once the system was initialized – according to the procedure described above – the parsing
starts as soon as the signal 	
 is activated. At this point the processors start to fill
the cells in row  of the chart. As soon as all the processors finish to process the cells in row 2
(i.e. synchronize) they start to fill the cells in row  and so on. The processors are synchronized
under the control of the global controller IO-CTRL unit. Note, that each time a row is filled the
processor assigned to the rightmost column will become idle for the rest of the parsing process.
As soon as the topmost cell of the chart is filled the signal  
 is activated to mark the
end of the parsing process.
The parsing results are available at some output ! (not represented in figure 3.1)
and can be collected during runtime for building the compact parsing forest.
3.4 The CYK algorithm data-structures
3.4.1 The chart data-structure memory representation
This section discusses the chart data-structure used for the hardware implementation of the
CYK algorithm. The factors that influence the chart data-structure organization are also dis-
cussed.
There are several data-structures that can be used for the chart memory representation and
when choosing one data-structure or another, a good compromise has to be found between: (1)
the required memory space, (2) data access time and (3) data access circuit complexity. It turns
out that, these parameters are highly dependent. For instance, the hardware for data access
depends on the data-structure chosen for the chart memory representation. The more complex
the data-structure the more complex becomes the hardware required to access it, which also
results in increased access times. On the other hand a complex (i.e. compact) data-structure
can significantly reduce the memory space used for the chart representation. Before trying
to define the chart data-structure reprezentation let’s see which are the functionalities it has
to support. In order to discuss these functionalities, let’s consider that the source cells  
and       have to be combined and stored in the destination cell   (see the CYK
algorithm at page 13). The required functionalities are the following:
3.4 : The CYK algorithm data-structures 27
 %: go through all elements of a set. This is required in order to pair each non-terminal
of the set 

, corresponding to cell   with each non-terminal of set 

cor-
responding to cell     .
 %: is an element in a set ? This functionality is required for testing the presence of a
non-terminal  in the set 

. Required, in order to store only once a non-terminal 
in the set 

of the destination cell  .
 %: insert an element in a set. Required in order to store a non-terminal  in the set 

of the destination cell  .
Let’s take a look at a case-study of chart data-structure memory representation .
Case-study: We assume that the CNF SUSANNE grammar is used. The chart data-structure
considered in this case-study was used in early versions of the current design but proved
inefficient1. However, it is a good starting point in discussing the issues related to the
chart data-structure memory representation.
We know that every cell   in the chart contains a set 

of non-terminals. The set


is a subset of  of the considered grammar (see the CFG definition in section 2.1).
Therefore, we can represent 

as a bit-vector of length  , where   stands for the
cardinal of  . Each bit of this bit-vector corresponds to a distinct non-terminal and has
value ’1’ if the non-terminal is present in the set 

, or ’0’ otherwise. Therefore, we
need  
 bytes to store a cell in the chart memory with each byte storing 
 non-
terminals.
If we consider the CNF SUSANNE grammar that has      non-terminals (see
appendix A.3), we require a number of 	 bytes to represent as a bit-vector the non-
terminals in a cell. If we allocate a power of  memory size for a cell with the above
data-structure, the cell address can be easily computed and requires simple access cir-
cuitry. More precisely, a memory size of 
 
 (e.g.  
 bytes when using
CNF SUSANNE grammar) allows to retrieve the location of a cell   in memory by
using inexpensive shift operations. With such a data-structure a chart memory size of
 Mbyte can parse any sentence with up to  words and a memory size of  Mbyte
any sentence with up to 
 words. However, with such a data-structure the functionality
% is not well supported. In order to combine two cells a processor needs to fetch from
memory  
 bytes (e.g 	  	 bytes in the particular case of the CNF SU-
SANNE grammar) in order to form all pairs of non-terminals for grammar look-up. This
takes significant time and causes many collisions between the processors accessing the
chart due to the large number of accesses.
The previous case-study shows that the reprezentation of the set of non-terminals in a cell as
a bit-vector is fast, requires a reasonable amount of memory and simple access hardware but
does not well support the % functionality. A list representation of the non-terminal set in a cell
is an alternative solution. However, when using lists, the search for a particular non-terminal in
order to only store distinct non-terminals (in the destination cell), the % functionality, is not
well supported. A solution is to use both a list and a bit-vector for representing a cell in the
chart. With such a data-structure, the lists will be used to sequentially access the elements in the
source cells, to support the % functionality, while the bit-vectors will be used to check for the
1It only works for CNF grammars with no more than several hundreds of non-terminals.
28 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
presence of non-terminals in the list, to support the % functionality. Such a data-structure has
a large redundancy, as both the bit-vector and the list in a cell represent the same information,
namely the non-terminals in that cell.
Like the bit-vectors, the lists represent subsets of the set  . Each list requires an amount
of memory proportional to  , more precisely   byte if we assume that each non-terminal
is represented on  bit. In the particular case of the CNF SUSANNE grammar the memory
required is       
 byte.
On the other hand, we know that to allocate for each set 

an amount of memory pro-
portional to   would represent an important memory waste (see appendix B.3), as in practice


   . Therefore, in order to reduce the size of the chart memory, we assume that any set


contains a maximum of 
 non-terminals. During run-time, if a cell receives more than 

distinct non-terminals, the hardware generates a fault signal and the parsing stops. This would
be, however, a very unlikely event for a well chosen value of 
 and we can thus use tables of
size proportional to 
 to store the non-terminals of the sets 

. The value of 
 depends on
the considered grammar and can be found by investigating the characteristics of the grammar.
For the CNF SUSANNE grammar (see section B.3) we have a value of 
   (much smaller
when compared to     ), which requires a chart memory size of  Mbyte for parsing
any sentence with up to  words and a memory size of  Mbyte for a 
 words sentence.
For understanding how the %, % and % functionalities are implemented, the chart data-
structure organization is given in figure 3.2. The chart memory storing the chart data-structure is
addressed on  byte words. Also, in order to efficiently access the data stored in the chart mem-
ory, the content of the data-structure is aligned to a -byte boundary. The chart data-structure is
N11
N21
N1n
N21
N1n
N11
indexing table non-terminals table guard-vectors table
Phead Pguard
PguardPhead
Phead Pguard
free
free
free
bit 10 15
empty
last
0 0
0 0 0 0
0
1 0 0 1 0
100
0 0 0 0 01
| N | [bits]
Dtail
Dtail
Dtail
8 [bytes] 2*K [bytes]
(1,n)
(2,1)
(1,1)
Figure 3.2: CYK chart data-structure organization
organized in memory as three distinct tables and is defined by some parameters whose values
depend on the used CNF grammar. These parameters, their maximal allowed size and their
3.4 : The CYK algorithm data-structures 29
parameter maximal allowed size actual
size
meaning
 
	
  total # of non-terminals
_  [bits]  [bits] size of a non-terminal
_"_ -DTAIL_SIZE [bits]  [bits] size of a chart memory pointer
"_  [bits] 	 [bits] bits used to represent the constant 

Table 3.1: The maximal allowed size for the parameters that define the CYK chart data-
structure. Parameter sizes in the particular case of the CNF SUSANNE grammar.
actual size in the particular case of the CNF SUSANNE grammar are tabulated in table 3.1. A
CNF grammar for which any of these parameters requires a size larger than its corresponding
maximal allowed size cannot be accommodated within the chart data-structure. These param-
eters are used in order to correctly access the chart data-structure by (1) the software used to
initialise the chart data-structure and (2) for configuring the VHDL code (i.e. using generics)
used to synthesize the design.
The first table in figure 3.2, called indexing table, contains for each chart cell   an entry
that allows to retrieve: (1) a pointer  	 to the memory location where the list representation
for the set 

is stored, (2) a pointer #!	
 to the memory location where the guard-vector
for the set 

is stored and (3) a value "	$%, taking values from  to 
 , representing the
number of elements in the set 

. Each entry in the indexing table is stored on 
 bytes
and is organized as illustrated in figure 3.3. The physical address of an entry in the indexing
16 15
1516





























0
1
(i,j)
(CYK_ADR_SIZE)
(CYK_ADR_SIZE)
pointer to guard-vector
pointer to non-terminals list non-terminals in list
(DTAIL_SIZE) 
31 0
Phead
Pguard
Dtail
31 0
Figure 3.3: The organization of an entry in the indexing table of the CYK chart data-structure.
table where the information about the cell   resides is built by concatenating the binary
representation of  and  and performing a left-shift with  positions (
 byte aligned). The
indexing table is 
 KBytes for parsing sentences with up to  words or 
 KBytes for parsing
sentences with up to 
 words.
The second table called the non-terminal table contains for each chart cell   an entry
storing the list representations for the set 

. While a non-terminal is stored on  bytes,
each entry in the non-terminal table requires   
 bytes. In the particular case of the CNF
SUSANNE grammar the size of an entry in the non-terminal table is  bytes and the size of
the non-terminal table is  KBytes for parsing sentences with up to  words or  MBytes
for parsing sentences with up to 
 words.
The third table called the guard-vectors table contains for each chart cell   an entry
storing the guard-vector associated to set 

. An entry in the guard-vectors table requires
 
 bytes. In the particular case of the CNF SUSANNE grammar the size of an entry
30 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
last empty meaning
0 0 the list is not empty and the non-terminal is not the last
0 1 the list is empty
1 0 the list is not empty and the non-terminal is the last
1 1 not used
Table 3.2: last/empty bits meaning
in the guard-vectors table is  	 bytes and the size of the guard-vectors table is  KByte
for parsing sentences with up to  words and 	 MBytes for parsing sentences with up to

 words.
Let’s now look at how the functionalities %, % and % are implemented. For the im-
plementation of %, the Phead pointer is used as a base address that points to the first non-
terminal in the non-terminals table. A displacement, i.e. index, is used for going through all
the non-terminals in the non-terminals table. The addition of the base memory address and the
displacement gives the physical memory location of the addressed item. The unit implementing
%, needs to know whether the table is empty or not, and, in the later case, to know which is
the last non-terminal in the table. This is implemented by means of two special bits attached
to each non-terminal in the table. One bit is set when the table is empty, the other when the
non-terminal is the last in the table (see figure 3.2 and table 3.2) and each time a non-terminal
is fetched the last and empty bits are checked. While  bits out of  representing the non-
terminal are used for special purposes, only  are used to code the non-terminal and for this
reason only grammars having less than 	   
 non-terminals can be considered.
The function % would be highly time consuming if we search  over the set 

every
time. To have an efficient implementation for this function, a guard-vector (i.e. bit-vector) is
used. If the bit associated with  is set in the guard-vector then  is already in 

. This gives
constant time for answering % while only few chart memory accesses are enough to check the
bit associated to  in the cell  . For the implementation of % the Pguard pointer is used
as a base address that points to the beginning of the guard-vector (organized in a -byte words).
The displacement for addressing the bit associated to a non-terminal  is given by the binary
representation of  . The physical memory address where the bit associated to a non-terminal
 is located is obtained by adding the base address and the displacement.
Finally, for the implementation of %, both Phead and Dtail are used to point the be-
ginning of the non-terminals table, and, respectively, to index the last non-terminal. The phys-
ical memory address where the next non-terminal is to be stored is obtained by adding the
base address Phead to the displacement Dtail. Each time a non-terminal is stored in mem-
ory the displacement Dtail is incremented to reflect the new set size. If during the parsing
Dtail " 
 , a fault signal is raised to signal that the 

has too many non-terminals. In this
case the parsing will stop and a signal should tell the host computer (or local microprocessor)
that the current parse could not finish and that the software has to redo it.
When parsing sentences, Dtail is always , and is not used since every set 

, with
  , is empty when the parsing starts. However, for word lattice parsing the sets 

of the
destination cells are not necessarily empty when the parsing starts and the Dtail is necessary,
while the insertion of new non-terminals has to be made at the end of the non-terminals table
where the pointer Phead Dtail points .
3.4 : The CYK algorithm data-structures 31
3.4.2 The CNF grammar data-structure memory representation
This section discusses the CNF grammar data-structure used in the hardware implementation
of the CYK algorithm. The factors that influence the selection of the grammar data-structure
are also discussed.
The grammar memories are used for storing the CNF grammar data-structure and are ac-
cessed by the processors during the parsing for grammar rules lookup. Given that real-life CNF
grammars data-structures require a large amount of storage memory, SRAM chips are used for
this purpose. The SRAMs have several advantages: (1) are easy to control, (2) state-of-the-art
chips have relatively large memory sizes, (3) have fast access times and (4) require a minimum
of interfacing signals with the FPGAs. The disadvantage is the high power (i.e. current) con-
sumption. Before being used the grammar memories have to be loaded with the CNF grammar
data-structure. The content of the grammar memories can be changed whenever a new CNF
grammar is to be used. If there are more processors in the design than available grammar mem-
ories, it is possible to cluster several processors around each grammar memory. When clusters
are used, the processors have to be arbitered when accessing the cluster’s grammar memory.
Precisely, the purpose of a grammar memory is to allow the processor, to retrieve for a given
rule right-hand side  :
1. all non-terminals 

for which there is a rule 

  in the CNF grammar;
2. a code &' that uniquely identifies the given rule right-hand side;
As the processors heavily rely on CNF grammar look-up during parsing, an important at-
tribute of the grammar lookup unit is to allow a fast access to the CNF grammar data-structure.
However, a simple (i.e. fast) hardware for accessing the grammar memory may require a large
memory space. On the other hand, a compact data-structure, requiring a small memory space,
also requires complex (i.e. slow) hardware for accessing this data-structure. Therefore, a com-
promise has to be found between (1) the memory space required for the CNF grammar repre-
sentation, (2) access circuit complexity and (3) data access time.
Let’s start looking for such a compromise in a case-study on the CNF SUSANNE grammar.
We consider a data-structure that allows a very fast access to the stored information.
Case-study: In CNF every grammar rule2 is of the form 

  . The rule right-hand side
 can be any pair of non-terminals and for each such rule right-hand side we can have
any set of non-terminals 

in the left-hand side.
A data-structure that allows a fast access (i.e. retrieval) of all the left-hand side non-
terminals 

corresponding to a given rule right-hand side  would be a table indexed
by the rule right-hand side and storing on each table entry (of size proportional to  )
sets of non-terminals (i.e. the left-hand sides). The advantage of such a data-structure
is that the physical memory address for retrieving the set of left-hand side non-terminals
can be constructed from the binary representations of the two non-terminals (i.e.  and
) in the rule right-hand side.
Concretely, for the SUSANNE CNF grammar we have      non-terminals
and each non-terminal is stored on  bytes. For having an easy way of computing the
physical memory address to the set of left-hand side non-terminals the amount of memory
required for a set is 
 . The physical memory address to a set is then computed
by concatenating the binary representations of  and  . There are   such sets stored
2The lexical rules    are stored in a lexicon and are only used for initializing the chart memory.
32 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
X1 X2
X1 X2
X1 X2 Xn
X1 X2
X1 X2
X1 X2
X1 X2
X1 X2
X1 X2































































0
|N|-1
|N|-2
|N|-3
4
Z
Z
Z
Z
ZZ
Z
Z
LEVEL 1
Y
Y
Y
Y
Y
(NT_SIZE)
Y ptr_tableZ
(PTR_SIZE)(NT_SIZE) (PTR_SIZE) (PTR_SIZE)
ptr_left ptr_right
LEVEL 2
(4 bytes) (4, 8 or 12 bytes depending on the sum NT_SIZE + 3*PTR_SIZE) 
LEVEL 3
RHScode
RHScode
RHScode
RHScode
RHScode
RHScode
RHScode
RHScode
RHScode
(RULE_SIZE) (NT_SIZE) (NT_SIZE)
Figure 3.4: The CNF grammar data-structure. Levels of representation.
in the table and by making a simple calculus we see that the amount of memory required
for storing such a data-structure – in the particular case of the SUSANNE CNF grammar
– is 	 bytes which is too large for being practical.
An improvement to the above data-structure would be to have in the table instead of
sets of non-terminals, lists of non-terminals and a mechanism for computing the physical
memory address pointing to these lists. Such a mechanism can be a table indexed by the
binary representations of  and  (like above) for retrieving a pointer to the associated
list. If such a pointer is represented on  bytes the amount of memory required to store
the data-structure is about  bytes. The improvement comes from the fact that the sets
of left-hand sides represented as lists of variable length required – in the particular case
of the SUSANNE CNF grammar – only the insignificant amount of memory, namely
 	 bytes. Unfortunately, even the  bytes amount of memory is impractical.
The first data-structure in the above case-study is characterised by a very fast access time to
the stored information. However, the memory space required is to large for being practical.
The second data-structure, which is a refinement of the first studied data-structure, is a better
solution, but however still unacceptable due to the required memory space.
The CNF grammar data-structure we propose is depicted in figure 3.4, and is organized on
 levels. The grammar memories storing the CNF grammar data-structure are addressed on
 byte words and therefore for efficiently accessing the stored data the proposed data-structure
is aligned to a  byte boundary. The CNF grammar data-structure is defined by some parameters
whose values depend on the used CNF grammar. These parameters, their maximal allowed size
and their actual size in the particular case of the CNF SUSANNE grammar are tabulated in
table 3.3 (for more details regarding these values see appendix A.4). A CNF grammar for
which any of these parameters requires a size larger than its maximal allowed size cannot be
accommodated within the proposed CNF grammar data-structure. Level  is a table with an
entry for each distinct non-terminal present in the grammar. Such an entry contains either a
NULL pointer if there is no right-hand side starting with the corresponding non-terminal  or
3.4 : The CYK algorithm data-structures 33
parameter maximal size [bit] SUSANNE size meaning
_   bits for a non-terminal
_   bits for a pointer (byte address)
(_   bits for a rule right-hand side
Table 3.3: The maximal size of the parameters that define the CNF grammar data-structure.
Parameter sizes in the particular case of the SUSANNE grammar.
a non-NULL pointer pointing to the root of a sorted binary-tree at level . The number of bits
required to represent a pointer is given by the parameter _ (see figure 3.4 and table 3.3).
Each entry in this table has a fixed size of  bytes even if less are enough to represent a pointer
given the final size of the required grammar memory. This allows to construct the index for
indexing the table from the binary representation of the  . Precisely, the binary representations
of the non-terminal  is shift left with  positions for obtaining the index in the table. In the
particular case of the CNF SUSANNE grammar, there are   non-terminals and therefore
the level 1 table has a size of   bytes.
Level  is a collection of sorted binary-trees. Each sorted binary-tree corresponds to a
certain  and contains in its nodes all the non-terminals  for which there is a rule having the
right-hand side of form  . The fields of a node in a sorted binary tree are:
& 	%! : a non-terminal ;
& )
_	*% : pointer to a table that contains a list of all left hand-sides for the rules
having as right hand-side ;
& )
_%  : pointer to the left subtree;
& )
_
$# : pointer to the right subtree;
The number of bits required to represent a non-terminal  is given by the parameter _,
and for representing the pointers )
_	*% , )
_%  and )
_
$# by the parameter
_. Each node in a sorted binary tree is represented on , 
 or  bytes, depending on
the characteristics of the used CNF grammar. For instance, in the particular case of the CNF
SUSANNE grammar the size of a node is _   _  	 [bits] and therefore
 bytes are used. The amount of memory required for level 2 is in this case  
 bytes.
Searching for a certain  in the sorted binary-tree is done in the traditional way. We take
the current node – we start with the root – look whether the 	%! in the node is equal to the
 and if this is the case we found the node. If the 	%! in the node is greater we continue
searching with the left subtree pointed by )
_%  , otherwise with the right subtree pointed
by )
_
$#. If we have to search the left subtree and the )
_%  is NULL or we have to
search the right subtree and )
_
$# is NULL, we stop because the right hand-side we are
looking-up does not exist in the grammar.
If we found a node for which 	%! equals  we go to level  of the data-structure to a
table storing the required information. This table is pointed by )
_	*% and starts with a
header (always stored on 4 bytes) containing a code &' that uniquely identifies the rule
right hand-side and the non-terminals 

, 

, . . . , corresponding to this rule right hand-side in
no special order. The last non-terminal in the list is marked by a flag. Recall that a non-terminal
is represented in the chart on  bits (stored on  bytes) and therefore one of the remaining two
34 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
AE
BC
BE
DE
EA
EB
BD
ED
AD
D(NULL)
(NULL)
S
E
D
C
B
A
E
C A
A
B
CA
S
D
D
D
DB
E
A
E
B
0
1
2
3
4
5
level 1 level 2 level 3
Figure 3.5: The CNF grammar data-structure representing the toy CNF grammar in exam-
ple 3.3.
bits can be used for the flag that marks the last non-terminal in the list. Together with a non-
terminal the &' uniquely identifies a grammar rule. The size in bits of &' is given
by the parameter (_ and for the non-terminals 

, 

, . . . , by the parameter _
(see figure 3.4 and table 3.3). In the particular case of the CNF SUSANNE grammar, the level
3 table has a size of   bytes.
For better understanding the CNF grammar data-structure an example is given bellow for a
toy CNF grammar.
Example 3.3 The considered CNF grammar is given by:
  '( 	,
     )     	,
      '    (   
  '    '    ( 
  (    (    ' 
'  (    ('    '( 	
The data-structure representing the above CNF grammar is illustrated in figure 3.5. Each node
in level 2 is represented on  bytes and the amount of memory required to store this data-
structure is  bytes.

Some FPGAs of the Xilinx Virtex family (e.g. XCV1000, XCV2000) have enough memory
resources for storing the data-structure representing the toy CNF grammar considered in the
example above. In this particular case SRAMs are not required and can be replaced by such
internal memory resources.
For the CNF SUSANNE grammar the memory size required for this representation is of

 	 bytes. For details regarding the creation of the binary memory image of the CNF
grammar refer to appendix A.4.
3.5 : Design units 35
3.5 Design units
3.5.1 The processor
The datapath of a processor used in the linear array of processors architecture is depicted in
figure 3.6. Within the linear array of processors architecture, the processors are synchronously
working on the same clock signal and are supervised by the global controller IO-CTRL (see
figure 3.1). Their main task is to compute (i.e. fill) new cells in the chart based on previously
computed cells. The new cells are obtained by performing a series of cell-combinations. For
describing how a cell-combination is performed, let’s assume that we have two source cells 
and  to be combined and a destination cell '. A cell-combination is performed as follows:
the processor pairs each non-terminal * in the source cell  with each non-terminal *
in the source cell  (the order is important) and checks whether a grammar rule having the
right-hand side ** exists in the CNF grammar. If such a right hand-side exists, all the
non-terminals 

for which there is a rule 

 * * in the grammar are added to the
destination cell ' in the chart such that every distinct non-terminal in the set ' occurs once.
This is particularly important as a processor performs several cell-combinations in order to fill
a destination cell ' and a certain non-terminal may occur several times but only stored once in
the destination cell '. The procedure continues until all the non-terminals in the input sets 
and  have been paired.
Note: The fact that each distinct non-terminal is stored once in a cell is strictly related to
the set representation of a chart cell. Obviously, for building the compact parse forest all the
occurences of a non-terminal in a cell are relevant. However, as shown in [15] it is enough to
store the unique occurences of the non-terminals in the chart cells while shipping on-line (i.e.
during the parsing) the information necessary to rebuild the compact forest.
Precisely, the procedure described above corresponds to line  in the CYK algorithm (see
page 13) where the cell  corresponds to set 

, the cell  to set 

and the cell '
to set 

. Also, during iteration  (    +), where + (  +    ) is the length of the
input sentence, all processors 

(    +    ) work in parallel, and the processor 

is constructing the set 

. At the end of iteration  the processors wait for a synchronisation
signal raised by the IO-CTRL before beginning the next iteration.
In order to achieve synchronization the processors are using the synchronization&validation
unit (see figure 3.1). How the synchronisation&validation unit is used to achieve processor
synchronization is described in detail in section 3.5.1.4.
For retrieving the *, and respectively * non-terminals from the source cells, re-
spectively write back the non-terminals 

in the destination cell, the processors need to access
the chart data-structure stored in the chart memory. The access to the cells in the chart is
implemented with the chart memory addressing unit (see figure 3.1) which is described in sec-
tion 3.5.1.1. Moreover, the information stored in the current destination cell is updated with
the update unit which is described in section 3.5.1.3. For checking the existence of a grammar
rule, the processors have to access the data-structure representing the CNF grammar, stored in
the cluster’s local grammar memory. This functionality is implemented within the grammar
look-up unit and is described in detail in section 3.5.1.2.
36 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
s
in
i+
1
s
in
i
fr
ee
ze
i
fr
ee
ze
i+
1
m
o
d
u
le
g
u
ar
d
R
ea
d
D
at
a
LREG1
LREG2
RREG2
RREG1
tm
p
In
d
ex
tm
p
B
as
e
s
h
ad
o
w
tm
p
C
Y
K
ad
d
re
ss
W
ri
te
D
at
a
R
H
S
2
R
H
S
1
L
H
S
G
ra
m
m
ar
m
e
m
o
ry
a
c
c
e
s
s
 m
o
d
u
le
M
A
G
R
H
S
1
In
d
ex
R
H
S
2
In
d
ex
R
H
S
1
b
as
e
R
H
S
2
b
as
e
G
u
ar
d
b
as
e
L
H
S
u
p
d
at
e
m
o
d
u
le
D
b
as
e
D
sh
ad
o
w
R
H
S
1
sh
ad
o
w
R
H
S
2
sh
ad
o
w
u
p
d
at
e
u
p
d
at
e 
u
n
it
g
ra
m
m
ar
 l
o
o
k
-u
p
 u
n
it
c
h
ar
t 
m
em
o
ry
 a
d
d
re
ss
in
g
 u
n
it
a
d
d
re
ss
C
H
A
R
T
s
y
n
ch
ro
n
is
at
io
n
&
v
a
li
d
at
io
n
s
y
n
ch
ro
n
is
at
io
n
 &
 v
al
id
at
io
n
 u
n
it
a
d
d
re
ss
G
R
A
M
M
A
R
d
at
aG
R
A
M
M
A
R
d
at
aC
H
A
R
T
D
In
d
ex
Figure 3.6: The datapath of the processor used in the linear array of processors (LAP) architec-
ture.
3.5 : Design units 37
3.5.1.1 The processor’s interface to the chart memory
The chart memory stores the chart and is shared for read and write by all processors in the
system. Concurrent accesses to this memory are solved with the chart memory arbiter. The
task of this interface is to allow the processors to access, for read or write, the elements in a
cell. The chart memory is a SRAM and as depicted in figure 3.6 the interface uses an address
bus 	
 &, a data bus 		& and three command lines (not illustrated).
Before each parsing the chart memory is initialized. The initialization of the chart memory
includes: the initialization of the indexing table, non-terminals table and guard-vectors table
(see section 3.4.1). It is done by some software running on the on-board processor or on the
host system.
In the case of the indexing table, the Phead and Pguard pointers are initialized only once
before any parsing starts and will not change their values afterwards as the non-terminal and
guard-vector tables do not move in memory. When parsing sentences, all Dtail values in the
chart are set on . When parsing word lattices the indexes Dtail are initialized in order to
reflect how many non-terminals are in a particular chart cell.
The most time-consuming initialization is the initialization of the guard-vectors. When
parsing sentences, the guard-vector are initialized to all ’0’. When parsing word lattices certain
bits should also be set to reflect the presence of some non-terminals in the initial sets 

.
Finally, the non-terminals table also has to be initialized and the last/empty flags should be
set accordingly.
This chart memory addressing unit (see figure 3.6) is used to access the chart data-structure.
Bellow we explain how the three distinct tables of the chart data-structure are accessed and the
hardware resources involved:
index table access: the index table is accessed only for read and used to retrieve the point-
ers Phead and Pguard and the displacement Dtail of a cell with known coordinates  
in the chart. For a given chart cell  , the physical address of its associated entry in the index
table is build from the   coordinates.
Precisely, if the chart data-structure can accommodate parsings for a maximal sentence
length 	, for a given cell   the physical memory address of its associated entry in the index
table is built as 
	   . Note that the index table is stored in the chart memory starting at
address .
In hardware, the physical memory addresses are constructed in the shadow register(s) as
bellow:
 in RHS1shadow as 
  	    for cell 

(source cell 1)
 in RHS2shadow as 
  	        for cell 

(source cell 2)
 in Dshadow as 
  	    for cell 

(destination cell)
If the sentence length 	 is a power of  (which is actually a requirement) the memory addresses
for the source and destination cells can be simply computed by performing binary shift opera-
tions on the cell coordinates.
non-terminals table access: accessed both for read and write. Read accesses are from 

(source cell ) and 

(source cell ) and write accesses to 

(destination cell).
In the hardware, the physical memory address for accessing the non-terminals in the non-
terminals table is constructed as bellow:
38 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
 (RHS1base + RHS1Index) for 

(source cell )
 (RHS2base + RHS2Index) for 

(source cell )
 (Dbase + DIndex) for 

(destination cell)
As the non-terminals in a source cells are read sequentially after each read the index stored in
RHS1Index or RHS2Index is incremented to address the next non-terminal in the list. The
registers RHS1base and RHS2base are initialized from the Phead field of the corresponding
cell and the counter registers RHS1Index and RHS2Index are initialized always on , even
when parsing word lattices.
The result of the cell-combination between the source cells is written in the destination cell
 . The processors write the non-terminals in the destination cell sequentially at the physical
address (Dbase + DIndex) and after each write the index stored in DIndex is incremented
to point at the next non-terminal. The register Dbase is initialized from the Phead field of the
corresponding cell and the counter register DIndex from the Dtail field. The displacement
Dtail is always  in case of sentence parsing. However, in the case of word lattice parsing it
may be different than .
guard-vectors table access: the table is accessed both for read and write. In hardware the
physical memory address for accessing a particular guard-vector in the guard-vectors table
is constructed from the registers Guardbase and LHS as (Guardbase + LHS), where LHS
is actually the binary representation of the non-terminal we want to check (read) or set (write).
The Guardbase register is initialized from the Pguard field associated to the destination
cell and LHS is the output of the grammar lookup unit.
3.5.1.2 The processor’s interface to the grammar memory.
Each grammar memory contains a copy of the CNF grammar data-structure (see section 3.4.2)
and is used for grammar rules look-up during the parsing process. A grammar memory is only
accessed for read and can be shared by several processors – clustered around the grammar
memory – in which case an arbiter is used to arbiter concurrent accesses.
The unit that interfaces the processor to the grammar memory is the MAG unit3 (see fig-
ure 3.6). The MAG unit uses an address bus 	
 , a data bus 		 and
three command lines (not illustrated) to connect to the grammar memories. The tasks of the
MAG unit are (1) given the non-terminals * and *, to allow the processor to retrieve
all the non-terminals 

for which there exists a grammar rule 

 * * and (2) a
code &' that uniquely identifies the right-hand side **.
Before being used the grammar memories have to be configured. The CNF grammar data-
structure is configured (i.e. loaded) once before any parsing can start, and remains unchanged
for successive parsing. Only when a different CNF grammar needs to be used the grammar
memories have to be reconfigured (i.e. reloaded) with the new CNF grammar data-structure.
The initialization of the grammar memories is done by some software running on the on-board
processor or on the host system.
The MAG unit allows the processor to retrieve the &' and all the non-terminals 

that are left hand-sides for a given right hand-side * *. For doing so, the MAG unit
implements the mechanism described in section 3.4.2 for accessing the three levels of the CNF
grammar data-structure.
3MAG is an anagram of the initials of Grammar Memory Access unit
3.5 : Design units 39
Signal Direction Size (bits) Details
& I NT_SIZE the first non-terminal in the right
hand-side of the searched grammar
rule
&+ I NT_SIZE the second non-terminal in the
right hand-side of the searched
grammar rule
	
_ 	
' I 1 initiates the search (activated when
* and * are stable)

 	,_ 	
' O 1 marks the end of the search
 -),&  O 1 if active when 
 	,_ 	
' is
active, means that a valid non-
terminal is available at the output
. &
. & O NT_SIZE the retrieved non-terminal(s)
&' O RULE_SIZE a code that uniquely identifies a
right hand-side. Used together
with . & it uniquely identifies
a grammar rule
. /& I 1 requests the following left hand-
side in the list
	0 12 O 1 requests the cluster’s grammar
memory (to the cluster’s arbiter)

 % 	 12 O 1 releases the cluster’s grammar
memory (to the cluster’s arbiter)
3 . I 1 if active, the grammar memory be-
longs to this MAG unit
		 I 32 connected to the grammar memory
data bus
	
  O PTR_SIZE-1 connected to the grammar memory
address bus
Table 3.4: MAG I/O signals used to interface with the processor, arbiter and cluster’s grammar
memory.
The signals used for interfacing the MAG unit to the processor along with their functional
description are given in table 3.4. All these signals are active high and operate according to
the switching waveforms (i.e. protocol) given in figure 3.7. We can distinguish two possible
situations when the MAG unit accesses the CNF grammar data-structure, (1) there is no rule
having the given right hand-side and (2) at least one such rule exists. The switching waveform
for the first case is given in figure 3.7(a). At the moment 

, the & and & are stable and
the processor activates the signal 	
_ 	
' asking the MAG unit to start the search. After
a non predictable duration, at time 

the MAG unit will answer by activating 
 	,_ 	
'
to mark the end of the searching and also  -),&  to signal that there is no rule in the
CNF grammar with the given right hand-side **.
The switching waveform for the second case is given in figure 3.7(b). The searching is
40 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
t
0
t
1
t
1
LHS 1 LHS 2 LHS N
t
0
t
2
time
oneLHS
RHS2
RHS1
RHS2
RHS1
start_search
ready_search
emptyLHSset
start_search
ready_search
emptyLHSset
nextLHS
t
n
time
(a)
(b)
Figure 3.7: Grammar memory access (MAG) unit interface protocol.
3.5 : Design units 41
initiated as described above. However, in this case, at time 

when the processor activates

 	,_ 	
', the signal  -),&  stays inactive (i.e. low). This means that the MAG
unit has a valid non-terminal on the output . & that is available to the processor. Whenever
the processor wants the next left hand-side it activates . /&. The sequence described above
is repeated for 

, 

, until we read all the left hand-sides for the given right hand-side. This
moment 

is marked by the signal  -),&  that is activated.
At this moment the * and * can be changed and a new search can begin.
3.5.1.3 The update unit
The tasks of the update unit are:
 set the value of flags (i.e. the last/empty special bits) of each non-terminal before writing
them in the chart memory, in the non-terminals table. This is done in the LHS update
module inside the update module (see figure 3.6);
 update the guard-vectors. Every time a new non-terminal is inserted in the non-terminals
table, its corresponding bit in the guard-vector has to be set. This is done in the guard
update module inside the update module (see figure 3.6). In order to update the guard-
vector, a read operation is performed on the chart memory, for fetching the 4-byte word
containing the bit. Once the corresponding bit is set the 4-byte word is written back to
the chart memory.
3.5.1.4 The synchronization/validation unit
The synchronization and validation unit has two tasks (1) to allow the processors achieve syn-
chronization after each filled row of the chart and (2) to generate the stop condition for the
processors. The processors are synchronized at the end of each iteration  (line 	 of the CYK
algorithm, see page 13), before starting iteration  and is necessary for insuring that the data
dependency among the sets 

is satisfied. A processor will also check the stop condition at
the beginning of each iteration . If the stop condition is false the processor is validated (i.e.
working) during the next iteration. By the contrary, if the stop condition is true, the processor
is idle during the next iterations until the current parsing ends. Let’s first see how the validation
is implemented.
validation : The state (i.e. stopped or working) of processor 

depends on the length of the
parsed sentence and the state of the processor 

, next to it at the right. For illustrating how
the length of the parsed sentence conditions the initial state of the processors, let’s consider the
following example:
Example 3.4 For this example we consider the    processors system in figure 3.1. If we
want to parse a sentence of 	   words, the initial configuration is: 

 

 

	 working and

	
 

     

	 stopped. If the sentence to parse has 	  	 words, the initial configuration
is: 

 

     

	 working and 

 


 

 

	 stopped.

Also during run-time (i.e. parsing), the state of processor 

depends on the processor 

at his right. Precisely, if the processor 

at his right is stopped, the processor 

will be
stopped during the next iteration.
Example 3.5 For the same    processors system in figure 3.1 we consider a sentence of
length 	   words. The initial configuration is: 

 

     

	working and 

 

     

	
42 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
validn+1validi
validi
freeze0
sin0
STOPi
sini+1
freezei+1
n+1sin
freezen+1
validn+1
sini
freezei
P1 P P2 n
TERM
1valid valid2 validn
Pi
validi+1
R <
Q D
localFINISH
GLOBAL_INIT
setNextRow
GLOBAL_INIT
setNextRowR <
DQ ’1’
’1’
FF
FF
Figure 3.8: The hardware used to achieve processor synchronization and for generating the stop
condition.
stopped. During run-time, after the first row of the chart is filled, the processor 

will stop.
After the second row of the chart is filled, the processor 
	
will stop and so on until the fifth
row is filled and processor 

stops and the parsing is finished.

In order to explain how the validation mechanism is implemented in hardware, let’s consider the
figure 3.8. In this figure we have the signals involved in the generation of the stop condition as
well as those used for achieving processor synchronization. For a processor 

, the input signals

  4 

, $.

are cascaded from the processor 

. The signals 	%$

are generated
from the length 	 of the parsed sentence and the processor identifier , where     , as
	%$

  if 	  ,  otherwise. The signal %'	%& is activated by the processor
when it has finished his job. The signals 2_ and   /5 are activated by the
global controller IO-CTRL. The signal 2_ is used for initializing the processors
each time a new parsing starts. The flip-flop  used to propagate the stop condition is reset
by this signal. The   /5 signal is activated each time the processors have achieved
synchronization (after a filled row) in order to update the processors state.
The output signal 2

is the stop condition for the local processor and the signals 
  4 

and $.

are cascaded towards the processor 

. Their boolean equations are given bellow:
2

 
  4 

 	%$

 	%$


  4 

 2

$.

 $.

 2

 %'	%&
As we can see from the figure 3.8, all the output signals are updated when the   /5
signal is activated.
synchronization : Let’s now look at how the synchronization is implemented. We can in-
terpret the boolean expression for the signal $.

given above, as follows: the output signal
$.

propagates the input signal $.

if either the local processor has finished working (i.e.
%'	%& 


) or the state of the local processor is stopped (i.e. 2




). As we
3.6 : Performance measurements 43
can see in figure 3.8 the $.

is connected directly to a -logic. This  propagates through
the chain of processors, from right to left up to $.

if all the processors are either in a stopped
state or have finished working on the current row.
3.5.2 The I/O controller
The tasks of the I/O controller IO-CTRL are:
 system interface: all the configuration registers (see section 3.3) are accessed through the
IO-CTRL. The host system is interfaced with the design through the IO-CTRL which
monitors the 	
 signal and generates the  
 signal to mark the end
of the parsing. The end of the parsing is marked by the activation of the 
  4 

signal
(see section 3.5.1.4);
 system initialization: at power-up the hardware is reset. However, the state of the proces-
sors in the system requires initialization between successive parsings. This initialization
is performed by the IO-CTRL with the 2_ signal;
 processor synchronization: although the hardware for achieving synchronization is lo-
cated in the synchronization/validation unit of each processor, the IO-CTRL monitors
the signal $.

. If this signal is activated it concludes that the processors have achieved
synchronization and will activate the   /5 signal for setting-up the processors
for the processing of the next row.
3.5.3 The arbiters
All arbiters are token-passing units, adapted to the current design. Being synchronous with
the system clock, their response does not depend on the size of the system (i.e. the number
of arbitered processors) as is the case with the asynchronous arbiters. The interface between
a processor 

requesting access to a shared resource and the arbiter consists of three signals:
	0 12

, 
 % 	 12

and 12

. The signal 	0 12

is used to signal the pro-
cessor’s intention to obtain the shared resource and the signal 
 % 	 12

is used to release
the shared resource. The signal 12

is the arbiter’s answer and is activated when the shared
resource is granted to processor 

.
The arbiter works as follows: when the arbiter sees the token passed to processor 

who has
previously activated the signal 	0 12

, it answers the processor by activating the 12

signal. The processor sees the 12

signal activated and knows that the requested resource
has been granted to it. The arbiter will also block the token until the processor 

will release
the resource by activating the 
 % 	 12

signal.
The shared resources in the system are the chart memory (shared by all the processors in
the system) and the cluster grammar memory (shared by the processors in the same cluster).
3.6 Performance measurements
All tests and performance measurements presented in this section were performed on the CNF
SUSANNE grammar that contains   non-terminals and 	  rules. The size of the
memory required to store the data-structure representing the CNF SUSANNE grammar is of

 	 bytes. The chart memory size depends on the number of processors in the system, or
44 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
Parameter Size[units] Details
system
CLOCK 50 [MHz] the system clock
SLEN_SIZE 5 [bits] can parse sentences with up to


  words
PROCESSORS 10 the number of processors in the
system
chart
memory
CYKLATENCY 17 [ns] the access-time of the SRAM
memory used to store the chart
memory
DTAIL_SIZE 9 [bits] see section 3.4.1
CYK_ADR_SIZE 19 [bits] chart data-structure pointer size,
see section 3.4.1
CYK_WAIT_CYCLES 4 delay until chart memory data lines
are stable
grammar
memory
GMEMSIZE 558,576 [bytes] the size of each grammar memory
GLATENCY 17 [ns] the access-time of the SRAM
memory used to store the grammar
memory
WAIT_CYCLES 2 delay until grammar memory data
lines are stable
NT 10,129 the number of non-terminals in the
grammar
NT_SIZE 14 [bits] bits used to represent a non-
terminal, see section 3.4.2
RULE_SIZE 15 [bits] bits used to represent a distinct
right-hand side 3.4.2
PTR_SIZE 19 [bits] grammar data-structure pointer
size 3.4.2
Table 3.5: Parameter values used to configure the synthesized -processor LAP design.
the maximum length of the sentence we want to parse (e.g.   bytes for the  processors
system or  	  bytes for a  processors system).
In order to determine the real maximal clock frequency at which the system is able to
work, the  processor system shown in figure 3.1 was synthesized4 and placed&routed5 in a
Xilinx FPGA, Virtex XCV1000bg560-4. The synthesis was performed on a design instance
characterised by the parameters given in table 3.5. A summary of the FPGA resources used
by the design units is given in table 3.6. The used FPGA resources are expressed in terms of
D Flip-Flops and/or Latches (DFFs/Latches), function generators (FGs) and configurable logic
blocks (CLBs). For detailed explanations of these terms you can refer to the Virtex manual [31].
Here, we only mention that for the Virtex XCV1000 FPGA a number of maximum  	
DFF/Latches,  	 FGs and  

 CLBs are available.
As we can see in the table, the synthesized -processor system uses less then 37% of
the FPGA resources and therefore a system with -processors may eventually fit in the same
4With LeonardoSpectrum v1999.1f
5With Design Manager (Xilinx Alliance Series 2.1i)
3.6 : Performance measurements 45
unit DFFs/Latches FGs CLBs XCV1000bg560-
4 area utiliza-
tion (%)
XCV2000bg1156-
6 area utiliza-
tion (%)
chart memory arbiter 36 91 46 0.37 0.24
3-GM cluster arbiter 8 18 9 0.07 0.05
4-GM cluster arbiter 12 27 14 0.11 0.07
IO-CTRL 28 27 14 0.11 0.07
MAG unit 168 296 148 1.20 0.76
processor 658 887 444 3.61 2.22
10-processor system 6’507 8’934 4’467 36.35 21.81
Table 3.6: Virtex XCV1000 FPGA resource utilization per LAP design unit in terms of Flip-
Flops/Latches, function generators (FGs) and configurable logic blocks (CLBs).
FPGA that will allow us to parse sentences of real-life length of up to 26 words.
The system was then tested, and checked for correctness, on a RC1000-PP board with a
clock frequency of  MHz. However, due to the fact that the RC1000-PP board can accommo-
date only  grammar clusters, the hardware run-times we present were obtained by simulating6
the VHDL model of a system with  processors and  grammar clusters (one processor per
grammar cluster) and respectively a system with  processors and 	 grammar clusters. In the
last case the processors were clustered such that a priori the number of concurrent accesses
to the grammar memories are as reduced as possible. The clusters are 

 
	
	, 

 

	,


 

	,. . . 

 


	.
The software used for comparison is an implementation of the enhanced CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The hardware performance
(i.e. run-time) of the  processors and  grammar clusters system (hereafter denoted as
	
__) and the  processors and 	 grammar clusters system (hereafter denoted as
	
__) was compared against two software run-times. The first (_) uses the
SUSANNE grammar in CNF, as it is also the case for the hardware. The second (_)
uses the SUSANNE grammar in its original context-free form. The software was run on a SUN
(Ultra-Sparc ) with  MBytes memory, 		 MBytes of swap memory, and  processor at
a clock frequency of  MHz. The initialization of the chart was not taken into account for
the computation of the run-times. For accuracy, the timing was done with the times() C library
function and not by profiling the code.
For the purpose of the comparison,   sentences were parsed and validated7. The sen-
tences have a length ranging from  to  and were all taken from the SUSANNE corpus.
Figure 3.9 shows the average run-times _, 	
__ and 	
__ as func-
tions of the sentence length (vertical axe). Figure 3.10(a) shows the hardware speedup, for
both 	
__ and 	
__ in comparison with _ and figure 3.10(b)
shows the hardware speedup, for both 	
__ and 	
__ in comparison with
_ as a function of the sentence length. From these figures we can see that there is
no significant difference for the speedups factors when comparing the 	
__ and
6With ModelSim EE/Plus 5.2e
7The hardware output was compared to the software output for detecting mismatches. However, as the validation
process consists of many technical details, it is not described here. A detailed description can be found in [9].
46 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
0 10 20 30 40 50 60 70 80
3
4
5
6
7
8
9
10
11
12
13
14
15
3.85
0.21
0.21
7.32
0.41
0.41
10.39
0.62
0.62
14.67
1.01
1.01
17.23
1.14
1.14
24.12
1.8
1.8
27.65
2.06
2.07
38.7
3.13
3.16
39.46
3.15
3.19
48.69
4.02
4.08
57.97
4.92
5.01
68.98
5.93
6.05
76.12
6.78
6.91
se
nt
en
ce
 le
ng
th
[w
ord
s]
time[ms]
soft_CFG     
hard_14P_14C
hard_14P_7C 
Figure 3.9: Average run-times for _ and 	
__, 	
__ LAP systems
as a function of sentence length. For each sentence length more than  sentences were parsed.
	
__ systems. For sentences with up to 
 words there is no difference between the two
systems while only the processors 

, 

, . . . , 

are working – which means that there is only
one active processor in each grammar memory cluster – and therefore there are no collisions
between the processors when accessing the grammar memories. For longer sentences, however,
the explanation for the fact that there is no significant difference between the two systems is that
the processors seldom work in parallel which results in sporadic grammar memory colissions.
We also note that the system performance decreases with the sentence length which is a bad
behaviour when dealing with real-life sentences.
The average speedup factor ( has been computed for 	
__ hardware im-
plementation against both the _ and _. For _, (
_	

 

 and for _, (
_	
  .
3.7 Design testing on the RC1000-PP FPGA board.
For obtaining the real clock frequency and validating the presented design, the -processor
LAP system (see figure 3.1) was physically tested on a commercial FPGA-board.
We used Celoxica’s RC1000-PP FPGA-board whose block diagram is given in figure 3.11.
This board contains a Xilinx Virtex XCV1000 FPGA and  SRAM memories of  MBytes
(organized as K x ) each. The RC1000-PP board is shipped with host-side development
software that support the initialization and communication with the hardware. The software
provides key features such as handling the FPGA configuration files, initialization of the on-
3.7 : Design testing on the RC1000-PP FPGA board. 47
2 4 6 8 10 12 14 16
80
90
100
110
120
130
140
150
sentence length[words]
sp
ee
du
p
hard_14P_14C vs. soft_CNFG
hard_14P_7C vs. soft_CNFG 
(a)
2 4 6 8 10 12 14 16
12
14
16
18
20
22
24
26
sentence length[words]
sp
ee
du
p
hard_14P_14C vs. soft_CFG
hard_14P_7C vs. soft_CFG 
(b)
Figure 3.10: Hardware speedup for the 	
__ and 	
__ LAP systems against
(a) _ and (b) _ software as a function of sentence length. For each sentence
length more than  sentences were parsed.
48 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
board programmable clock and data transfer support (i.e. handshake and DMA8) for transfer-
ring data between the host system an the RC1000-PP board. The software comes in the form
of a static C library that can be linked to host-side programs. As the tested system contains 
SRAM
512K x 32
SRAM
512K x 32
SRAM
512K x 32
SRAM
512K x 32
IS
O
LA
TI
O
N
IS
O
LA
TI
O
N
V1000
Xilinx
PCI-PCI
bridge
PCI bus
PLX PCI9080
bridge
secondary PCI Programmable
  clock
(copy of GM)
(stores the chart)
(copy of GM)
(copy of GM) (the 10 processors system)
Figure 3.11: RC1000-PP FPGA-board block diagram illustrating the mapping of the -
processor LAP system.
grammar clusters (     processors) and the chart memory it can be fit on the  memory
SRAM banks available on the RC1000-PP board. Also, the SRAM banks on the RC1000-PP
board have a -bit databus that match the databus width used on the  processors system.
Given these remarks, the synthesized  processors system can be placed&routed in the Virtex
XCV1000 FPGA available on the RC1000-PP board without any further modification. The
place&route tool only requires a pin constraint file *.ucf that will force signal routing inside
the FPGA to the correct pins that are hardwired from fabrication to the external resources (i.e.
memory banks, data communication units). The output of the place&route tool is a *.bit
file that can be downloaded (using the development software library) to the FPGA without any
further modification. Previous to the FPGA design configuration the  grammar memories are
initialized with the binary image of the CNF grammar (i.e. SUSANNE CNF grammar) used
for parsing and the programmable clock is set on the required frequency. The initialization
of the grammar memories is performed through DMA transfers between the host system and
the RC1000-PP board. Precisely, the CNF grammar image file is read by the host software
and further transfered to each of the  grammar memories on the RC1000-PP board through
 successive DMA transfers. The chart memory is initialized before each parsing by first cre-
ating an image of the chart memory in the host system. The image is then transfered through
a DMA transfer in the SRAM memory associated with the chart memory. The 	
,
8Direct Memory Access.
3.8 : Conclusions 49
respectively  
 signals are implemented with the  single-bit user signals available
for host-FPGA communication. Finally, the sentence length and wait cycles (see section 3.3)
are configured using one of the two 
-bit ports available for host-FPGA communication. In
our case -bits are used to set the sentence length and the remaining -bits (a maximum of 
)
represent the wait cycles.
The host system pools the  
 signal for detecting the end of the parsing. Once
the parsing is finished the chart memory can be read (through a DMA transfer for instance) for
validating the results .
3.8 Conclusions
The chapter shows by means of some examples why a 2D-array of processors architecture as
presented in [8] is not feasible when using state-of-the-art FPGAs and real-life CFGs. Next it
proposes a linear array of processors (LAP) architecture as an alternative solution. The LAP
design was the first step of our design methodology during which we proved the feasibility
and correct functionality of a FPGA-based hardware implementation of the CYK algorithm.
The feasibility was proved in terms of required hardware resources and size of data-structures
for storing the chart and the CNF grammars. For proving the correct functionality the design
was synthesized and tested on a commercial FPGA board (RC1000-PP). We also showed that
a significant speedup factor is obtained when compared against a software implementation of a
similar algorithm.
The main features of the LAP design are: (1) a speedup factor of about  over our best
software implementation of the enhanced-CYK algorithm and (2) its ability to parse sentences
with up to  words or time-stamps when dealing with word lattices. These features make the
design an interesting solution for integration within real-life NLP applications frameworks that
have strong real-time constraints.
50 CHAPTER 3: The Linear Array of Processors Hardware Design of the CYK Algorithm
Chapter 4
Linear Array of Processors Design
Analysis
In this section we discuss some important features of the LAP design . First, in order to get
a general idea about the design we look at the processor activity and measure the processor
utilization for some arbitrary sentences extracted from the SUSANNE corpus. More precisely,
as the processor activity mainly consists of two tasks (1) data processing and (2) accessing
the chart memory data-structure, we investigate the fraction of time spent by the processors
in performing each of the two tasks. Next, the effect of increased number of collisions when
accessing the chart memory as the sentence length increases on the overall design performance
is investigated. We conclude with some remarks on the LAP architecture that pinpoint several
critical changes required for building an improved design.
4.1 Processors utilization
In order to get a general idea about the processor activity during the parsing process we illustrate
in figure 4.1(a), (b) and (c) the processor activity for three sentences (given in table 4.1) of
length ,  and respectively  words. The figures 4.1(a)-(c) illustrate for each processor in
the design (vertical axis) its activity time during the parsing. More precisely, with black is
depicted the amount of time spent by the processor for processing data and with gray the time
spent for chart memory read/write accesses. In these figures we can see the moments when the
length sentence
4 “One wing stood open”
10 “In fact our whole defensive unit did a good job”
15 “In societies like ours , however , its place is less clear and more complex”
Table 4.1: Three sentences parsed with the LAP design for which the processor activity is
represented in figure 4.1.
processors synchronise and start the filling of a new row. An important thing to note about the
processor activity is that there is a large discrepancy between the load of different processors
when processing a row. In fact, when processing a row the most loaded processor, i.e. the one
that finishes last, triggers the filling of the next row while all the other processors are idle. The
52 CHAPTER 4: Linear Array of Processors Design Analysis
above observations suggest that a better processor allocation mechanism is needed in order to
better exploit the available processing power.
Another feature of the LAP design that can be observed in the figures 4.1(a)-(c) is the fact
that as the length of the sentence increases the time spent by a processor for chart read/write
accesses (i.e. gray area) also increases becoming a significant fraction in the overall processor
activity time. This behaviour is explained by an increased number of collisions – that are
arbitered – when accessing the chart memory.
The design efficiency can be described by means of the average processor utilization which
is a measure for the efficiency with which the processors are used during the parsing process.
In order to investigate the average processor utilization in the current design we will parse one
sentence for each length between  and  words – chosen from the SUSANNE corpus – and
compute for each of these sentences the average processor utilization. The average processor
utilisation is computed as follows.
We define the utilisation ,


of processor 

as the ratio between the time it spends in data-
processing (i.e. black area without the gray area) during the entire parsing process and the time
elapsed until the parsing finishes. Then, for a system with  processors the average processor
utilization , is computed as follows:
, 



,



(4.1)
The table 4.1 tabulates for each of the sentences chosen from the SUSANNE corpus the average
processor utilization using the formula given above. Note that the average processor utilization
length sentence average processor
utilisation U[%]
3 “One pass only” 9.26
4 “There was no moon” 8.27
5 “She too began to weep” 9.35
6 “She must not think about time” 10.99
7 “The form and the chaos remain separate” 12.72
8 “The games were over , this was life” 10.58
9 “Like Napoleon , he was the worst of losers” 8.90
10 “In fact our whole defensive unit did a good job” 15.01
11 “I told him who I was and he was quite cold” 8.36
12 “Nerves tight as a bowstring , he paused to gather his wits” 12.86
13 “I told him no , that I had had a very happy childhood” 10.00
14 “As he had longed to be , he became the echo of a saga” 10.48
15 “It is all around us and our only chance now is to let it in” 13.42
Table 4.2: The LAP design average processor utilization for a set of sentences extracted from
the SUSANNE corpus with lengths between  and  words.
neither decreases nor increases as the sentence length increases. The average processor utiliza-
tion actually oscillates around 10%.
4.1 : Processors utilization 53
(a)
(b)
(c)
Figure 4.1: LAP design processor activity in BLACK+GRAY when parsing a sentence of length
(a)  words, (b)  words and (c)  words. GRAY: represents the time spent for chart read/write
accesses. BLACK: represents the time spend for data processing.
54 CHAPTER 4: Linear Array of Processors Design Analysis
4.2 Expected performance depreciation
As the length of the parsed sentence increases and more processors contribute to the parsing
process, the system’s performance (i.e. parsing time) does not scale at the expected rate. This
depreciation in the expected system’s performance is the consequence of an increased number
of collisions (i.e. concurrent accesses) between the processors when accessing the chart mem-
ory as the sentence length increases. The increased number of collisions between processors
can be observed on the figures 4.1(a)-(c) depicting the processor activity for three sentences
randomly chosen from the SUSANNE corpus. In these figures, the gray area – corresponding
to the time spent by the processors accessing the chart memory – increases as the sentence
length increases, becoming a significant amount of the overall processor activity. However, for
better illustrating this phenomena we will build our own grammar that has the particularity that
any cell-combination in the chart requires the same amount of processing. In other words, such
a grammar guarantees that the processing is uniformly distributed among the processors during
the parsing.
The following example proposes such a grammar and uses it in an experiment in order to
illustrate this expected performance depreciation phenomena.
Example 4.1 Let’s consider the CNF grammar given by:
  



     


	
	,
  	,
  

 



 

  -  
  
	


	
 



-  
  
   
  	


	
 



-  
     
  	
The symbol  does not have any meaning for this grammar an is not defined. Although, of no
practical use, this grammar helps us illustrate the expected performance depreciation phenom-
ena, and for this purpose we will use it to parse the sentences: "a a", "a a a", "a a a a", . . . up to
a similar sentence of length .
After initialization each cell in the bottom row of the chart will contain the set of non-
terminals 



    


	. During parsing the grammar generates the non-terminals 

,


, . . . , 


for each cell-combination performed. This means that at the end of the parsing,
each cell in the chart will contain the same set of non-terminals 

, 

, . . . , 


and that
during the parsing the processors work on the same data each time two cells are combined. In
conclusion, if we know the time . spent by a processor for performing a cell-combination (i.e.
the parsing time for the sentence "a a") we can compute as 	  .  	   the expected
parsing time for a sentence of length 	. Note, that when parsing the sentence "a a" only one
processor is working and therefore no collisions occur when accessing the chart.
In reality the parsing time is greater than the computed expected parsing time. This is due
as we already said, to the fact that as the sentence length increases, the number of collisions
when accessing the chart memory also increases.
Figure 4.2 illustrates the computed expected parsing time vs. the real parsing time when
parsing the sentences "a a", "a a a", "a a a a", . . . up to a similar sentence of length . The
figure 4.3 depicts the processor activity when parsing a sentence of length  and respectively
	 "a"s, with the grammar proposed above. Note that the time required for filling the first row
of the chart has increased due to the gray area (chart read/write accesses) while the black area
4.3 : Conclusions 55
2 4 6 8 10 12 14 16
0
1
2
3
4
5
6
7
8
9
x 106
sentence length[words]
tim
e [n
s]
real    
expected
Figure 4.2: LAP design real vs. expected parsing time when parsing the sentences "a a", "a a a",
"a a a a", . . . up to a similar sentence of length . Real parsing time is greater than the expected
parsing time which illustrates the expected performance depreciation phenomenon.
(processing time) stays unchanged. This can be observed also for the second and the third row.

4.3 Conclusions
This chapter proposes a linear array of processors (LAP) FPGA-based hardware implementa-
tion of the CYK algorithm adapted for word lattice parsing that can deal with large-size real-life
Chomsky Normal Form CFGs.
The LAP design fills the chart is a row-by-row fashion – all the cells in a row in parallel –
and can parse any sentence of length   words or word lattice with   time stamps if 
processors are available. The parsing results are available at the design’s output as a compact
parse forest that can be used for further processing (e.g. a semantic module). The design can
be configured in FPGA chips such as the Xilinx Virtex/Virtex-E FPGA family available on the
market by the time this thesis is written. Concretely, in the particular case when the SUSANNE
grammar is used, the resources available in a Xilinx Virtex XCV1000-bg560 FPGA can fit a
LAP design with up to    processors that allows us to parse any sentence with up to 
words.
For the implemented LAP design, the performance measurements show an average speedup
of  for the hardware when compared with our software implementation of the enhanced-
CYK algorithm using a general context-free grammar and a speedup of 
 when compared
with the same software using a CNF version of the same grammar. These preliminary experi-
ments are encouraging results for the application of the reconfigurable computing paradigm for
the NLP applications requiring efficient parsing with large-size real-life context-free grammars.
A -processors LAP design was synthesized and tested for validation on the commercial
RC1000-PP FPGA board.
The initialisation and interface with the LAP design were conceived as simple as possible
in order to allow an easy integration within a larger system. The FPGA – configured with
56 CHAPTER 4: Linear Array of Processors Design Analysis
(a)
(b)
Figure 4.3: LAP design processor activity for a sentence of length (a)  and respectively (b)
	 words. The larger gray area is the reason for the expected performance depreciation phe-
nomenon.
the LAP design – can be further integrated within a larger system, such as an FPGA-board for
building a hardware tool that can work as an accelerator in an application framework (e.g. NLP)
that requires efficient parsing of context-free languages. An example of such an application
framework, namely a Vocal Information Server, is described in section 1.4 and depicted in
figure 1.2.
The LAP design was the starting point in exploring the behaviour and characteristics of a
hardware implementation of the CYK algorithm when dealing with large-size real-life gram-
mars. The LAP design was the first step of our design methodology and was useful for:
 proofing the feasibility of an FPGA-based hardware implementation of the CYK algo-
rithm both in terms of hardware resources and data structures used for storing the chart
and the grammar memories;
 getting acquainted with the real-life behaviour of the CYK algorithm;
4.3 : Conclusions 57
 having a first evaluation of the speed-up factor over a software implementation.
During design implementation and testing we acquired knowledge about the required FPGA
resources, the size of the grammar memories (storing the grammar) and the chart memory
(storing the chart), (2) we identified the critical regions and features of the design that are
subject for further improvements and (3) we shown that a hardware implementation of the
CYK algorithm offers a substantial advantage over a software implementation.
From the performed experiments and performance measurements we remark a number of
drawbacks of the presented design:
 for a given sentence length, an increase in the number of processors in the system does not
contribute to an increase in performance when parsing the same sentence. An increased
number of processors only allows the system to parse longer sentences;
 as the sentence length grows, the number of working processors grows, and there is an
increased number of interprocessor collisions when accessing the chart memory. This
results in a performance depreciation as the sentence length grows. The accesses to the
guard-vectors is the main reason for this phenomena and in the future implementations
an alternative to the guard-vectors should be found;
 the analysis performed on some sentences extracted from the SUSANNE corpus shows
that the average processor utilization is around 10%. A better processor control is re-
quired in order to better exploit the parallelism available in the CYK algorithm and there-
fore to increase the average processor utilization;
 the initialization procedure is relatively complex and time consuming. Again, the source
of this problem are the guard-vectors that require a ’0’-filling before each new parsing;
 the design is only able to deal with CNF grammars. A design that could cope with general
context-free grammars is sought;
 the design does not have yet the ability to recover from fatal errors such as that en-
countered when exceeding the maximum number 
 of non-terminals in a cell (see sec-
tion 3.4.1) nor does it integrate the unit that extracts on-line the compact parse forest.
Although the above mentioned features are not key for a design aiming at proofing the
feasibility of the FPGA technology for implementing the CYK algorithm, they are re-
quirements for a future design;
58 CHAPTER 4: Linear Array of Processors Design Analysis
Chapter 5
The Dynamic Array of Processors
Hardware Design
This chapter presents an improved design architecture implementing the CYK algorithm adapted
for word lattice parsing. The proposed architecture is the second step of our design methodol-
ogy during which:
 we investigate a better processor allocation method while also increasing the maximum
sentence length that can be parsed;
 we refine and extend our background knowledge about the real-life behaviour and critical
features of the CYK algorithm before attempting attempting to design an architecture for
the enhanced-CYK algorithm;
 we improve the speed-up factor;
In order to better exploit the parallelism available in the CYK algorithm – when compared
to the row-by-row method used in the LAP design – a better processor allocation method is
implemented within the current design. The implemented processor allocation method tries to
process a cell-combination as soon as the cells it relies upon become available. In other words,
the implemented processor allocation method tries to "follow" the dataflow in the chart. Due
to the fact that the processors are assigned dynamically to process cell-combinations we will
call the current design a Dynamic Array of Processors – henceforth referred as DAP – and
the processor allocation mechanism, the dynamic processor allocation method . The dynamic
processor allocation method is studied in appendix B.2 in the particular case of the SUSANNE
grammar.
Another improvement is that the maximal sentence length that can be parsed with the DAP
design is independent of the number of processors in the system. The maximal sentence length
only depends on the size of the available chart memory and does not depend anymore on the
resources available in the used FPGA. In other words, if the syntactic analysis of a sentence fits
in the chart memory, then any number of processors can parse the sentence and the number of
processors we can fit in the FPGA will only influence the parsing performance.
While essentially the same architecture will be used for implementing the enhanced-CYK
algorithm the analysis and performance measurements of the proposed design architecture will
add more background knowledge and also help to identify the critical spots of such an archi-
tecture. The last point is an argument for the dynamic processor allocation method.
60 CHAPTER 5: The Dynamic Array of Processors Hardware Design
This chapter starts with a general and functional system description. It continues by present-
ing the system units in detail. The design performance measurements and an analysis discussing
some key features of the proposed design are given before concluding this chapter.
5.1 General system description
The design proposed in this chapter can parse sentences of any length, regardless of the number
of processors in the system, given that the chart memory is large enough to store the chart
data-structure. As for the LAP design, the input to the DAP design is an initialized chart
and grammar lookup tables and the output is a compact parse forest. The DAP design uses
SRAMs for storing the data-structures representing the chart (stored in the chart memory) and
the CNF grammar (stored in grammar memories). These data-structures are the same as those
employed by the LAP design (see section 3.4.1 for the chart data-structure and section 3.4.2 for
the grammar memory data-structure).
As in the current design the maximal length of the sentence that can be parsed does not
depend on the number of processors, there is no need to cluster the processors around grammar
memories. This was only necessary with the LAP design when the length of the parsed sentence
– and therefore the number of processors – was exceeding the number of available grammar
memories. In the current design there are as many grammar memories as processors in the
system. However, as the number of pins available for an FPGA is limited, only a limited
number of grammar memories can be connected to the FPGA which will also limit the number
of processors. Due to the fact that most of a processor’s processing time is spent in accessing
the grammar memory, having only one processor per grammar memory will result in a more
efficient grammar memory utilization (no time is lost for grammar memory cluster arbitration).
The chart memory is shared by all the processors in the system as it was the case for the
LAP design. All the processors in the system implement a procedure for accessing the chart
data-structure stored in the chart memory in order to fetch the sets required for performing the
cell-combinations they are assigned to compute.
Each processor in the DAP design is dynamically assigned to perform cell-combinations
on the chart. The difficulty with such a method comes from the fact that a cell-combination
cannot be issued before the cells it relies upon are not available (i.e. processed). In other
words, it has to deal with the dynamic data-dependency during runtime. The data-dependency
issue was simply solved in the LAP design by (re)synchronizing the processors after each filled
row. While the (re)synchronization mechanism has the advantage of being easy to implement
it does not yield good results. For this reason the DAP design implements a more sophisticated
mechanism that starts to process the cell-combinations as soon as the cells (i.e. sets) they rely
upon become available.
The general architecture of an -processor system implementing the dynamic allocation
method is depicted in the block diagram in figure 5.1. The elements inside the dashed line
are implemented within the FPGA chip. The other elements (chart and grammar memories
GMi) are implemented in SRAM chips present on the system board. The system’s initialization
procedure as well as its interface with the external world did not change, being identical with
the one described for the LAP design. A system with    processors was synthesized and
physically tested (i.e. the results were validated for correctness) on the commercial RC1000-PP
FPGA board containing a Xilinx Virtex XCV1000bg560-4 FPGA. As the RC1000-PP FPGA
board contains 4 SRAM chips, one is used for storing the chart and 3 for storing identical copies
of the CNF grammar data-structure – one for each processor.
5.1 : General system description 61
SE
Q-
GE
N
DI
SP
AT
CH
ER
FI
FO
FI
FO
W
RI
TE
R
st
ar
tP
AR
SE
o v
e r
P A
RS
E
PO
OL
FI
FO
D-
TA
BL
E
P
P
P
P
1
2
3
n
CT
X 
M
EM
OR
Y
SL
EN
CH
AR
T  
AR
BI
T E
R
M
EM
OR
Y
CH
AR
T
GM
2
GM
1
GM
3
GM
n
CH
EC
KE
R
DB
US
W
BU
S
IO
-C
TR
L
Figure 5.1: The block diagram of an -processor DAP design.
62 CHAPTER 5: The Dynamic Array of Processors Hardware Design
5.2 Functional description
The following initialisations are necessary before the parsing can start:
 grammar memories : before any sentence can be parsed, each of the grammar memories
have to be configured with the binary image of the data-structure representing the CNF
grammar (for details see section 3.4.2). The initialization of the grammar memories is
done by some software running on the on-board processor or on the host system. As
the grammar data-structure does not change during the parsing, the grammar memories
only have to be configured once for multiple parsings with the same grammar. Only if a
different grammar has to be used the grammar memories have to be reconfigured;
 the chart memory : the initialization of the chart memory is done by some software
running on the on-board processor or on the host system. It consists of initializing certain
cells   of the chart with sets of non-terminals 

along with their corresponding
guard-vectors (see section3.4.1);
 sentence length : the global controller (IO-CTRL) is initialised before every parsing with
the length of the sentence to be parsed. Based on the sentence length the SEQ_GEN
will generate the sequence of cell-combinations required for parsing the sentence. The
sentence length is configured in a register;
 wait cycles : due to the fact that the chart memory is accessed by a relatively large number
of processors the logic required to access this memory may have a large propagation
time (i.e. delay) and in consequence the chart memory access cycles may be long. In
order to keep the rest of the system frequency high, wait cycles are introduced in the
chart memory access cycles. As the number of necessary wait cycles cannot be foreseen
from the beginning (they actually depend on the way the signals are routed, the number
of processors and other factors), we use a register for configuring the number of wait
cycles. A first estimate for the number of wait cycles required to access the chart memory
is known after the design is synthesized. Within the current design a number of 1 to 8
wait cycles can be pre-configured (default value) in the VHDL code with the generic
__;
Among these initializations, the initialization of the chart memory is the most time consuming
an may represent and important amount in the overall processing time.
Once the system was initialized – according to the procedure described above – the parsing
can start when the signal 	
 is activated. At this moment the sequence generator
unit SEQ_GEN (see section 5.3.1 for details) will start to generate triples ' of two
chart source cells  and , that need to be combined (see lines 7-9 of the CYK algorithm on
page 13) along with their corresponding destination chart cell ', where the combination result
will be stored.
Note: A source cell or the destination cell in a triplet is in fact a pair of coordinates represent-
ing the row and the column of that cell. For instance,  is written as 

 

, where 

is the row and 

is the column of the first source cell. The number of bits used to represent
a row/column is configured in the VHDL code with the generic 22"_. The value allo-
cated to this parameter limits the length of the sentence that can be parsed. For instance, if
22"_  , only sentences with up to 32 words can be parsed.
5.3 : Design units 63
These triples depend on the length of the parsed sentence and the same series of triples is gen-
erated for any two sentences of the same length. The triples are then passed to the CHECKER
(see section 5.3.2 for details) unit who’s task is to check whether the source chart cells are
ready, i.e. are not destinations of unfinished previous cell-combinations. In other words, the
CHECKER verifies that the data-dependency among the sets 

in the chart, is satisfied. For
doing this, the CHECKER uses a table D-TABLE (see section 5.3.6 for details) that stores all
destination cells ', currently under processing. The CHECKER inserts a new destination cell
' in the D-TABLE when it is encountered for the first time in a triplet and deletes it when all
the triples containing the destination cell ' have been treated. Each time the CHECKER inserts
a new destination cell ' in the D-TABLE it does two things: (1) sets some "context informa-
tion" for the destination cell ' and stores it in the CTX-memory and (2) allocates an unique
identifier ID for the destination cell '. The CHECKER tags all subsequent triplets containing
the destination cell ' with the identifier ID before sending them to the DISPATCHER.
A triplet that passes the CHECKER’s test for data-dependency is forwarded to the task
dispatching unit DISPATCHER, while a triplet that does not pass the CHECKER’s test for data-
dependency is stored in the POOL (see section 5.3.3 for details) where it waits until the data-
dependency is satisfied. In the POOL all triplets are continuously checked for data-dependency
against the D-TABLE and as soon as a triplet passes the data-dependency test it is forwarded to
the CHECKER (that will update the D-TABLE) and immediately further to the DISPATCHER.
The CHECKER has therefore two inputs, one from the POOL and the other from the SEQ_GEN
– both passing through FIFO memories (see figure 5.1). The triplets coming from the POOL
have higher priority and are handled first. The reason for giving higher priority to the triplets
coming from the POOL is that, for a SEQ_GEN generating triplets in a natural chronological
order, the POOL will store old triplets that require to be treated first.
The DISPATCHER (see section 5.3.4 for details) unit has the task of distributing the tasks
(i.e. triplets and their associated unique ID) to the available processors that will further perform
the cell-combinations. The result of the cell-combinations (i.e. the processors output) is fetched
by the WRITER and finally stored in the chart memory if necessary. The WRITER uses the
ID provided by the flushed processor to retrieve the "context information" of the associated
destination cell ' in which the processing results will be stored. After fetching a processor’s
output the WRITER also updates the data-dependency information in the D-TABLE.
The  
 signal indicates the end of the parsing. It is activated as soon as the
SEQ_GEN unit has generated all the triplets according to the length of the parsed sentence
and there are no more triplets under processing. The parse result is available at some output
! (not represented in figure 5.1) of each processor and can be collected for building
the compact parsing forest.
5.3 Design units
5.3.1 The sequence generator (SEQ_GEN) unit
The sequence generator is presented in the block diagram (see figure 5.1) as the SEQ_GEN
unit. Its task is to generate all triples ' of two chart source cells  and  that
need to be combined, along with their corresponding destination chart cell ', where the cell-
combination result will be stored. A source/destination cell in a triplet is in fact a pair of coor-
dinates representing the row and column of that cell. For the CYK algorithm (see section 2.2)
64 CHAPTER 5: The Dynamic Array of Processors Hardware Design
and respectively the enhanced CYK algorithm (see section 2.3) the triplets are generated in the
lines 	  of the algorithms as   for ',   for  and respectively  for    .
The following example illustrates the sequence of triples generated by the SEQ_GEN unit each
time a sentence of length 	   is parsed.
Example 5.1 In figure 5.2 we have a possible sequence of all the generated triplets. The


     '
2-nd row 

         

        


         

        
3-rd row 
	
         

        


         

        



         

        
4-th row 

         

        


         

        

	
         

        
5-th row 

         

        



         

        
Figure 5.2: A (possible) sequence of triples 

 

    generated when a sentence of length
	   is parsed.
triplets are ordered according to the time instant 

at which they are generated. The notation


     ' stands for the triplet  ' generated at time 

, and represents
the source cells to be combined into the destination cell '.

The sequence of triplets generated by the SEQ_GEN unit is always the same for a given
sentence length 	. In the above example and in general we assume that the destination cells are
visited in a row-by-row manner. This order of visiting (i.e. processing) the destination cells is
natural as every cell in a row depends on cells in the rows bellow it.
However, there are several possible orderings for the generated sequence of triplets, corre-
sponding to different orders of processing the cells in a given row. The best order in which the
triplets can be generated is a sequence with chronological ordered source cells. Let’s illustrate
this with the following example.
Example 5.2 In the previous example, consider the sequence of triplets 

 

 

 generated
for the destination cell  . In this sequence, 

         comes after


        , although the source cell   of the triplet generated at 

has less chances of being available before any of the source cells   or   in the triplet
generated at 

. A better ordering would be therefore 

 

 

.

In other words, a triplet 

 


 


'

 will be generated before a triplet 

 


 


 '


if both of the source cells 


, 


were produced (as destinations) before the later produced
among the source cells 


, 


. Such a triplet ordering guarantee a priori the processing
of the cell-combinations in chronological order and gives the best chances to pass the data-
dependency test performed by the CHECKER.
In figure 5.3 we have the example of a sequence of triplets generated for the same sentence
length 	   but with chronological ordered source cells. Note that successive triplets do not
necessarily belong to the same destination cell. The sequence generation method illustrated in
5.3 : Design units 65


     '
2-nd row 

         

        


         

        
3-rd row 

         
	
        



         

        


         

        
4-th row 
	
         

        


         

        


         

        
5-th row 

         

        


         


        
Figure 5.3: Another (possible) sequence of triplets, but with chronological ordered 

 

   
source cells, generated for the same sentence length 	  .
this example is actually implemented in the SEQ_GEN unit and consists of a set of counters
and adder/subtracters units, easy to implement.
The output of the SEQ_GEN unit is forwarded to the CHECKER through a FIFO memory
for decoupling the two units. The size of this FIFO memory is configured in the VHDL code
with the generic 2(__"&. A small FIFO memory is actually required (i.e. 2-4) as the
SEQ_GEN unit is fast and will always keep the FIFO memory full.
5.3.2 The data-dependency checking (CHECKER) unit
The task of the CHECKER unit is to verify that the data-dependency among the triplets coming
from the sequence generator is satisfied. For doing so, the CHECKER makes use of the desti-
nation table D-TABLE that stores all destination cells currently under processing. Whenever a
triplet   ' coming from the sequence generator contains a source cell – either  or 
– that is in the D-TABLE (i.e. that source cell is a destination under processing) the CHECKER
concludes that the triplet fails the data-dependency test and therefore the task associated with
it cannot be yet dispatched to the processors. At this point, the CHECKER has two possibil-
ities: (1) either it waits for the triplet to satisfy the data-dependency constraint or (2) stores
the problematic triplet in a buffer and continues to process the following triplets coming from
the sequence generator. The second solution is obviously better and was implemented within
the current design. The triplets that do not pass the data-dependency test are actually stored in
a buffer, represented in the block diagram (see figure 5.1) as the POOL unit, where they are
continuously checked for data-dependency against the D-TABLE. As soon as a triplet stored in
the POOL passes the data-dependency test, it is forwarded back to the CHECKER and treated
by the later with higher priority over the triplets coming from the sequence generator. This is
reasonable in order to process the cell-combinations (i.e. triplets) in the chronological order in
which they were generated. On the other hand, if the buffer in the POOL unit is full the only
possibility is to wait until either a triplet in the POOL or the triplet to be stored in the POOL
will pass the data-dependency test.
A destination cell that is encountered for the first time in a triplet that passed the data-
dependency test will trigger the insertion of a new entry in the D-TABLE. The index in the D-
TABLE where the destination cell is inserted will be used as an identifier, henceforth referred
to as ID, for the destination cell under discussion. Every destination cell under processing has
66 CHAPTER 5: The Dynamic Array of Processors Hardware Design
therefore a unique identifier that will be sent along all the triples containing the same destination
cell. Also, when a destination cell is inserted in the D-TABLE a certain amount of "context in-
formation" is set up for it. This "context information" is stored in an internal memory, referred
to as the context memory and represented in the block diagram as the CTX-memory. In the
targeted Virtex family FPGAs the CTX-memory is implemented with _6_6 primi-
tives. The CTX-memory stores for each unique destination cell information about: the physical
memory address of the non-terminals list in the chart table, the physical memory address of the
guard-vectors in the chart table, the number of non-terminals and so on. A processor receives
the ID with the triplet to process and when finishes the processing it forwards the ID along
with the processing results to the WRITER. The later uses this ID for retrieving the "context
information" from the CTX-memory in order to store the processing results in the associated
destination cell.
The first empty entry in the D-TABLE is allocated to the CHECKER whenever a new
destination cell is inserted. If there are no free entries in the the D-TABLE the CHECKER will
wait until an entry is freed. This will happen as soon as a destination cell was processed.
The output of the CHECKER unit is forwarded to the DISPATCHER unit through a FIFO
memory for decoupling the two units. The size of this FIFO memory is configured in the VHDL
code with the generic 2(_&_"&. Ideally this FIFO memory should be never empty – nor
full – such that the DISPATCHER will always have tasks to distribute to the idle processors.
5.3.3 The triplets buffer (POOL) unit
A triplet coming from the sequence generator and containing a source cell that is in the D-
TABLE (i.e. means that the source cell is a destination under processing) will not pass the data-
dependency test performed by the CHECKER. In this case the task associated with the triplet
cannot be dispatched to the processors. As already mentioned, in this case, the CHECKER
will store the problematic triplet in a buffer, represented in the figure 5.1 as the POOL unit.
This unit is detailed in figure 5.4 and has two tasks: (1) to store the triplets that did not pass
the data-dependency test and (2) to further periodically check all the stored triplets for data-
dependency against the D-TABLE. In the later case, as soon as a triplet in the POOL satisfies
the data-dependency test, it will be forwarded back to the CHECKER and treated by the later
with higher priority over the triplets coming from the sequence generator. The POOL unit is
implemented as a pile of registers (represented in figure 5.4 as 

, 

,. . . , 

) whose size is
configured in the VHDL code with the generic 22_. Each such register can store a
triplet. During run-time it is possible that the register pile is full in which case no more triplets
can be stored and eventually the CHECKER will have to wait, until an entry in the register pile
is freed and can be reused.
The POOL unit implements the following functionalities (the signals related to each func-
tionality are grouped in the bottom of the figure 5.4) in parallel:
 write a triplet: requested by the CHECKER for storing a triplet. A write takes always
place in the top of the register pile pointed by the write pointer 5
$  which is
incremented after each write operation. Initially (i.e. after reset and before each parsing),
the top of the register pile is the register 

and after the first write the top is the register


and so on. The flag !%%22 is used to indicate that there is no more place left in
the register pile. A write should not take place if the !%%22 flag is active and it is the
CHECKER’s task to verify the !%%22 flag before writing a triplet in the POOL.
5.3 : Design units 67
D E C O I F I C A T O RD
D E C O D I F I C A T O R
n
-
2
n
-
1
1 2
0 1 2 n
-
2
n
-
1
0
in
s
e
rt
 a
 t
ri
p
le
t
2
:1
 m
u
x
2
:1
 m
u
x
2
:1
 m
u
x
c
o
n
tr
o
l
u
n
it
r
e
a
d
2
:1
 m
u
x
s
h
if
t
s
h
if
t 1 2
s
h
if
t
s
h
if
tn
-
2
n
-
1
e
x
tr
a
c
t 
a
 t
r
ip
le
t
c
h
e
c
k
 a
 t
ri
p
le
t
tr
ip
le
tO
U
T
r
e
a
d0 1 2 3
n
-
1
R
0
R R
1 2
R R
n
-
2
n
-
1
c
h
e
c
k
w
r
it
e
fu
ll
P
O
O
Lw
r
it
e
P
T
R
s
h
if
t 0
c
y
c
li
c
P
T
R
c
h
e
c
k
tr
ip
le
tS
1
S
2
tr
ip
le
tI
N
Figure 5.4: The POOL unit datapath
68 CHAPTER 5: The Dynamic Array of Processors Hardware Design
The triplet to be stored in the POOL is placed on the input 
$)%  and is written in
the register pointed by 5
$  when the signal 5
$ is activated.
 cyclic triplets check: for a fast retrieval (i.e. efficient) of the triplets in the register pile
during the cyclical check against the D-TABLE, the triplets are kept contiguous in the
registers 

, 

,. . . , up to the temporary top register of the register pile – pointed by
5
$ . With this assumption – requiring a careful implementation of the read oper-
ation (see the extract operation bellow)– it is possible to access the triplets in the register
pile by using a counter ','%$' that counts cyclically from  (when addressing reg-
ister 

) to the value of the temporary 5
$  corresponding to the top of the register
pile.
A triplet is checked against the D-TABLE under the control of the local control unit.
The signals used for interfacing with the D-TABLE are the two source cells in a triplet
available on the output signal 
$)% + and the ' '0 signal that will initiate the
search in the D-TABLE.
 extract a triplet: a read on the register pile is performed each time a triplet passes the
data-dependency test and is returned to the CHECKER. A read can take place anywhere
in the register pile as there is no rule on the order in which the triplets pass the data-
dependency test. Therefore, read operations may cause gaps in the register pile that are
eliminated by performing shifts in order to keep the stored triplets contiguous (required
for an efficient hardware implementation of the cyclic check).
A read is immediately performed on a register that passed the data-dependency check
against the D-TABLE and which is therefore pointed by ','%$'. The ','%$'
pointer is used to activate the shift signals $

, for all values of
 
 ','%$'      	. If the signal $

is activated the content of register


is moved to register 

. The 
 	 signal is generated by the local control unit
and means that the register pointer by ','%$' is available and stable at the output

$)% 2(.
Note that, during run-time it is possible to have concurrent write and read operations on the
register pile. The current hardware implementation can deal with this situation without having
to arbiter the concurrent accesses. In order to illustrate the working of the POOL let’s consider
the following example:
Example 5.3 For a register pile of size   
, with the initial content illustrated in figure 5.5(a)
we illustrate the content of the registers after: the read of the triplet + in figure 5.5(b), the
write of the triplet 7 in figure 5.5(c), and a concurrent read and write of the triplets  and
respectively 6 in figure 5.5(d).
In the current design a FIFO memory is used for decoupling the POOL and CHECKER units.
The size of this FIFO memory is configured in the VHDL code with the generic 2(_22_"&.
A small size (e.g. 2-4) should be sufficient.
5.3.4 The task dispatching (DISPATCHER) unit
A triplet that passed the CHECKER’s data-dependency test will be allocated to an idle processor
for processing. The allocation of triplets (i.e. cell-combinations) to idle processors is the task
of the DISPATCHER unit (see figure 5.1). The state of a processor in the system is either idle
5.3 : Design units 69
t5
t4
t3
t0
SH
IFT
t1
t0
t1
t2
t3
t4
t0
t4
t3
t1
t0
t4
t3
t5 t6
SH
IFT
R2
R3
R4
R5
R6
R7
R
1R
0
R2
R3
R4
R5
R6
R7
R
1R
0
R2
R3
R4
R5
R6
R7
R
1R
0
R2
R3
R4
R5
R6
R7
R
1R
0
read t1 and write t6
(c)
(b)(a)
read t2 write t5
Figure 5.5: Typical POOL operations on the register pile: read (read t2), write (write t5) and
concurrent read/write (read t1, write t6)
70 CHAPTER 5: The Dynamic Array of Processors Hardware Design
or busy, and the precise task of the DISPATCHER unit is to find an idle processor to which
the triplet can be dispatched. It may be that all the processors in the system are busy in which
case the DISPATCHER unit will have to wait until a processor becomes idle. The mechanism
implemented in the current design for finding an idle processor is a continuous polling over all
the processors in the system. Even though the pooling implementation has in general a bigger
latency when compared to an equivalent asynchronous implementation it does not depreciates
the global system clock when the number of processors in the system grows. It is precisely for
this reason that the pooling was adopted.
The triplet dispatching solution implemented within the current design is illustrated in fig-
ure 5.6. The state of processor 

is available on its output signal 	 

(’1’ if busy, ’0’
P0
stateP0
activeP0
stateP1
activeP1
stateP2
activeP2
n-1stateP
activePn-1
P1 P2
Q Q Q QD D
> >> >
D
0SRD
D
1SRD 2SRD n-1SRD
n-1P
DISPATCHER
nextSTATEGO DBUS stateDISPATCHER
DBUS
GO GO GO GO
DBUS DBUS DBUS DBUS
Figure 5.6: The interface between the DISPATCHER and the processors. Triplet dispatching
solution implemented in the DAP design.
when idle) and further to the DISPATCHER on the 	 "& input. Only one pro-
cessor state is available to the DISPATCHER at a given moment on the 	 "&
input and corresponds to the processors whose output tri-state gate is driven open (only one
tri-state is open at a time). The tri-state gate on the output of processor 

is driven by the
flip-flop "

of the rotating shift-register "

. After system reset, only the flip-flop "

of the rotating shift-register is set on ’1’, all the others on ’0’. The content of the shift-register
is rotated one position each time the signal . / is activated by the DISPATCHER and
the bit set on ’1’ will drive open one after the other the tri-state gates of the processors, making
available their state information to the DISPATCHER. The DISPATCHER keeps activated the
signal . / until it finds an idle processor.
When the DISPATCHER finds an idle processor 

, it sends the triplet along with its as-
sociated ID to the processor using the bus "( and activates the signal 2 that will start the
processing for the processor. A processor 

knows that the information on the "( refers to
it from the signal "

(available on the input 	'$ 

) that is used to identify the processor
that communicates with the DISPATCHER.
When dispatching a triplet, the DISPATCHER will also update the information in the D-
5.3 : Design units 71
TABLE, associated with the destination cell corresponding to the ID that tags the triplet (see
section 5.3.6 for details).
5.3.5 The WRITER unit
Within the current design the processors do not access the chart memory directly in order to
store the the processing results of the cell-combination they perform. Instead, they use the
WRITER as an interface for storing the processing results in the chart memory. Such a mech-
anism is supposed to reduce the number of collisions when accessing the chart memory. For
fetching the processing results from the processors, the WRITER is using a pooling mechanism
similar to the one implemented for the DISPATCHER and continuously checks the processors
for two conditions: (1) whether a processor’s processing results, require to be flushed and (2)
the processor has finished the processing.
In the first case the WRITER checks whether a processor requires to flush its output buffer
(i.e. a FIFO memory, see section 5.3.7) containing the (partial) cell-combination result. A
processor requests flushing each time the output buffer is filled, or almost filled with processing
results. Note that a processor requesting flushing did not necessarily finish the cell-combination
it is working on. A processor with a filled output buffer stays idle until is flushed and therefore
triggering the flushing before the processor’s output buffer becomes full would be better. If a
processor’s flush request is triggered when the output buffer is  full for instance, this will
hopefully overlap the processor’s working with its flushing.
For flushing a processor’s output buffer, the WRITER needs the "context information" of
the destination cell the flushed processor is working on. Recall that the context information
for each distinct destination cell under processing is stored in the CTX-memory and that this
information can be retrieved by using the unique identifier ID that the DISPATCHER sends
along a triplet to a processor. Each processor stores the ID of the destination cell it is working
on in a local register and the WRITER retrieves this ID each time it communicates (i.e. flushes
the results) with a processor. Using this ID the WRITER reads the CTX-memory and sets
up the environment (i.e. physical addresses uses for accessing the chart memory,. . . ) for the
destination cell in which the flushed data will be stored. The environment setup will be referred
henceforth to as a context switch. Note that when flushing a new processor it may be the case
that the context does not need to be switched. This is the case when the new processor to be
flushed has the same ID, and therefore works on the same destination cell as the previously
flushed processor. Once the processor’s output data is fetched and stored it in the chart memory
the WRITER updates the destination cell context information and rewrites it back in the CTX-
memory. At this point the WRITER will continue to pool the processors.
The second check the WRITER performs on a processor is to see whether the proces-
sor has finished to process the assigned cell-combination or not. The WRITER requires to
know if a processor has finished the processing in order to update the D-TABLE and keep the
data-dependency information up-to-date. It may be the case that a processor has finished the
processing and also has some processing results to be flushed in which case a flush step will
take place as explained above. If the processor has finished the processing and there are no
processing results to be flushed – or the processing results have been already flushed – the
WRITER checks whether the processed cell-combination was the last for the destination cell
identified by the ID. If this is the case the WRITER will also release the D-TABLE entry and
the CTX-memory information associated with the ID.
The system described above is illustrated in figure 5.7. A processor’s state is available on
its output signal 	 

and further to the WRITER on the input 	 . Only one
72 CHAPTER 5: The Dynamic Array of Processors Hardware Design
P0
stateP0
activeP0
P1 P2
activeP1
stateP1 stateP2
activeP2
P
n-1
activePn-1
n-1stateP
> >> >
0 1 2 n-1
D Q D Q D Q D Q
SRW SRW SRW SRW
stateWRITER
nextSTATE
WRITER
WBUS WBUS WBUSWBUS
WBUS
getDATA
getDATA getDATA getDATAgetDATA
WBUS
Figure 5.7: The interface between the WRITER and the processors. Parsing results flushing
solution implemented in the DAP design.
processor state is available to the WRITER and corresponds to the processor whose output tri-
state gate is driven open at that moment. The tri-state gate on the state output of processor 

is driven by the flip-flop 

of the rotating shift-register 

. After system reset, only
the flip-flop 

of the rotating shift-register is set on ’1’, all the others on ’0’. The content
of the shift-register is rotated one position each time the signal . / is activated by the
WRITER and the bit set on ’1’ will drive open one after the other the tri-state gates of the
processors, making available their state information to the WRITER. The WRITER keeps acti-
vated the signal . / until it finds a processor that requests flushing and/or has finished
the processing.
When the WRITER finds a processor requiring to be flushed it will use the signal # "
and the bus ( for fetching the processor’s output data. The processor 

knows that it com-
municates with the WRITER – that it is flushed in particular – from the signal 

available
on its input 	'$ 

.
The WRITER unit implements the same % and % functionalities that were presented in
section 3.4.1. The datapath diagram of the WRITER given in figure 5.8 illustrates in greater
detail the internals of the WRITER unit. Each word fetched from a processor consists of an
identifier ID, a left-hand side (&) and two right-hand sides (& and &+) producing the
left-hand side1. The fetched ID is used by the context setup module for switching the context (if
necessary). The context switch corresponds to the initialization of the registers "$. /, "*	 
and !	
*	 with new values "	$%,  	 and respectively #!	
. For details about
these values see section 3.4.1). All these values are fetched from the CTX memory – where
they were stored by the CHECKER.
The WRITER writes sequentially the left-hand side non-terminals & fetched from the
processor’s output buffer in the destination cell at the physical address ("*	  "$. /)
and after each write the index "$. / is incremented to point at the next non-terminal. The
displacement "	$% (stored in "$. /) is always  in case of sentence parsing. However, in
1Within the current design the right-hand sides are further ignored. However, they are available for extracting
on-line the parsing forest. Not implemented within this design version.
5.3 : Design units 73
update unit WriteData ReadData
LHS
update
module
guard
update
module
chart address
tmpBase
Dbase GuardbaseDindex
tmpIndex
WBUS
chart memory address
ID
context setup
    module
D-table
interface
chart memory data
LH
S
LH
S
LH
S
ID + (LHS + RHS1 +RHS2)
2:12:1
CTX-memory address
CTX-memory data
LH
S
CYK memory addressing unit
Figure 5.8: The WRITER unit datapath.
the case of word lattice parsing it may be different than .
The WRITER accesses the guard-vectors table both for read and write. In hardware, the
physical memory address for accessing a particular guard-bit in the guard-vectors table is con-
structed from the registers !	
*	 and & as (!	
*	  &), where & is actually
the binary representation of the non-terminal we want to check (read) or set (write).
Each time the WRITER finishes with a processor it restores the context information (i.e.
the updated content of the "$. /, "*	 and !	
*	 registers) in the CTX-memory. If
the flushed processor has finished the processing, the WRITER will also update the information
in the D-TABLE associated with the destination cell corresponding to the processor’s ID (see
section 5.3.6 for details).
5.3.6 The destination cells table (D-TABLE) unit
The LAP design uses a row-by-row method for filling the chart which insures – by default
– run-time cell data-dependency satisfaction. The row-by-row method however yields a low
processor utilization and for this reason and others the dynamic allocation method presented in
appendix B.2 is implemented within the current design. As with the dynamic allocation method
the processors run freely for filling the chart without being synchronized during run-time, a
mechanism that can keep track of the chart’s cells data-dependency is required. Essentially,
74 CHAPTER 5: The Dynamic Array of Processors Hardware Design
the chart’s cells data-dependency is guaranteed to be satisfied if only triplets containing source
cells that are not destination cells under processing are sent (i.e. dispatched) to processors
for processing. Whenever a triplet containing a source cell that is a destination cell under
processing is encountered it will be temporarily put aside (i.e. stored in a buffer) until the
source cells it relies upon become available. The triplets stored in the buffer (i.e. the POOL
unit, see section 5.3.3) will be sent to processors for processing as soon as they satisfy the
data-dependency they failed to pass previously. Such a mechanism is implemented within the
current design with the aid of the D-TABLE unit (see figure 5.1).
The D-TABLE comes from destination (cells) table and stores a list of all destination cells
under processing at a given moment. A destination cells is inserted in this table when encoun-
tered for the first time in a triplet that satisfies the data-dependency check and deleted when all
the triplets issued for its processing have been processed and their processing results have been
flushed and written in the chart. As there are two units that perform data-dependency checks on
the D-TABLE – the POOL and the CHECKER units – the D-TABLE should handle concurrent
checks. Given these considerents the D-TABLE should implement and support the following
functionalities:
 data-dependency check: the D-TABLE is organized as an associative memory and there-
fore a look-up for a source cell takes place in parallel over all the entries of the D-TABLE.
As we saw in section 5.3.2 and 5.3.3 both the CHECKER and the POOL units may per-
form data-dependency checks on the D-TABLE at any moment in time and therefore in
order to achieve maximum efficiency the D-TABLE should handle concurrent accesses
without having to arbiter them.
The circuitry used by the CHECKER and POOL units to perform data-dependency checks
on the D-TABLE is illustrated in figure 5.9 – for a D-TABLE that can store  destination
cells at once. In the figure there are two groups of signals: one used by the CHECKER
(on the left) and the other one by the POOL (on the right) for communicating with the
D-TABLE during a data-dependency check. In the figure certain signals have the same
denomination as they have the same functionality both for the CHECKER and the POOL
unit and this should not be confusing in the text that follows.
On the right are the signals used by the POOL to perform a data-dependency check for a
triplet   '. In a POOL’s check query only the source cells   are impor-
tant, the destination cell being irrelevant as the POOL only has to verify that the source
cells are not destination cells under processing. The POOL starts the data-dependency
check by activating the 	
22 signal. The result of the performed data-dependency
check is available on the outputs 
 !% and 
 !%+ that are valid when the signal

 !%" is activated. When the 	
22 signal is activated the POOL control
unit (bottom right) will first activate the 	-)% 22 signal to latch the state of each
D-TABLE entry – that can be either used or free. Next, the POOL control unit will ac-
tivate the 	-)% 22+ signal for latching, for each D-TABLE entry, the results of the
lookup performed for  and respectively . The result of a lookup for a D-TABLE en-
try is validated if that entry is not free. Finally two wide-OR gates will output the lookup
result over the entire D-TABLE for  and respectively . The signal 
 !%" is
activated to mark the availability of the result, three clock ticks later after the 	
22
signal was activated.
On the left are the signals used by the CHECKER to perform a data-dependency check
for a triplet  '. In a CHECHER’s query both the source cells and the destination
5.3 : Design units 75
=
<
Q
D
<
Q
D
D
Q D
Q
<
<
<
Q
D
D
Q D
Q
<
<
D E O D I F I C A T O RC
= =
<
Q
D
D
Q
<
(S
1 , 
S 2
, D
)
= = =
= =
D
Q D
Q
>
>
= =
D
Q D
Q
>
>
TB
L_
D
_R
EG
TB
L_
D
_R
EG
TB
L_
D
_R
EG
0
TB
L_
D
_R
EG
1 2 n-
1
= = ==
< < <
=
0
0
= = = = = =
<
Q
D
D
Q D
Q
<
<
D
Q
<
<
Q
D
D
Q D
Q
<
<
+
resultDresultS2resultS1
CN
T_
PR
O
CE
SS
IN
G
CN
T_
PR
O
CE
SS
IN
G
CN
T_
W
A
IT
IN
G
CN
T_
W
A
IT
IN
G
CN
T_
PR
O
CE
SS
IN
G
CN
T_
W
A
IT
IN
G
CN
T_
PR
O
CE
SS
IN
G
CN
T_
W
A
IT
IN
G
= =
D
Q D
Q
>
>
= =
D
Q D
Q
>
>
ID
(S
1, 
S2
)
re
su
ltR
EA
D
Y
re
su
ltR
EA
D
Y
POOL related signals
CHECKER related signals
resultS2resultS1
lo
ca
l c
on
tr
ol
 u
ni
t
lo
ca
l c
on
tr
ol
 u
ni
t
=
0
0
Q
D
=
0
0
Q
D
=
0
0
Q
D
st
ar
tP
O
O
L
st
ar
tC
H
EC
K
ER
sa
m
pl
eC
H
EC
K
ER
2
sa
m
pl
eC
H
EC
K
ER
1
sa
m
pl
eP
O
O
L1
sa
m
pl
eP
O
O
L2
Figure 5.9: The D-TABLE unit circuitry used for performing a data-dependency check.
76 CHAPTER 5: The Dynamic Array of Processors Hardware Design
cell are looked-up in the D-TABLE. The source cells are looked-up for performing the
data-dependency check while the destination cell is looked-up for knowing whether it is
encountered for the first time in which case it has to be inserted in the D-TABLE. On the
other hand, if the destination cell is not encountered for the first time the CHECKER has
to retrieve the ID (i.e. the index where it is stored in the D-TABLE) associated with it.
This ID will be sent along the triplet to the processors. The CHECKER starts the data-
dependency check by activating the 	
& signal. The result of the performed
data-dependency check is available on the outputs 
 !%, 
 !%+ and respectively

 !%" that are valid when the signal 
 !%" is activated. A supplementary
signal "_& is available and outputs the ID if the 
 !%" signal is valid. When
the 	
& signal is activated the CHECKER control unit (bottom left) will first
activate the 	-)% & signal to latch the state of each D-TABLE entry that can be
either used or free. Next, the CHECKER control unit will activate the 	-)% &+
signal for latching, for each D-TABLE entry, the results of the lookup performed for ,
 and respectively '. The result of a lookup on a D-TABLE entry is validated if that
entry is not free. Finally the wide-OR gates will output the lookup result over the entire
D-TABLE for ,  and respectively ' and the decoder will output the "_&.
The signal 
 !%" is activated to mark the availability of these results, three clock
ticks later after the 	
& signal was activated.
 insertion: only the CHECKER inserts destination cells in the D-TABLE and always in a
D
E
C
O
D
E
R
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
load reset load
(row)
REQ
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
load reset load
(row)
REQ
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
load reset load
(row)
REQ
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
load reset load
(row)
REQ
0
0
0
0
IDtableFULLwrite
DST(row+column)
DESTINATION CELL (row, column)
ACK
ACK
ACK
ACK
0
1
2
N
2
2
1
1
0
0
=
=
=
=
=
=
=
=
n-1
n-1
 
 
Figure 5.10: The D-TABLE unit, circuitry used for inserting a destination cell.
free entry of the D-TABLE. An entry in the D-TABLE is free if both _2
5.3 : Design units 77
and _ counters are 0. The index of the D-TABLE entry where a destination
cell is inserted will be the unique identifier associated with that destination cell until the
destination cell is deleted from the D-TABLE. The circuitry used by the CHECKER to
insert a destination cell in the D-TABLE is illustrated in figure 5.10 – for a D-TABLE
that can store  destination cells at once.
In this figure the signal used to store a new destination cell is 5
$ and the index of the
entry where the destination cell was written is returned on output ". A destination cell
can be inserted in any free entry of the D-TABLE. In other words, any priority scheme
may be used to allocate an entry when inserting a destination cell. For instance within the
current implementation the 8

entry has the highest priority and 8

the lowest. If
during run-time the D-TABLE is filled the signal 	*% (will be activated and in this
case the CHECKER will have to wait for an entry to be released. It is precisely the task
of the CHECKER to verify that the signal 	*% ( is not activated before attempting
to insert a new destination cell in the D-TABLE.
When a destination cell is inserted in the D-TABLE, two counters are initialized that en-
try. These two counters are necessary for the WRITER to detect when the system has fin-
ished the processing of the associated destination cell. The counters _ repre-
sents the number of cell-combinations left for processing and the counter _2,
represents how many cell-combinations are currently under processing. When a destina-
tion cell is inserted in the D-TABLE the _ counter is initialized with the row
of the same destination cell as there are exactly as many triplets to process for filling that
destination cell. The counter _2 is initialized on . A destination cell is
processed (i.e. filled) as soon as both counters are zero.
 deletion: the deletion of an entry in the D-TABLE is done by the WRITER, but the delete
operation does not take place explicitly on the D-TABLE. Each time the WRITER finds
a processor that has finished the processing it decrements the counter _2.
If both the _2 and _ are  for an entry, that entry will be
automatically released – which corresponds to a delete.
 update: an update is operated on a D-TABLE entry each time the DISPATCHER sends a
triplet to a processor or when the WRITER finds a processor that has finished the process-
ing. Both the DISPATCHER and the WRITER are using the identifier ID sent along the
triplets for updating an entry in the D-TABLE. Precisely, each time the DISPATCHER
sends a triplet for processing, the counter _ is decremented and the counter
_2 is incremented. Once the processor that processed the same triplet
has finished the WRITER will decrement the counter _2. And as we al-
ready said when both the counters become  (i.e. when the signal $.$  is
activated) the associated entry in the D-TABLE is released. The circuitry used by the
DISPATCHER and the WRITER to update a D-TABLE entry is illustrated in figure 5.11
– for a D-TABLE that can store  destination cells at once. The WRITER uses the
"_ to select the D-TABLE entry it wants to update and the DISPATCHER uses
the "_"& for the same purpose. Both the WRITER and the DISPATCHER
may concurrently update the same entry (or different entries) of the D-TABLE.
The signal $.' '"& is used by the DISPATCHER and the  ' sig-
nal is used by the WRITER to update the counters as described above. The signal
$.$  is used by the WRITER to detect a destination cell that was processed
(i.e. filled) and whose associated entry in the D-TABLE should be released.
78 CHAPTER 5: The Dynamic Array of Processors Hardware Design
D
E
C
O
D
I
F
I
C
A
T
O
R
D
E
C
O
D
I
F
I
C
A
T
O
R
























CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
0
=
=
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
0
=
=
CNT_PROCESSINGTBL_D_REG CNT_WAITING0
0
0
=
=
CNT_PROCESSINGTBL_D_REG CNT_WAITING
0
0
=
=
1
2
n-1
ID_WRTIER incdecDISPATCHER ID_DISPATCHER
0
1
n-1
2
n-1
2
1
0
decWRITERfinishedCELL
Figure 5.11: The D-TABLE unit, circuitry used for updating a D-TABLE entry.
The size of the D-TABLE can be configured in the VHDL code with the generic .
Note, that in the current system the number of distinct destination cells under processing at the
same time cannot exceed the number of processors in the system. Therefore, there is no point
in having a D-TABLE with more entries than the number of processors in the system – which
is a waste of hardware resources.
5.3.7 The processor
The figure 5.12 illustrates the processor datapath. A processor receives the processing tasks
(i.e. triplets) from the DISPATCHER which also activates the signal 2 (see section 5.3.4)
that initiates the processing. Based on the triplet received from the DISPATCHER on the bus
"(, the processor will access the chart memory (see section 3.5.1.1) in order to fetch the
non-terminals in the source cells. The interface with the grammar memory is the same as for
the processors used in the LAP design (see section 3.5.1.2). Each time two non-terminals &
and &+ in the source cells are producing the left-hand side &, the item (&, &+, &) is
stored in the output FIFO buffer. When the output FIFO is full (or almost full) the flush request
module asserts the 	 

signal to request the output FIFO flush. The WRITER will flush
the processor’s output FIFO – using the ( – as soon as it pools the processor and detects the
request. The 	 

signal is also used to signal that a processor has finished the processing
of the cell-combination.
A processor whose output buffer is full cannot store its processing results and stays idle
until the output buffer is flushed. Therefore, requesting the flush of the output FIFO when it
is almost full will eventually eliminate the time lost by a processor waiting for the results to
be flushed. A processor that has finished the processing and is flushed, becomes immediately
5.4 : Performance measurements 79
istateP
activeP i
istateP
activeP i
LR
EG
1
LR
EG
2
R
R
EG
2
R
R
EG
1
grammar look-up unit
LHS Grammar
access module
MAG
RHS1 RHS2
memory
ADDRESS
DATA
ReadData
ID
tmpIndex tmpBase
RHS1Index RHS2Index RHS1base RHS2base
CYKaddress
WBUS
output 
FIFO
DBUS
chart memory addressing unit
chartADDRESS
getDATA
module
flush request
GO
interface
dispatcher
DISPATCHER interface
WRITER interface
chartDATA
Figure 5.12: The datapath of the processor used in the dynamic array of processors (DAP)
architecture.
available for processing other triplets.
5.4 Performance measurements
The tests and performance measurements presented in this section were performed on the same
CNF SUSANNE grammar – containing   non-terminals and 	  rules – used to
benchmark the previous design (see section 3). The data-structures representing the CNF SU-
SANNE grammar and the chart in the current design are the same as those used for the LAP
design. The size of the memory required to store the CNF SUSANNE grammar data-structure
is 
 	 bytes and the size of the chart memory depends on the length of the sentence we
want to parse (e.g.   bytes for parsing sentences with up to  words or  	  bytes
for parsing sentences with up to  words).
In contrast to the previous design, the maximal length of the sentences that can be parsed
is independent of the number of processors in the system. In other words, any number of
processors can be used for parsing sentences of any length in the condition that the amount of
memory available for storing the chart is enough (e.g. for  words requires about  [MBytes]
and for 
 words requires  [MBytes]).
In order to determine the size and the clock frequency at which the system is able to work,
a -processors system configuration was synthesized2 and placed&routed3 in a Xilinx FPGA,
2With LeonardoSpectrum v2000.1a2
3With Design Manager (Xilinx Alliance Series 2.1i)
80 CHAPTER 5: The Dynamic Array of Processors Hardware Design
Virtex XCV1000. The synthesis of the -processors system was made on a design instance
characterised by the parameters given in table 5.1 and was physically tested and checked for
correctness on a RC1000-PP FPGA board with a clock frequency of  MHz. The tested
system was restricted to  processors as there are only  memory banks on the RC1000-PP
FPGA board ( bank allocated for the chart memory and the rest of  banks allocated for the
grammar memories).
For details regarding the grammar memory parameters see section 3.4.2 and appendix A.4,
for the chart memory parameters see section 3.4.1 and for the other parameters in the table see
the DAP design block diagram (figure 5.1). The table 5.4 gives for a Xilinx Virtex XCV1000
FPGA a summary of the resources used by each unit in the synthesized -processors system.
The overall amount of resources required by a -processors system and for a -processors
system are also given.
The -processors system synthesized with the "extract RAM" feature enabled, uses less
than  of the available FPGA resources and allows us to parse sentences of any length if
there is enough room in the chart memory.
Due to the drastic restriction on the number of processors imposed by the RC1000-PP board
the hardware run-times we present were obtained by simulating4 the VHDL model of a system
with , 	,  and respectively  processors. The POOL size used in these simulations gave
the best time results on the   sentences of length  to  from the SUSANNE corpus, on a
-processors system (see section 5.6.3).
The software used for comparison is an implementation of the enhanced CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The hardware performance
(i.e. run-times) of a system containing  processors (denoted as 	
_), 	 processors (de-
noted as 	
_),  processors (denoted as 	
_) and respectively  processors (de-
noted as 	
_) was compared against two software run-times. The first (_) uses
the SUSANNE grammar in CNF, as it is also the case for the hardware. The second (_)
uses the SUSANNE grammar in its original context-free form. The software was run on a SUN
(Ultra-Sparc ) with  MBytes memory, 		 MBytes of swap memory, and  processor at
a clock frequency of  MHz. The initialization of the chart was not taken into account for
the computation of the run-times. For accuracy, the timing was done with the times() C library
function and not by profiling the code.
For the purpose of the comparison,   sentences were parsed and validated5. The
sentences have a length ranging from  to  and were all taken from the SUSANNE cor-
pus. Figure 5.13 shows the average run-times _ and 	
_ as functions of the
sentence length (vertical axe). The run-time _ was not presented as it is slower
than the run-time of _ and therefore not relevant. The average speedup factor (
has been computed for the -processors system against both the _ and _.
For _, (
_	

   and for _, (
_	
  . Fig-
ure 5.14(a) shows the hardware speedup, for the 	
_, 	
_, 	
_ and 	
_
in comparison with _ and the figure 5.14(b) in comparison with _ as a
function of the sentence length. This figure shows that the best performance when parsing
sentences above  words is obtained with a 	
_ system. The figure illustrates that the
overhead required for managing more processors than needed results in a performance deterio-
ration. This is the case for short sentences (e.g.  words) when the 	
_ system gives best
4With ModelSim EE/Plus 5.2e
5The hardware output was compared to the software output for detecting mismatches. However, as the validation
process consists of many technical details, it is not described here. A detailed description can be found in [9].
5.4 : Performance measurements 81
Parameter Size[units] Details
sy
st
em CLOCK 50 [MHz] the system clockCOORD_BITS 5 [bits] number of bits representing a
row/column in the chart, see sec-
tion 5.2
PROCESSORS 3 the number of processors in the sys-
tem
ch
ar
t m
em
o
ry CYKLATENCY 17 [ns] the access-time of the SRAM memory
used to store the chart
DTAIL_SIZE 9 [bits] see section 3.4.1
CYK_ADR_SIZE 19 [bits] chart data-structure pointer size, see
section 3.4.1
CYK_WAIT_CYCLES 2 delay until chart memory data lines
are stable
gr
am
m
ar
m
em
o
ry
GMEMSIZE 558,576 [bytes] the size of each grammar memory
GLATENCY 17 [ns] the access-time of the SRAM memory
used to store the grammar memory
WAIT_CYCLES 2 delay until grammar memory data
lines are stable
NT 10,129 number of non-terminals in the gram-
mar, see section 3.4.2
NT_SIZE 14 [bits] bits used to represent a non-terminal,
see section 3.4.2
RULE_SIZE 15 [bits] bits used to represent a distinct right-
hand side, see section 3.4.2
PTR_SIZE 19 [bits] grammar data-structure pointer size,
see section 3.4.2
o
th
er
ENTRIES 31 size of D-TABLE, see section 5.3.6
OUT_GEN_DEPTH 16 size of FIFO at output of SEQ_GEN,
see section 5.3.1
OUT_CHK_DEPTH 16 size of FIFO at output of CHECKER,
see section 5.3.2
OUT_POOL_DEPTH 16 size of FIFO from POOL to
CHECKER, see section 5.3.3
POOL_ENTRIES 16 size of the POOL, see section 5.3.3
Table 5.1: The parameter values used to configure (i.e. instantiate) the synthesized -processors
DAP system.
82 CHAPTER 5: The Dynamic Array of Processors Hardware Design
component DFFs/Latches FGs CLBs XCV1000 area utilization (%)
IOctrl 23 19 12 0.10
SEQ_GEN 57 80 40 0.33
SEQ_GEN out
FIFO
422 284 211 1.72
15 26 13 0.11 ("extract RAM"/see note
bellow)
CHECKER 242 210 121 0.98
CHECKER out
FIFO
1122 686 561 4.57
15 33 17 0.14 ("extract RAM")
DISPATCHER 2 4 2 0.02
POOL 416 880 440 3.58
POOL out FIFO 422 284 211 1.7215 26 13 0.11 ("extract RAM")
D-TABLE 1129 2205 1103 8.98
CTX (uses 6 Block SelectRAMs/see Xilinx’s XCV1000 Manual [31])
CHART ARBITER 24 48 24 0.26
WRITER 257 303 152 1.24
MAG 168 296 148 1.20
processor 463 511 256 2.08
-processors system 3511 5313 2657 21.62
-processors system 5674 7360 3680 29.95
Table 5.2: Virtex XCV1000 FPGA resource utilization per DAP design unit in terms of
Flip-Flops/Latches (DFFs/Latches), function generators (FGs) and configurable logic blocks
(CLBs).
Note:The Virtex XCV1000 FPGA has several types of RAM structures available (see
XCV1000 manual [31]). The "extract RAM" is a synthesizer feature that allows to infer RAM
structures whenever possible.
performance. For sentence lengths between  and  words, the 	
_ system gives best
performance. In this case the 	
_ system has not enough processors, while the 	
_
and 	
_ systems have too many processors that are not all used. For sentences with more
than  words it is the 	
_ system that performs best for the same reasons. The figure
does not show what happens for sentence lengths above 15 words but is very likely that for a
certain sentence length the 	
_ system will give best performance .
5.5 Design testing on the RC1000-PP board.
For obtaining the real clock frequency and validating the presented design, a -processor sys-
tem was physically tested on a commercial FPGA board. For this purpose we used Celoxica’s
RC1000-PP FPGA board whose block diagram is given in section 3.7 in figure 3.11. As the
tested system contains  processors (and therefore  grammar memories) and the chart memory
it can be fit in the  memory SRAM banks available on the RC1000-PP FPGA board. Each
SRAM bank on the RC1000-PP FPGA board has a -bit databus that match the current de-
sign’s databus widths for the grammars and respectively the chart memories. Under these condi-
5.6 : Dynamic Array of Processors Design Analysis 83
0 10 20 30 40 50 60 70 80
3
4
5
6
7
8
9
10
11
12
13
14
15
3.85
0.17
7.32
0.28
10.39
0.36
14.67
0.52
17.23
0.57
24.12
0.8
27.65
0.89
38.7
1.25
39.46
1.24
48.69
1.54
57.97
1.8
68.98
2.08
76.12
2.36
se
nte
nc
e l
en
gth
[wo
rds
]
time[ms]
soft_CFG    
10 processors
Figure 5.13: Average run-times for _ and 	
_ DAP system as a function of
sentence length. For each sentence length more than  sentences were parsed.
tions, the synthesized -processor system can be placed&routed in the Xilinx XCV1000bg560-
4 FPGA available on the RC1000-PP board without any further modification.
The initialization, running the design and the result readback is done exactly in the same
way as for the previous design (see section 3.7). The interfacing signals are the same.
5.6 Dynamic Array of Processors Design Analysis
In this section some important aspects of the DAP design are analysed and discussed . First,
in order to get a general idea about the design, we look at the processor activity and measure
the processor utilization for some arbitrary sentences extracted from the SUSANNE corpus.
In order to compare the DAP design with the LAP design we actually use the same sentences
that we used for studying the LAP design. More precisely, as the processor activity mainly
consists of two tasks (1) data processing and (2) accessing the chart memory data-structure, we
investigate – as in the case of the LAP design – the fraction of time spent by the processors
in performing each of the two tasks. Second, the effect of an increased number of collisions
when accessing the chart memory as the sentence length increases on the overall design perfor-
mance is investigated. Third, the influence of the POOL unit size on the design performance is
investigated.
5.6.1 Average processor utilization
In order to get a general idea about the processor activity during the parsing process we illustrate
in figure 5.15(a), (b) and (c) the processor activity for three sentences (given in table 5.3) of
length ,  and respectively  words. In the figures 5.15(a)-(c) with black is depicted the
amount of time spent by the processor for processing data and with gray the time spent for
chart memory read accesses or waiting for data to be flushed. In order to have a comparison
84 CHAPTER 5: The Dynamic Array of Processors Hardware Design
2 4 6 8 10 12 14 16
150
160
170
180
190
200
210
220
230
sentence length[words]
sp
ee
du
p
hard_P14 vs. soft_CNFG
hard_P10 vs. soft_CNFG
hard_P7 vs. soft_CNFG 
hard_P4 vs. soft_CNFG 
(a)
2 4 6 8 10 12 14 16
24
25
26
27
28
29
30
31
32
33
34
sentence length[words]
sp
ee
du
p
hard_P14 vs. soft_CFG
hard_P10 vs. soft_CFG
hard_P7 vs. soft_CFG 
hard_P4 vs. soft_CFG 
(b)
Figure 5.14: Hardware speedup for the 	
_, 	
_, 	
_ and 	
_ DAP
systems against (a) _ and (b) _ software as a function of sentence length.
For each sentence length more than  sentences were parsed.
5.6 : Dynamic Array of Processors Design Analysis 85
between the processor activity in the DAP an respectively LAP designs, we use for this purpose
the same three sentences illustrated for the LAP design in section 4.1). The speedup factor for
both the LAP and DAP designs when parsing these sentences (against the _ software
implementation) is also given for comparison in table 5.3. It results from these figures that the
len sentence speedup LAP speedup DAP
4 “One wing stood open” 11.98 22.41
8 “In fact our whole defensive unit did a good job” 15.38 34.18
15 “In societies like ours , however , its place is less
clear and more complex”
13.56 36.31
Table 5.3: Three sentences parsed with the DAP design for which the processor activity is given
in figure 5.15. Speedup factor is given against the _ software implementation.
DAP design is using the processors more efficiently than the LAP design in particular for longer
sentences (see the figures depicting the LAP design processor activity for the same sentences
in figure 4.1). The reason is the ability of the DAP design to use during the parsing all the
processors if necessary.
As the parsing time decreases with the DAP design (2 to 3 times in comparison with the
LAP design) and the processor activity becomes more "intense", the time spent by the pro-
cessors for reading chart memory data and for flushing the processing results (i.e. gray area)
becomes a significant fraction in the overall processor activity time. Moreover, as for the DAP
design there are more processors working in parallel, the number of collisions when accessing
the chart memory also increases. This is particularly important for longer sentences and can
also be observed in figure 5.15(c). In this case the -bit databus with the chart memory used in
the DAP design becomes a bottleneck. For comparing the average processor utilization of the
DAP and respectively LAP designs, we use some sentences that were arbitrarily extracted from
the SUSANNE corpus. The average processor utilisation is computed in the same way as it
was computed for the LAP design (see section 4.1). These sentences are tabulated in table 5.4.
The average processor utilization of the DAP design increases significantly in comparison to
the average processor utilization of the LAP design especially for some long sentences. An
important thing to note about the processor activity is that during the parsing the amount of
processing is not evenly distributed among the processors. There are processors carrying on
an important amount of processing while the others are waiting – due to data dependency re-
strictions – for the hard working processor to finish. This suggests a design with the ability to
have several processors working in parallel on the same cell-combination. Such a system will
distribute better the processing power among the processors.
5.6.2 Expected performance depreciation
As already discussed in section 4.2 the expected performance depreciation phenomena means
that the system performance (i.e. parsing time) does not scale at the expected rate when the
number of working processors grows. This depreciation in the expected system’s performance
is the consequence of an increased number of collisions (i.e. concurrent accesses) between the
processors when accessing the chart memory as the sentence length increases. The increased
number of collisions between processors can be observed on the figures 5.15(a)-(c) depicting
the processor activity for three sentences randomly chosen from the SUSANNE corpus. In
these figures, the gray area – corresponding to the time spent by the processors accessing the
86 CHAPTER 5: The Dynamic Array of Processors Hardware Design
(a)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(b)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(c)
Figure 5.15: DAP design processor activity in BLACK+GRAY when parsing a sentence of
length (a)  words, (b)  words and (c)  words. GRAY: represents the time spent for chart
read accesses and for flushing processing results. BLACK: represents the time spend for pro-
cessing data.
5.6 : Dynamic Array of Processors Design Analysis 87
len sentence U[%]
DAP LAP
3 “One pass only” 10.8 9.26
4 “There was no moon” 8.27 8.27
5 “She too began to weep” 16.78 9.35
6 “She must not think about time” 15.78 10.99
7 “The form and the chaos remain separate” 23.65 12.72
8 “The games were over , this was life” 21.34 10.58
9 “Like Napoleon , he was the worst of losers” 18.41 8.90
10 “In fact our whole defensive unit did a good job” 33.68 15.01
11 “I told him who I was and he was quite cold” 20.01 8.36
12 “Nerves tight as a bowstring , he paused to gather his wits” 24.86 12.86
13 “I told him no , that I had had a very happy childhood” 28.36 10.00
14 “As he had longed to be , he became the echo of a saga” 47.40 10.48
15 “It is all around us and our only chance now is to let it in” 48.07 13.42
Table 5.4: The average processor utilization for the DAP (respectively LAP) design, for a set
of sentences extracted from the SUSANNE corpus with lengths between  and  words.
chart memory – increases as the sentence length increases, becoming a significant amount of
the overall processor activity.
However, in order to better illustrate this phenomena we will use an own built grammar –
the same that we used in example 4.1 – that has the particularity that any cell-combination in
the chart requires the same amount of processing. The following example uses this grammar in
an experiment that illustrate the expected performance depreciation phenomena.
Example 5.4 We are using the same grammar as in example 4.1 to parse the sentences: "a a",
"a a a", "a a a a", . . . , up to a similar sentence of length .
After initialization each cell in the bottom row of the chart will contain the set of non-
terminals 



    


	. During parsing the grammar generates the set of non-terminals




    


	 for each cell-combination performed. This means that at the end of the
parsing, each cell in the chart will contain this set of non-terminals and that during the parsing
the processors work on the same data each time two cells are combined. In conclusion, if we
know the time . spent by a processor for performing a cell-combination (i.e. the parsing time
for the sentence "a a") we can compute the expected parsing time for a sentence of length 	.
The method used to compute the expected parsing time is not analytic and we used a program
for this purpose. The program computes the minimum number of steps  required for filling
the chart when parsing a sentence of length 	, using a given number of processors (e.g.  in
our case). With  and . we compute the expected parsing time as   . .
Note, that when parsing the sentence "a a" only one processor is working and therefore no
collisions occur when accessing the chart.
For each of these sentences the figure 5.16 illustrates the (computed) expected parsing time
vs. the real parsing time. Within the DAP design the processors are more intensively (i.e.
efficiently) used and the number of collisions between the processors when accessing the chart
memory increases. This results in a significant difference between the expected and the real
DAP design performance (i.e. parsing time) as illustrated in figure 5.16.
Figure 5.17 illustrates the processor activity when parsing a sentence of length  and re-
88 CHAPTER 5: The Dynamic Array of Processors Hardware Design
2 4 6 8 10 12 14 16
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
x 106
sentence length[words]
tim
e [
ns]
real    
expected
Figure 5.16: DAP design real vs. expected parsing time when parsing the sentences "a a",
"a a a", "a a a a", . . . up to a similar sentence of length . Real parsing time is greater than the
expected parsing time illustrating the expected performance depreciation phenomenon.
spectively 	 "a"s. Note that the gray area for each cell combination increased for the sentence
of length 	 when compared to the sentence of length  while the sizes of the black areas did not
change.

5.6.3 POOL influence on DAP design performance
The POOL unit is key to the design ability to dynamically allocate the processors to cell-
combinations. Without the POOL unit the design will hang on the first cell-combination that
does not satisfy a data-dependency constraint and keep all the other cell-combinations – even if
they satisfy their data-dependency constraints – unissued. On the other hand, a POOL unit al-
lows the design to pass over (i.e. to store and temporarily ignore) the cell-combinations that do
not satisfy the data-dependency constraints while still try to issue other cell-combinations that
satisfy their data-dependency constraints. This results in a better average processor utilization
due to a better exploitation of the parallelism available in the parsed sentence. One can interpret
the POOL unit functionality as a window sliding over the unprocessed cells of the chart that
allows the design to issue the cell-combinations that satisfy the data-dependency constraints
while keeping under observation those that do not satisfy the data-dependency constraints. In
this context, a larger POOL size (i.e. a larger sliding window) is supposed to be better as it
allows the design to search over a larger number of cell-combination that can be potentially
issued.
In order to investigate the influence of the POOL size on the DAP design performance a
number of benchmarks have been performed on   sentences of length  to  from the
SUSANNE corpus. The benchmarks were performed by simulating6 the VHDL model of a
DAP design (configured for the CNF SUSANNE grammar) with  processors for several
6With ModelSim EE/Plus 5.2e
5.6 : Dynamic Array of Processors Design Analysis 89
0 1 2 3 4 5 6 7
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r #
(a)
0 1 2 3 4 5 6 7
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r #
(b)
Figure 5.17: DAP design processor activity for a sentence of length (a)  "a"s and respec-
tively (b) 	 "a"s. The larger gray area is the reason for the expected performance depreciation
phenomenon.
90 CHAPTER 5: The Dynamic Array of Processors Hardware Design
POOL sizes. A POOL size of , 
, ,  and respectively  was used for the purpose of these
simulations.
The software used for comparison is an implementation of the enhanced-CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The hardware performance
(i.e. run-time) of the -processor DAP design with POOL size  (denoted as 	
_)%),
POOL size  (denoted as 	
_)%+), POOL size  (denoted as 	
_)%), POOL size

 (denoted as 	
_)%9), POOL size  (denoted as 	
_)%6) was compared against
two software run-times. The first (_) uses the SUSANNE grammar in CNF, as it is
also the case for the hardware. The second (_) uses the SUSANNE grammar in its
original context-free form. The software was run on a SUN (Ultra-Sparc ) with  MBytes
memory, 		 MBytes of swap memory, and  processor at a clock frequency of  MHz. The
initialization of the chart was not taken into account for the computation of the run-times. For
accuracy, the timing was done with the times() C library function and not by profiling the code.
Figure 5.18(a) shows the speedup of the -processors DAP design for different POOL
sizes when compared to the _ and figure 5.18(b) shows the speedup for the same
design, for different POOL sizes when compared to the _. In these figures the speedup
factor is represented as a function of sentence length. It comes out from this figures that the
design performance improves as the POOL size increases from  to 
. However, as the POOL
size changes from 
 to  no further increase in the design performance is observed. This leads
us to conclude that in the particular case of a -processors DAP system a POOL of size 
 is
optimal as it does not waste hardware resources. It is likely that if more processors are used a
larger POOL size is required to achieve the best performance. The explanation is that a larger
window may be necessary in order to search deeper in the chart for finding cell-combinations
to process as there are more processors available for processing cell-combinations .
5.7 Conclusions
This chapter proposes an improved FPGA-based hardware implementation of the CYK algo-
rithm, adapted for word lattice parsing that can deal with large-size real-life Chomsky Normal
Form.
The proposed design – called DAP – implements the dynamic allocation method and the
maximal sentence length it can parse is independent of the number of processors in the system.
The maximal sentence length only depends on the size of the chart memory and does not depend
anymore on the resources available in the used FPGA. In other words, if the syntactic analysis
of a sentence fits in the chart memory, then any number of processors can parse the sentence and
the number of processors we can fit in the FPGA will only influence the parsing performance.
This is in contrast to the LAP design for which a Xilinx Virtex XCV1000 FPGA can only fit
a LAP design with maximum    processors – in the particular case of the SUSANNE
grammar – which is particularly restrictive when parsing word lattices.
The parsing results are available on-line, during the parsing, on some FPGA pins as a
compact parsing forest and can be used for further processing (e.g. semantic module). For the
implemented DAP design the performance measurements show an average speedup of 
when compared to a software implementation of the enhanced-CYK algorithm – run on a SUN
(Ultra-Sparc ) with  processor at  MHz – using a CNF of the SUSANNE grammar
and an average speedup of  when using a general CFG representation of the SUSANNE
grammar.
All the improvements addressed by the DAP design did not require to change the data-
5.7 : Conclusions 91
2 4 6 8 10 12 14 16
160
170
180
190
200
210
220
230
sentence length[words]
sp
ee
du
p
hard_pool1 vs. soft_CNFG 
hard_pool2 vs. soft_CNFG 
hard_pool4 vs. soft_CNFG 
hard_pool8 vs. soft_CNFG 
hard_pool16 vs. soft_CNFG
(a)
2 4 6 8 10 12 14 16
26
27
28
29
30
31
32
33
34
sentence length[words]
sp
ee
du
p
hard_pool1 vs. soft_CFG 
hard_pool2 vs. soft_CFG 
hard_pool4 vs. soft_CFG 
hard_pool8 vs. soft_CFG 
hard_pool16 vs. soft_CFG
(b)
Figure 5.18: The speedup for the -processors DAP system for several POOL sizes when
compared to (a) _ and (b) _ software as a function of sentence length. For
each sentence length more than  sentences were parsed.
92 CHAPTER 5: The Dynamic Array of Processors Hardware Design
structures used to represent the chart and the grammar memories, which are the same as those
used for the LAP design. This rendered possible an accurate comparison between the perfor-
mance of the LAP and DAP designs. Also, the DAP interface and initialization requirements are
identical to those required for the LAP design. This allows an easy integration of a commercial
FPGA such as the Xilinx Virtex XCV2000efg1156-6 – containing the DAP design – within a
larger system. The larger system can be for instance an FPGA-board working as an accelerator
in an application framework (e.g. NLP) requiring efficient parsing of context-free languages.
An example of such an application, a Vocal Interface Server, is described in section 1.4.
The DAP design was the second step of our design methodology during which:
 we studied a better processor task allocation method;
 we refined and extended our background knowledge about the real-life behaviour of the
CYK algorithm, on which the enhanced-CYK implementation will be built upon;
 we improved the speed-up factor.
The first point addresses a major drawback of the previous LAP design by investigating the
dynamic allocation method – presented in appendix B.2 – which is also a key feature for a
design able to efficiently exploit the parallelism available in the CYK algorithm. The second
point was useful for identifying critical regions and features of the DAP design that have to be
improved when implementing the enhanced-CYK algorithm. The third point was an argument
for the advantage offered by the dynamic allocation method.
From the performed experiments and performance measurements of the DAP design we
remark a number of drawbacks:
 for every parsed sentence there is an optimal number of processors in the system for
which the speedup is maximal. A smaller/larger number of processors may result in a
performance depreciation. The explanation is that when there are more processors than
needed – for parsing a particular sentence – in the system, the time lost by the system for
managing the unneeded processors becomes an overhead.
 as the number of working processors in the system grows the collisions among processors
– that require arbitration – becomes significant resulting in a performance depreciation.
This phenomena was illustrated in section 5.6.2. While a significant amount of chart
memory accesses are performed for checking the guard-vectors a solution for solving
this problem would be to use a data-structure for the chart that does not rely on the use
of guard-vectors;
 although the processor utilization was improved for the DAP design it is still low – be-
tween   % – for sentences of real-life length. The reason is that the amount of
processing is not evenly distributed among the processors. Concretely, as illustrated in
the processor activity diagrams (see figure 5.15), there are processors carrying on cell-
combinations that are very demanding in terms of processing, while other processors
work on less demanding cell-combinations and finish quicker. Moreover, it is often the
case that the processors that finish will also wait – due to data-dependency constraints –
for the hard working processor(s) to finish. A solution to this problem that can increase
even more the processor utilization would be to cluster several processors and assign
them to process the same cell-combination. Such a solution will better distribute the
overall amount of processing among the processors, resulting in a better average proces-
sor utilization and shorter parsing times;
5.7 : Conclusions 93
 the initialization procedure – the same required for the LAP design – is relatively complex
and time consuming. This comes from the fact that the guard-vectors are initialized in
two steps: (1) all the guard-vectors in the chart require a cleaning step (i.e. "0"-filling)
and (2) an initialization with the new content. The second step is required only when
parsing word lattices;
 the designs can only deal with CNF grammars. The rewriting of a general CFG in a CNF
grammar, although always possible, obfuscates the syntactic structure of the described
language. Concretely, this refers to the fact that the parsing trees produced with the
rewritten CNF grammar have a different structure when compared to those produced by
the original general CFG. A design able to cope with general CFGs is required;
 the designs do not have yet the ability to recover from fatal errors (e.g. exceeded number
of non-terminals in a cell, a unit crash). A mechanism for monitoring the normal system
operation and that can reset the design is a stable state from which other parsings can
start is also required;
 the designs do not integrate a unit for on-line extraction of parsing forest while this was
not key feature for a non-final version of the design;
 only 30% of the resources available in a Xilinx Virtex XCV1000 were used in the DAP
design. A lot of resources are therefore still available for further improvements .
94 CHAPTER 5: The Dynamic Array of Processors Hardware Design
Chapter 6
The Hardware Design of the
enhanced-CYK Algorithm
This chapter presents a design architecture implementing the enhanced-CYK algorithm adapted
for word lattice parsing (see section 2.3, page 17). The proposed architecture is the third step –
and the final – of our design methodology, during which:
 we propose a design that can deal with almost unrestricted general CFGs;
 simplify the chart initialization procedure;
 propose a method called tiling that improves the average processor utilization;
 integrate an on-line parse extraction module;
 integrate a module that monitors the normal system operation during runtime;
The first point mentioned above addresses a major drawback of the previous designs by allow-
ing the enhanced-CYK design to cope with general (almost unrestricted) CFGs, not only with
Chomsky Normal Form CFGs. In fact the design can only handle a large subclass of CFGs,
called "non partially lexicalized" CFGs (see section 2.3) that also does not contain unitary rules.
The restriction to "non partially lexicalized" CFGs simplifies the design’s initialization step and
the restriction to contain no unitary rules reduces significantly the hardware complexity.
The second improvement is a simplified initialization procedure which was made possible
by eliminating the guard-vectors used in the former chart data-structure and by replacing their
functionality with specialized hardware. This change substantially reduces the memory space
requirements per chart cell, resulting in the design’s ability to parse, for the same size of the
chart memory, longer sentences or word-lattices with more time-stamps. For instance, with the
current implementation,  KBytes memory is enough to parse sentences with up to  words
and  MBytes memory will be enough to parse sentences with up to  words.
The third improvement is a method called tiling that improves the average processor uti-
lization. With the tiling method the processors can be assigned to process chunks of a cell-
combination – henceforth referred as tiles – unlike the processors in the previous designs that
were processing entire cell-combinations.
The last two improvements are required for an implementation of the enhanced-CYK algo-
rithm that will be integrated in a larger application framework. Both the ability to extract the
compact parse forest and to monitor the system state during runtime are essential. The later
functionality is required in order to create a reliable environment.
96 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
The main features of this design are: (1) an average speed-up factor of about  
 when
compared to a software implementation of the enhanced-CYK algorithm1 and (2) the ability to
parse real-life sentences of up to  words (or time-stamps).
This chapter starts with a general and functional system description. The data-structures
used to represent the chart and the grammar are next presented. It continues by presenting
the system units in detail. The design performance measurements and the design analysis –
discussing some key features of the proposed design – are given before concluding this chapter.
6.1 General system description
The enhanced-CYK design proposed in this chapter can parse sentences of any length, regard-
less of the number of processors in the system, given that the chart memory is large enough
to store the chart data-structure. As for the DAP design, the input to the enhanced-CYK de-
sign is an initialized chart and grammar lookup tables and the output is a compact parse forest.
The enhanced-CYK design uses SRAMs for storing the data-structures representing the chart
(stored in the chart memory) and the "non partially lexicalized" grammar (stored in grammar
memories).
Like for the DAP design, for the current design the maximal length of the sentence that
can be parsed does not depend on the number of processors in the system and therefore the
processors are not clustered around the grammar memories. Instead, each processor has its
own (local) grammar memory for best performance. As the number of pins available for an
FPGA is limited, only a limited number of grammar memories can be connected to the FPGA
which will also limit the number of processors.
The chart memory is shared by all the processors in the system as it was the case for the
previous designs. However, within the enhanced-CYK design the processors do not directly
access the chart memory. An interface is used for this purpose in order to reduce the number of
access collisions when accessing the chart memory and to increase in consequence the overall
system performance.
The design’s interface with the external world is identical with that of the previous designs
(	
,  
 and  signals). The initialization procedure is almost the same
but was simplified in some aspects and both the chart and the grammar data-structures changed.
The grammar data-structure changed in order to represent the nplCFGs that cannot employ
the same data-structure as the one used for representing Chomsky Normal Form CFGs. On
the other hand, the chart data-structure changed, rendering its initialization significantly less
time-consuming when compared to the previous designs. Concretely, the guard-vectors were
eliminated and their functionality (i.e. non-terminals lookup) was undertaken by an associative
memory.
The processor architecture and the grammar data-structure were jointly designed in order
to allow a tighter coupling and for improving the access time to data. A -bit databus is used
to link each processor to its grammar memory – in comparison to the -bit databus used in the
DAP design – that allows a number of  processors to be placed inside the XCV2000efg1156-
6 FPGA. A more sophisticated processor architecture compensates the smaller bandwidth be-
tween the processor and its grammar memory. Like the DAP design, the enhanced-CYK design
implements the dynamic processor allocation method. However, unlike the processors in the
previous designs that were processing entire cell-combinations, the processors in the current
design can be assigned to process smaller chunks of a cell-combination. Such an approach is
1Implemented in the SlpToolKit (see appendix A.1)
6.2 : Functional description 97
useful in the case when a cell-combination is demanding in terms of processing in which case
several processors can team-up and process the same cell-combination. This leads to a better
average processor utilization. The tile size can be configured in the VHDL code describing
the design which offers a very powerful method for investigating the design’s behaviour under
different tile size configurations.
Two new functionalities were integrated in the design architecture implementing the enhanced-
CYK algorithm. The first functionality gives the system the ability to monitor, trace (and re-
cover from) fatal errors such as exceeded number of non-terminals in a chart cell. Whenever
a fatal error occurs during runtime the system is reset in a stable state from which normal op-
eration can restart. The second functionality gives the system the ability to extract on-line the
parsing forest.
The general system architecture for an -processor system implementing the enhanced-
CYK algorithm using the dynamic processor allocation method is depicted in the block diagram
in figure 6.1. Again as for the previously presented designs the elements inside the dashed line
are implemented with hardware resources available within the FPGA. The other elements (chart
and grammar memories GMi) are implemented in SRAM chips present on the system board.
An FPGA-board containing the hardware resources required for implementing a -processors
enhanced-CYK system proposed in this chapter is presented in chapter 7.
6.2 Functional description
Before any parsing can start the system requires to be initialized. The system initialization pro-
cedure consists of initializing the:
 grammar memories: the binary image of the nplCFG data-structure is loaded in each
grammar memory in the system. The grammar memories are configured once before
any parsing can start and their content stays unchanged during successive parsings. The
grammar memories require reconfiguration only if a different grammar (i.e. nplCFG
without unitary rules) has to be used;
 chart memory: the initialization of the chart memory is done by some software running
on the on-board processor or on the host system. It consists in initializing for certain
cells   of the chart the set of non-terminals 

and respectively the set of partial
right-hand sides 

by making use of the lexical rules;
 sentence length: a register of the I/O controller IO-CTRL is initialised before every pars-
ing with the length of the sentence to be parsed.
Once the system was initialized the parsing can start by activating the signal 	
. At
this moment the sequence generator unit SEQ_GEN (see section 6.4.1 for details) will start
to generate triples ' of two source chart cells  and , that need to be combined
(see lines 	   of the enhanced-CYK algorithm on page 17) along with their corresponding
destination cell ' in which the cell-combination result will be stored.
Note: A source/destination cell in a triplet is in fact a pair of coordinates representing the row
and the column of that cell. For instance,  is written as 

 

, where 

is the row and


is the column of the first source cell.
The generated triples depend (only) on the length of the parsed sentence and the same sequence
98 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
M
O
N
IT
O
R
m
em
or
y
bu
ff
er
E
X
TR
A
C
TO
R
st
ar
tP
A
R
SE
o
v
e
rP
A
R
SE
LO
O
K
U
P
W
R
IT
E
R
S
E
Q
-G
E
N
D
-T
A
B
LE
C
TX
C
TX
R
A
W
(to
 a 
FI
FO
->
PC
I i
nt
er
fa
ce
)
o
u
tP
A
R
SE
P
1
P
2
P
3
G
M
N
G
M
3
G
M
2
G
M
1
P
n
D
IS
PA
T
C
H
E
R
C
H
E
C
K
E
R
P
O
O
L 
&
 
C
H
E
C
K
E
R
R
E
A
D
E
R
TI
LE
R
IO
-C
TR
L
M
E
M
O
R
Y
C
H
A
R
T
C
H
A
R
T
M
E
M
O
R
Y
A
R
B
IT
E
R
SL
E
N
Figure 6.1: The block diagram of an -processor enhanced-CYK design.
6.2 : Functional description 99
of triples is generated for any two sentences of the same length. The triples are then passed to
the CHECKER unit (see section 6.4.2 for details) who’s task is to check whether the source cells
are available, or more precisely, that the source cells are not destinations of unfinished previous
cell-combinations. In other words, the CHECKER verifies that the data-dependency in the
chart, is satisfied. For doing this, the CHECKER makes use of the destination table D-TABLE
(see section 6.4.3 for details) that stores all destination cells ', currently under processing.
The CHECKER inserts a new destination cell ' in the D-TABLE when it is encountered for
the first time in a triplet and deletes a destination cell ' from the D-TABLE when all the triples
containing the destination cell ' have been treated. Each time the CHECKER inserts a new
destination cell ' in the D-TABLE it does three things: (1) sets some "context information" for
the destination cell ' and stores it in the CTX-memory, (2) reads the content of the destination
cell ' (both the  and  sets) and stores them in the CTXRAW-memory and (3) allocates
an unique identifier ID for the destination cell '. The CHECKER will tag all the subsequent
triplets containing the destination cell ' with the identifier ID before sending them further to
the DISPATCHER.
A triplet that passes the CHECKER’s test for data-dependency is forwarded to the task
dispatching unit DISPATCHER, while a triplet that does not pass the CHECKER’s test for
data-dependency is stored in the POOL (see section 6.4.4 for details) where it waits for the
data-dependency to be satisfied. In the POOL all triplets are continuously checked for data-
dependency against the D-TABLE and as soon as a triplet passes the data-dependency test it
is returned to the CHECKER. The CHECKER has therefore two inputs, one from the POOL
and the other from the SEQ_GEN – both passing through FIFO memories (represented with
small black boxes in figure 6.1). The triplets coming from the POOL have higher priority and
are handled first. The reason for giving higher priority to the triplets coming from the POOL is
that, for a SEQ_GEN generating triplets in a natural chronological order, the POOL will store
old triplets that require to be treated first. Before forwarding a triplet to the DISPATCHER, the
CHECKER will tag it with the ID associated with the destination cell ' it contains.
The task of the DISPATCHER unit (see section 6.4.5 for details) is to assign (i.e. dis-
patch for processing) the incoming triplets to the available processors in the system. The DIS-
PATCHER unit is one of the major changes of the new design. The new DISPATCHER unit
has two stages: the first is the READER and the second is the TILER. The READER stage
prefetches for each incoming triplet the set  for the source cell  and the  set for the
source cell  and stores them in an FPGA internal buffer memory. The buffer memory – built
with dual-port memory resources (see Xilinx’s Virtex-E FPGA manual [30] for details) – is
written by the READER stage and read by the TILER stage. The TILER stage reads the con-
tent of this buffer in a first-in-first-out fashion and splits each of the cross-products   
(i.e. cell-combinations) in smaller chunks of predefined size (that can be configured in the
VHDL code) – referred henceforth as tiles. The tiles are further distributed to the processors as
soon as the processors become available during runtime. As for the DAP design, the result of a
processed tile (i.e. the processors output), is fetched by the WRITER (see section 6.4.7 for de-
tails), reassembled2, and finally stored3 in the chart memory. The WRITER uses the identifier
ID that comes along a tile’s processing result in order to identify and reassemble the destination
cell ' to which these results belong.
2All the tiles corresponding to a destination cell  are reassembled by the WRITER and the partial reassembled
destination cell  is kept in the CTXRAW-memory for all destination cells  currently under processing.
3The content of the destination cell  stored in the CTXRAW-memory is dumped in the chart memory when
the last tile for the destination cell  was reassembled.
100 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
After fetching the processor’s output the WRITER updates accordingly the data-dependency
information in the D-TABLE and forwards the data required for extracting the parsing forest
to the EXTRACTOR (see section 6.4.8 for details). The EXTRACTOR extracts the compact
parsing forest information and packs it in a compact format that can be efficiently (i.e. rapidly)
transfered to the host machine. The MONITOR (see section 6.4.9) unit supervises correct sys-
tem operation and in case of errors resets the system in an initial state from which new parsings
can restart.
6.3 The enhanced-CYK algorithm data-structures
6.3.1 The chart data-structure
This section discusses the chart data-structure used for the hardware implementation of the
enhanced-CYK algorithm. The factors that influence the chart data-structure organization are
also discussed. Like for the CYK algorithm, the data-structure representing the chart used by
the enhanced-CYK algorithm has to find a compromise between: (1) required functionality, (2)
required memory space, (3) data access time and (4) data access circuit complexity.
The content of a cell for the enhanced-CYK algorithm is not the same with the content
of a cell for the CYK algorithm and therefore the data-structure used to represent the chart
changes. A cell   for the CYK algorithm contained a set 

of non-terminals. For the
enhanced-CYK algorithm, a cell   contains two sets, a set 

of non-terminals and a set


of partial rule right-hand sides. Nevertheless, the functionalities this new data-structure
has to support, are essentially the same as for the CYK algorithm (see section 3.4.1) with some
slight complications. Suppose we have the triplet    ' in which the source cell 
corresponds to cell  , the source cell  to cell      and the destination cell ' to
cell  . Then the following functionalities require support:
  %: go through all elements of a set. Required in order to pair each partial rule
right-hand side of the set 

, in the source cell   with each non-terminal of set


in the source cell      . For more details see the enhanced-CYK
algorithm on page 17);
  %: is an element in a set ? Required in order to check the presence – in the destination
cell   – of a non-terminal  in the set 

, or to check the presence of a partial rule
right-hand side  in the set 

;
  %: insert an element in a set. Required in order to store – in the destination cell  
– a non-terminal  in the set 

or to store a partial rule right-hand side  in the set


;
Where  % stands for enhanced functionality. The  % and  % functionalities are supported
by the same means as for the CYK algorithm, namely a list representations of the sets.
The previous % functionality was supported by means of guard-vectors that are not used
anymore due to their bad influence on the overall system performance. The  % functionality
is currently supported by means of an associative memory. The associative memory is loaded
either with the set 

or 

of a destination cell  . When the associative memory is
loaded with the 

set of a destination cell   any non-terminal  can be checked for
occurrence against the 

set. Identically, when the associative memory is loaded with the


set of a destination cell   any partial right-hand side  can be checked for occurrence
6.3 : The enhanced-CYK algorithm data-structures 101
against the 

set. In order to be efficient (i.e. fast) it is imperative that the content of the
associative memory can be changed (i.e. loaded) very fast with a new set.
Allocating for each set 

an amount of memory proportional to  would represent an
important memory waste, as in practice 

   . Identically, the size of each set 

is much smaller in practice than its theoretical maximal size which depends on the grammar
characteristics (i.e. number of rules, number of non-terminals). A maximal size 
 is assumed
for the set 

respectively the set 

. If however, during runtime, a cell receives in either
of the sets more than 
 distinct elements, the hardware generates a fault signal and the parsing
stops. This would be a very unlikely event for a well chosen value of 
 . The value of 

depends on the considered grammar and can be found by investigating the characteristics of the
grammar. In the particular case of the nplCGF without unitary-rules SUSANNE grammar (see
appendix B.4) we have a value of 
  
 which requires a chart memory size of  KByte
for parsing any sentence with up to  words and a memory size of  MByte for parsing any
sentence with up to  words.
For understanding how the  %,  % and  % functionalities are implemented, the chart
data-structure organization is given in figure 6.2. The chart memory storing the chart data-
indexing table
8 [bytes] 2*2*K [bytes]
cells table
Phead
Phead
Phead sizeN2sizeN1
sizeN1
sizeN1
sizeN2
sizeN2
(1,1)
(2,1)
(1,n)
11
11
21
21
1n
1n
N1
N1
N1
N2
N2
N2
Figure 6.2: The enhanced-CYK chart data-structure memory organization. Indexing table entry
is on 
 bytes and cells table entry is on 
 bytes (each set  and  requires 
 bytes).
structure is addressed on 
 byte words. Therefore, in order to efficiently access the data stored
in the chart memory, the content of the data-structure is aligned to an 
 byte boundary. The chart
data-structure is organized in memory as two distinct tables and is defined by some parameters
that depend on the used grammar. The maximal size allowed for these parameters as well as the
actual size of these parameters in the particular case of the SUSANNE grammar are tabulated in
table 6.1. A grammar for which any of these parameters requires a size larger than its maximal
allowed size cannot be accommodated within the proposed data-structure. These parameters
are used by (1) the software used to initialise the chart data-structure and (2) the VHDL code
used to synthesize the design in order to correctly access the chart data-structure. The left table
in figure 6.2, called indexing table, contains for each chart cell   an entry that allows to
retrieve (1) a pointer  	 to the memory location where the associated sets 

and 

are stored and (2) two values $4  and $4 + representing the (current) number of elements
in each of the two sets. Each entry in the indexing table is stored on 
 bytes and is organized
as illustrated in figure 6.3. The physical address of the entry in the indexing table where the
102 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
parameter maximal
allowed
size
SUSANNE
size
meaning
 

  total # of non-terminals
__  [bits]  [bits] bits for a non-terminal
+__  [bits]  [bits] bits for a partial right-hand side
_  [bits] 	 [bits] bits for the constant 

_"_  [bits]  [bits] bits for a chart memory pointer
Table 6.1: The maximal size of the parameters that define the enhanced-CYK chart data-
structure. Parameter values in the particular case of the SUSANNE grammar.
16 15



























 













(CELLSIZE_BITS) (CELLSIZE_BITS)
0
1
(i,j) 31 0
31 16 15 0
Phead
sizeN1 sizeN2
(CYK_ADR_SIZE)
Figure 6.3: The organization of an entry   in the indexing table of the enhanced-CYK chart
data-structure.
information about cell   resides is build by concatenating the binary representation of 
and  and performing a left-shift with  positions (
 bytes aligned). The indexing table size is

 KBytes for parsing sentences with up to  words and  KBytes for parsing sentences with
up to  words.
The second table called the cell table contains for each chart cell   an entry storing the
sets 

and 

. While both the elements of the  set (i.e. non-terminals, represented on
__) and the elements of the  set (i.e. partial rule right-hand sides, represented
on +__) are stored on  bytes, each entry in the cell table requires 
 bytes. In
the particular case of the SUSANNE grammar the size of an entry in the cell table is  bytes.
In the same particular case the size of the cell table is  KBytes for parsing sentences with
up to  words and  MBytes for parsing sentences with up to  words.
Let’s now look at how the functionalities  %,  % and  % are implemented. For the
implementation of  % the pointer  	 is the base memory address where the set  is
stored and the pointer  	 	_ is the base memory address where the set  is
stored. Two displacements, i.e. indexes, are used for going through the non-terminals in the set
 ( to $4  ) and respectively through the partial right-hand sides in the set  ( to
$4 +). The addition of the base memory address and the displacement gives the physical
memory location of the addressed item.
The functionality  % is supported by means of an associative memory that is loaded either
with the content of the set 

or 

of a destination cell  . A non-terminal  can
be looked-up in the associative memory, when the associative memory is loaded with the set


of the destination cell  . Identically, a partial right-hand side  can be looked-up
6.3 : The enhanced-CYK algorithm data-structures 103
in the associative memory, when the associative memory is loaded with the set 

of the
destination cell  .
Finally, for implementing the functionality  %, the  	 $4  points to the physical
memory location where the next non-terminal has to be stored and  		_
$4 + points to the physical memory location where the next partial right-hand side has to be
stored. Each time a non-terminal is stored in memory the value of $4  is incremented to
reflect the new size of the  set. Identically, each time a partial right-hand side is stored in
memory the values of $4 + is incremented to reflect the new size of the  set. If during
the parsing $4  " 
 or $4 + " 
 , a fault signal is raised to signal that the 

or
respectively the 

set has to many elements. The parsing will stop in this case and a signal
should tell the host computer (or local microprocessor) that the current parse could not finish
and that the software has to redo it. When parsing sentences, the initial value of the $4 
and $4 + fields is always . However, for word lattices this in not always the case .
6.3.2 The nplCFG grammar data-structure
This section discusses the data-structure representation of the nplCFG (without unitary-rules)
– referred henceforth simply as the grammar – used in the hardware implementation of the
enhanced-CYK algorithm.
A copy of the grammar data-structure has to be loaded in each grammar memory in the
system, before any parsing can start. The content of the grammar memory only has to be
changed when a new grammar is to be used. Given that for real-life grammars the data-structure
requires a large amount of storage memory, SRAM chips are used for this purpose. The SRAMs
have several advantages: (1) are easy to control, (2) state-of-the-art SRAM chips have relatively
large sizes, (3) fast access times and (4) require a minimum of interfacing signals with the
FPGAs. The disadvantage is the high power (i.e. current) consumption.
Within the current design the processors do not share the available grammar memories.
Each processor has its own grammar memory in order to achieve best performance. While such
a configuration is efficient only if the processors are using the grammar memories intensively,
the design of the grammar data-structure and of the processor aimed a tighter coupling of the
two.
When working on the cell-combination   ', a processor pairs each partial right-
hand side  in the set  of the source cell  with each non-terminal  of the source cell
 and uses the grammar data-structure to check whether the   is a new partial rule right-
hand side, and/or  is a rule right-hand side and in each of the two cases to retrieve some
information. Precisely, given a partial right-hand side  and a non-terminal  the grammar
data-structure should allow a processor to retrieve the following information:
1. the partial right-hand side #    if there is one;
2. if  is a rule right-hand side: all the non-terminals 

that are left-hand sides of the
grammar rules 

  and all the 

 in the grammar in this case;
3. a code that uniquely identifies the right-hand side  for the grammar rules at point 2.
The example bellow illustrates the functionality required from the grammar data-structure when
a cell-combination  ' is performed by a processor.
Example 6.1 Let’s consider the nplCFG without unitary rules given by:
104 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
  '( %* / 0
 	,
     )     	,
       (     (     (  
(  * '  %  * '  %    (  
*    (  /    (  0    % 

  (  
        
 0 	
an perform the cell-combination  ' given in figure 6.4. The cross-product of the set
N   = {A, H, J}1
N   = {BCE   , A   , H   , BCA   }2
N   = {BC   }2
...
N   = {A, E, F}1
...
S2 :S1 : 
D :
Figure 6.4: Example of cell-combination for the enhanced-CYK algorithm.
 (partial right-hand sides in source cell ) and the set  (non-terminals in the set ) is
   (  % 	. The grammar data-structure should allow a processor to
retrieve the following information:
& for   : (1) the partial rule right hand side  (e.g. the grammar rule / 
();
& for  ( : (1) the partial rule right-hand side ( (e.g. the grammar rule % 
() and (2) the non-terminals  (the rule   () and * (*  () respec-
tively the partial right-hand sides  (
  () and * ((  *');
& for  %  : (2) the non-terminal 0 (the rule 0  % );
The partial rule right-hand sides found above are inserted in the set , and the non-terminals
are inserted in the set  of the destination cell '. The result is illustrated in figure 6.4.

The proposed grammar data-structure that allows a processor to perform the operation described
in the example above is illustrated in figure 6.5. The grammar data-structure is aligned to a
 byte boundary and therefore the grammar memories can be accessed with a 
-bit, -bit or
-bit wide databus as needed. If the databus width is changed only the processor’s interface
with the grammar memories requires small changes. The grammar data-structure is defined by
some parameters whose values depend on the used grammar. These parameters, their maximal
allowed size and their actual size in the particular case of the SUSANNE grammar (without
unitary rules) are tabulated in table 6.2. A grammar for which any of these parameters re-
quires a size larger than its maximal allowed size cannot be accommodated within the proposed
grammar data-structure.
Let’s see how the grammar data-structure is organized. Level 1 is a table with an entry
for each distinct partial rule right-hand side  present in the grammar. An entry in this table
contains a pointer )
_%$ to a list stored at level 2, containing all the non-terminals that are
6.3 : The enhanced-CYK algorithm data-structures 105
F
(1) (1)(1)
P L
RHScodesize N2
(2*CNT_LHS_SIZE)
size N1
(PTR_SIZE)
Y Y
0 31
ptr_list
(N2_SIZE_BITS)(N1_SIZE_BITS)
ptr_table
(PTR_SIZE)
X3X1
(N1_SIZE_BITS) (N1_SIZE_BITS)
(N2_SIZE_BITS) (N2_SIZE_BITS)
(N2_SIZE_BITS)
Xm
(N1_SIZE_BITS)
Xn
X2X1
(RULE_SIZE)
LEVEL 1 LEVEL 3LEVEL 2
Figure 6.5: The data-structure used to represent a nplCFG without unitary rules.
parameter maximal size [bit] SUSANNE size meaning
_ up to  
 bits for a pointer
__ up to   bits for a non-terminal
+__ up to   bits for a partial right-hand side
_&_ N/A  bits for the size of level3 lists
(_ N/A  bits for a rule right-hand side
Table 6.2: The maximal size of the parameters that define the nplCFG (without unitary rules)
grammar data-structure. Parameter sizes in the particular case of the SUSANNE grammar.
possible continuations for the corresponding rule right-hand side . Note, that none of the
table entries can contain a NULL pointer as a partial rule right-hand side always (i.e. by defini-
tion) has at least one non-terminal as possible continuation. Each table entry is represented on
 bytes, that allows the construction of the physical address of the entry associated to the partial
rule right-hand side  from its binary representation. Precisely, the physical memory address
on the associated table entry is obtained by performing a  positions left-shift on the binary
representation of the partial rule right-hand side . The number of bits required to represent a
pointer is given by the parameter _ (see figure 6.5 and table 6.2). In the particular case
of the SUSANNE grammar there are  
 partial rule right-hand sides and the size of level
1 table is   bytes.
The level 2 is a collection of lists, one for each partial rule right-hand side  in the gram-
mar. The list associated to the partial rule right-hand side , has an entry for each non-terminal
106 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
 for which there is a partial rule right-hand side #    and/or rule right-hand side
#   in the grammar. The fields of an entry in a level 2 list (see figure 6.5) corresponding
to the partial rule right-hand side  are:
& : flag. ’1’ if   is a partial rule right-hand side, ’0’ otherwise;
& : flag. ’1’ if  is a rule right-hand side, ’0’ otherwise;
& : flag. ’1’ the entry is the last in the list;
& : a non-terminal that is a possible continuation of the partial rule right-hand side .
& #: the new partial rule right-hand side  ;
& )
_	*% : if   , a pointer to a table in level 3 containing (1) a code &' that
uniquely identifies the rule right-hand side  and (2) all the non-terminals 

for which
there is a grammar rule 

  and all the 

 in the grammar in this case;
Each entry in a level 2 list is represented on 
 bytes. Each flag requires  bit. The size in bits for
the non-terminal  is given by the parameter __, for the new partial right-hand side
by the parameter +__ and for the pointer to the level 3 by the parameter _
(see figure 6.5 and table 6.2). While an entry in the level 2 is stored on 8 bytes, the following
restriction should be satisfied: __+___  . In the
particular case of the SUSANNE grammar the level 2 size is 	 
 bytes.
Finally, level 3 is a collection of tables, one for each distinct rule right-hand side  present
in the grammar. Each level 3 table contains a header, a list of non-terminals and a list of partial
rule right-hand sides. The fields of a level 3 table (see figure 6.5) corresponding to the rule
right-hand side  are:
& $4 : in header. Number of non-terminals in the list;
& $4 +: in header. Number of partial rule right-hand sides in the list;
& &' : in header. A code that uniquely identifies the rule right-hand side  to which
the table corresponds;
& list of non-terminals: all the non-terminals 

for which there is a grammar rule 


 ;
& list of partial rule right-hand sides: all the partial rule right-hand sides 

 in the grammar
where 

are elements of the non-terminals list;
A header in the level 3 table is represented on  bytes and both the non-terminals and the partial
rule right-hand sides are represented on  bytes. The size in bits for the $4  and $4 +
is given by the parameter _&_, for the &' by the parameter (_, for
a non-terminal in the list of non-terminals by the parameter __ and for the partial
rule right-hand side by the parameter +__ (see figure 6.5 and table 6.2). While
the header in the level 3 is stored on 4 bytes, the following restriction should be satisfied:
  _&_ (_  . In the particular case of the SUSANNE grammar the
level 3 size is   bytes.
For a better understanding of the proposed data-structure the example bellow illustrates the
data-structure organization for the nplCFG that was used in example 6.1 and explains how the
search for a particular pair    works .
6.4 : Design units 107
B
E
CEE111
K K
1 0 1
1 0 1
0 1 1
0 1 1
0 1 1
0 1 1
0 1 1
0 1 1
0 1 1
0
C
01
D
J
A
0
C
BCE
1
F
BCA
F
A
E011
(NULL)
G
E
G
BCC (NULL)
(NULL)AB
010
B
H
BC
AB
K
CE
BCA
BCE
A
H
B
J
E
22
1 1 AE
1 1 CE
1
A
BCEG
1
2 0
KJ
HD
BCAE
1
H
ABC
BCE
1
1
BCF
I
F
1
0
CEG
1
S
0
0
B
B
K K
B
level 1 level 3level 2
Figure 6.6: The data-structure representing the nplCFG without unitary rules used in exam-
ple 6.1.
Example 6.2 We use the same grammar defined in example 6.1 – whose data-structure is illus-
trated in figure 6.6 – for illustrating how a processor uses the grammar data-structure to lookup
the pair  (. The partial rule right-hand side  is used to index the level 1 table and
retrieve a pointer to a level 2 list. The pointed level 2 list contains three items, and the second
corresponds to the continuation of  with the non-terminal (. According to this list entry
(  ,   ) there is both a partial right-hand side ( and a final ( in the grammar.
The pointer )
_	*% in this list entry allows to retrieve from the level 3 table the set of
non-terminals *	 and the set of partial right-hand sides *	.

6.4 Design units
6.4.1 The sequence generator (SEQ_GEN) unit
The sequence generator is presented in the enhanced-CYK block diagram (see figure 6.1) as the
SEQ_GEN unit. The task of this unit is to generate all triples ' of two chart source
cells  and  that need to be combined, along with their corresponding destination chart
cell ', where the cell-combination result will be stored. The unit’s functionality is identical
with that described for the SEQ_GEN unit of the DAP design in section 5.3.1. The SEQ_GEN
implemented within the enhanced-CYK design generates triplets with chronologically ordered
108 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
source cells.
The output of the SEQ_GEN unit is forwarded to the CHECKER through a FIFO memory
for decoupling the two units. A small FIFO is required (i.e.  to  words) as the SEQ_GEN
unit is fast and will always keep the FIFO memory full. The size of this FIFO is configured in
the VHDL code with the generic 2(__"&.
6.4.2 The data-dependency checking (CHECKER) unit
The checker is presented in the enhanced-CYK block diagram (see figure 6.1) as the CHECKER
unit. The unit’s functionality is identical with that described for the DAP design in section 5.3.2
with some differences as follows.
As for the DAP design, when the CHECKER inserts a destination cell ' in the D-TABLE
(1) the CTX-memory is initialized with some "context information" related to the destination
cell ' and (2) a unique identifier is allocated to the destination cell '. In addition to these
operations the CHECKER implemented within the enhanced-CYK design also initializes the
CTXRAW-memory with the content of the destination cell ' (both the  and the  sets).
Note: When parsing sentences, the sets  and  in all destination cells are empty and
therefore the CTXRAW-memory is only initialized when parsing word lattices.
The information stored in the CTX-memory, CTXRAW-memory and D-TABLE associated to a
destination cell ' can be retrieved by using the unique identifier ID assigned to the destination
cell '. The WRITER uses this information for reassembling the tile processing results.
As for the DAP design the output of the CHECKER is forwarded to the DISPATCHER unit
through a FIFO memory for decoupling the two units. The size of this FIFO is configured in
the VHDL code with the generic 2(_&_"&.
6.4.3 The destination cells table (D-TABLE) unit
The destination cells table is represented in the enhanced-CYK block diagram (see figure 6.1)
as the D-TABLE unit. The unit’s functionality is identical with that described for the DAP
design in section 5.3.6 with some differences as follows.
An entry in the D-TABLE has three fields: (1) a register _"_

storing the row and
column of a destination cell, (2) the counter _2 and (3) the counter _.
The difference between the DAP and enhanced-CYK designs is related to the usage of the
_2 counter. Precisely, while for the DAP design the _2 counter
is counting the cell-combinations issued for that entry’s destination cell, in the case of the
enhanced-CYK design it is counting the issued tiles. Note, that when counting tiles the value
of the _2 counter cannot exceed the number of processors in the system.
The insertion and the deletion of D-TABLE entries works similarly for the DAP and enhanced-
CYK designs. Each time a destination cell is inserted in the D-TABLE the _2
counter is initialized on 0 and the _ counter is initialized with the row of the in-
serted destination cell '. Like for the DAP design, an entry in the D-TABLE is deleted (i.e.
released) when both counters become 0.
6.4 : Design units 109
6.4.4 The triplets buffer (POOL) unit
The POOL is presented in the enhanced-CYK block diagram (see figure 6.1) as the POOL unit.
The unit’s functionality is similar to the POOL unit used in the DAP design and presented in
section 5.3.3.
In both the DAP and the current design, the POOL unit is key to the system’s ability to
dynamically allocate the processors to process cell-combinations. Without the POOL unit the
system will hang on the first cell-combination that does not satisfy a data-dependency constraint
and keep all the remaining cell-combinations – even if they satisfy their data-dependency con-
straints – unissued. On the other hand, by using the POOL unit the system can search over
a larger number of cell-combination that can be potentially issued and eventually keep all the
processors busy.
However, for the enhanced-CYK design it seems a priori that the POOL unit is less im-
portant – when comparing to the DAP design – due to the design’s ability to balance the
processor load by using the tiling mechanism. Concretely, there are less chances for a cell-
combinations not to satisfy a data-dependency constraint that was usually the case when large
cell-combinations were processed and therefore to require to be stored in the POOL. This will
however be investigated in section 6.6.5.
6.4.5 The task dispatching (DISPATCHER) unit
The task dispatcher is represented in the enhanced-CYK block diagram (see figure 6.1) as the
DISPATCHER unit. Its purpose is the same as for the DAP design, namely to dispatch tasks
to the processors in the system. However, in order to implement the ability of tiling (large)
cell-combinations the DISPATCHER has been redesigned from scratch, being one of the major
changes in the new design. The new DISPATCHER unit is built of two stages: the first stage
is called READER and the second TILER. The separation of the DISPATCHER in two stages
aims to pipeline the source cell prefetching with their tiling and distribution to the processors.
The time spent for accessing the source cells is made thus transparent, partly by the pipeline
and partly by the high bandwidth communication – due to a -bit databus – with the chart
memory. We will further discuss the two stages of the DISPATCHER.
6.4.5.1 The prefetching (READER) stage
The READER stage reads from the chart memory, for each incoming triplet   ', the
set  for the source cell  and the  set for the source cell  and further stored them in a
buffer memory. The enhanced-CYK design uses -bit to represent a non-terminal (-item),
respectively a partial right-hand side (-item), and therefore the -bit databus with the chart
memory allows the READER to fetch   or -items per read cycle. In the particular case
of SUSANNE grammar, the maximal size for the sets  and  was established at 
 (see
appendix B.4), and therefore a number of 
   chart memory read cycles are sufficient
to fetch a set  or  of any size and  read cycles to fetch both the  and  sets for a
triplet. However, for real-life grammars less read cycles are usually sufficient for fetching these
sets, as they do not contain a large number of elements.
The READER stage stores the set  from the source cell  and the set  from the
source cell  in a dual-port buffer memory which is organized in two banks -bank and
-bank as depicted in figure 6.7. For a given triplet   ', the -bank stores the
 set from the source cell S2, and the -bank stores the  set from the source cell S1.
110 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
MU
X
MU
X
32
32
32
32
32
32
64
da
ta_
reg
ist
er[
63:
0]
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
N2
 B
AN
K
N1
 B
AN
K
writeBANK_N2_address
’
1’
RE
SE
T
CL
K
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
RA
MB
4_
S1
6_
S1
6
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
ADDRA[7:0]
ADDRA[7:0]
ADDRA[7:0]
ADDRA[7:0]
dataL2[63:48]
dataL2[47:32]
dataL2[31:16]
dataL2[15:0]
ADDRB[7:0]
ADDRB[7:0]
ADDRB[7:0]
ADDRB[7:0]
CL
K
RE
SE
T
’
0’
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
ENB
WEB
RSTB
CLKB
DIB[15:0]
DOB[15:0]
WEA
ENA
RSTA
CLKA
DIA[15:0]
DOA[15:0]
ADDRA[7:0]
ADDRA[7:0]
ADDRA[7:0]
ADDRA[7:0]
ADDRB[7:0]
ADDRB[7:0]
ADDRB[7:0]
ADDRB[7:0]
dataL1[63:48]
dataL1[47:32]
dataL1[31:16]
dataL1[15:0]
readBANK_N2_address
readBANK_N2
readBANK_N1_addresswriteBANK_N1_address
writeBANK_N1
CL
K
RE
SE
T
’
0’
’
1’RE
SE
T
CL
K
RE
AD
ER
 ST
AG
E
TIL
ER
 ST
AG
E
writeBANK_N2 readBANK_N2
Figure 6.7: A dual-port memory buffer used for overlapping the triplet prefetching (READER
stage) with their tiling and dispatching (TILER stage)
6.4 : Design units 111
Both the -bank and the -bank are built with  _6_6 primitives4. The size of
a _6_6 primitive is  bit – organized as  words of  bit – and therefore both
the -bank and the -bank have a size of  KByte.
The number of sets that can be stored in the -bank and -bank depends on the maximal
size of the  and  set which also depends on the used grammar. For simplicity we consider
that both the  and  sets have the same maximal size (i.e. the size of the largest). For
instance, in the particular case of the SUSANNE grammar a maximal set size of 
 is used.
The maximal size of these sets should be a power of  and is configured in the VHDL code by
means of the _ generic.
Table 6.3 tabulates for different maximal set sizes the value of parameter _
and respectively the number of such sets that can be stored in the -bank, respectively
-bank. In the particular case of the SUSANNE grammar a maximal set size of 
 is
/ set size _ # of set entries/bank
32 5 32
64 6 16
128 7 (configured for SUSANNE) 8
256 8 4
512 9 2
Table 6.3: The value of parameter _ to be configured for different maximal
/ set sizes and the number of sets that can be accommodated in a bank.
used (_  	) and both the -bank and the -bank, can store a number
of   
  
 , respectively  sets.
Note: If it is necessary to store larger (or more) sets in the buffer memory more _6_6
primitives can be used for building the buffer memory. This requires, however, to slightly mod-
ify the READER and TILER in order to cope with such changes.
The internal organization of the -bank and -bank is similar, on the assumption that (1)
both the -items and the -items are represented on  bit and (2) the size of the  and
 sets are the same. Figure 6.8 illustrates the internal organization of the -bank (respec-
tively -bank) in the particular case when the parameter _  	. With this
internal organization of the banks, each  bit word read by the READER from the chart mem-
ory, containing  -items (when the  set is read) is stored "at once" in the -bank, one
-item in each _6_6 memory primitive. Similarly, each  bit word read by the
READER from the chart memory, containing  -items (when the  set is read) is stored
"at once" in the -bank, one -item in each _6_6 memory primitive.
Assuming that the set entry  is allocated to an incoming triplet  ' the READER
reads the set  from  and stores it in the set entry  of the -bank. Also, it reads the set
 from  and stores it in the same set entry  of the -bank.
The buffer memory is supervised by a manager that keeps the evidence of the allocated set
entries. If there are free set entries in the buffer memory, the READER reads the next triplet
from the CHECKER’s FIFO and asks the buffer memory manager to allocate one of the free
4We assume that a Xilinx Virtex-E XCV2000 FPGA is used. See details in [30].
112 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
0
1
31
0
0
1
31
31
29
0
1
31
0
0
1
31
31
29
0
1
31
0
0
1
31
31
29
0
1
31
0
0
1
31
31
29
16 16 16 16
16 161616
0
1
RAMB4_S16_S16 RAMB4_S16_S16 RAMB4_S16_S16 RAMB4_S16_S16
7
set entry #
FROM READER
4*16-bit word from chart memory containing either 
4 non-terminals (N1 items) or 4 partial right-hand sides (N2 items)
TO TILER
Figure 6.8: Internal -bank/-bank organization in the particular case when the set size is

 (_  	).
set entries for this triplet. On the other hand, if the buffer memory is full the READER will
wait until the manager releases a set entry5. Each time the buffer memory manager allocates a
set entry an identifier corresponding to the allocated set entry (i.e. the index) is returned. Once
the sets in the source cells of the triplet are read and stored in the buffer memory, the triplet
and the identifier is forwarded through an internal FIFO to the TILER stage. The TILER uses
the identifier for accessing the sets stored in the buffer memory and the internal FIFO insures
that the triplets are tiled and dispatched to the processors in the same order in which they come
from the CHECKER, regardless of the place where they were stored in the buffer memory. The
size of the internal FIFO memory is equal to the number of set entries in the buffer memory.
6.4.5.2 The cell-combinations tiling (TILER) stage
As seen in the design analysis of the DAP design (see section 5.6.1), the size of the cell-
combinations may vary a lot and when forwarded as such to the processors and this typically
results in an unbalanced processor load. Tiling is implemented within the current design ar-
chitecture as a means of balancing the processor load. The basic idea is to split the (large)
cell-combinations in smaller chunks, called tiles, and to send these tiles for processing to the
processors. As the size of these tiles is more uniform the processor load has more chances to be
balanced. One can see the tiling procedure as a dynamic clustering of the available processors
around demanding (i.e. large) cell-combinations. That is, a way of concentrating the processing
5A set entry is released on the request of the tiler stage as soon as a triplet was tiled and dispatched.
6.4 : Design units 113
resources when and where needed.
In general for a tile size !, a cell-combination given by the sets  and  is di-
vided in !   tiles. The figure 6.9 illustrates two possible tilings of a cell-
combination given by    and   , one with a tile size 
 (left) and the second
with a tile size  (right). Note, that the boundaries of the tiled cell-combination are not
|N2|=23|N2|=23
|N1|=11 |N1|=11
tile 1
tile 3
tile 5
tile 8
tile 9 tile 10
tile 11
tile 1
tile 3
tile 5
tile 7
tile 9
tile 11
tile 13
tile 15
tile 17
tile 19
tile 21
tile 2
tile 22
tile 20
tile 18
tile 16
tile 14
tile 12
tile 10
tile 8
tile 6
tile 4
tile 4
tile 2
tile 6
tile 23 tile 24
tile 7
tile 12
Figure 6.9: Two possible tilings for a cell-combination given by    and   
when a tile size 
 is used (left) and when a tile size  is used (right).
necessarily matched and tiles smaller than the used tile size may result. The size of the tile can
be configured in the VHDL code by means of two parameters 2"(2_+ for the set  and
2"(2_ for the set . The only restriction on these two parameters is that they should be
a multiple of . As a rule of thumb, in practice the table size should be neither too small nor too
large. The influence of the tile size on the system performance is investigated in section 6.6.4
The second task of the TILER is to dispatch the tiles to the processors. The tile dispatching
mechanism is similar with the one implemented in the DISPATCHER unit of the DAP design
(see section 5.3.4 for details). Basically, for finding an idle processor the TILER performs a
continuous polling over all the processors in the system. When an idle processor is found the
TILER reads the content of the two banks from the buffer memory for building a tile. Due to
the physical separation of the two banks in the buffer memory the TILER reads these banks in
parallel. The tile is sent to the idle processor along the ID associated with the cell-combination
to which the tile belongs. The D-TABLE entry associated with the destination cell ' of the
cell-combination to which the tile belongs is updated after each dispatched tile by incrementing
the _2 counter. Also, when the last tile of a cell-combination is dispatched the
_ counter in D-TABLE entry associated to the destination cell ' is decremented.
6.4.6 The processor
In the current design a processor is working on tiles as a means of balancing processor load. As
illustrated in the enhanced-CYK block diagram (see figure 6.10) a processor is interfaced to the
DISPATCHER which distributes the processing tasks (i.e. tiles), to the local grammar memory
used for grammar rules lookup during the parsing and to the WRITER which is the interface
114 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
used for storing the processing results into the chart memory.
The interface with the DISPATCHER: The processor receives tiles for processing from the
=
=
=
=
WRITER INTERFACE
CENTRAL
UNIT
INTERFACE
MEMORY
GRAMMAR
data GRAMMAR
lookup unit
DISPATCHER  INTERFACE
address GRAMMAR
N2_REG_BANK
20 grammar memory 
Z2_FIFO
Z1_FIFO
(8, 16, 32)
OFIFO_ACK
N1_REG_BANK
OFIFO_REQ
R
R MODULE_N1
0RR
1
2R
1R
MODULE_N2R
2
output
FIFO
R
DBUS
ID
search result
activeP i
GO
istatePdispatcher
interface
istateP
activeP i
getDATA
flush request
module
0
WBUS
Figure 6.10: The enhanced-CYK processor datapath.
DISPATCHER. The later obtains the tiles for a certain cell-combination  ', where
 
  and  
 , by rewriting  as   




  


	 and  as
  





  


 and taking all possible pairings   for  
 !
and  
  . Remember that the tile size is configured in the VHDL code by means of two
generics 2"(2_+ for the set  and 2"(2_ for the set . The tile   
received by a processor is stored in two banks of registers (see figure 6.10). The first bank
+__ of size 2"(2_+ stores the set  and the second bank __ of
size 2"(2_ stores the set . The tiles are transfered from the DISPATCHER to a pro-
cessor by means of the "( bus (see section 5.3.4). Once the tile was transfered and loaded in
the two banks of registers the signal 2 issued by the DISPATCHER will start the processing
of the tile.
The interface with the grammar memory: The processing of a tile   consists on
grammar lookups and internal data movement and works as follows. The grammar memory
lookup unit takes a partial right-hand side  stored in the +__, and retrieves from the
grammar memory all the entries6 (,,, ,#,)
_	*% ) in the level 2 list (see section 6.3.2)
corresponding to this partial right-hand side. Depending on the databus width of 8, 16 or 32-
bits with the grammar memory, it takes 8, 4 and respectively 2 cycles to fetch a level 2 entry.
6The last entry is detected when the flag L is set on ’1’.
6.4 : Design units 115
Each level 2 entry is written as such (without the flag ) into the Z1-FIFO memory. The pro-
cessor’s central unit reads an item (,, ,#,)
_	*% ) from the Z1-FIFO and compares the
non-terminal  "at once" against all the non-terminals stored in the __ by activating
the signal  	
' (see figure 6.10) that latches in a register the comparison results of  against
each entry of the +__. The result of this search is available on the signal 
 !%
and if the non-terminal  was found in the __ one or even both of the cases listed
bellow occur:
& if =’1’: means that the partial right-hand side #    is in the grammar. In this case
the (,  , #) item is stored in the output FIFO; Note that, the ID and the information
contained in such an item allows the extraction of the compact parse forest;
& if =’1’: means that the  is a rule right-hand side. In this case the item (,  ,
)
_	*% ) is returned through the Z2-FIFO to the grammar memory lookup unit;
If the non-terminal  is not found in the __, nothing happens.
The field )
_	*% in the items (,  , )
_	*% ) returned to the grammar memory
lookup unit is used to retrieve from the grammar memory level 3 tables all the non-terminals


for which there exists a grammar rule 

  and all partial right-hand sides 

 in this
case. A code &' that uniquely identifies the rule right-hand side  is also retrieved in
this case. All the items retrieved from the grammar memory level 3 tables are stored in the
output FIFO. While both the grammar memory lookup unit and the central unit may concur-
rently write data in the output FIFO, the access to the output FIFO is assigned permanently to
the central unit and is requested when needed by the grammar memory lookup unit. The signal
22_8 and 22_ are used for this purpose.
The interface with the WRITER Within the current design, the processors do not access di-
rectly the chart memory. Instead, they use the WRITER as an interface for storing the process-
ing results into the chart memory. The reason is to reduce the number of collisions between
processors and therefore to reduce the expected performance degradation phenomenon we have
seen on the previous implementations of the CYK algorithm.
When the output FIFO is full (or almost full) the flush request module asserts the 	 

(on the WRITER interface) signal to request the output FIFO flush. The WRITER will flush
the processor’s output FIFO – using the ( – as soon as it pools the processor and detects the
request. The 	 

signal is also used to indicate that a processor has finished the processing
of the tile. During a tile processing it is possible that the output FIFO becomes full even if
the tile processing is not finished in which case the WRITER only flushes the content of the
output FIFO. A flush is also requested when the output FIFO of the processor that finished to
process a tile is not empty. A processor that finished to process a tile and was flushed becomes
immediately available for processing other tiles.
6.4.7 The WRITER and LOOKUP units
The WRITER and LOOKUP units (see the enhanced-CYK block diagram figure 6.1) are used
as an interface for storing the parsing results into the chart memory. The LOOKUP unit is
under the control of the WRITER unit and for this reason the two units are discussed together.
The task of the WRITER is to flush the processor parsing results – from the processor’s output
FIFO – and to store these results in the chart memory in the destination cell identified by the
116 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
ID associated to the flushed processor. Recall that this ID was allocated by the CHECKER and
forwarded by the DISPATCHER to the processor along the dispatched tile.
A flushed item can be either a partial right-hand side or a non-terminal and in either of
the two cases it should only be stored once in the destination cell set  or respectively .
Checking for the occurrence of a partial right-hand side in the  set, and respectively for a
non-terminal in the  set of a destination cell is implemented in the LOOKUP unit by means
of an associative memory. The associative memory can be loaded either with the  set or with
the  set of a destination cell, but not both at the same time given the large amount of required
hardware resources. The sets  and  in a destination cell are retrieved from the CTXRAW-
memory with the identifier ID associated to that destination cell. When the LOOKUP unit is
loaded with the  set of a destination cell the WRITER can check "at once" for the unique
occurrence of non-terminals and when loaded with the  set the WRITER can check for the
unique occurrence of partial right-hand sides. The size of the associative memory is equal to
the constant 
 representing the maximal size of a set / in a chart cell. In the particular
case of the SUSANNE grammar this constant is 
.
Due to the fact that at each moment during runtime there are several destination cells under
processing and that there is no special order on the sequence of destination cells in which the
processors results are flushed, the WRITER requires the ability to deal (i.e. write) with the
destination cells in any order – the order in which the processors finish to process the tiles. In
order to support this ability, when passing from a destination cell (ID1) to another (ID2) the
WRITER performs the following three operations:
1. stores its current state – that of the destination cell (ID1) – in the CTX-memory;
2. restores the state of the destination cell (ID2) from the CTX-memory;
3. loads the LOOKUP with the  or  sets – in turn – of the new destination cell ID2.
These sets are retrieved from the CTXRAW-memory;
The sequence of three operations (1)-(2)-(3) presented above is referred henceforth as a context
switch. Note that a context switch is only required when two successively flushed processors
have different IDs. The physical address in the CTX-memory and CTXRAW-memory where
the information related to a destination cell is stored is built from the associated ID.
The WRITER employs two memory buffers in order to overlap the fetching of the parsing
results (from processors) with their writing into the chart memory. The first buffer (FB, from
Fetch Buffer) is used to store the parsing results fetched from a processor’s output FIFO, while
the second buffer (LB, from Lookup Buffer) is used by the WRITER with the LOOKUP unit.
Each buffer is labelled with the ID of the processor whose parsing results it stores. The buffers
are swapped as soon as the WRITER has finished to operate on the LB buffer and the FB
contains new parsing results. A context switch takes place if after swapping the buffers the new
ID of the LB buffer is different from the previous.
Once the WRITER’s context is switched – if necessarily – the associative memory in the
LOOKUP unit is already initialized with the  set of the destination cell (ID), where ID
tags the content of the current LB buffer. The WRITER starts to process the non-terminals
(i.e. the -items) stored in the LB buffer. If a non-terminal does not occur in the associative
memory it is stored both in the associative memory and the CTXRAW-memory (this takes place
in parallel). When the WRITER has finished to process the -items, it loads the associative
memory with the  set of the destination cell (ID) and starts to process the partial right-hand
sides (i.e. the -items) stored in the LB buffer. If a partial right-hand side does not occur in
6.4 : Design units 117
the associative memory it is stored both in the associative memory and the CTXRAW-memory
(this takes place in parallel). When the WRITER has finished to process the -items, two
situations may occur:
1. the tile from BL was not the last tile (to be processed) for the destination cell ID;
2. the tile from BL was the last tile (to be processed) for the destination cell ID;
In the first case the WRITER decrements the _2 counter (in the D-TABLE). In
the second case the content of the CTXRAW-memory (the  and the  sets) is dumped into
the chart memory destination cell associated to the identifier ID. The ID, the entry in the D-
TABLE, CTX-memory and CTXRAW-memory corresponding to the destination cell (ID) are
released. The WRITER detects that the flushed processor processed the last tile if both the coun-
ters _2 and _ in the D-TABLE are 0, that is, no cell-combinations
are left for processing and no other tiles are under processing (for details on the D-TABLE see
section 5.3.6).
At this point, if the FB contains new processing results, the FB will be swapped with the
LB buffer and the procedure described above repeats.
6.4.8 The compact parse trees extractor (EXTRACTOR) unit
The EXTRACTOR unit (see the enhanced-CYK block diagram in figure 6.1) is the interface
between the enhanced-CYK design and the host system and the information it sends to the host
machine allows the later to rebuild the parsing forest.
When the WRITER flushes the output FIFO, i.e. parsing results, of a processor it forwards
a copy of each item to the EXTRACTOR. The EXTRACTOR further packs the parsing results
according to the width of the databus used to interface the enhanced-CYK design and the host
machine in order to best exploit the available bandwidth. For instance if a 32-bit PCI interface
is used the parsing results should be packed on 32-bit words and if a 16-bit SCSI interface is
used the parsing results should be packed on 16-bit words.
6.4.9 The enhanced-CYK system monitor (MONITOR) unit
During the physical testing of the previous designs on the RC1000-PP board it was noted that
it is useful to have a mechanism that can monitor the state (i.e. normal or faulty) of the system
and allow to recover from faults and eventually track their source. The absence of such a mech-
anism, made very difficult the task of detecting a faulty system and the only way of recovering
from such a state was to perform a manual system reset.
Such a monitoring mechanism is integrated within the enhanced-CYK design and is imple-
mented in the MONITOR unit (see figure 6.1). The internal structure of the MONITOR unit
is illustrated in figure 6.11 and allows to monitor the system’s state and in the case of a faulty
system to eventually track the unit which is the cause of the disfunctionality. The monitor unit
is very useful during design testing and allows an easier and faster integration of extension
units in the design. In order to be integrated (i.e. cope with) within the monitoring system each
design unit uses an output error signal that is activated only if an abnormal internal condition is
encountered. A unit may actually generate several error signals – one for each particular error
condition. For instance, the WRITER activates the error signal if the allocated maximal cell
size is exceeded for a certain cell destination and a processor activates the error signal if an
entry in the Z1-FIFO has both the  and  flags set on ’0’. A unit may activate as many error
signals as needed, .
118 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
0
1
0 1
11
10
01
00
R
R
POWER_UP_RESET_LOW
SYSTEM
D
Q D
’0’
’1’
’0’
’1’
FFI
start PARSE
Q
RESET_SYSTEM
FFO
ERROR_PROCESSOR(n-1)
ERROR_REGISTER
ERROR_OCCURED(out to the pins of FPGA)
ERROR_SOURCE(n)
ERROR_SOURCE(n+k)
ERROR_PROCESSOR(1)
ERROR_PROCESSOR(0)
ERROR_WRITER
ERROR_WRITER
ERROR_CHECKER
Figure 6.11: The monitoring system implemented within the enhanced-CYK design.
If a unit – integrated within the monitoring system – activates an error signal both the inter-
nal flip-flop  and the external flip-flop 2 are set. The  flip-flop is used to synchronize
the error signal to the clock and its output is routed to the global system reset  signal
used to bring the system in an initial state – from which new parsings can (re)start. The 
signal is activated either by the output of the  flip-flop or by the FPGA’s global initialization
signal 2_(_. Once the system is reset by the internal  signal, the error signal
is deactivated (due to the system reset) and the  flip-flop is reset. The 2 flip-flop routes
the error signal to an FPGA pin that is used to signal the host system about the faulty state. The
2 flip-flop is not reset by the internal  signal which gives the required time to the host
machine to react at the encountered error. The 2 flip-flop is reset when a new parsing starts
– when the signal 	
 is activated.
In order to track the source of errors, the error signals are latched in a register each time the
 and 2 flip-flops are set. They can be read by the host system in order to check which
error signal was activated and therefore to track its source.
The proposed mechanism is useful for monitoring the state of the system and for creating a
reliable environment.
6.5 : Performance measurements 119
6.5 Performance measurements
The tests and performance measurements presented in this section are performed with the SU-
SANNE grammar without unitary rules – containing   non-terminals and   rules.
The data-structures used to represent the chart and the SUSANNE grammar without unitary
rules in the current design are presented in section 6.3.1 and section 6.3.2 respectively. The
size of the memory required to store the SUSANNE grammar data-structure is   bytes
and the size of the chart memory depends on the length of the sentence we want to parse (e.g.
 KBytes for parsing sentences with up to  words or  MBytes for parsing sentences with
up to  words). The maximal length of the sentences that can be parsed with the enhanced-
CYK design is independent of the number of processors in the system. In other words, any
number of processors can be used for parsing sentences of any length in the condition that the
amount of memory available for storing the chart is enough.
In order to determine the size and the clock frequency at which the system is able to work,
a -processors system configuration was synthesized7 and placed&routed8 in a Xilinx FPGA,
Virtex XCV2000efg-1156. The synthesis of the -processors system was made on a design
instance characterised by the parameters given in table 6.4. The table 6.5 gives for a Xilinx Vir-
tex XCV2000efg-1156 FPGA a summary of the resources used by each unit in the synthesized
-processors system. The overall amount of resources required by a -processors system are
also given. The hardware run-times presented were obtained by simulating9 the VHDL model
of an enhanced-CYK system with , 	,  and respectively  processors. Each of the simu-
lated systems is running at a 50 MHz clock frequency, although the clock frequency at which
the systems can work (i.e. the clock reported by the place&route tool) is 60 MHz. A 50 MHz
clock is used in order to compare the enhanced-CYK design performance with the performance
of the previous CYK designs. A  tile size and a POOL with  entries are used.
The software used for comparison is an implementation of the enhanced-CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The software uses the
SUSANNE grammar in its original context-free form, while the hardware uses the same gram-
mar but without unitary rules. In order to compare the performance of the enhanced-CYK
hardware design with the performance of the previous CYK hardware designs we use two plat-
forms for running the software: the first is a SUN (Ultra-Sparc ) with  MBytes memory,
		 MBytes of swap memory, and  processor at a clock frequency of  MHz – that was
also used to benchmark the CYK designs, and the second is a PC (DELL Dimension 4300)
running the Linux OS (Mandrake 8.1) with 
 MByte memory, 	 MByte virtual memory
and a PENTIUM Intel IV processor at a clock frequency of  GHz. For these simulation, the
initialization of the chart was not taken into account for the computation of the run-times. For
accuracy, the timing was done with the times() C library function and not by profiling the code.
For the purpose of the comparison,   sentences were parsed and validated10. Among
these sentences  did not parse because at some point during the parsing the size allocated for
a chart cell was exceeded (see section 6.3.1). The sentences have a length ranging from  to 
and were all taken from the SUSANNE corpus. The hardware performance (i.e. run-times) of
an enhanced-CYK system containing  processors (denoted as 	
_), 	 processors (denoted
as 	
_),  processors (denoted as 	
_) and respectively  processors (denoted as
	
_) were compared against the software run-times.
7With LeonardoSpectrum v2000.1a2
8With Design Manager (Xilinx Alliance Series 2.1i)
9With ModelSim EE/Plus 5.5c
10The hardware output was compared to the software output for detecting mismatches
120 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
Parameter Size[units] Details
s y
s t
e m CLOCK 50 [MHz] the system clockCOORD_BITS 5 [bits] bits used to represent a row/column in
the chart
PROCESSORS 14 processors in the system
MODULO_N1 12 tile size,  set, see section 6.4.5.2
MODULO_N2 6 tile size,  set, see section 6.4.5.2
c h
a r
tm
e m
o
r y CYKMEMSIZE 278,528 [bytes] the required size for the chart memory
for sentences of maximum 32 words
CYKLATENCY 15 [ns] the access-time of the SRAM memory
used to store the chart
CELLSIZE_BITS 7 [bits] maximal size of a chart / set, see
section 6.3.1
CYK_ADR_SIZE 16 [bits] chart data-structure pointer size, see
section 6.3.1
CYK_WAIT_CYCLES 2 delay until chart memory data lines are
stable
g r
a m
m
a r
m
e m
o
r y
GMEMSIZE   [bytes] the size of each grammar memory
GLATENCY 15 [ns] the access-time of the SRAM memory
used to store the grammar memory
N1_SIZE_BITS 11 [bits] number of bits representing a -item,
see section 6.3.2
N2_SIZE_BITS 14 [bits] number of bits representing a -item,
see section 6.3.2
NT 1,912 number of non-terminals (-items)
RULE_SIZE 15 [bits] bits used to represent a distinct right-
hand side, see section 6.3.2
CNT_LHS_SIZE 5 [bits] bits used to count the number of
/-items that may occur in a level
3 grammar table, see section 6.3.2
p r
o
c e
s s
o
r FIFO_Z1_DEPTH 8 processor internal FIFO, see sec-
tion 6.4.6
FIFO_Z2_DEPTH 8 processor internal FIFO, see sec-
tion 6.4.6
OFIFO_DEPTH 16 processor output FIFO, see sec-
tion 6.4.6
o
th
e r
ENTRIES 16 entries in D-TABLE
OUT_GEN_DEPTH 16 size of FIFO at output of SEQ_GEN
OUT_CHK_DEPTH 16 size of FIFO at output of CHECKER
OUT_POOL_DEPTH 16 size of FIFO from POOL to CHECKER
OUT_WRITER_DEPTH 16 size of FIFO from WRITER to EX-
TRACTOR
POOL_ENTRIES 8 size of the POOL
Table 6.4: The parameter values used to configure (i.e. instantiate) a -processors enhanced-
CYK system.
6.6 : Design analysis 121
component DFFs/Latches FGs CLBs XCV2000 area utilization (%)
IOctrl 23 18 12 0.06
SEQ_GEN 57 82 41 0.21
SEQ_GEN out FIFO 15 23 12 0.06
CHECKER 1139 1498 749 3.90
CHECKER out FIFO 23 24 12 0.06
DISPATCHER 258 303 152 0.79
POOL 213 393 197 1.03
POOL out FIFO 15 23 12 0.06
D-TABLE 1169 2677 1339 6.97
CTX (uses 6 Block SelectRAMs/see Xilinx’s XCV2000 Manual [30])
CTXRAW (uses 32 Block SelectRAMs/see Xilinx’s XCV2000 Manual [30])
WRITER 227 510 255 1.33
LOOKUP 2694 9766 4883 25.43
EXTRACTOR 8 18 9 0.05
MONITOR 4 12 12 0.03
processor 576 735 368 1.92
-processors system 13976 25025 12513 65.17
Table 6.5: Virtex XCV2000 FPGA resource utilization per enhanced-CYK design unit in terms
of Flip-Flops/Latches (DFFs/Latches), function generators (FGs) and configurable logic blocks
(CLBs) for the enhanced-CYK design.
When using the SUN machine that was also used to benchmark the previous CYK designs
the average speedup factor of the 	
_ system against the software is (  .
The figure 6.12(a) shows the hardware speedup, for the 	
_, 	
_, 	
_ and
	
_ in comparison with the software run on the SUN as a function of the sentence length.
When using the PC machine, the average speedup factor of the 	
_ system against the
software is in this case (  
	. The figure 6.12(b) shows the hardware speedup, for the
	
_, 	
_, 	
_ and 	
_ in comparison with the same software run on
the PC, as a function of the sentence length.
Note that the speedup factor is increasing with the sentence length (see figure 6.12(b)), up
to 
 for  words long sentences. This behaviour is very important when dealing with real-life
size sentences. There are several other factors that can increase the system performance among
which we mention (1) the clock frequency can be increase to  MHz instead of  MHz,
resulting in a 20% speedup improvement and (2) the number of processors can be increased
to  which may lead to a further increase in performance and (3) using a more performant
FPGA technology such as the XCV2000efg1156-8 instead of an XCV2000efg1156-6 which
will allow the system to run at an even higher clock frequency – about 20% higher. The design
can be further improved as well .
6.6 Design analysis
In this section some important aspects of the enhanced-CYK design are analysed and discussed.
First, in order to get a general idea about the design, we look at the processor activity and
122 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
2 4 6 8 10 12 14 16
120
140
160
180
200
220
240
260
280
sentence length[words]
sp
ee
du
p
hard_P14
hard_P10
hard_P7 
hard_P4 
(a)
2 4 6 8 10 12 14 16
40
45
50
55
60
65
70
75
80
85
sentence length[words]
sp
ee
du
p
hard_P14
hard_P10
hard_P7 
hard_P4 
(b)
Figure 6.12: Hardware speedup for the 	
_, 	
_, 	
_ and 	
_
enhanced-CYK systems against the software running on (a) a SUN machine and (b) a PC ma-
chine, as a function of sentence length. For each sentence length more than  sentences were
parsed.
6.6 : Design analysis 123
measure the average processor utilization for some arbitrary sentences extracted from the SU-
SANNE corpus. As the processor activity mainly consists of two tasks (1) data-processing and
(2) waiting for flushing the processing results, we investigate – as in the case of the DAP design
– at the fraction of time spent by the processors in performing each of the two tasks. The effect
of an increased number of processors requesting the flushing of parsing results – as the sentence
length increases – on the overall design performance is next investigated. The influence of the
tile size and POOL size on the design performance are also investigated. Finally, the system
throughput – for sending the compact parse forest to the host machine – is investigated in order
to select the bus that will interface the FPGA-board and the host machine.
6.6.1 Average processor utilization
In order to get a general idea about the processor activity during the parsing process we illustrate
the processor activity for three sentences (given in table 6.6) of length ,  and  words in
figure 6.13(a), (b) and respectively (c). In this figures the processor activity is depicted for a
-processors system with a  entries POOL, in two instances (1) for a tile size  in the
left column and (2) for a tile size 

 in the right column. The 

 tile size makes the
enhanced-CYK design to work like the DAP design in which a cell-combination is computed
by a single processor. The speedup factor for the DAP design and enhanced-CYK design using
a  tile size and respectively a 

 tile size when parsing these sentences are also
given in table 6.6. The first important thing to note from the results in this table is the large
len sentence DAP
speedup
enhanced-CYK
speedup
 


4 “One wing stood open” 22.41 282.75 215.51
10 “In fact our whole defensive unit did a good job” 34.18 328.25 280.43
15 “In societies like ours , however , its place is less
clear and more complex”
36.31 247.67 222.03
Table 6.6: Three sentences parsed with the enhanced-CYK design for which the processor
activity is given in figure 6.13. Speedup factor is given against the software implementation.
difference between the speedup factors of the DAP and respectively enhanced-CYK (using an


 tile size) designs. The difference comes mostly from an improved hardware design
(i.e. processor, DISPATCHER, WRITER and LOOKUP units). The effect of tilling although
significant is not the main factor for the observed speedup improvement. The second important
thing to note is the difference between the gray area of the DAP (see figure 5.15 on page 86) –
when parsing the same sentences – and of the enhanced-CYK design using an 

 tile size
(see figure 6.13, right column). A much smaller gray area, which means a faster handling of
the processing results is observed for the enhanced-CYK design (

 tile size) even if the
parsing time diminished drastically with the enhanced-CYK design. For comparing the average
processor utilization of the enhanced-CYK design using a  tile size with the enhanced-
CYK design using a 

 tile size, we use some sentences that were arbitrarily extracted
from the SUSANNE corpus. These sentences are tabulated in table 6.7. The average processor
utilisation is computed in the same way as it was computed for the LAP design (see section 4.1,
page 52). The results in this table show that the average processor utilization increases when
124 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
(a)
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
(b)
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
0 0.5 1 1.5 2 2.5
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
time[ns]
pr
oc
es
so
r 
#
(c)
Figure 6.13: Enhanced-CYK processor activity in BLACK+GRAY when parsing a sentence
of length (a)  words, (b)  words and (c)  words. GRAY: represents the time for flushing
processing results. BLACK: represents the time spend for processing data.
6.6 : Design analysis 125
len sentence enhanced-CYK U[%]
6x12 128x128
3 “One pass only” 15.34 8.88
4 “There was no moon” 9.94 7.18
5 “She too began to weep” 23.21 14.68
6 “She must not think about time” 16.36 13.02
7 “The form and the chaos remain separate” 26.09 19.02
8 “The games were over , this was life” 17.95 13.68
9 “Like Napoleon , he was the worst of losers” 19.53 14.64
10 “In fact our whole defensive unit did a good job” 62.53 42.90
11 “I told him who I was and he was quite cold” 14.31 12.72
12 “Nerves tight as a bowstring , he paused to gather his wits” 50.55 36.60
13 “I told him no , that I had had a very happy childhood” 14.76 13.40
14 “As he had longed to be , he became the echo of a saga” 40.06 30.03
15 “It is all around us and our only chance now is to let it in” 27.43 21.45
Table 6.7: The average processor utilization for the enhanced-CYK design when using a 
 tile size and respectively a 

 tile size, for a set of sentences extracted from the
SUSANNE corpus with lengths between  and  words.
using a  tile size in comparison with the case when a 

 tile size is used. However
the increase in the average processor utilization is not significant in some cases.
6.6.2 Expected performance depreciation
This section investigates the influence of an increasing number of working processors on the
expected system performance. For the previous designs the expected performance depreciation
phenomena was mainly caused by interprocessor collisions when accessing the chart memory.
A number of improvements implemented within the current design (e.g. elimination of guard-
vectors, a wider chart memory databus, a processor that does not access the chart memory and
others) aim to limit the interprocessor collisions. In fact interprocessor collisions only occur
within the enhanced-CYK design when flushing the processing results. Therefore we expect
that the expected performance depreciation phenomena is less important within the enhanced-
CYK design.
The fact that there are less interprocessor collisions in the enhanced-CYK design than in the
previous DAP design can be observed by comparing the gray area in figures 6.13(a)-(c) (right
column, 

 tile size) and respectively the gray area in figures 5.15, page 86. However, in
order to better illustrate this phenomena we will use an own built grammar – the same that we
used in example 4.1 – that has the particularity that any cell-combination in the chart requires
the same amount of processing. The following example uses this grammar in an experiment
that illustrates the expected performance depreciation phenomena.
Example 6.3 We use the grammar defined in example 4.1 on page 54 to parse the sentences:
"a a", "a a a", "a a a a", . . . , up to a similar sentence of length .
This grammar has the property of generating the same sets  of non-terminals 

 

    


	
and the same set  of partial right hand sides 



    


	 for all cell-combination
performed. This means that at the end of the parsing, each cell in the chart will contain these
126 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
two sets and that during the parsing the processors work on the same data each time two cells are
combined. In conclusion, if the time . spent by a processor for performing a cell-combination
(i.e. the parsing time for the sentence "a a") is known we can compute the expected parsing
time for a sentence of length 	. We will also assume that a 128x128 tile size is used in order to
simplify the computation of the expected parsing time for a sentence of arbitrary length. The
method used to compute the expected parsing time is not analytic and we used a program for
this purpose. The program computes the minimum number of steps  required for filling the
chart when parsing a sentence of length 	, using a given number of processors (e.g.  in our
case). With  and . we compute the expected parsing time as   . .
For each of these sentences the figure 6.14 illustrates the (computed) expected parsing time
vs. the real parsing time. As we can see from this figure, the difference between the expected
2 4 6 8 10 12 14 16
0
0.5
1
1.5
2
2.5
x 106
sentence length[words]
tim
e [
ns]
real    
expected
Figure 6.14: Enhanced-CYK design real vs. expected parsing time when parsing the sentences
"a a", "a a a", "a a a a", . . . up to a similar sentence of length  with the enhanced-CYK design
using a 

 tile size.
and the real enhanced-CYK design performance (i.e. parsing time) is not significant and does
not increase with the sentence length. Therefore we can conclude that for the enhanced-CYK
design an increasing number of processors does not depreciate the system performance as it
was the case for the LAP and DAP designs.
Figure 6.15 illustrates the enhanced-CYK processor activity when parsing a sentence of
length  and respectively 	 "a"s. Note that the gray area does not change when parsing the
sentence of length  and when parsing the sentence of length 	.

6.6.3 Required bandwidth for transferring the compact parse forest
In this section we study (1) whether a PCI or SCSI interface is fast enough to transfer the
compact parse forest to the host computer and (2) look at the average amount of data to be
transfered during the parsing. The first point is important for choosing the type of interface
that will be used to build the FPGA-board and the second for measuring the size of the FIFO
required for buffering the data between the enhanced-CYK design output and the host machine.
6.6 : Design analysis 127
0 0.5 1 1.5 2 2.5 3 3.5 4
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(a)
0 0.5 1 1.5 2 2.5 3 3.5 4
x 105
1
2
3
4
5
6
7
8
9
10
11
12
13
14
(b)
Figure 6.15: Enhanced-CYK processor activity for a sentence of length (a)  "a"s and respec-
tively (b) 	 "a"s. Note the small gray area in both cases.
All these measurements will be made in the particular case of the SUSANNE grammar without
unitary rules.
The WRITER sends the parsing results fetched from the processors – without any change
– to the EXTRACTOR . The later packs the parsing results according to the databus width of
the interface used between the FPGA-board and the host machine in order to best exploit the
available bandwidth. For instance if a 32-bit PCI interface is used the parsing results are packed
on 32-bit words and if a 16-bit SCSI interface is used the parsing results are packed on 16-bit
words. The rate, i.e. throughput, at which the FPGA-board is shipping the parsing results is
given by the amount of data to be shipped during the parsing time. The bus connecting the
FPGA-board to the host machine is required to have a bandwidth bigger than the FPGA-board
128 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
throughput in order not to be a system bottleneck. While the FPGA-board throughput depends
on the parsed sentence and the grammar used for parsing we investigate the throughput of
the enhanced-CYK design in the particular real-life case of the SUSANNE grammar without
unitary rules. For this purpose we will use (i.e. parse) the same   sentences from the
SUSANNE corpus used to benchmark the enhanced-CYK design. Two buses are investigated,
a 32-bit 33 MHz PCI bus and a 16-bit Ultra 2 Wide SCSI bus.
In order to get a general idea about the information transfered from the FPGA-board to the
host computer, we use the following example illustrating the output produced by the enhanced-
CYK design when the sentence "It was a box" is parsed with the SUSANNE grammar without
unitary rules.
Example 6.4 The output produced when parsing the sentence "It was a box" – with the SU-
SANNE grammar without unitary rules – is given in table 6.8. Each line (i.e. parser result item)
in this table has the form:




 



 '

'


or




 



 '

'


where  is the first source cell (

is the row and 

is the column of this source cell), 
is the second source cell (

is the row and 

is the column of this source cell) and ' is the
destination cell ('

is the row and '

is the column of the destination cell). Parentheses are
used to mark -items. The parsing result represented in this format allows the reconstruction
of the parsing trees on the host machine. Concretely, the chart can be reconstructed on-line
during the parsing and used later to extract the parsing trees.

The FPGA-board throughput illustrated in figure 6.16(a) (left) corresponds to the case when all
the parse result items – as for the example given in table 6.8 – are sent to the host. This figure
illustrates for each sentence in the bench the required throughput. The sentences are ordered
in increasing order of their length. The figure 6.16(a) (right) illustrates for the same sentences
the size (in KBytes) of the parse result. Note, that even if the parse result size is relatively low
(i.e. several KBytes) the required throughput is high. This is due to the fast parsing times. The
figure 6.16(a) (left) shows that neither the SCSI interface nor the PCI interface is fast enough
to transfer efficiently the parsing results. Therefore, it is not possible to send all the parse result
items to the host. A solution is to eliminate the redundant parts from the parse result items.
The squared items illustrated in table 6.8 correspond to such a solution. The FPGA-board
throughput illustrated in figure 6.16(b) (left) corresponds to this case. The figure 6.16(b) (right)
illustrates in this case the size (in KBytes) of the parse result. In this case we can see that the
PCI bus is fast enough in most of the cases.
An even more aggressive solution would be to send only the minimum information that
allows to retrieve the parsing forest. The oval-squared items illustrated in table 6.8 correspond
to such a solution. In this case the parse result only contains the pairs of -items (in the
source cell ) and -items (in the source cell ) that produce output results (in their as-
sociated destination cell '), but does not include these output results. The output results can
be nevertheless reconstructed on the host side by performing supplementary grammar lookups.
The required grammar lookups may be an overhead in this case. The FPGA-board throughput
illustrated in figure 6.16(c) (left) corresponds to this case. The figure 6.16(c) (right) illustrates
in this case the size (in KBytes) of the parse result. In this case we can see that the SCSI is fast
enough in most of the cases and that the PCI bus is fast enough in all the cases.
6.6 : Design analysis 129
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
100
200
300
400
500
600
sentence #
th
ro
ug
hp
ut
 [M
B/
s]
16−bit Ultra 2 Wide SCSI bus
32−bit 33MHz PCI bus
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
5
10
15
20
25
30
35
40
45
sentence #
pa
rs
e 
re
su
lt 
siz
e 
[K
B]
(a)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
20
40
60
80
100
120
140
160
180
200
sentence #
th
ro
ug
hp
ut
 [M
B/
s]
16−bit Ultra 2 Wide SCSI bus
32−bit 33MHz PCI bus
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
2
4
6
8
10
12
14
16
18
20
sentence #
pa
rs
e 
re
su
lt 
siz
e 
[K
B]
(b)
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
20
40
60
80
100
120
140
160
180
200
sentence #
th
ro
ug
hp
ut
 [M
B/
s]
16−bit Ultra 2 Wide SCSI bus
32−bit 33MHz PCI bus
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0
2
4
6
8
10
12
14
16
18
20
sentence #
pa
rs
e 
re
su
lt 
siz
e 
[K
B]
(c)
Figure 6.16: Parse result output rate (left column) and output size (right column) for three
possible ways of formating the parsing results.
130 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
item # parser result item
1-




1 1 (:1226) 1 2 :1509 2 1 (:1226 :1509)
2-




1 1 (:1223) 1 2 :1509 2 1 (:1223 :1509)
3-




1 1 (:1222) 1 2 :1509 2 1 (:1222 :1509)
4-




1 3 (:9) 1 4 :167 2 3 :1238
5- 1 3 (:9) 1 4 :167 2 3 :1341
6- 1 3 (:9) 1 4 :167 2 3 :1342
7- 1 3 (:9) 1 4 :167 2 3 :1344
8- 1 3 (:9) 1 4 :167 2 3 :1347
9- 1 3 (:9) 1 4 :167 2 3 :1351
10- 1 3 (:9) 1 4 :167 2 3 :1355
11- 1 3 (:9) 1 4 :167 2 3 :1363
12- 1 3 (:9) 1 4 :167 2 3 :1373
13-




2 1 (:1226 :1509) 2 3 :1347 4 1 :1170
14- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1238
15- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1363
16- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1368
17- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1369
18- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1371
19- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1372
20- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1373
21- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1375
22- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1450
23- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1451
24- 2 1 (:1226 :1509) 2 3 :1347 4 1 :1517
Table 6.8: Output for the enhanced-CYK design when parsing the sentence "It was a box".
There are faster interfaces than the 32-bit, 33 MHz PCI bus (132 MB/s) or the 16-bit Ultra
2 Wide SCSI (80 MB/s) considered in this section. For instance the 64-bit 66 MHz PCI bus
(528 MB/s) or the Ultra 160 SCSI (160 MB/s), Ultra 320 SCSI (320 MB/s) offer larger band-
widths but by the time this thesis is written these technologies are not in common use and we
did not consider them as options for the FPGA-board implementation.
In conclusion we select the 32-bit 33MHz PCI bus for interfacing the FPGA-board to the
host system and use the second solution proposed above for sending the parsing results to the
host system.
6.6 : Design analysis 131
6.6.4 Tile size influence on the design performance
In this section we study the influence of the tile size on the overall system performance. The
tiling method implemented within the enhanced-CYK design aims to balance the processor load
in order to achieve a better average processor utilization and therefore better performance. In
order to study the influence of the tile size on the design performance a number of benchmarks
have been performed on   sentences of length 3 to 15 from the SUSANNE corpus – the
same sentences used for all the benchmarks in this chapter. The benchmarks were performed
by simulating the VHDL model of the enhanced-CYK design (configured for the SUSANNE
grammar without unitary rules) with -processors and a POOL with  entries for several tile
sizes. A tile size of , ,  and 

 were used for these simulations. The
later tile size, corresponds to the case when no tilling is actually performed.
The software used for comparison is an implementation of the enhanced-CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The hardware performance
(i.e. run-times) of the -processors enhanced-CYK design and a tile size  (denoted as
	
_$% +/+), a tile size  (denoted as 	
_$% 6/+), a tile size  (denoted as
	
_$% +/+) and a tile size 

 (denoted as 	
_$% +9/+9) was compared
against the software run-times. The software was run on a SUN (Ultra-Sparc ) with 
MBytes memory, 		 MBytes of swap memory, and  processor at a clock frequency of 
MHz. The initialization of the chart was not taken into account for the computation of the
run-times. For accuracy, the timing was done with the times() C library function and not by
profiling the code.
The results are illustrated in figure 6.17. It can be seen from this figure that for short
sentences (up to 8 words) the system using a  tile size gives best performance. However,
as the sentence length increases the performance of the system using the  tile size decays
and becomes the worse for sentences above  words.
For short sentences, the explanation is that without tiling not all the processors are used dur-
ing the parsing while with tiling the design will use more processors. Thus for short sentences
(e.g. 3,4) the smallest the tile size and more processors are used the better.
The explanations becomes more complex as the sentence length increases. Basically, we
can distinguish two different regions in the chart in which the tiling has a different effect. The
first region is the bottom of the chart where all the processors will be busy and the tilling only
introduces an unnecessary overhead by splitting the cell-combinations. The second region is
towards the top of the chart where the number of interpretations (i.e. size of the sets  and
) becomes smaller and smaller. The direct consequence of this fact is that the processors
spend a shorter time for processing the cells towards the top of the chart. In this context the
overhead introduced for sending the tiles prevails over the gain of processing the tiles in parallel
on several processors. Concretely, there are several tiles for a cell-combinations but only few
or none of the processors processing these tiles produce results. In this case it is better to send
the entire cell-combination to a processor instead of loosing time for tiling it.
There are (at least) two solutions for this problem. The first solution should aim an adaptive
sizing of the tile during the parsing according to the sentence length and the size of the cells
that are combined. A second solution would be to improve the hardware used for dispatching
the tiles in order to hide the dispatching overhead.
132 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
2 4 6 8 10 12 14 16
100
120
140
160
180
200
220
240
260
280
sentence length[words]
sp
ee
du
p
tile size 2x12   
tile size 6x12   
tile size 12x12  
tile size 128x128
Figure 6.17: The speedup for a -processors enhanced-CYK system for several tile sizes as a
function of sentence length. For each sentence length more than  sentences were parsed.
6.6.5 POOL influence on the design performance
In both the DAP and the enhanced-CYK design, the POOL unit is key to the system’s ability to
dynamically allocate the cell-combinations to the processors. Without the POOL unit the sys-
tem will hang on the first cell-combination that does not satisfy a data-dependency constraint
and keep all the remaining cell-combinations – even if they satisfy their data-dependency con-
straints – unissued. On the other hand, a POOL unit allows the system to search over a larger
number of cell-combination that can be potentially issued. One can interpret the POOL unit
functionality as a window sliding over the unprocessed cells of the chart that allows the design
to issue the cell-combinations that satisfy the data-dependency constraints while keeping under
observation those that do not satisfy the data-dependency constraints. In this context, a larger
POOL size (i.e. a larger sliding window) is supposed to be better as it allows the design to
search over a larger number of cell-combination that can be potentially issued.
In order to investigate the influence of the POOL size on the enhanced-CYK design per-
formance a number of benchmarks have been performed on   sentences of length  to 
from the SUSANNE corpus – the same sentences used for all the benchmarks in this chapter.
The benchmarks were performed by simulating the VHDL model of the enhanced-CYK design
(configured for the SUSANNE grammar without unitary rules) with -processors and a 6x12
tile size for several POOL sizes. A POOL size of 
, ,  and respectively  was used for the
purpose of these simulations.
The software used for comparison is an implementation of the enhanced-CYK algorithm
and is part of the SlpToolKit that was developed in our laboratory. The hardware performance
(i.e. run-times) of the -processor enhanced-CYK design with POOL size  (denoted as
	
_)%), POOL size  (denoted as 	
_)%+), POOL size  (denoted as 	
_)%),
POOL size 
 (denoted as 	
_)%9) was compared against the software run-times. The
software was run on a SUN (Ultra-Sparc ) with  MBytes memory, 		 MBytes of swap
memory, and  processor at a clock frequency of  MHz. The initialization of the chart was
6.7 : Conclusions 133
not taken into account for the computation of the run-times. For accuracy, the timing was done
with the times() C library function and not by profiling the code. The results are illustrated in
figure 6.18.
2 4 6 8 10 12 14 16
160
170
180
190
200
210
220
230
sentence length[words]
sp
ee
du
p
hard_pool1
hard_pool2
hard_pool4
hard_pool8
Figure 6.18: The speedup for the -processors enhanced-CYK system for several POOL sizes
as a function of sentence length. For each sentence length more than  sentences were parsed.
Contrary to the expectations, the size of the POOL has a significant influence on the system
performance. It was expected that the system performance will not increase when passing –
for instance – from a POOL size  to 
. It was expected that the role of the POOL unit is
less important due to the design’s ability to balance the processor load by performing tiling.
A balanced processor load, further means less chances for a cell-combination not to satisfy
a data-dependency constraint – that was usually the case when large cell-combinations were
processed – and therefore to require to be stored in the POOL.
The problem comes from the fact that the tiling does not equilibrate as much as expected
the processor load .
6.7 Conclusions
This chapter proposes a design architecture implementing the enhanced-CYK algorithm adapted
for word lattice parsing. The presented enhanced-CYK design architecture can deal with real-
life large-size almost unrestricted CFGs. It can parse sentences of any length if enough memory
is available for storing the chart. For instance, in the particular case of the SUSANNE gram-
mar (without unitary rules) a chart memory size of  KBytes allows to parse sentences with
up to  words and a chart memory size of  MBytes allows to parse sentences with up to
 words. The speedup obtained when compared against a software implementation of the
enhanced-CYK algorithm run on a Intel PENTIUM IV processor at  GHz is 
 for sen-
tences of real-life size (15 words).
For all the tests, performance measurements and design analysis performed in the current
chapter the databus with the chart memory is -bit and the databus with each grammar mem-
134 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
ory is -bit. However, the databus with the grammar memories or the chart memory can be
easily changed as needed – according to the targeted system (e.g. FPGA-board). For instance
the databus with the grammar memories can be changed by using a different VHDL module
description for the (plug-in) grammar memory interface module, while the rest of the proces-
sor stays unchanged. With such an approach we can use a 
-bit, -bit or -bit databus with
the grammar memories or even a -bit, -bit or -bit databus. The same remark applies to the
databus with the chart memory. Even more, a mixed databus environment is also allowed. For
instance some grammar memories may use a -bit databus and the rest a 
-bit databus. Such
an approach makes feasible the mapping of the proposed enhanced-CYK design on any sys-
tem containing the FPGA resources available in a Xilinx Virtex XCV2000 FPGA and enough
external memory resources.
Precisely, the proposed design architecture is the third step and the final of our design
methodology during which:
 we proposed a design that can deal with almost unrestricted general CFGs;
 simplify the (chart) initialization procedure;
 integrate a method called tiling that improves the average processor utilization and im-
proves the performance;
 integrate an on-line parse extraction module;
 integrate a module that monitors the normal system operation during runtime;
The enhanced-CYK design can deal with almost unrestricted general CFGs. Precisely, it can
deal with "non partially lexicalized" CFGs (nplCFG) that do not contain unitary rules. The re-
striction to the nplCFG subclass of the CFGs, was introduced because it corresponds to the kind
of grammars we are actually dealing with in practice, and also because it restricts the process-
ing of "lexicalized rules" to the initialization step. On the other hand, in order to improve the
hardware performance and to reduce the hardware complexity we also restricted the grammars
not to contain unitary rules.
The simplification of the initialization step aims to insure that this step represents a insignif-
icant amount in the overall parsing time.
The tiling method implemented within the enhanced-CYK design aims at improving the
average processor utilization and therefore the design performance. With the tiling method
the processors can be assigned to process chunks of a cell-combination (i.e. the tiles) unlike
the processors in the previous designs that were processing entire cell-combinations. One can
see the tiling method as the design’s ability to dynamically cluster the available processors
around demanding (i.e. large) cell-combinations. That is, a way of concentrating the processing
resources when and where needed.
The integration of an on-line parse extraction module finally renders feasible the extraction
of the compact parse forest. The information sent by the parse extraction module allows a host
machine to rebuild (on-line) the chart table that can be further used for extracting parsing trees.
The monitoring module creates a reliable environment that allows the design to continue,
i.e. restart, with a new parsing whenever a fatal error occurs. It also allows the host machine
to check which unit in the system is the source of the error, ability which is especially useful
when debugging or when integrating new functionalities in the enhanced-CYK design.
From the performed experiments and performance measurements of the enhanced-CYK
design we make the following remarks:
6.7 : Conclusions 135
 increasing the number of processors in the system can only increase the system perfor-
mance. The overhead introduced for managing more processors then actually needed –
especially when parsing short sentences – does not result in a performance depreciation
as in the case of the DAP design;
 when the number of processors in the system increases no performance depreciation is
observed as in the case of the DAP design. This is mainly the result of eliminating the
guard-vectors and using the DISPATCHER and the WRITER as interfaces for accessing
the chart memory;
 the average processor utilization is (still) relatively low. For instance in the particular case
of the SUSANNE grammar the observed average processor utilization for some arbitrary
sentences extracted from the SUSANNE corpus is between 7 - 45%, when tiling is not
used. In the same particular case, when using tiling a better average processor utilization
is observed – between 10-62%;
 the tiling mechanism does not balance as much as expected the processor load. The ex-
planation is that for longer sentences, the number of interpretations (i.e. size of the sets
 and ) for the cells towards the top of the chart becomes smaller and smaller. The
direct consequence of this fact is that the processors spend less time for processing the
cells towards the top of the chart. In this context the overhead introduced for sending the
tiles prevails over the gain of processing the tiles in parallel on several processors and
in this case it is better to send the entire cell-combination to a processor instead of loos-
ing time for tiling it. This also explains the relatively low average processor utilization
mentioned above.
Two solutions to this problem are: (1) to use an adaptive tile size during the parsing. It
should start with a small tile size and increase it as the filled row is closer to the top of
the chart and (2) improve the hardware used for dispatching the tiles in order to hide the
dispatching overhead;
 the throughput of the enhanced-CYK design is relatively high and if the parsing time
shrinks even more it will soon become to large for a standard -bit  MHz PCI bus. A
solution in this case would be to use a faster bus;
 less than 65% of the resources available in a Xilinx Virtex XCV2000 FPGA were used
for a -processors enhanced-CYK design. Resources are still available for further im-
provements ;
136 CHAPTER 6: The Hardware Design of the enhanced-CYK Algorithm
Chapter 7
An accelerator FPGA-board for
running the enhanced-CYK algorithm
This chapter presents an accelerator FPGA-board that implements the enhanced-CYK design
presented in chapter 6. The proposed FPGA-board contains the necessary hardware resources
for implementing a -processors enhanced-CYK design and features a PCI interface that al-
lows it to be integrated and work as an accelerator within a host machine (e.g. desktop PC).
The chapter starts with a general and functional description of the proposed FPGA-board.
Next it presents in greater detail the purpose of the components on the FPGA-board.
7.1 General description
As the analysis of the enhanced-CYK design (see section 6.6.3) shows, a common -bit
33 MHz PCI bus has a bandwidth (132 MBytes/s) which is large enough in most of the cases1
for transferring the compact parse forest to a host machine.
However, building a PCI interface is not an easy task by itself and we used for this pur-
pose an of-the-shelf generic PCI interface board which is the development board provided by
PLX Technology Inc. in the Reference Design Kit (RDK). This PCI interface board is build
around the PLX’s IOP 480 I/O processor which integrates both a PCI-bus controller and a 32-
bit, 66MHz PowerPC RISC core. The PCI interface board provides an expansion connector
(PLX Option Module, POM) that can be used to plug-in custom build modules. We used the
POM expansion connector to plug-in our FPGA-board expansion module implementing the
enhanced-CYK design.
Figure 7.1 illustrates the FPGA-board which is built from the PCI interface board and the
FPGA-board expansion module. Hereafter, we will use the terms:
 PCI interface board: to denote the PLX PCI development board;
 FPGA expansion board: to denote the FPGA-board implementing the hardware design
of the enhanced-CYK algorithm;
 FPGA-board to denote the ensemble of the two above;
1This analysis was made in the particular case of the SUSANNE grammar.
138 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
RS
23
2
int
erf
ace
Pr
og
ram
ab
le
clo
ck
IO
P
48
0
re
se
t
co
nt
ro
l
se
ria
l
EP
RO
M
dri
ve
r
FI
FO
FP
GA
DI
SP
LA
Y
GM
 1
GM
 2
GM
 3
GM
 15
GM
 16
PC
I I
NT
ER
FA
CE
 B
OA
RD
clo
ck
FP
GA
 EX
PA
NS
IO
N 
BO
AR
D
CP
LD
FL
AS
H
(51
2K
)
SD
RA
M
(32
M)
ch
art
m
em
or
y 1
ch
art
m
em
or
y 2
PO
M 
co
nn
ect
or
"
glu
e"-
log
ic
sw
itc
h
bu
s
sw
itc
h
bu
s
XC
V2
00
0e
fg1
15
6-6
(Lo
cal
 bu
s)
Figure 7.1: An FPGA-board implementing a -processors enhanced-CYK design.
7.2 : Functional description 139
The resources available on the PCI interface board are:  KBytes flash memory,  MBytes
SDRAM memory and a RS-232 serial interface. The flash memory stores the power on ini-
tialization routines for the PCI interface board and several other initialization routines for the
FPGA expansion board. The SDRAM stores the lexicon which is used to initialize the chart
memory and some other (large) data-structures.
The FPGA expansion board uses a Xilinx Virtex-E XCV2000efg1156-6 FPGA for imple-
menting a -processors enhanced-CYK design (see chapter 6) and also contains the required
memory resources for (i.e. chart and grammar memories) this purpose. For the implementation
of the -processors enhanced-CYK design a -bit databus with the chart memory and a -bit
databus with the grammar memories are used.
The CPLD implements the "glue-logic" used to interface the FPGA expansion board and
the PCI interface board and the FIFO memory is used for buffering the compact parse forest
in DMA transfers over the PCI-bus. Note, that the FPGA expansion board uses two chart
memories in order to overlap the chart memory initialization and the parsing in order to make
the initialization transparent.
7.2 Functional description
This section describes the steps required for initializing the FPGA-board and how the FPGA-
board can be used for parsing sentences or word lattices.
The FPGA-board is a slave device under the full control of the host system. In other words,
any parsing request or operation initiates on the host system. In order to execute the commands
issued by the host system the FPGA-board relies on a software component running on the
IOP 480 processor. This software – stored in the flash memory on the PCI interface board
– is the interface between the host machine and the FPGA-board and contains routines for
executing commands such as: initializing the FPGA on the FPGA expansion board, initializing
the grammar memories, initializing the chart memories, initializing the lexicon, starting the
parsing process and others.
Three initialization steps are required for initializing the FPGA-board before any parsing
can start. These steps are required for configuring the FPGA on the FPGA expansion board, for
initializing the lexicon and respectively the grammar memories.
The FPGA is configured by means of a configuration file generated by a place&route tool
(e.g. Design Manager, Xilinx Alliance Series 2.1i) that is provided by the host system to the
FPGA-board. The FPGA on the FPGA expansion board is configured under the control of the
IOP 480 processor that generates the required signals during the configuration process.
The initialization of the lexicon consists of storing the lexicon in the SDRAM memory
available on the PCI interface board in a data-structure that allows fast words lookup (i.e. prefix
trees). The lexicon is provided by the host system in a format compatible with the SlpToolKit
software (see appendix A.1) and the lexicon data-structure is built locally by the IOP 480 pro-
cessor. The time required for building the lexicon data-structure is not critical as it is done only
once. The lexicon is specific to a grammar and does not change during successive parsings. A
reinitialization of the lexicon data-structure is only required if the grammar changes.
The initialization of the grammar memories consists of storing a copy of the binary image of
the grammar data-structure in each grammar memory available on the FPGA expansion board.
The binary image of the grammar data-structure is provided by the host system and the IOP
480 processor only has to copy the grammar data-structure in the grammar memories. Note
that the only access path for the IOP 480 processor to the grammar memories passes through
140 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
the FPGA circuit and therefore the FPGA circuit has to be configured before the grammar
memories can be initialized. Moreover the circuit configured in the FPGA should support the
IOP 480 processor during the initialization of the grammar memories. For this purpose, the
circuit inside the FPGA has two working states: (1) used for initializing the grammar memories
and (2) implementing the enhanced-CYK algorithm. The working state is switched under the
control of the IOP 480 processor whenever required.
Once the above three initializations have been performed the FPGA-board is ready to parse
sentences and/or word lattices. The procedure given bellow describes how a sentence is initial-
ized and parsed with the FPGA-board. First, the sentence is sent in text form by the host system
to the FPGA-board where the IOP 480 processor splits the sentence in words and lookups the
words in the lexicon data-structure in order to initialize the chart memory. Once the chart mem-
ory is initialized an FPGA internal register is configured with the length of the parsed sentence
and the signal _2 is activated. The parsing result (i.e. the compact parse forest)
is written by the FPGA into the FIFO memory from where the IOP 480 processor performs
a DMA transfer to the host system (1) as soon as the FIFO is (almost) full or (2) the parsing
is finished and the FIFO is not empty. When the parsing ends the FPGA activates the signal
_&"
3 which is used to signal the host that the parsing results are available and
that a new parsing can start.
Note that two chart memories are used, which allows to overlap the chart memory initial-
ization and the parsing. When one chart memory is initialized with the next sentence to be
parsed the other is used to parse the current sentence. When the current parsing ends the two
chart memories are swapped4 and a new parsing can immediately start. Therefore, the initial-
ization and parsing of successive sentences are overlapped and the time lost for chart memory
initialization is made transparent.
7.3 The FPGA-board
7.3.1 The PCI interface board
7.3.1.1 IOP 480 I/O processor
The IOP 480 is an I/O processor with a 32-bit 66 MHz PowerPC RISC core, integrated 32-
bit PCI bus interface, 3 DMA channels controller, memory controllers and UART. The RISC
core is compatible with the PowerPC Book D architecture from an instruction and register-level
perspective. The IOP 480 processor interfaces a Local bus running at 66 MHz (available on the
POM connector) to the PCI bus running at 33 MHz. The Local bus is the standard J-Mode
multiplexed bus. The integrated memory controller can support SDRAM, FLASH, ROM and
other types of memory and peripherals.
In order to access the chart memory, the FIFO memory and the other devices on the FPGA
expansion board, the IOP 480 processor uses 3 of the 4 available Local bus Chip Selects
(:  ) generated by the integrated memory controller. The Local bus Chip Selects can
be configured by means of some IOP 480 internal registers that are used to control the bus
width and the timing for the memory regions assigned to each Local bus Chip Select (see [24]
for details). The Local bus Chip Selects are used as follows:
2the equivalent of the 	 signal used in the block diagram of the enhanced-CYK design, see page 98
3the equivalent of the 
	 signal used in the block diagram of the enhanced-CYK design, see page 98
4The hardware supports the swapping of the chart memories by means of bidirectional bus-switches.
7.3 : The FPGA-board 141
 : for accessing general purpose FPGA user registers, for FPGA configuration, for
swapping the chart memories and for programming the programmable clock;
 + for accessing the chart memories (i.e. initialization and possibly readback during
debugging);
  for reading the FIFO memory during DMA transfers;
The  is already assigned to the  KBytes flash memory on the PCI interface board
(see appendix C for a detailed IOP 480 memory map). The memory controller has bursting
capabilities for supporting high data transfer rates. The FPGA-board uses burst transfers for
transferring the parse results from the FIFO memory to the host machine. Such burst transfers
take place under the control of the on-chip DMA controller.
The Local bus Chip Select (:  ) signals as well as all IOP 480 signals (i.e. address,
data, control) on the Local bus are available on the POM connector and further to the FPGA
expansion board.
7.3.1.2 The serial EEPROM
The serial EEPROM on the PCI interface board is a MN93CS66LEN 4 Kbit serial EEPROM,
used as a boot device for initializing the IOP 480 processor internal registers – in particular
those required to configure the PCI bus controller – after reset.
7.3.1.3 The flash memory
The flash memory is a  KByte memory addressed on 8-bit words, with  [ns] access
time. This memory can be read and written and is selected by the Local bus Chip Select signal
. It stores the power on initialization routines and the software that allows the host to
communicate with the FPGA-board. This software includes routines for configuring the Virtex
XCV2000 FPGA on the FPGA expansion board, for initializing the chart and grammar memo-
ries and others.
Note: Due to the fact that it is relatively slow, the software it contains should be copied into the
SDRAM memory and executed from there.
7.3.1.4 The SDRAM memory
The SDRAM memory consists of four 8Mx8 bit SDRAMs, providing a total of 32 MBytes
on-board SDRAM with a 	 [ns] access time. The SDRAM controller is part of the integrated
memory controller of the IOP 480 and is software programmable. The SDRAM memory is
used for mass storage and in particular for storing the lexicon data-structure, the binary image
of the grammar memories and the FPGA configuration byte-stream. It is also used to run the
software that allows the host computer to communicate with the FPGA-board.
7.3.1.5 The Serial port
The serial port on the PCI interface board is a standard RS-232 DB-9 connector. The IOP 480
processor built-in UART is connected to this connector through a voltage level converter. This
interface is very useful during the testing of the FPGA-board for debugging purposes.
142 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
7.3.1.6 The PLX Option Module Connector
The PLX Option Module (POM) connector is directly connected on the -bit multiplexed
J-Mode Local bus.
Some signals used by the FPGA expansion board are not available on the POM connector
but can be retrieved on two 2x10 headers (JP5 and JP8, see [23]) available on the PCI interface
board. These signals are: , 2,  and the latched address lines 6  .
7.3.2 The FPGA expansion board
The FPGA expansion board contains the hardware resources required for implementing a -
processors enhanced-CYK design as presented in chapter 6. The FPGA expansion board is
built as an expansion module for the PCI interface board and uses the POM connector for this
purpose. The schematics for the FPGA expansion board are available in appendix C. For all
the signals mentioned in this chapter please refer to these schematics.
7.3.2.1 The programmable clock
The circuit used is a dual programmable clock generator ICD 2051 with two independent clock
outputs ranging from 320 KHz to 100 MHz. The circuit also requires a clock reference that is
derived from a 10 MHz clock oscillator. The two clocks are fully user-programmable phase-
locked loops. The circuit is required for driving the FPGA clock input(s) with a clock frequency
that matches the working clock frequency of the design configured in the FPGA circuit. The
two clock outputs are used to drive two (out of four) clock pins (i.e. the + and :)
of the used Xilinx Virtex XCV2000 FPGA. The second clock may be eventually used to build
a more sophisticated SRAM memory access module or for other purposes, but is currently
unused.
The ICD 2051 circuit is powered from a 5V power supply while the rest of the FPGA ex-
pansion board is using 3.3V signalling. For this reason each of the two programmable clock
outputs is passed through a zero delay clock buffer CY 2304 operating at 3.3V and having the
clock input tolerant to 5V signalling. The operating range of this circuit is from  MHz to
 MHz, thus the working frequency range of the FPGA expansion board will be restricted to
the range from  MHz to  MHz.
Note: The observation above is important for the host-side software library function that is
invoked when programming the programmable clock. This function has to check that the pro-
grammed frequency is within the above mentioned range.
The clock output driving the FPGA clock +_ is also used to drive the FIFO memories
(required for synchronous writes) available on the FPGA expansion board.
The programmable clock is controlled by the IOP 480 processor. The programmable clock
is decoded in the memory space assigned to the Local bus Chip Select :. The port addresses
used for this purpose are given in table 7.1 (see the ICD 2051 data-sheet for more details). A
detailed memory map is given in appendix C.
7.3.2.2 The FPGA
The FPGA used on the FPGA expansion board is a Xilinx Virtex-E XCV2000efg1156-6. This
FPGA was chosen due to the high number of user available pins (i.e. 804 pins), the internal
7.3 : The FPGA-board 143
port R/W meaning
+  W pulse  to load _" input
+  W pulse  to load _" input
+ 9 W set/reset (; for switching the clock output A
+  W set/reset (; for switching the clock output B
Table 7.1: Port addresses used to configure the dual programmable clock generator. For details
see the ICD 2051 circuit data-sheet.
memory resources (160 Block SelectRAM) and its advanced technology which allows rela-
tively high clock frequencies. Among these features, the internal memory resources were par-
ticularly important for the implementation of the enhanced-CYK design presented in chapter 6.
The FPGA uses a core voltage of 1.8 V and the user pins are 3.3V LVTTL compatible as the
rest of the circuits used on the FPGA expansion board.
The FPGA circuit is configured in the SelectMAP mode which is the fastest configuration
option (see [30] for details). The FPGA configuration byte-stream is provided by the host
machine and configured in the FPGA under the control of the IOP 480 processor. The signals
required for configuring the FPGA are decoded in the memory space assigned to the Local bus
Chip Select :. The port addresses used for controlling these signals are given in table 7.2
(see [30] for details). A detailed memory map is given in appendix C. The user also has access
port R/W meaning
+  R reads the _ and _"2 status signals
+ 7 W write the _2 control signal
+ 9 W activates the _ FPGA chip select
+  R/W FPGA general-purpose register 

(sentence length)
+  R/W FPGA general-purpose register 

(fault code, MSB)
+ 9 R/W FPGA general-purpose register 

(fault code)
+  R/W FPGA general-purpose register 

(fault code, LSB)
+  R/W FPGA general-purpose register 

(grammar initialization support)
+  R/W FPGA general-purpose register 

(grammar initialization support)
+ 9 R/W FPGA general-purpose register 

(not assigned)
+  R/W FPGA general-purpose register 

(not assigned)
+ + R/W FPGA general-purpose register 

(not assigned)
+ + R/W FPGA general-purpose register 

(not assigned)
+ +9 R/W FPGA general-purpose register 

(not assigned)
+ + R/W FPGA general-purpose register 

(not assigned)
+ : R/W FPGA general-purpose register 

(not assigned)
+ : R/W FPGA general-purpose register 

(not assigned)
+ :9 R/W FPGA general-purpose register 

(not assigned)
+ : R/W FPGA general-purpose register 

(not assigned)
Table 7.2: Port addresses used to control the signals involved in the configuration of the Virtex-
E XCV2000 FPGA and for accessing the  
-bit FPGA internal general-purpose registers.
to  
-bit FPGA internal general-purpose registers that can be used for data-communication,
144 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
state readback and during the debugging phase of an implemented design. One of these registers
is used for configuring the length of the parsed sentence. These registers are also decoded in
the memory space assigned to the Local bus Chip Select :. The signals used to read/write
the internal FPGA general-purpose registers are: :   (used to select one of these registers)
, 2,  and (_. In table 7.2 is given the port addresses used to access the FPGA
internal general-purpose registers. In particular the register 

is used for configuring the length
of the sentence to parse and the registers 

, 

and 

are used to store the error code in case a
fault occurred during the parsing. The code stored in these registers allows the user to track the
unit that generated the error signal (see section 6.4.9 for details). The registers 

and 

are
used to support the grammar memories initialization.
7.3.2.3 FIFO memory
The FIFO memory used on the FPGA expansion board is implemented with 4 CY7C4291-10JI
and works at a frequency up to  MHz. The CY7C4291-10JI is a 128K x 9bit FIFO memory
and the 4 chips are used to build a 128K x 32bit FIFO memory. The FIFO is fully asynchronous
and allows simultaneous read and write operations. Data is written in the FIFO from the FPGA
in words of 32-bit and the clocks +_2   (synchronous with +_, gener-
ated by CY2304) are used for this purpose. Data is read from the FIFO during DMA transfers
by the IOP 480 DMA controller at the 66 MHz clock frequency of the IOP 480 Local bus, avail-
able on the 2_ on the POM connector. The "glue"-logic required for accessing the FIFO
memory during DMA transfers is implemented in the ALTERA 7128AE CPLD. The DMA
transfer is initiated when the flag  (FIFO almost empty) goes high, meaning that a signifi-
cant amount of data is already in the FIFO and is stopped when the empty flags    are ac-
tivated. Using DMA transfers on -bit insures that the full bandwidth available on the PCI bus
is used. The FIFO memory is mapped in the IOP 480 memory space    
and is decoded by the Local bus Chip Select signal . See appendix C for a detailed memory
map.
7.3.2.4 Chart memory
The chart memory used on the FPGA expansion board is implemented with 8 TC55V16100FT-
12. Each TC55V16100FT-12 is a 1M x 16bit static RAM. As figure 7.2 illustrates, two chart
memories are used, which allows to overlap the chart memory initialization and the parsing.
When the chart memory 1 is initialized with the next sentence to be parsed, the chart memory
2 is used for parsing the current sentence. When the parsing ends a new one can immediately
begin on the chart memory 1 which has already been initialized, and so on.
Both the chart memory 1 and the chart memory 2 are build from 4 TC55V16100FT-12
which are organized in two blocks, low L and high H ( see figure 7.2). The L and H blocks are
build from 2 TC55V16100FT-12 and organized as 1M x 32bit memories which can be directly
accessed by the IOP 480 processor on the Local bus which is a -bit databus. On the other
hand, the FPGA accesses the L and H blocks of a chart memory in parallel on a 64-bit databus.
The IOP 480 processor uses the Local bus Chip Select +, mapped in the memory space
6   6  to access the chart memories. More precisely, the selection signals
used by the IOP 480 to access the chart memories and the mapping of the chart memories in
the IOP 480 memory space is given bellow:
 chart memory 1, L: selected by _, mapped in 6   6: ;
7.3 : The FPGA-board 145
bus
switch
0 1
bus
switch
0 1
DATA   ADDR   OE   WE   CE DATA   ADDR   OE   WE   CE
select
(1Mx32bit SRAM)
L
(1Mx32bit SRAM)
chart memory 2 (8 Mbyte)
H
bus
switch
0 1
bus
switch
0 1
DATA   ADDR   OE   WE   CE DATA   ADDR   OE   WE   CE
select
(1Mx32bit SRAM)
L
chart memory 1 (8 Mbyte)
from IOP 480
H
(1Mx32bit SRAM)
from FPGA
(data, address, control)
(data, address, control)
Figure 7.2: The organization of the chart memories on the FPGA expansion board. Two chart
memories are used in order to overlap the parsing and the chart initialization.
 chart memory 1, H: selected by _&, mapped in 6   6 ;
 chart memory 2, L: selected by +_, mapped in 69   6 ;
 chart memory 2, H: selected by +_&, mapped in 6   6 ;
On the other hand, the FPGA sees the chart memories starting at the physical address 0.
Passing the control of a chart memory to the IOP 480 or the FPGA means that full control is
given over the databus ", address bus "" and control signals (, 2 and ) of that chart
memory. This is implemented by means of 12 bidirectional "near-zero delay" bus-switches
IDTQS34XVH245. The  signal – controlled by the IOP 4805 – is used to control
the bus-switches in order to assign the chart memories ownership to the IOP 480 processor
or respectively the FPGA. Table 7.3 gives the routing of the IOP 480 Local bus signals and
respectively the FPGA bus signals to the L and H banks of the chart memories under the control
of the  signal. Note that when    the chart memory 1 is under the control
of the IOP 480 while the chart memory 2 is under the control of the FPGA and vice-versa.
7.3.2.5 Grammar memories
Each grammar memory is implemented with a 1M x 16bit TC55V16100FT-12 static RAM.
This amount of memory is considered enough for storing large-size real-life CFGs. For in-
stance, the SUSANNE grammar without unitary rules only uses 1/4 of this memory space.
Before the parsing starts, each grammar memory is initialized by the IOP 480 from a bi-
nary image of the grammar data-structure provided by the host machine. In order to reduce
the number of components on the FPGA expansion board the grammar memories are only con-
nected to the FPGA and not to the IOP 480 Local bus. The only access for the IOP 480 to
the grammar memories is through the FPGA. Thus, the initialization of the grammar memories
can only be done with the support of the design (i.e. enhanced-CYK) configured in the FPGA.
The enhanced-CYK design should be extended to include a mechanism that supports the IOP
5The 	 signal is decoded in the memory space assigned to . The port address   is used to
control this signal.
146 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
chart memory 1
|" "" 2  











L ":  _
&_"+ <
& 6 
2 
H ":  _&











L _":  _
_""<  _2 _
H _"6: :+ _
chart memory 2
|" "" 2  











L _":   _
_""<   _2 _
H _"6:  :+ _











L ":  +_
&_"+ <
& 6 
2 
H ":  +_&
Table 7.3: Routing of the IOP 480 Local bus signals and respectively the FPGA bus signals
to the L and H banks of the chart memories under the control of the  signal. The
organization of the chart memory 1 and the chart memory 2 is illustrated in figure 7.2.
480 during the initialization of the grammar memories. For this purpose the FPGA internal
general-purpose registers 

and 

are used (see section 7.3.2.2 and table 7.2). The 

register
is used for data transfers and the 

register is used for signalling (e.g. marking the end of the
initialization, handshake).
Thus, the initialization of the grammar memories is done in 3 steps: (1) the FPGA is con-
figured with the enhanced-CYK design extended to support the initialization of the grammar
memories, (2) the enhanced-CYK design uses the grammar memory initialization mechanism
for initializing the grammar memories and (3) the enhanced-CYK design is switched to nor-
mal operation and waits for sentences to parse. The extension on the enhanced-CYK design is
relatively easy to implement and is not discussed here.
7.4 : Conclusions 147
7.3.2.6 "Glue"-logic
Implemented with a ALTERA EPM7128AE CPLD that can work up to  MHz. The circuit
implements all the "glue"-logic required for interfacing: the FIFO to the IOP 480 DMA con-
troller, the IOP 480 to the FPGA as well as for generating the port and chart memories selection
signals. It can be programmed in-system via the industry standard 4-pin IEEE 1149.1 (JTAG)
interface.
7.3.2.7 Display
Implemented with a DLG1414 circuit. A general purpose display that can be used, for instance,
during debugging to display FPGA internal status information.
7.3.2.8 Power supply
The FPGA expansion board uses three power supplies: 5V for the dual-programmable clock
generator and the display, 3.3V for most of the components and 1.8V for the FPGA core voltage.
The power consumption on the 5V supply and 1.8V supply is not critical and does not require
special precautions. However, on the 3.3V supply the peak current consumption is around 7A in
the assumption that all the SRAMs on the FPGA expansion board work at maximum frequency.
For this reason the power supply used on the FPGA expansion board is powered directly from
a 5V source available on the host machine and not through the PCI interface board. The 3.3V
and 1.8V are derived locally with a LT1584CT and respectively LT1764 circuits.
7.4 Conclusions
This chapter presented an FPGA-board that implements the enhanced-CYK design presented
in chapter 6. The proposed FPGA-board contains the necessary hardware resources for imple-
menting a -processors enhanced-CYK design and features a PCI interface that allows it to be
integrated and work as an accelerator within a host machine (e.g. desktop PC).
148 CHAPTER 7: An accelerator FPGA-board for running the enhanced-CYK algorithm
Chapter 8
Conclusions
In this thesis we have presented an FPGA-based hardware implementation of a syntactic parsing
algorithm – an enhanced version of the CYK algorithm – that can deal with large-size real-life
context-free grammars and is adapted for word lattice parsing. In this final chapter, we first
summarize the results obtained during our research. Then, we propose some improvements to
the current implementation and outline some future research directions.
8.1 Analysis of the results
Our work started by the investigation of a 2D-array of processors implementing the CYK algo-
rithm presented in [8]. This approach proved to be much too demanding in terms of hardware
resources for being implemented within state-of-the-art FPGAs.
We further investigated whether a 1D-array of processors implementing the CYK algo-
rithm is feasible – in terms of required hardware resources and data-structures – for being
implemented within state-of-the-art FPGAs. The results of this study was presented in chap-
ter 3 in which a linear array of processors (LAP) architecture implementing the CYK algorithm
was proposed and tested for validation on a commercial FPGA-board. The proposed LAP ar-
chitecture uses    processors for parsing sentences with up to  words and word lattices
with up to    time stamps. The implementation of the LAP design was the first step of
our design methodology during which we proved the feasibility of an FPGA-based hardware
implementation of the CYK algorithm, both in terms of required hardware resources and size
of data-structures. The LAP design also exhibited a significant  speedup factor when com-
pared against a software implementation of the enhanced-CYK algorithm.
However, the LAP architecture neither efficiently exploits the parallelism available in the
CYK algorithm nor the available processors. We therefore investigated a more aggressive and
efficient processor allocation mechanism for further increasing the speedup factor. This new
approach for processor allocation mechanism was investigated in chapter 5 in which a dynamic
array of processors (DAP) architecture implementing the CYK algorithm was proposed. We
called the design a dynamic array of processors, due to the fact that the processors are dynam-
ically allocated to process cell-combinations in the chart. The DAP architecture has a better
average processor utilization in comparison to the LAP design and a better  to  speedup
factor while using a slightly modified processor and the same data-structures for the chart and
the grammar.
Next, we investigated (in chapter 6) a dynamic array of processors architecture implement-
ing the enhanced-CYK algorithm, called enhanced-CYK. The enhanced-CYK architecture pro-
150 CHAPTER 8: Conclusions
poses a method called tiling that can further increase the average processor utilization and the
speedup factor. The idea of tiling was suggested by the observation that there are large cell-
combinations that require a long time to process and that all cell-combinations that are data-
depend on the results of these large cell-combinations cannot be processed having to wait for
these results to be available. The tiling mechanism then consists of splitting expensive (i.e.
large) cell-combinations in smaller chunks that can be more efficiently processed by the pro-
cessors. The tiling mechanism can therefore be seen as a dynamic clustering of the available
processors around expensive cell-combinations. In other words, it is a way of concentrating the
processing resources when and where they are needed. The enhanced-CYK architecture yields
a  to 
 speedup factor and can parse sentences with up to  words.
In brief, the main results achieved in this thesis are:
 a processor allocation method called dynamic processor allocation that better exploits
the parallelism available in the (enhanced-)CYK algorithm(s). The dynamic processor
allocation method allows the design to dynamically assign the processors to compute
cells in the chart, which was not the case in the LAP design that statically assigns the
processors to cells in the chart;
 a mechanism called tiling that allows several processors to work on the same (large) cell-
combination. This mechanism is useful for small sentence lengths when in general not
all the processors are busy during parsing or for large sentence lengths when filling the
cells towards the top of the chart;
 an FPGA-based hardware implementation of the enhanced-CYK algorithm that inte-
grates the dynamic processor allocation method and the tiling mechanism. The enhanced-
CYK design has the following features: it parses word lattices, can deal with large-size
context-free grammars and can parse long sentences (up to  words or time-stamps).
These features, along with the achieved speedup factor, render the hardware implementa-
tion well suited for being integrated in NLP applications that have strong data-size and/or
real-time constraints or within a speech recognition application framework;
 an accelerator FPGA-board that implements the enhanced-CYK design is proposed in
chapter 7. The proposed FPGA-board contains all the necessary hardware resources for
implementing a -processors enhanced-CYK design and features a PCI interface that
allows it to be integrated and work as an accelerator within an application framework as
illustrated in figure 1.2 at page 8;
An important feature of the proposed enhanced-CYK design is its ability to be configured (as
an IP core) – by means of generics in the VHDL code – according to the CFG considered.
The synthesis will therefore produce a design with optimal resources (e.g. registers, counters)
sizes for the particular considered CFG. The modularity of the VHDL code also allows the
enhanced-CYK design to be targeted on "any" FPGA-board that contains a Virtex-E family
FPGA and enough memory resources for storing the chart and the grammar data-structures.
This modularity allows us to use a 
-bit, -bit or -bit databus with the grammar memories
or even a -bit, -bit or -bit databus. The same remark applies to the databus with the chart
memory. Even more, a mixed databus environment is also allowed. For instance some grammar
memories may use a -bit databus while the others use an 
-bit databus.
8.2 : Future work 151
8.2 Future work
Several improvements can be added to the current enhanced-CYK design. Among these, we
mention (1) a faster tile dispatching mechanism, (2) an adaptive tile sizing that will take into
account the size of each individual cell-combination and the number of processors available
in the system and (3) a time-efficient mechanism for compressing/decompressing the compact
parse forest before shipping it to the host machine. The integration of the FPGA-board within
an application framework will also be required for testing the utility of such a tool.
Another research direction – which is also interesting as a standalone research topic – is to
investigate the possibility to use FPGA internal resources for implementing (i.e. representing)
the CFGs. While certain CFGs may require less resources (e.g. combinatorial logic) to be
implemented than others an interesting investigation will be to seek a particularly synthesis-
friendly CFG form.
Finally, to fully exploit the FPGA-board that will result of this research, a software library
will need to be written to implement the communication interface between the host machine
and the FPGA-board.
The FPGA-board will then allow to perform tests and performance measurements in a very
efficient way when compared to the time required to carry the VHDL simulations (usually
taking about 1 week). The fast feedback will allow to explore several design variations and
occasionally to perform exhaustive analyses on the tested designs.
152 CHAPTER 8: Conclusions
Appendix A
Transformation steps towards CNF
grammars
A.1 SlpToolKit organisation of SUSANNE grammar
The SUSANNE grammar we used is extracted from the SUSANNE corpus. Due to the fact
that this grammar is used by SlpToolKit it has some particularities that we briefly describe in
the following lines. When transforming the SUSANNE grammar in its equivalent Chomsky
Normal Form (CNF), special attention should be given to these particularities.
We saw that a CFG is defined as     	. In the context of NLP the terminals ()
are the words used in the language generated by grammar G. From the definition of the CFG
we can see that the grammar can contain words in the rules. However, this kind of grammars
– called lexicalized grammars – are not often used when dealing with natural languages for
practical reasons. This is also the case of the SUSANNE grammar which is not a lexicalized
grammar.
SlpToolKit uses two files for storing a grammar, in particular the SUSANNE grammar. The
first file, called the lexicon file, stores all (and only) the rules such as    (called lexical
rules) where  is a terminal (word). The second file stores all the other remaining grammar
rules and is called, for convenience, the grammar file. Both the lexicon and the grammar files
are text files.
Another particularity of the grammars used by SlpToolKit is a category of non-terminals,
called pre-terminals, used to represent sets of words in the lexicon. For example, in the grammar
rule   , where  is a word, the left-hand side  is a pre-terminal. The pre-terminals are
associated, in general, with morpho-syntactic categories of the lexicon (e.g. noun, verb). The
pre-terminals use a special notation :nnn (e.g. :1415, :19, ...). For more details regarding these
files see the documentation of SlpToolKit1.
Every line in the grammar file is a grammar rule. The SUSANNE grammar file 2, used by
SlpToolKit has the following format:
.
.
.
    
   
.
.
.
1available under request from {chaps, webmaster}@lia.di.epfl.ch
2the grammars considered in this document are assumed non probabilistic
154 APPENDIX A: Transformation steps towards CNF grammars
Every line in the lexicon file stores a word and an associated pre-terminal (without ’:’) sepa-
rated by ’|=|’. A certain word may have associated several pre-terminals (e.g. the word ’can’ is
a noun and a verb). In this case it occures in several lines of the lexicon file. The SUSANNE
lexicon file used by SlpToolKit has the following format:
.
.
.
!12  
!12  	
  

  


  	
.
.
.
A.2 Transforming the general SUSANNE CFG to its CNF.
During the extraction of the SUSANNE grammar from the SUSANNE corpus, no 3-rules (i.e.
rules like   3 where 3 is the empty string) are introduced. We also assume that no useless
rules are introduced.
With these assumptions, in order to transform a CFG to an equivalent CNF one should
apply the following transformation steps:
 step 1: eliminate the unitary rules (i.e.    )
 step 2: tranform all rules with more than 2 non-terminals in the right-hand-side, in rules
of the form    
The grammar and lexicon files that result after each transformation step are illustrated in fig-
ure A.1. These transformation steps are detailed bellow.
lexicon
G G1
lexicon lexicon1 lexicon1
G2 GCstep 1.a step 1.b step 2
enhanced-CYK design
used in the used in the 
LAP and DAP designs
Figure A.1: Transformation steps for the SUSANNE grammar for obtaining the equivalent
Chomsky Normal Form.
step 1.a: The unitary rules are eliminated using for instance the algorithm given in [20].
After this transformation the original grammar file G becomes the grammar file G1. The lexicon
does not change.
step 1.b: A particularity of the SUSANNE grammar is that it contains rules such as  
  that cannot be eliminated with the same algorithm by treating them as unitary rules
because they will make the number of grammar rules grow too much (i.e. explode). In a test we
made on an algorithm that treats the rules of type     as unitary rules we got as result
for the transformed grammar a number of more than    new rules3! This number of
3the number of rules in the original grammar is  
A.3 : Transformation steps detailed. 155
rules is not resonable and in conclusion we should find another way to eliminate these rules. A
solution to eliminate rules like    





  





    is to modify both the lexicon and
the grammar files as described bellow:
1. code  with   (i.e.  becomes a new ’pre-terminal’) and replace all occurences of
 in the grammar with its code   4
2. for every word in the lexicon like:
word|=|





or word|=|





or . . . we introduce only once a new line:
word|=|
After these transformations we can remove the rules    





  





   . The
obtained grammar file G2 plus lexicon file lexicon1 (see figure A.1) is weakly equivalent
to the previous grammar file G1 plus lexicon file lexicon. The inconvenience with this
procedure is the growth in size for the lexicon file from 
 	 lines to 
 	 lines in
lexicon1 file, which is however reasonable.
optional step: At this point we may consider to eliminate the useless rules in the grammar
G2. The useless rules are those rules that never take part in a derivation. This step is however
not necessary and may be skipped. An algorithm can be found in [20].
step 2 : In order to transform the grammar G2 in an equivalent CNF grammar file GC, the
algorithm bellow is aplyied:
 1: cnt=0;
 2: remove from G2 and put in the output grammar GC the rules that are CNF
 3: if G2 is empty (no rules left) then STOP
 4: build a table with all the distinct pairs of non-terminals and their frequency that occur
in the right hand-side of the remaining rules in G2
 5 : take the most frequent of these pairs and replace it everywhere in G2 with a new
non-terminal. This new non-terminal has to be unique and we built the symbol as A_cnt
(e.g. A_1, A_2, . . . ).
Increment cnt
Go to 2
By replacing the most frequent pair in step 3 of the algorithm above we have a greedy algorithm
that tries to reduce as much as possible the size of the grammar. This algorithm is not optimal
in the sense that it does not produce necessarily the smallest possible grammar GC. The lexicon
does not change during this transformation step.
A.3 Transformation steps detailed.
All the steps described bellow are discussed in the context of the particular grammar format
used for representing a CFG for the SlpToolKit. The processing steps are:
4this may result in occurence of ’pre-terminal’ symbols in the left hand-side of the new grammar rules, and
therefore the term ’pre-terminal’ loses its meaning.
156 APPENDIX A: Transformation steps towards CNF grammars
 step 1.a (elimination of unitary rules): The program that makes the first transformation
takes as input the SUSANNE grammar G. Bellow are given some characteristics of G:
# of ,	(  	 
# of non-terminals = 1,920
# of unitary rules = 206 (to be eliminated in this step)
# of non unitary rules = 17,463
During the transformation a number of   new rules are added. The output of this
program is the transformed grammar G1 that does not contain unitary rules. Some char-
acteristics of the grammar G1 are:
# of ,	(  	   	      
# of non-terminals = 1,920
 step 1.b (elimination of rules like   





  





   ): The program that makes
this transformation takes as input the grammar G1 resulted in the previous step and the
lexicon. The output is a text file G2 containing the transformed grammar that does
not contain rules like    





  





    and a new lexicon file lexicon1.
Some characteristics of the grammar G2 are:
# of ,	(   
# of non-terminals = 1,912
# of Chomsky Normal Form rules = 8,587
# of non Chomsky Normal Form rules = 66,123 - 8,587= 57,536
The lexicon1 contains 
 	 lines in comparison to only 
 	 in the lexicon.
 step 2 (rules transformation) : The program that makes this transformation takes as input
the grammar file G2. The output is a grammar file GC that contains only CNF rules. The
lexicon1 does not require changes.
Some characteristics of the grammar GC are:
# of ,	(  	      
 	
# of non-terminals =10,129 = 1,912 + 8,217
A number of 
 	 new non-terminals (

, 

, . . . , 


) were introduced in this
transformation step and of course the same number of rules.
All the algorithms implemented for transforming the SUSANNE grammar in CNF grammar
were written in the Python language.
A.4 Creating and validating the memory image
Three kind of elements can be distinguished in the data-structure used for representing the
CNF grammar in section 3.4.2, namely (1) the non-terminal, (2) the pointer, and (3) the right
hand-side code. Due to constraints imposed by this data-structure which are:
A.4 : Creating and validating the memory image 157
parameter size (bits)
_ 
_ 
(_ 
Table A.1: Grammar data-structure parameters sizes for the SUSANNE CNF grammar (GC.
 an entry in level 1 is  bytes ( bits);
 an entry in level 2 is , 
 or  bytes (maximum  bits);
 an entry in level 3 is  bytes ( bits) for the header and  bytes ( bits) for a pair of
non-terminals;
a maximum number of  bits can be used for representing a non-terminal (the restriction
actually comes from the chart memory representation, see section 3.4.1),  bits for a pointer,
and  bits for a right hand-side code. The CNF grammar data-structure is alligned to a -byte
boundary.
These maximal parameter sizes can accomodate (most) real-life CNF grammars. A CNF
grammar that requires a larger size for any of these parameters cannot be accomodated within
the proposed data-structure. The actual sizes for the non-terminal, pointer and right hand-side
code depend on the considered CNF grammar. We will denote henceforth by _ the size
of the non-terminal, by _ the size of the pointer and by (_ the size of the right
hand-side code.
The value of the _ parameter is not known prior to memory image creation and the
maximum value (i.e.  bits) for this parameter is assumed. The _ and (_ pa-
rameters can be easily evaluated. We used for this purpose the Python programming language,
that can easily handle grammars stored in text files (see section A.1). Once a first estimate for
the memory image size is obtained, the real value for the _ parameter is obtained and
the pointers in the data-structure can be correctly (re)alligned according to the new value. The
created CNF grammar memory image is stored in a dump file in a big-endian format.
In particular for the SUSANNE CNF grammar the size of the CNF grammar memory im-
age is 
 	 bytes and the values for the _, (_ and _ are given in
table A.1. These values are used to configure the VHDL code (i.e. using generics) to generate
the right register sizes, counter sizes, comparator sizes and buses sizes that match the character-
istics of the used CNF grammar. In order to validate the created CNF grammar memory image,
a program that can extract the CNF grammar from the dump file was also writen. The extracted
CNF grammar was compared to the original CNF grammar for validation.
158 APPENDIX A: Transformation steps towards CNF grammars
Appendix B
Facts about the (enhanced)-CYK
algorithm
B.1 The number of cell combinations for an  word sentence
A cell of the chart is filled by combining several pairs of already filled cells (preciselly   
cell pairs for a cell in row ). The operation of combining a pair of cells (i.e. cell-combination)
is the basic operation performed by a processor and we assume that a processor performs one
such basic operation in a time-step. For a sentence with  words, the recognition/parsing result
is available after a certain number  of cell-combinations has been performed. The number 
can be computed as follows:
                     


 

The first term in the summ represents the number of cell-combinations performed for filling the
first row of the CYK chart (  times  cell-combination), the second term is the number of
cell-combinations performed for filling the second row (   times  cell-combinations) and
so on.
B.2 The dynamic processor allocation method
The linear array of processors (LAP) architecture presented in chapter 3 has a low processor
utilization due to the method used for allocating cell-combinations to processors. The alloca-
tion of cell-combinations to processors within the LAP design is intrinsic to the design which
processes the chart cells in a row-by-row manner and has the advantage of taking into account
– without requiring aditional hardware – the data-dependencies among the cells.
In order to increase the processor utilization we introduce a method called dynamic proces-
sor allocation. A hardware architecture using the dynamic processor allocation method would
be able to accomodate any number of processors and can be seen as a general case for the 2D-
array of processors presented in [8] if enough processors are available. The method works as
follows: at each time step allocate as many processors as possible (given the data-dependency
restrictions) for performing cell-combination in the chart. If a number  of processors are
available, ideally at each time step all  processors will be allocated to cell-combinations.
However, this will not always be possible due to data-dependency constraints in the chart. On
the other hand, in practice, a design based on the dynamic processor allocation method may
160 APPENDIX B: Facts about the (enhanced)-CYK algorithm
suffer due to the delays introduced for cell-combination dispatching. These delays are a direct
consequence of the fact that the cell-combination dispatcher has to take into account dynamic
data-dependency constraints that is not an easy task and may therefore introduce some signifi-
cant overhead.
By assuming that the overhead introduced for dispatching cell-combinations is insignifi-
cant, the dynamic processor allocation method can be used for computing the minimum num-
ber of processors required for parsing a sentence of length  within    time-steps – as if a
2D-array of processors is used. Given the sentence length , the procedure implementing the
dynamic processor allocation method used to compute the minimum number of processors  is
given bellow: The above algorithm starts with a tight assumption on the number of processors
Algorithm 3 Minimum number of processors required for filling the chart in linear time (,
for a given sentence length ) when using the dynamic processor allocation method
1: -)  +2 
2:   
3: while (not -)) do
4: empty chart
5: while ((chart not filled) and (2 1   )) do
6: allocate at most  if possible
7: 2 1
8: end while
9: if (chart not filled) then
10:  
11: else
12: -)  ) 
13: end if
14: end while
15: return 
(  ) required to parse (i.e. fill the chart) a sentence of length  within    time-steps.
Then, it tries to fill the chart within    time-steps using  processors, while taking into
account the data-dependencies among cells during chart filling. If more then    steps are
required for filling the chart, the number of processors  is increased and the procedure retries
to fill the chart. The chart is emptied each time a new filling is retried. If the chart is filled
within   time-steps the returned value of  represents the minimum number of processors
required for parsing a sentence of length  within    time-steps. For a sentence length 
taking values between  and , the number of processors computed with the procedure above
is tabulated in table B.1 and compared against the number of processors required in a 2D-array
of processors, given by the formula  . The number of processors computed with the
given procedure corresponds to the worst-case assumption when all the cells in the chart are
non-empty. However, this is not the case for real-life CFGs when quite a significant number
of cells in the chart may be empty. A direct consequence is that for real-life CFGs even less
processors are actually required for parsing a sentence of length  within   time-steps than
computed for the worst-case assumption.
The number of processors actually required when parsing real-life sentences (extracted
from the SUSANNE corpus) with the CYK, respectivelly enhanced-CYK algorithms is mea-
sured in the following sections.
In the following two sections we will compare the theoretical (i.e. worst-case) number of
B.2 : The dynamic processor allocation method 161
 # processors  # processors  # processors  # processors
– – 17 62 (136) 33 228 (528) 49 498 (1176)
2 1 (1) 18 69 (153) 34 242 (561) 50 518 (1225)
3 2 (3) 19 77 (171) 35 256 (595) 51 539 (1275)
4 4 (6) 20 85 (190) 36 270 (630) 52 560 (1326)
5 6 (10) 21 94 (210) 37 286 (666) 53 581 (1378)
6 9 (15) 22 103 (231) 38 301 (703) 54 603 (1431)
7 12 (21) 23 112 (253) 39 317 (741) 55 626 (1485)
8 15 (28) 24 122 (276) 40 333 (780) 56 649 (1540)
9 18 (36) 25 132 (300) 41 350 (820) 57 672 (1596)
10 22 (45) 26 142 (325) 42 367 (861) 58 695 (1653)
11 27 (55) 27 153 (351) 43 384 (903) 59 719 (1711)
12 32 (66) 28 165 (378) 44 402 (946) 60 744 (1770)
13 37 (78) 29 177 (406) 45 420 (990) 61 768 (1830)
14 43 (91) 30 189 (435) 46 439 (1035) 62 794 (1891)
15 49 (105) 31 202 (465) 47 458 (1081) 63 819 (1953)
16 55 (120) 32 215 (496) 48 478 (1128) 64 845 (2016)
Table B.1: The worst-case number of processors required for filling the chart within   time-
steps, given the sentence length , when using the dynamic allocation method. The number in
parantheses ( ) is the number of processors required by a 2D-array.
processors required for filling the chart within    time-steps, where  is the length of the
sentence (as given in table B.1), to the number of processors that are actually required when
parsing with a concrete real-life CFG. The real-life CFG used in our case-studies is the SU-
SANNE grammar. For the purpose of this comparison a number of  	

 sentences, consisting
of all the sentences with lengths between  and  from the SUSANNE corpus, were parsed
with the CYK algorithm and respectivelly with the enhanced-CYK algorithm. For these algo-
rithms, two values are computed for the sentences with the same length : (1) the maximal
number of processors ever needed during a time-step for parsing the sentences and (2) the
average number of processors during all time-steps for parsing the sentences.
B.2.1 Case-study on the CYK algorithm
As the CYK algorithm requires a grammar written in CNF, the SUSANNE grammar was first
transformed into an equivalent CNF grammar. The CNF equivalent of the SUSANNE grammar
consists of 	  grammar rules and   distinct non-terminals. When using the CYK
algorithm every cell in the chart contains a set of non-terminals that can derive the subsentence
associated with that cell.
As we know, a processor combines the sets in two source cells and stores the result in the
set of a destination cell. For a given sentence, the set sizes as well as their distribution within
the chart during the parsing depends on the used grammar. Whenever one of the sets in the
source cells is empty the processor does not have any work to do being therefore unnecessary
during that time-step. Figure B.1 depicts for each sentence length  (varying between  and )
the worst-case required number of processors along with the real number of processors actually
required (both the maximal and the average number). The same values illustrated in figure B.1
162 APPENDIX B: Facts about the (enhanced)-CYK algorithm
0 5 10 15 20 25 30 35
0
50
100
150
200
250
# p
roc
es
so
rs
sentence length [words]
theoretical worst−case
maximum real          
average real          
Figure B.1: The theoretical (worst-case) and real (average and maximal) number of processors
required to parse sentences of length     , extracted from the SUSANNE corpus, when
using the CYK algorithm in a case-study on the SUSANNE grammar. A number of at least 
sentences were parsed for each value of .
are tabulated in table B.2. The average number of processors required for parsing any sentence
with length  ranging from  to  within    time-steps is 
 and is significantly less than
the number of processors required in the worst-case assumption.
B.2.2 Case-study on the enhanced-CYK algorithm
The grammar used in this experiment is a version of the SUSANNE grammar from which we
have eliminated all unitary rules (i.e. rules of the form    ). The new grammar contains a
number of   grammar rules and   distinct non-terminals.
When using the enhanced-CYK algorithm every cell in the chart contains two sets of items.
One set contains non-terminals (the  sets) and the second set (the  sets) contains partial
derivations that can derive the subsentence associated with that cell. A processor combines the
set  of the first source cell with the set  of the second source cell and stores the result of
this combination in the set  and/or  of a destination cell. Whenever one of the sets in the
source cells (either  for the first or  for the second) are empty the processor does not have
any work to do, being therefore unnecessary during that time-step. Figure B.2 depicts for each
sentence length  (varying between  and ) the worst-case required number of processors
along with the real number of processors actually required (both the maximal and the average
number). The same values illustrated in figure B.2 are tabulated in table B.2.2. The average
number of processors (i.e. 
) required for parsing any sentence with length  ranging from 
to  within   time-steps is significantly less (actually even smaller when compared to the
CYK algorithm) than the number of processors required in the worst-case assumption.
B.2 : The dynamic processor allocation method 163
 worst-case real (average/maximal)  worst-case real (average/maximal)
- - - 17 62 15 / 57
2 1 1 / 1 18 69 17 / 59
3 2 2 / 2 19 77 17 / 65
4 4 3 / 4 20 85 18 / 73
5 6 4 / 6 21 94 20 / 77
6 9 5 / 9 22 103 19 / 78
7 12 5 / 12 23 112 22 / 80
8 15 6 / 15 24 122 21 / 97
9 18 7 / 18 25 132 20 / 115
10 22 9 / 22 26 142 21 / 87
11 27 9 / 26 27 153 25 / 107
12 32 11 / 32 28 165 26 / 105
13 37 11 / 34 29 177 27 / 114
14 43 13 / 41 30 189 24 / 116
15 49 13 / 44 31 202 25 / 114
16 55 14 / 45 32 215 28 / 124
Table B.2: The values for the theoretical (worst-case) and real (average and maximal) number
of processors as illustrated in figure B.1.
0 5 10 15 20 25 30 35
0
50
100
150
200
250
# p
roc
es
so
rs
sentence length [words]
theoretical worst−case
maximum real          
average real          
Figure B.2: The theoretical (worst-case) and real (average and maximal) number of processors
required to parse sentences of length     , extracted from the SUSANNE corpus, when
using the enhanced-CYK algorithm in a case-study on the SUSANNE grammar. A number of
at least  sentences were parsed for each value of .
164 APPENDIX B: Facts about the (enhanced)-CYK algorithm
 worst-case real (average/maximal)  worst-case real (average/maximal)
- - - 17 62 11 / 55
2 1 1 / 1 18 69 12 / 54
3 2 2 / 2 19 77 12 / 56
4 4 3 / 4 20 85 12 / 60
5 6 3 / 6 21 94 14 / 61
6 9 4 / 9 22 103 13 / 79
7 12 5 / 12 23 112 16 / 78
8 15 5 / 15 24 122 14 / 71
9 18 6 / 18 25 132 13 / 58
10 22 7 / 22 26 142 14 / 64
11 27 7 / 22 27 153 16 / 102
12 32 8 / 29 28 165 17 / 87
13 37 8 / 33 29 177 17 / 82
14 43 9 / 37 30 189 15 / 88
15 49 10 / 41 31 202 16 / 83
16 55 10 / 41 32 215 18 / 93
Table B.3: The values for the theoretical (worst-case) and real (average and maximal) number
of processors as illustrated in figure B.2.
B.3 The chart cell size for the CYK algorithm
In this section the chart cell size (i.e. the number of non-terminals) for the CYK algorithm is
profiled in the real-life case of the CNF SUSANNE grammar. More precisely, the chart cell
size distribution is investigated. Such a profiling is useful in order to choose the right chart
memory size. In other words, to allocate for a chart cell an amount of memory smaller than the
distinct number of non-terminals in the grammar which tends to waste memory space.
For the purpose of these measurements, a number of   sentences, with lengths between
 to  words, extracted from the SUSANNE corpus were parsed. For these sentences the
number of occurences of each cell size is counted and used to compute the overall cell size
distribution. Figure B.3 depicts the frequency with which each cell size occures within the
charts.
The figure shows that most of the cells have a size between  and  non-terminals and
that only few cells have a size larger than 
 (exactly 	%). This means that, if a cell
size of 
 is used, about 	% sentences will fail to parse. The maximum cell size is 
and therefore, a cell size of  is sufficient for parsing any sentence from the SUSANNE
corpus. This number is significantly smaller than the number   of distinct non-terminals
in the CNF SUSANNE grammar. A non-terminal is represented on  bytes which means that
 bytes are used for storing a chart cell. It also turns out that in average 		% of the overall
(over all the parsed sentences) number of cells are empty.
B.4 The chart cell size for the enhanced-CYK algorithm
In this section the chart cell size (i.e. the number of non-terminals and the number of partial
rule right-hand sides) for the enhanced-CYK algorithm is profiled in the real-life case of the
B.4 : The chart cell size for the enhanced-CYK algorithm 165
0 20 40 60 80 100 120 140 160 180
0
1
2
3
4
5
6
7
x 104
cell size [# of non−terminals]
oc
cu
rre
nc
e 
fre
qu
en
cy
Figure B.3: Cell size distribution for the CYK algorithm in the particular case of the CNF
SUSANNE grammar. Computed for a number of   sentences with lengths between  to
 words, extracted from the SUSANNE corpus.
SUSANNE grammar without unitary rules. More precisely, the  set size and respectivelly
the  set size distributions are investigated. Such a profiling is useful in order to choose the
right chart memory size for not wasting memory space.
For the purpose of these measurements, a number of   sentences, with lengths between
 to  words, extracted from the SUSANNE corpus were parsed. For these sentences the num-
ber of occurences of each  set size (and respectivelly each  set size) is counted and used
to compute the overall  set size (and respectivelly  set size) distribution. Figure B.4(a)
depicts the frequency with which each  set size occures within the charts and figure B.4(b)
the frequency with which each  set size occures within the charts.
The figure B.4(a) shows that most of the  sets have a size between  and 
 non-
terminals and that only few  sets have a size larger than 
 (exacly %). This means
that, if a  set size of 
 is used, about % sentences will fail to parse. The maximum
size of the  sets is  and therefore, a set size of  is sufficient for parsing any sentence
from the SUSANNE corpus. This number is significantly smaller than the number   of
distinct non-terminals in the SUSANNE grammar without unitary-rules. A non-terminal is
represented on  bytes which means that   bytes are used for storing a  set in a chart
cell. It also turns out that in average 		% of the overall (over all the parsed sentences)
number of sets  are empty.
On the other hand, the figure B.4(a) shows that most of the  sets have a maximum size
of  partial rule right-hand sides. Therefore, if a  set size of  is used any sentence
from the SUSANNE corpus can be parsed.
Note: using the same set size for the  sets and for the  sets is important in order to (1)
simplify the retrieval of the sets in the chart memory and (2) to simplify the hardware required
for processing the sets.
166 APPENDIX B: Facts about the (enhanced)-CYK algorithm
0 50 100 150
0
0.5
1
1.5
2
2.5
3
3.5
x 104
oc
cu
rre
nc
e 
fre
qu
en
cy
N1 set size [# of non−terminals]
0 50 100 150
0
1
2
3
4
5
6
x 104
oc
cu
rre
nc
e 
fre
qu
en
cy
N2 set size [# of partial rule right−hand sides]
Figure B.4: (a)  set size distribution and (b)  set size distribution for the enhanced-CYK
algorithm in the particular case of the SUSANNE grammar without unitary rules. Computed
for a number of   sentences with lengths between  to  words, extracted from the SU-
SANNE corpus.
B.4 : The chart cell size for the enhanced-CYK algorithm 167
This number is significantly smaller than the number 
  of distinct partial rule right-hand
sides in the SUSANNE grammar without unitary-rules. A partial rule right-hand side is repre-
sented on  bytes which means that  bytes are used for storing a set  in a chart cell. It
also turns out that in average 	
% of the overall (over all the parsed sentences) number of
sets  are empty.
In conclusion if a set size of 
 is used, about % sentences will fail to parse. If a set
size of  is used any sentence from the SUSANNE corpus can be parsed.
168 APPENDIX B: Facts about the (enhanced)-CYK algorithm
Appendix C
FPGA expansion board schematics
Address Range Device Chip Select
 
9 
PCI interface board, flash memory  KBytes 
 
 
unused -
 
 
FIFO memory (DMA transfers) 
 
9 
unused -
 
 
unused -
6 
6 
chart memories 1 & 2 +
7 
7 
Local bus-to-PCI bus, Memory Mapped -
 
 
Local bus-to-PCI bus, I/O Mapped -
: 
: 
IOP 480 Internal Config. Registers -
+ 
+ 
FPGA configuration, FPGA general-purpose user
registers, swapping the chart memories and for
programming the programmable clock
:
 +
 
IOP 480 Internal UART -
 
+ 
unused -
 
 
SDRAM 32MBytes -
Table C.1: FPGA-board Memory Map (see [23] for more details)
$1
I2
05
$2
I2
68
$3
I1
81
$4
I8
5
$5
I4
41
$6
I4
20
$7
I1
84
$8
I1
46
$9
I1
49
$1
0I
11
8
$1
1I
11
8
$1
2I
12
8

183
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
1
CC
LK
C3
1
F P
G
AC
FG
_C
CL
K
2 6
IO
0 _
9
F 1
7
F P
G
A
_D
AT
A
9
2
D
O
N
E
A
M
31
FP
G
AC
FG
_D
O
N
E
27
IO
0_
10
J1
2
FP
G
A
_D
AT
A
10
3
D
X
N
A
J5
(no
tu
se
d)
28
IO
0_
11
J1
3
FP
G
A
_D
AT
A
11
4
D
X
P
A
L5
(no
tu
se
d)
29
IO
0_
12
J1
4
FP
G
A
_D
AT
A
12
5
G
CL
K
0
A
H
18
(no
tu
se
d)
30
IO
0_
13
K
11
FP
G
A
_D
AT
A
13
6
G
CL
K
1
A
L1
9
(no
tu
se
d)
31
IO
0_
14
F7
FP
G
A
_D
AT
A
14
7
G
CL
K
2
D
17
G
CL
K
2_
FP
G
A
32
IO
0_
15
H
9
FP
G
A
_D
AT
A
15
8
G
CL
K
3
E1
7
G
CL
K
3_
FP
G
A
33
IO
0_
16
C5
FP
G
A
_D
AT
A
16
9
M
0
A
K
4
pu
ll-
do
w
n
34
IO
0_
17
J1
0
FP
G
A
_D
AT
A
17
10
M
1
A
G
7
pu
ll-
up
35
IO
0_
18
E6
FP
G
A
_D
AT
A
18
11
M
2
A
L3
pu
ll-
up
36
IO
0_
19
D
6
FP
G
A
_D
AT
A
19
12
PR
O
G
RA
M
A
G
28
/F
PG
AC
FG
_P
RO
G
RA
M
37
IO
0_
20
A
4
FP
G
A
_D
AT
A
20
13
TC
K
D
5
FP
G
A
_T
CK
38
IO
0_
21
G
8
FP
G
A
_D
AT
A
21
14
TD
I
C3
0
FP
G
A
_T
D
I
39
IO
0_
22
C6
FP
G
A
_D
AT
A
22
15
TD
O
K
26
FP
G
A
_T
D
O
40
IO
0_
23
J1
1
FP
G
A
_D
AT
A
23
16
TM
S
C4
FP
G
A
_T
M
S
41
IO
0_
24
G
9
FP
G
A
_D
AT
A
24
17
IO
0_
0
B
4
FP
G
A
_D
AT
A
0
42
IO
0_
25
F8
FP
G
A
_D
AT
A
25
18
IO
0_
1
B
9
FP
G
A
_D
AT
A
1
43
IO
0_
26
A
5
FP
G
A
_D
AT
A
26
19
IO
0_
2
B
10
FP
G
A
_D
AT
A
2
44
IO
0_
27
H
10
FP
G
A
_D
AT
A
27
20
IO
0_
3
D
9
FP
G
A
_D
AT
A
3
45
IO
0_
28
D
7
FP
G
A
_D
AT
A
28
21
IO
0_
4
D
16
FP
G
A
_D
AT
A
4
46
IO
0_
29
B
5
FP
G
A
_D
AT
A
29
22
IO
0_
5
E7
FP
G
A
_D
AT
A
5
47
IO
0_
30
K
12
FP
G
A
_D
AT
A
30
23
IO
0_
6
E1
1
FP
G
A
_D
AT
A
6
48
IO
0_
31
E8
FP
G
A
_D
AT
A
31
24
IO
0_
7
E1
3
FP
G
A
_D
AT
A
7
49
IO
0_
32
B
6
FP
G
A
_D
AT
A
32
25
IO
0_
8
E1
6
FP
G
A
_D
AT
A
8
50
IO
0_
33
F9
FP
G
A
_D
AT
A
33
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
1
o
f1
7)
184 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
5 1
IO
0 _
3 4
G
1 0
F P
G
A
_D
AT
A
3 4
7 6
IO
0 _
5 9
D
1 1
F P
G
A
_D
AT
A
5 9
52
IO
0_
35
C7
FP
G
A
_D
AT
A
35
77
IO
0_
60
G
13
FP
G
A
_D
AT
A
60
53
IO
0_
36
D
8
FP
G
A
_D
AT
A
36
78
IO
0_
61
C1
2
FP
G
A
_D
AT
A
61
54
IO
0_
37
B
7
FP
G
A
_D
AT
A
37
79
IO
0_
62
K
15
FP
G
A
_D
AT
A
62
55
IO
0_
38
H
11
FP
G
A
_D
AT
A
38
80
IO
0_
63
A
12
FP
G
A
_D
AT
A
63
56
IO
0_
39
C8
FP
G
A
_D
AT
A
39
81
IO
0_
64
B
12
/F
PG
A
_C
E
57
IO
0_
40
E9
FP
G
A
_D
AT
A
40
82
IO
0_
65
H
14
/F
PG
A
_O
E
58
IO
0_
41
B
8
FP
G
A
_D
AT
A
41
83
IO
0_
66
D
12
/F
PG
A
_W
E
59
IO
0_
42
K
13
FP
G
A
_D
AT
A
42
84
IO
0_
67
F1
3
FP
G
A
_A
D
D
R0
60
IO
0_
43
G
11
FP
G
A
_D
AT
A
43
85
IO
0_
68
A
13
FP
G
A
_A
D
D
R1
61
IO
0_
44
A
8
FP
G
A
_D
AT
A
44
86
IO
0_
69
B
13
FP
G
A
_A
D
D
R2
62
IO
0_
45
F1
0
FP
G
A
_D
AT
A
45
87
IO
0_
70
J1
5
FP
G
A
_A
D
D
R3
63
IO
0_
46
C9
FP
G
A
_D
AT
A
46
88
IO
0_
71
G
14
FP
G
A
_A
D
D
R4
64
IO
0_
47
H
12
FP
G
A
_D
AT
A
47
89
IO
0_
72
C1
3
FP
G
A
_A
D
D
R5
65
IO
0_
48
D
10
FP
G
A
_D
AT
A
48
90
IO
0_
73
F1
4
FP
G
A
_A
D
D
R6
66
IO
0_
49
A
9
FP
G
A
_D
AT
A
49
91
IO
0_
74
H
15
FP
G
A
_A
D
D
R7
67
IO
0_
50
F1
1
FP
G
A
_D
AT
A
50
92
IO
0_
75
D
13
FP
G
A
_A
D
D
R8
68
IO
0_
51
A
10
FP
G
A
_D
AT
A
51
93
IO
0_
76
A
14
FP
G
A
_A
D
D
R9
69
IO
0_
52
K
14
FP
G
A
_D
AT
A
52
94
IO
0_
77
K
16
FP
G
A
_A
D
D
R1
0
70
IO
0_
53
C1
0
FP
G
A
_D
AT
A
53
95
IO
0_
78
E1
4
FP
G
A
_A
D
D
R1
1
71
IO
0_
54
H
13
FP
G
A
_D
AT
A
54
96
IO
0_
79
B
14
FP
G
A
_A
D
D
R1
2
72
IO
0_
55
G
12
FP
G
A
_D
AT
A
55
97
IO
0_
80
G
15
FP
G
A
_A
D
D
R1
3
73
IO
0_
56
A
11
FP
G
A
_D
AT
A
56
98
IO
0_
81
D
14
FP
G
A
_A
D
D
R1
4
74
IO
0_
57
B
11
FP
G
A
_D
AT
A
57
99
IO
0_
82
J1
6
FP
G
A
_A
D
D
R1
5
75
IO
0_
58
E1
2
FP
G
A
_D
AT
A
58
10
0
IO
0_
83
D
15
FP
G
A
_A
D
D
R1
6
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
2
o
f1
7)
185
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
1 0
1
IO
0 _
8 4
F 1
5
F P
G
A
_A
D
D
R1
7
1 2
6
IO
1 _
1 0
D
2 8
G
1 _
A
D
D
R3
10
2
IO
0_
85
B
15
FP
G
A
_A
D
D
R1
8
12
7
IO
1_
11
D
29
G
1_
A
D
D
R4
10
3
IO
0_
86
A
15
FP
G
A
_A
D
D
R1
9
12
8
IO
1_
12
G
23
G
1_
A
D
D
R5
10
4
IO
0_
87
E1
5
G
1_
DA
TA
0
12
9
IO
1_
13
J2
3
G
1_
A
D
D
R6
10
5
IO
0_
88
G
16
G
1_
DA
TA
1
13
0
IO
1_
14
J1
8
G
1_
A
D
D
R7
10
6
IO
0_
89
A
16
G
1_
DA
TA
2
13
1
IO
1_
15
G
18
G
1_
A
D
D
R8
10
7
IO
0_
90
F1
6
G
1_
DA
TA
3
13
2
IO
1_
16
C1
8
G
1_
A
D
D
R9
10
8
IO
0_
91
J1
7
G
1_
DA
TA
4
13
3
IO
1_
17
H
18
G
1_
A
D
D
R1
0
10
9
IO
0_
92
C1
6
G
1_
DA
TA
5
13
4
IO
1_
18
F1
8
G
1_
A
D
D
R1
1
11
0
IO
0_
93
B
16
G
1_
DA
TA
6
13
5
IO
1_
19
B
19
G
1_
A
D
D
R1
2
11
1
IO
0_
94
H
17
G
1_
DA
TA
7
13
6
IO
1_
20
A
19
G
1_
A
D
D
R1
3
11
2
IO
0_
95
A
17
G
1_
DA
TA
8
13
7
IO
1_
21
K
19
G
1_
A
D
D
R1
4
11
3
IO
0_
96
G
17
G
1_
DA
TA
9
13
8
IO
1_
22
C1
9
G
1_
A
D
D
R1
5
11
4
IO
0_
97
B
17
G
1_
DA
TA
10
13
9
IO
1_
23
F1
9
G
1_
A
D
D
R1
6
11
5
IO
0_
98
C1
7
G
1_
DA
TA
11
14
0
IO
1_
24
E1
9
G
1_
A
D
D
R1
7
11
6
IO
1_
0
A
18
G
1_
DA
TA
12
14
1
IO
1_
25
G
19
G
1_
A
D
D
R1
8
11
7
IO
1_
1
B
18
G
1_
DA
TA
13
14
2
IO
1_
26
J1
9
G
1_
A
D
D
R1
9
11
8
IO
1_
2
B
24
G
1_
DA
TA
14
14
3
IO
1_
27
A
20
G
2_
DA
TA
0
11
9
IO
1_
3
B
25
G
1_
DA
TA
15
14
4
IO
1_
28
G
20
G
2_
DA
TA
1
12
0
IO
1_
4
E2
2
/G
1_
CE
14
5
IO
1_
29
B
20
G
2_
DA
TA
2
12
1
IO
1_
5
E2
3
/G
1_
W
E
14
6
IO
1_
30
F2
0
G
2_
DA
TA
3
12
2
IO
1_
6
D
18
/G
1_
O
E
14
7
IO
1_
31
D
20
G
2_
DA
TA
4
12
3
IO
1_
7
D
19
G
1_
A
D
D
R0
14
8
IO
1_
32
E2
0
G
2_
DA
TA
5
12
4
IO
1_
8
D
25
G
1_
A
D
D
R1
14
9
IO
1_
33
H
20
G
2_
DA
TA
6
12
5
IO
1_
9
D
26
G
1_
A
D
D
R2
15
0
IO
1_
34
A
21
G
2_
DA
TA
7
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
3
o
f1
7)
186 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
1 5
1
IO
1 _
3 5
E2
1
G
2 _
DA
TA
8
1 7
6
IO
1 _
6 0
D
2 4
G
2 _
A
D
D
R1
4
15
2
IO
1_
36
J2
0
G
2_
DA
TA
9
17
7
IO
1_
61
A
25
G
2_
A
D
D
R1
5
15
3
IO
1_
37
D
21
G
2_
DA
TA
10
17
8
IO
1_
62
E2
4
G
2_
A
D
D
R1
6
15
4
IO
1_
38
K
20
G
2_
DA
TA
11
17
9
IO
1_
63
A
26
G
2_
A
D
D
R1
7
15
5
IO
1_
39
B
21
G
2_
DA
TA
12
18
0
IO
1_
64
C2
5
G
2_
A
D
D
R1
8
15
6
IO
1_
40
H
21
G
2_
DA
TA
13
18
1
IO
1_
65
F2
4
G
2_
A
D
D
R1
9
15
7
IO
1_
41
G
21
G
2_
DA
TA
14
18
2
IO
1_
66
B
26
G
3_
DA
TA
0
15
8
IO
1_
42
F2
1
G
2_
DA
TA
15
18
3
IO
1_
67
K
23
G
3_
DA
TA
1
15
9
IO
1_
43
A
22
/G
2_
CE
18
4
IO
1_
68
F2
5
G
3_
DA
TA
2
16
0
IO
1_
44
B
22
/G
2_
W
E
18
5
IO
1_
69
C2
6
G
3_
DA
TA
3
16
1
IO
1_
45
J2
1
/G
2_
O
E
18
6
IO
1_
70
H
24
G
3_
DA
TA
4
16
2
IO
1_
46
C2
2
G
2_
A
D
D
R0
18
7
IO
1_
71
G
24
G
3_
DA
TA
5
16
3
IO
1_
47
D
22
G
2_
A
D
D
R1
18
8
IO
1_
72
A
27
G
3_
DA
TA
6
16
4
IO
1_
48
G
22
G
2_
A
D
D
R2
18
9
IO
1_
73
B
27
G
3_
DA
TA
7
16
5
IO
1_
49
K
21
G
2_
A
D
D
R3
19
0
IO
1_
74
G
25
G
3_
DA
TA
8
16
6
IO
1_
50
A
23
G
2_
A
D
D
R4
19
1
IO
1_
75
E2
6
G
3_
DA
TA
9
16
7
IO
1_
51
F2
2
G
2_
A
D
D
R5
19
2
IO
1_
76
C2
7
G
3_
DA
TA
10
16
8
IO
1_
52
B
23
G
2_
A
D
D
R6
19
3
IO
1_
77
J2
4
G
3_
DA
TA
11
16
9
IO
1_
53
C2
3
G
2_
A
D
D
R7
19
4
IO
1_
78
B
28
G
3_
DA
TA
12
17
0
IO
1_
54
H
22
G
2_
A
D
D
R8
19
5
IO
1_
79
K
24
G
3_
DA
TA
13
17
1
IO
1_
55
D
23
G
2_
A
D
D
R9
19
6
IO
1_
80
H
25
G
3_
DA
TA
14
17
2
IO
1_
56
K
22
G
2_
A
D
D
R1
0
19
7
IO
1_
81
D
27
G
3_
DA
TA
15
17
3
IO
1_
57
A
24
G
2_
A
D
D
R1
1
19
8
IO
1_
82
F2
6
/G
3_
CE
17
4
IO
1_
58
J2
2
G
2_
A
D
D
R1
2
19
9
IO
1_
83
G
26
/G
3_
W
E
17
5
IO
1_
59
H
23
G
2_
A
D
D
R1
3
20
0
IO
1_
84
C2
8
/G
3_
O
E
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
4
o
f1
7)
187
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
2 0
1
IO
1 _
8 5
E2
7
G
3 _
A
D
D
R0
2 2
6
IO
2 _
9
P 2
5
G
4 _
DA
TA
3
20
2
IO
1_
86
J2
5
G
3_
A
D
D
R1
22
7
IO
2_
10
U
26
G
4_
DA
TA
4
20
3
IO
1_
87
A
30
G
3_
A
D
D
R2
22
8
IO
2_
11
U
30
G
4_
DA
TA
5
20
4
IO
1_
88
H
26
G
3_
A
D
D
R3
22
9
IO
2_
12
U
32
G
4_
DA
TA
6
20
5
IO
1_
89
G
27
G
3_
A
D
D
R4
23
0
IO
2_
13
U
34
G
4_
DA
TA
7
20
6
IO
1_
90
B
29
G
3_
A
D
D
R5
23
1
IO
2_
14
M
30
LA
D
5
20
7
IO
1_
91
F2
7
G
3_
A
D
D
R6
23
2
IO
2_
15
D
32
FP
G
AC
FG
_B
U
SY
20
8
IO
1_
92
C2
9
G
3_
A
D
D
R7
23
3
IO
2_
16
J2
7
LA
D
7
20
9
IO
1_
93
E2
8
G
3_
A
D
D
R8
23
4
IO
2_
17
E3
1
G
4_
DA
TA
8
21
0
IO
1_
94
F2
8
G
3_
A
D
D
R9
23
5
IO
2_
18
F3
0
G
4_
DA
TA
9
21
1
IO
1_
95
L2
5
G
3_
A
D
D
R1
0
23
6
IO
2_
19
G
29
G
4_
DA
TA
10
21
2
IO
1_
96
B
30
G
3_
A
D
D
R1
1
23
7
IO
2_
20
F3
2
G
4_
DA
TA
11
21
3
IO
1_
97
B
31
G
3_
A
D
D
R1
2
23
8
IO
2_
21
E3
2
G
4_
DA
TA
12
21
4
IO
1_
98
E2
9
G
3_
A
D
D
R1
3
23
9
IO
2_
22
G
30
G
4_
DA
TA
13
21
5
IO
1_
99
A
31
/F
PG
AC
FG
_W
RI
TE
24
0
IO
2_
23
M
25
G
4_
DA
TA
14
21
6
IO
1_
10
0
D
30
/F
PG
AC
FG
_C
S
24
1
IO
2_
24
G
31
G
4_
DA
TA
15
21
7
IO
2_
0
F3
1
G
3_
A
D
D
R1
4
24
2
IO
2_
25
L2
6
/G
4_
CE
21
8
IO
2_
1
J3
2
G
3_
A
D
D
R1
5
24
3
IO
2_
26
D
33
/G
4_
W
E
21
9
IO
2_
2
K
27
G
3_
A
D
D
R1
6
24
4
IO
2_
27
D
34
/G
4_
O
E
22
0
IO
2_
3
K
31
G
3_
A
D
D
R1
7
24
5
IO
2_
28
H
29
G
4_
A
D
D
R0
22
1
IO
2_
4
L2
8
G
3_
A
D
D
R1
8
24
6
IO
2_
29
J2
8
G
4_
A
D
D
R1
22
2
IO
2_
5
L3
0
G
3_
A
D
D
R1
9
24
7
IO
2_
30
E3
3
G
4_
A
D
D
R2
22
3
IO
2_
6
M
32
G
4_
DA
TA
0
24
8
IO
2_
31
H
28
G
4_
A
D
D
R3
22
4
IO
2_
7
N
26
G
4_
DA
TA
1
24
9
IO
2_
32
H
30
G
4_
A
D
D
R4
22
5
IO
2_
8
N
28
G
4_
DA
TA
2
25
0
IO
2_
33
H
32
G
4_
A
D
D
R5
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
5
o
f1
7)
188 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
2 5
1
IO
2 _
3 4
K
2 8
G
4 _
A
D
D
R6
2 7
6
IO
2 _
5 9
R
2 5
G
5 _
DA
TA
1 0
25
2
IO
2_
35
L2
7
G
4_
A
D
D
R7
27
7
IO
2_
60
M
34
G
5_
DA
TA
11
25
3
IO
2_
36
F3
3
G
4_
A
D
D
R8
27
8
IO
2_
61
L3
1
G
5_
DA
TA
12
25
4
IO
2_
37
M
26
G
4_
A
D
D
R9
27
9
IO
2_
62
L3
3
G
5_
DA
TA
13
25
5
IO
2_
38
E3
4
G
4_
A
D
D
R1
0
28
0
IO
2_
63
P2
7
G
5_
DA
TA
14
25
6
IO
2_
39
H
31
G
4_
A
D
D
R1
1
28
1
IO
2_
64
M
33
G
5_
DA
TA
15
25
7
IO
2_
40
G
32
G
4_
A
D
D
R1
2
28
2
IO
2_
65
M
31
/G
5_
CE
25
8
IO
2_
41
N
25
G
4_
A
D
D
R1
3
28
3
IO
2_
66
R
26
/G
5_
W
E
25
9
IO
2_
42
J3
1
G
4_
A
D
D
R1
4
28
4
IO
2_
67
N
30
/G
5_
O
E
26
0
IO
2_
43
J3
0
G
4_
A
D
D
R1
5
28
5
IO
2_
68
P2
8
G
5_
A
D
D
R0
26
1
IO
2_
44
G
33
G
4_
A
D
D
R1
6
28
6
IO
2_
69
N
29
G
5_
A
D
D
R1
26
2
IO
2_
45
H
34
G
4_
A
D
D
R1
7
28
7
IO
2_
70
N
33
G
5_
A
D
D
R2
26
3
IO
2_
46
J2
9
G
4_
A
D
D
R1
8
28
8
IO
2_
71
T2
5
G
5_
A
D
D
R3
26
4
IO
2_
47
M
27
G
4_
A
D
D
R1
9
28
9
IO
2_
72
N
34
G
5_
A
D
D
R4
26
5
IO
2_
48
H
33
G
5_
DA
TA
0
29
0
IO
2_
73
P3
4
G
5_
A
D
D
R5
26
6
IO
2_
49
K
29
G
5_
DA
TA
1
29
1
IO
2_
74
R
27
G
5_
A
D
D
R6
26
7
IO
2_
50
J3
4
G
5_
DA
TA
2
29
2
IO
2_
75
P2
9
G
5_
A
D
D
R7
26
8
IO
2_
51
L2
9
G
5_
DA
TA
3
29
3
IO
2_
76
P3
1
G
5_
A
D
D
R8
26
9
IO
2_
52
J3
3
G
5_
DA
TA
4
29
4
IO
2_
77
P3
3
G
5_
A
D
D
R9
27
0
IO
2_
53
M
28
G
5_
DA
TA
5
29
5
IO
2_
78
T2
6
G
5_
A
D
D
R1
0
27
1
IO
2_
54
K
34
G
5_
DA
TA
6
29
6
IO
2_
79
R
34
G
5_
A
D
D
R1
1
27
2
IO
2_
55
N
27
G
5_
DA
TA
7
29
7
IO
2_
80
R
28
G
5_
A
D
D
R1
2
27
3
IO
2_
56
L3
4
G
5_
DA
TA
8
29
8
IO
2_
81
N
31
G
5_
A
D
D
R1
3
27
4
IO
2_
57
K
33
G
5_
DA
TA
9
29
9
IO
2_
82
N
32
LA
D
4
27
5
IO
2_
58
P2
6
LA
D
6
30
0
IO
2_
83
P3
0
G
5_
A
D
D
R1
4
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
6
o
f1
7)
189
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
3 0
1
IO
2 _
8 4
R
3 3
G
5 _
A
D
D
R1
5
3 2
6
IO
3 _
8
A
F 3
4
G
6 _
A
D
D
R1
30
2
IO
2_
85
R
29
G
5_
A
D
D
R1
6
32
7
IO
3_
9
A
G
31
G
6_
A
D
D
R2
30
3
IO
2_
86
T3
4
G
5_
A
D
D
R1
7
32
8
IO
3_
10
A
G
33
G
6_
A
D
D
R3
30
4
IO
2_
87
R
30
G
5_
A
D
D
R1
8
32
9
IO
3_
11
A
G
34
G
6_
A
D
D
R4
30
5
IO
2_
88
T3
0
G
5_
A
D
D
R1
9
33
0
IO
3_
12
A
H
29
G
6_
A
D
D
R5
30
6
IO
2_
89
T2
8
G
6_
DA
TA
0
33
1
IO
3_
13
A
J3
0
G
6_
A
D
D
R6
30
7
IO
2_
90
R
31
G
6_
DA
TA
1
33
2
IO
3_
14
V
26
G
6_
A
D
D
R7
30
8
IO
2_
91
T2
9
G
6_
DA
TA
2
33
3
IO
3_
15
V
30
G
6_
A
D
D
R8
30
9
IO
2_
92
U
27
G
6_
DA
TA
3
33
4
IO
3_
16
W
34
G
6_
A
D
D
R9
31
0
IO
2_
93
T3
1
G
6_
DA
TA
4
33
5
IO
3_
17
V
28
G
6_
A
D
D
R1
0
31
1
IO
2_
94
T3
3
G
6_
DA
TA
5
33
6
IO
3_
18
W
32
G
6_
A
D
D
R1
1
31
2
IO
2_
95
U
28
G
6_
DA
TA
6
33
7
IO
3_
19
W
30
G
6_
A
D
D
R1
2
31
3
IO
2_
96
T3
2
G
6_
DA
TA
7
33
8
IO
3_
20
V
29
G
6_
A
D
D
R1
3
31
4
IO
2_
97
U
29
G
6_
DA
TA
8
33
9
IO
3_
21
Y
34
G
6_
A
D
D
R1
4
31
5
IO
2_
98
U
33
G
6_
DA
TA
9
34
0
IO
3_
22
W
29
G
6_
A
D
D
R1
5
31
6
IO
2_
99
V
33
G
6_
DA
TA
10
34
1
IO
3_
23
Y
33
G
6_
A
D
D
R1
6
31
7
IO
2_
10
0
U
31
G
6_
DA
TA
11
34
2
IO
3_
24
W
26
G
6_
A
D
D
R1
7
31
8
IO
3_
0
V
27
G
6_
DA
TA
12
34
3
IO
3_
25
W
28
G
6_
A
D
D
R1
8
31
9
IO
3_
1
V
31
G
6_
DA
TA
13
34
4
IO
3_
26
Y
31
G
6_
A
D
D
R1
9
32
0
IO
3_
2
V
32
G
6_
DA
TA
14
34
5
IO
3_
27
Y
30
G
7_
DA
TA
0
32
1
IO
3_
3
W
33
G
6_
DA
TA
15
34
6
IO
3_
28
A
A
34
G
7_
DA
TA
1
32
2
IO
3_
4
A
B2
5
/G
6_
CE
34
7
IO
3_
29
W
31
G
7_
DA
TA
2
32
3
IO
3_
5
A
B2
6
/G
6_
W
E
34
8
IO
3_
30
A
A
33
LA
D
3
32
4
IO
3_
6
A
B3
1
/G
6_
O
E
34
9
IO
3_
31
Y
29
G
7_
DA
TA
3
32
5
IO
3_
7
A
C3
1
G
6_
A
D
D
R0
35
0
IO
3_
32
W
25
G
7_
DA
TA
4
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
7
o
f1
7)
190 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
3 5
1
IO
3 _
3 3
A
B3
4
G
7 _
DA
TA
5
3 7
6
IO
3 _
5 8
A
A
2 5
G
7 _
A
D
D
R9
35
2
IO
3_
34
Y
28
G
7_
DA
TA
6
37
7
IO
3_
59
A
E3
2
G
7_
A
D
D
R1
0
35
3
IO
3_
35
A
B3
3
G
7_
DA
TA
7
37
8
IO
3_
60
A
E3
1
G
7_
A
D
D
R1
1
35
4
IO
3_
36
A
A
30
G
7_
DA
TA
8
37
9
IO
3_
61
A
D
29
G
7_
A
D
D
R1
2
35
5
IO
3_
37
Y
26
G
7_
DA
TA
9
38
0
IO
3_
62
A
D
31
G
7_
A
D
D
R1
3
35
6
IO
3_
38
Y
27
G
7_
DA
TA
10
38
1
IO
3_
63
A
F3
3
G
7_
A
D
D
R1
4
35
7
IO
3_
39
A
A
31
G
7_
DA
TA
11
38
2
IO
3_
64
A
C2
8
G
7_
A
D
D
R1
5
35
8
IO
3_
40
A
A
27
G
7_
DA
TA
12
38
3
IO
3_
65
A
F3
1
G
7_
A
D
D
R1
6
35
9
IO
3_
41
A
A
29
G
7_
DA
TA
13
38
4
IO
3_
66
A
C2
7
G
7_
A
D
D
R1
7
36
0
IO
3_
42
A
B3
2
G
7_
DA
TA
14
38
5
IO
3_
67
A
F3
2
G
7_
A
D
D
R1
8
36
1
IO
3_
43
A
B2
9
G
7_
DA
TA
15
38
6
IO
3_
68
A
E2
9
G
7_
A
D
D
R1
9
36
2
IO
3_
44
A
A
28
/G
7_
CE
38
7
IO
3_
69
A
D
28
G
8_
DA
TA
0
36
3
IO
3_
45
A
C3
4
/G
7_
W
E
38
8
IO
3_
70
A
D
30
G
8_
DA
TA
1
36
4
IO
3_
46
Y
25
/G
7_
O
E
38
9
IO
3_
71
A
G
32
G
8_
DA
TA
2
36
5
IO
3_
47
A
D
34
G
7_
A
D
D
R0
39
0
IO
3_
72
A
C2
6
G
8_
DA
TA
3
36
6
IO
3_
48
A
B3
0
G
7_
A
D
D
R1
39
1
IO
3_
73
A
H
33
G
8_
DA
TA
4
36
7
IO
3_
49
A
C3
3
G
7_
A
D
D
R2
39
2
IO
3_
74
A
D
26
G
8_
DA
TA
5
36
8
IO
3_
50
A
A
26
G
7_
A
D
D
R3
39
3
IO
3_
75
A
F3
0
G
8_
DA
TA
6
36
9
IO
3_
51
A
C3
2
G
7_
A
D
D
R4
39
4
IO
3_
76
A
C2
5
G
8_
DA
TA
7
37
0
IO
3_
52
A
D
33
G
7_
A
D
D
R5
39
5
IO
3_
77
A
H
32
G
8_
DA
TA
8
37
1
IO
3_
53
A
B2
8
G
7_
A
D
D
R6
39
6
IO
3_
78
A
E2
8
G
8_
DA
TA
9
37
2
IO
3_
54
A
E3
4
G
7_
A
D
D
R7
39
7
IO
3_
79
A
L3
4
G
8_
DA
TA
10
37
3
IO
3_
55
A
B2
7
LA
D
2
39
8
IO
3_
80
A
G
30
G
8_
DA
TA
11
37
4
IO
3_
56
A
E3
3
LA
D
1
39
9
IO
3_
81
A
D
27
G
8_
DA
TA
12
37
5
IO
3_
57
A
C3
0
G
7_
A
D
D
R8
40
0
IO
3_
82
A
F2
9
G
8_
DA
TA
13
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
8
o
f1
7)
191
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
4 0
1
IO
3 _
8 3
A
K
3 4
G
8 _
DA
TA
1 4
4 2
6
IO
4 _
7
A
K
1 9
G
8 _
A
D
D
R1
8
40
2
IO
3_
84
A
D
25
G
8_
DA
TA
15
42
7
IO
4_
8
A
L2
5
G
8_
A
D
D
R1
9
40
3
IO
3_
85
A
E2
7
/G
8_
CE
42
8
IO
4_
9
A
L2
7
G
9_
DA
TA
0
40
4
IO
3_
86
A
J3
3
/G
8_
W
E
42
9
IO
4_
10
A
L3
0
G
9_
DA
TA
1
40
5
IO
3_
87
A
H
31
/G
8_
O
E
43
0
IO
4_
11
A
N
18
G
9_
DA
TA
2
40
6
IO
3_
88
A
E2
6
G
8_
A
D
D
R0
43
1
IO
4_
12
A
N
22
G
9_
DA
TA
3
40
7
IO
3_
89
A
L3
3
G
8_
A
D
D
R1
43
2
IO
4_
13
A
N
24
G
9_
DA
TA
4
40
8
IO
3_
90
A
F2
8
G
8_
A
D
D
R2
43
3
IO
4_
14
A
P3
1
G
9_
DA
TA
5
40
9
IO
3_
91
A
L3
2
G
8_
A
D
D
R3
43
4
IO
4_
15
A
K
29
G
9_
DA
TA
6
41
0
IO
3_
92
A
J3
1
G
8_
A
D
D
R4
43
5
IO
4_
16
A
P3
0
G
9_
DA
TA
7
41
1
IO
3_
93
A
F2
7
G
8_
A
D
D
R5
43
6
IO
4_
17
A
N
31
G
9_
DA
TA
8
41
2
IO
3_
94
A
G
29
G
8_
A
D
D
R6
43
7
IO
4_
18
A
H
27
G
9_
DA
TA
9
41
3
IO
3_
95
A
J3
2
G
8_
A
D
D
R7
43
8
IO
4_
19
A
N
30
G
9_
DA
TA
10
41
4
IO
3_
96
A
K
33
G
8_
A
D
D
R8
43
9
IO
4_
20
A
M
30
G
9_
DA
TA
11
41
5
IO
3_
97
A
H
30
G
8_
A
D
D
R9
44
0
IO
4_
21
A
K
28
G
9_
DA
TA
12
41
6
IO
3_
98
A
K
32
LA
D
0
44
1
IO
4_
22
A
G
26
G
9_
DA
TA
13
41
7
IO
3_
99
A
K
31
/F
PG
AC
FG
_I
N
IT
44
2
IO
4_
23
A
N
29
G
9_
DA
TA
14
41
8
IO
3_
10
0
V
34
G
8_
A
D
D
R1
0
44
3
IO
4_
24
A
F2
5
G
9_
DA
TA
15
41
9
IO
4_
0
A
E2
1
G
8_
A
D
D
R1
1
44
4
IO
4_
25
A
M
29
/G
9_
CE
42
0
IO
4_
1
A
G
18
G
8_
A
D
D
R1
2
44
5
IO
4_
26
A
L2
9
/G
9_
W
E
42
1
IO
4_
2
A
G
23
G
8_
A
D
D
R1
3
44
6
IO
4_
27
A
L2
8
/G
9_
O
E
42
2
IO
4_
3
A
H
24
G
8_
A
D
D
R1
4
44
7
IO
4_
28
A
E2
4
G
9_
A
D
D
R0
42
3
IO
4_
4
A
H
25
G
8_
A
D
D
R1
5
44
8
IO
4_
29
A
N
28
G
9_
A
D
D
R1
42
4
IO
4_
5
A
J2
8
G
8_
A
D
D
R1
6
44
9
IO
4_
30
A
J2
7
G
9_
A
D
D
R2
42
5
IO
4_
6
A
K
18
G
8_
A
D
D
R1
7
45
0
IO
4_
31
A
H
26
G
9_
A
D
D
R3
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
9
o
f1
7)
192 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
4 5
1
IO
4 _
3 2
A
G
2 5
G
9 _
A
D
D
R4
4 7
6
IO
4 _
5 7
A
P 2
4
G
1 0
_ D
AT
A
9
45
2
IO
4_
33
A
K
27
G
9_
A
D
D
R5
47
7
IO
4_
58
A
L2
4
G
10
_D
AT
A
10
45
3
IO
4_
34
A
M
28
G
9_
A
D
D
R6
47
8
IO
4_
59
A
K
23
G
10
_D
AT
A
11
45
4
IO
4_
35
A
F2
4
G
9_
A
D
D
R7
47
9
IO
4_
60
A
G
22
G
10
_D
AT
A
12
45
5
IO
4_
36
A
J2
6
G
9_
A
D
D
R8
48
0
IO
4_
61
A
N
23
G
10
_D
AT
A
13
45
6
IO
4_
37
A
P2
7
G
9_
A
D
D
R9
48
1
IO
4_
62
A
P2
3
G
10
_D
AT
A
14
45
7
IO
4_
38
A
K
26
G
9_
A
D
D
R1
0
48
2
IO
4_
63
A
M
23
G
10
_D
AT
A
15
45
8
IO
4_
39
A
N
27
G
9_
A
D
D
R1
1
48
3
IO
4_
64
A
H
22
/G
10
_C
E
45
9
IO
4_
40
A
E2
3
G
9_
A
D
D
R1
2
48
4
IO
4_
65
A
P2
2
/G
10
_W
E
46
0
IO
4_
41
A
M
27
G
9_
A
D
D
R1
3
48
5
IO
4_
66
A
L2
3
/G
10
_O
E
46
1
IO
4_
42
A
L2
6
G
9_
A
D
D
R1
4
48
6
IO
4_
67
A
F2
1
G
10
_A
D
D
R0
46
2
IO
4_
43
A
P2
6
G
9_
A
D
D
R1
5
48
7
IO
4_
68
A
L2
2
G
10
_A
D
D
R1
46
3
IO
4_
44
A
N
26
G
9_
A
D
D
R1
6
48
8
IO
4_
69
A
J2
2
G
10
_A
D
D
R2
46
4
IO
4_
45
A
J2
5
G
9_
A
D
D
R1
7
48
9
IO
4_
70
A
K
22
G
10
_A
D
D
R3
46
5
IO
4_
46
A
G
24
G
9_
A
D
D
R1
8
49
0
IO
4_
71
A
M
22
G
10
_A
D
D
R4
46
6
IO
4_
47
A
P2
5
G
9_
A
D
D
R1
9
49
1
IO
4_
72
A
G
21
G
10
_A
D
D
R5
46
7
IO
4_
48
A
F2
3
G
10
_D
AT
A
0
49
2
IO
4_
73
A
J2
1
G
10
_A
D
D
R6
46
8
IO
4_
49
A
M
26
G
10
_D
AT
A
1
49
3
IO
4_
74
A
P2
1
G
10
_A
D
D
R7
46
9
IO
4_
50
A
J2
4
G
10
_D
AT
A
2
49
4
IO
4_
75
A
E2
0
G
10
_A
D
D
R8
47
0
IO
4_
51
A
N
25
G
10
_D
AT
A
3
49
5
IO
4_
76
A
H
21
G
10
_A
D
D
R9
47
1
IO
4_
52
A
E2
2
G
10
_D
AT
A
4
49
6
IO
4_
77
A
L2
1
G
10
_A
D
D
R1
0
47
2
IO
4_
53
A
M
25
G
10
_D
AT
A
5
49
7
IO
4_
78
A
N
21
G
10
_A
D
D
R1
1
47
3
IO
4_
54
A
K
24
G
10
_D
AT
A
6
49
8
IO
4_
79
A
F2
0
G
10
_A
D
D
R1
2
47
4
IO
4_
55
A
H
23
G
10
_D
AT
A
7
49
9
IO
4_
80
A
K
21
G
10
_A
D
D
R1
3
47
5
IO
4_
56
A
F2
2
G
10
_D
AT
A
8
50
0
IO
4_
81
A
P2
0
G
10
_A
D
D
R1
4
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
10
o
f1
7)
193
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
5 0
1
IO
4 _
8 2
A
E1
9
G
1 0
_ A
D
D
R1
5
5 2
6
IO
5 _
6
A
K
1 3
G
1 1
_ A
D
D
R1
50
2
IO
4_
83
A
N
20
G
10
_A
D
D
R1
6
52
7
IO
5_
7
A
L1
3
G
11
_A
D
D
R2
50
3
IO
4_
84
A
G
20
G
10
_A
D
D
R1
7
52
8
IO
5_
8
A
M
4
G
11
_A
D
D
R3
50
4
IO
4_
85
A
L2
0
G
10
_A
D
D
R1
8
52
9
IO
5_
9
A
N
9
G
11
_A
D
D
R4
50
5
IO
4_
86
A
H
20
G
10
_A
D
D
R1
9
53
0
IO
5_
10
A
N
10
G
11
_A
D
D
R5
50
6
IO
4_
87
A
K
20
G
11
_D
AT
A
0
53
1
IO
5_
11
A
N
16
G
11
_A
D
D
R6
50
7
IO
4_
88
A
N
19
G
11
_D
AT
A
1
53
2
IO
5_
12
A
N
17
G
11
_A
D
D
R7
50
8
IO
4_
89
A
J2
0
G
11
_D
AT
A
2
53
3
IO
5_
13
A
L1
7
G
11
_A
D
D
R8
50
9
IO
4_
90
A
F1
9
G
11
_D
AT
A
3
53
4
IO
5_
14
A
H
17
G
11
_A
D
D
R9
51
0
IO
4_
91
A
P1
9
G
11
_D
AT
A
4
53
5
IO
5_
15
A
M
17
G
11
_A
D
D
R1
0
51
1
IO
4_
92
A
M
19
G
11
_D
AT
A
5
53
6
IO
5_
16
A
J1
7
G
11
_A
D
D
R1
1
51
2
IO
4_
93
A
H
19
G
11
_D
AT
A
6
53
7
IO
5_
17
A
G
17
G
11
_A
D
D
R1
2
51
3
IO
4_
94
A
J1
9
G
11
_D
AT
A
7
53
8
IO
5_
18
A
P1
6
G
11
_A
D
D
R1
3
51
4
IO
4_
95
A
P1
8
G
11
_D
AT
A
8
53
9
IO
5_
19
A
L1
6
G
11
_A
D
D
R1
4
51
5
IO
4_
96
A
F1
8
G
11
_D
AT
A
9
54
0
IO
5_
20
A
J1
6
G
11
_A
D
D
R1
5
51
6
IO
4_
97
A
P1
7
G
11
_D
AT
A
10
54
1
IO
5_
21
A
M
16
G
11
_A
D
D
R1
6
51
7
IO
4_
98
A
J1
8
G
11
_D
AT
A
11
54
2
IO
5_
22
A
K
16
G
11
_A
D
D
R1
7
51
8
IO
4_
99
A
L1
8
G
11
_D
AT
A
12
54
3
IO
5_
23
A
P1
5
G
11
_A
D
D
R1
8
51
9
IO
4_
10
0
A
M
18
G
11
_D
AT
A
13
54
4
IO
5_
24
A
L1
5
G
11
_A
D
D
R1
9
52
0
IO
5_
0
A
F1
7
G
11
_D
AT
A
14
54
5
IO
5_
25
A
H
16
G
12
_D
AT
A
0
52
1
IO
5_
1
A
G
12
G
11
_D
AT
A
15
54
6
IO
5_
26
A
N
15
G
12
_D
AT
A
1
52
2
IO
5_
2
A
H
12
/G
11
_C
E
54
7
IO
5_
27
A
F1
6
G
12
_D
AT
A
2
52
3
IO
5_
3
A
J1
0
/G
11
_W
E
54
8
IO
5_
28
A
P1
4
G
12
_D
AT
A
3
52
4
IO
5_
4
A
J1
1
/G
11
_O
E
54
9
IO
5_
29
A
E1
6
G
12
_D
AT
A
4
52
5
IO
5_
5
A
K
7
G
11
_A
D
D
R0
55
0
IO
5_
30
A
K
15
G
12
_D
AT
A
5
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
11
o
f1
7)
194 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
5 5
1
IO
5 _
3 1
A
J 1
5
G
1 2
_ D
AT
A
6
5 7
6
IO
5 _
5 6
A
J 1
3
G
1 2
_ A
D
D
R1
2
55
2
IO
5_
32
A
H
15
G
12
_D
AT
A
7
57
7
IO
5_
57
A
P1
0
G
12
_A
D
D
R1
3
55
3
IO
5_
33
A
N
14
G
12
_D
AT
A
8
57
8
IO
5_
58
A
K
12
G
12
_A
D
D
R1
4
55
4
IO
5_
34
A
K
14
G
12
_D
AT
A
9
57
9
IO
5_
59
A
M
10
G
12
_A
D
D
R1
5
55
5
IO
5_
35
A
G
15
G
12
_D
AT
A
10
58
0
IO
5_
60
A
P9
G
12
_A
D
D
R1
6
55
6
IO
5_
36
A
M
13
G
12
_D
AT
A
11
58
1
IO
5_
61
A
K
11
G
12
_A
D
D
R1
7
55
7
IO
5_
37
A
F1
5
G
12
_D
AT
A
12
58
2
IO
5_
62
A
L1
1
G
12
_A
D
D
R1
8
55
8
IO
5_
38
A
G
14
G
12
_D
AT
A
13
58
3
IO
5_
63
A
L1
0
G
12
_A
D
D
R1
9
55
9
IO
5_
39
A
P1
3
G
12
_D
AT
A
14
58
4
IO
5_
64
A
E1
3
G
13
_D
AT
A
0
56
0
IO
5_
40
A
E1
4
G
12
_D
AT
A
15
58
5
IO
5_
65
A
M
9
G
13
_D
AT
A
1
56
1
IO
5_
41
A
E1
5
/G
12
_C
E
58
6
IO
5_
66
A
F1
2
G
13
_D
AT
A
2
56
2
IO
5_
42
A
N
13
/G
12
_W
E
58
7
IO
5_
67
A
P8
G
13
_D
AT
A
3
56
3
IO
5_
43
A
G
13
/G
12
_O
E
58
8
IO
5_
68
A
L9
G
13
_D
AT
A
4
56
4
IO
5_
44
A
H
14
G
12
_A
D
D
R0
58
9
IO
5_
69
A
H
11
G
13
_D
AT
A
5
56
5
IO
5_
45
A
P1
2
G
12
_A
D
D
R1
59
0
IO
5_
70
A
F1
1
G
13
_D
AT
A
6
56
6
IO
5_
46
A
J1
4
G
12
_A
D
D
R2
59
1
IO
5_
71
A
N
8
G
13
_D
AT
A
7
56
7
IO
5_
47
A
L1
4
G
12
_A
D
D
R3
59
2
IO
5_
72
A
M
8
G
13
_D
AT
A
8
56
8
IO
5_
48
A
F1
3
G
12
_A
D
D
R4
59
3
IO
5_
73
A
G
11
G
13
_D
AT
A
9
56
9
IO
5_
49
A
N
12
G
12
_A
D
D
R5
59
4
IO
5_
74
A
L8
G
13
_D
AT
A
10
57
0
IO
5_
50
A
F1
4
G
12
_A
D
D
R6
59
5
IO
5_
75
A
K
9
G
13
_D
AT
A
11
57
1
IO
5_
51
A
P1
1
G
12
_A
D
D
R7
59
6
IO
5_
76
A
H
10
G
13
_D
AT
A
12
57
2
IO
5_
52
A
N
11
G
12
_A
D
D
R8
59
7
IO
5_
77
A
N
7
G
13
_D
AT
A
13
57
3
IO
5_
53
A
H
13
G
12
_A
D
D
R9
59
8
IO
5_
78
A
E1
2
G
13
_D
AT
A
14
57
4
IO
5_
54
A
M
12
G
12
_A
D
D
R1
0
59
9
IO
5_
79
A
J9
G
13
_D
AT
A
15
57
5
IO
5_
55
A
L1
2
G
12
_A
D
D
R1
1
60
0
IO
5_
80
A
M
7
/G
13
_C
E
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
12
o
f1
7)
195
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
6 0
1
IO
5 _
8 1
A
L7
/G
1 3
_ W
E
6 2
6
IO
6 _
6
A
B5
G
1 4
_ D
AT
A
3
60
2
IO
5_
82
A
G
10
/G
13
_O
E
62
7
IO
6_
7
A
B7
G
14
_D
AT
A
4
60
3
IO
5_
83
A
N
6
G
13
_A
D
D
R0
62
8
IO
6_
8
A
B9
G
14
_D
AT
A
5
60
4
IO
5_
84
A
K
8
G
13
_A
D
D
R1
62
9
IO
6_
9
A
D
7
G
14
_D
AT
A
6
60
5
IO
5_
85
A
H
9
G
13
_A
D
D
R2
63
0
IO
6_
10
A
D
8
G
14
_D
AT
A
7
60
6
IO
5_
86
A
P5
G
13
_A
D
D
R3
63
1
IO
6_
11
A
E2
G
14
_D
AT
A
8
60
7
IO
5_
87
A
J8
G
13
_A
D
D
R4
63
2
IO
6_
12
A
E4
G
14
_D
AT
A
9
60
8
IO
5_
88
A
E1
1
G
13
_A
D
D
R5
63
3
IO
6_
13
A
J4
G
14
_D
AT
A
10
60
9
IO
5_
89
A
N
5
G
13
_A
D
D
R6
63
4
IO
6_
14
A
H
5
G
14
_D
AT
A
11
61
0
IO
5_
90
A
F1
0
G
13
_A
D
D
R7
63
5
IO
6_
15
A
H
6
G
14
_D
AT
A
12
61
1
IO
5_
91
A
M
6
G
13
_A
D
D
R8
63
6
IO
6_
16
A
F8
G
14
_D
AT
A
13
61
2
IO
5_
92
A
L6
G
13
_A
D
D
R9
63
7
IO
6_
17
A
E9
G
14
_D
AT
A
14
61
3
IO
5_
93
A
G
9
G
13
_A
D
D
R1
0
63
8
IO
6_
18
A
K
3
G
14
_D
AT
A
15
61
4
IO
5_
94
A
H
8
G
13
_A
D
D
R1
1
63
9
IO
6_
19
A
D
10
/G
14
_C
E
61
5
IO
5_
95
A
P4
G
13
_A
D
D
R1
2
64
0
IO
6_
20
A
L2
/G
14
_W
E
61
6
IO
5_
96
A
N
4
G
13
_A
D
D
R1
3
64
1
IO
6_
21
A
L1
/G
14
_O
E
61
7
IO
5_
97
A
J7
G
13
_A
D
D
R1
4
64
2
IO
6_
22
A
H
4
G
14
_A
D
D
R0
61
8
IO
5_
98
A
M
5
G
13
_A
D
D
R1
5
64
3
IO
6_
23
A
G
6
G
14
_A
D
D
R1
61
9
IO
5_
99
A
K
6
G
13
_A
D
D
R1
6
64
4
IO
6_
24
A
K
1
G
14
_A
D
D
R2
62
0
IO
6_
0
T1
G
13
_A
D
D
R1
7
64
5
IO
6_
25
A
F7
G
14
_A
D
D
R3
62
1
IO
6_
1
V
2
G
13
_A
D
D
R1
8
64
6
IO
6_
26
A
K
2
G
14
_A
D
D
R4
62
2
IO
6_
2
V
3
G
13
_A
D
D
R1
9
64
7
IO
6_
27
A
J3
G
14
_A
D
D
R5
62
3
IO
6_
3
V
5
G
14
_D
AT
A
0
64
8
IO
6_
28
A
G
5
G
14
_A
D
D
R6
62
4
IO
6_
4
V
8
G
14
_D
AT
A
1
64
9
IO
6_
29
A
D
9
G
14
_A
D
D
R7
62
5
IO
6_
5
A
A
10
G
14
_D
AT
A
2
65
0
IO
6_
30
A
J2
G
14
_A
D
D
R8
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
13
o
f1
7)
196 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
6 5
1
IO
6 _
3 1
A
C1
0
G
1 4
_ A
D
D
R9
6 7
6
IO
6 _
5 6
A
D
2
G
1 5
_ D
AT
A
1 4
65
2
IO
6_
32
A
H
2
G
14
_A
D
D
R1
0
67
7
IO
6_
57
A
B8
G
15
_D
AT
A
15
65
3
IO
6_
33
A
H
3
G
14
_A
D
D
R1
1
67
8
IO
6_
58
A
C1
/G
15
_C
E
65
4
IO
6_
34
A
F5
G
14
_A
D
D
R1
2
67
9
IO
6_
59
A
C5
/G
15
_W
E
65
5
IO
6_
35
A
E8
G
14
_A
D
D
R1
3
68
0
IO
6_
60
A
C2
/G
15
_O
E
65
6
IO
6_
36
A
G
3
G
14
_A
D
D
R1
4
68
1
IO
6_
61
A
A
9
G
15
_A
D
D
R0
65
7
IO
6_
37
A
E7
G
14
_A
D
D
R1
5
68
2
IO
6_
62
A
C3
G
15
_A
D
D
R1
65
8
IO
6_
38
A
G
2
G
14
_A
D
D
R1
6
68
3
IO
6_
63
A
C4
G
15
_A
D
D
R2
65
9
IO
6_
39
A
F6
G
14
_A
D
D
R1
7
68
4
IO
6_
64
A
D
4
G
15
_A
D
D
R3
66
0
IO
6_
40
A
G
1
G
14
_A
D
D
R1
8
68
5
IO
6_
65
A
A
8
G
15
_A
D
D
R4
66
1
IO
6_
41
A
C9
G
14
_A
D
D
R1
9
68
6
IO
6_
66
A
B6
G
15
_A
D
D
R5
66
2
IO
6_
42
A
G
4
G
15
_D
AT
A
0
68
7
IO
6_
67
A
B1
G
15
_A
D
D
R6
66
3
IO
6_
43
A
E6
G
15
_D
AT
A
1
68
8
IO
6_
68
Y
10
G
15
_A
D
D
R7
66
4
IO
6_
44
A
F3
G
15
_D
AT
A
2
68
9
IO
6_
69
A
B2
G
15
_A
D
D
R8
66
5
IO
6_
45
A
F1
G
15
_D
AT
A
3
69
0
IO
6_
70
A
A
7
G
15
_A
D
D
R9
66
6
IO
6_
46
A
F4
G
15
_D
AT
A
4
69
1
IO
6_
71
A
A
4
G
15
_A
D
D
R1
0
66
7
IO
6_
47
A
B1
0
G
15
_D
AT
A
5
69
2
IO
6_
72
A
A
1
G
15
_A
D
D
R1
1
66
8
IO
6_
48
A
F2
G
15
_D
AT
A
6
69
3
IO
6_
73
Y
9
G
15
_A
D
D
R1
2
66
9
IO
6_
49
A
C8
G
15
_D
AT
A
7
69
4
IO
6_
74
A
B4
G
15
_A
D
D
R1
3
67
0
IO
6_
50
A
E1
G
15
_D
AT
A
8
69
5
IO
6_
75
A
A
2
G
15
_A
D
D
R1
4
67
1
IO
6_
51
A
D
5
G
15
_D
AT
A
9
69
6
IO
6_
76
Y
8
G
15
_A
D
D
R1
5
67
2
IO
6_
52
A
E3
G
15
_D
AT
A
10
69
7
IO
6_
77
A
A
6
G
15
_A
D
D
R1
6
67
3
IO
6_
53
A
C7
G
15
_D
AT
A
11
69
8
IO
6_
78
A
A
5
G
15
_A
D
D
R1
7
67
4
IO
6_
54
A
D
1
G
15
_D
AT
A
12
69
9
IO
6_
79
A
B3
G
15
_A
D
D
R1
8
67
5
IO
6_
55
A
D
6
G
15
_D
AT
A
13
70
0
IO
6_
80
Y
7
G
15
_A
D
D
R1
9
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
14
o
f1
7)
197
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
7 0
1
IO
6 _
8 1
Y
1
G
1 6
_ D
AT
A
0
7 2
6
IO
7 _
5
K
4
G
1 6
_ A
D
D
R6
70
2
IO
6_
82
W
10
G
16
_D
AT
A
1
72
7
IO
7_
6
L6
G
16
_A
D
D
R7
70
3
IO
6_
83
Y
5
G
16
_D
AT
A
2
72
8
IO
7_
7
M
5
G
16
_A
D
D
R8
70
4
IO
6_
84
Y
2
G
16
_D
AT
A
3
72
9
IO
7_
8
M
10
G
16
_A
D
D
R9
70
5
IO
6_
85
W
9
G
16
_D
AT
A
4
73
0
IO
7_
9
N
5
G
16
_A
D
D
R1
0
70
6
IO
6_
86
W
2
G
16
_D
AT
A
5
73
1
IO
7_
10
N
10
G
16
_A
D
D
R1
1
70
7
IO
6_
87
W
7
G
16
_D
AT
A
6
73
2
IO
7_
11
R
7
G
16
_A
D
D
R1
2
70
8
IO
6_
88
Y
4
G
16
_D
AT
A
7
73
3
IO
7_
12
T2
G
16
_A
D
D
R1
3
70
9
IO
6_
89
W
1
G
16
_D
AT
A
8
73
4
IO
7_
13
T7
G
16
_A
D
D
R1
4
71
0
IO
6_
90
Y
6
G
16
_D
AT
A
9
73
5
IO
7_
14
U
8
G
16
_A
D
D
R1
5
71
1
IO
6_
91
W
6
G
16
_D
AT
A
10
73
6
IO
7_
15
V
4
G
16
_A
D
D
R1
6
71
2
IO
6_
92
W
3
G
16
_D
AT
A
11
73
7
IO
7_
16
U
9
G
16
_A
D
D
R1
7
71
3
IO
6_
93
V
9
G
16
_D
AT
A
12
73
8
IO
7_
17
U
4
G
16
_A
D
D
R1
8
71
4
IO
6_
94
W
4
G
16
_D
AT
A
13
73
9
IO
7_
18
U
7
G
16
_A
D
D
R1
9
71
5
IO
6_
95
W
5
G
16
_D
AT
A
14
74
0
IO
7_
19
U
5
FP
G
A
_F
IF
O
DA
TA
0
71
6
IO
6_
96
V
1
G
16
_D
AT
A
15
74
1
IO
7_
20
U
3
FP
G
A
_F
IF
O
DA
TA
1
71
7
IO
6_
97
V
7
/G
16
_C
E
74
2
IO
7_
21
U
6
FP
G
A
_F
IF
O
DA
TA
2
71
8
IO
6_
98
U
2
/G
16
_W
E
74
3
IO
7_
22
T3
FP
G
A
_F
IF
O
DA
TA
3
71
9
IO
6_
99
V
6
/G
16
_O
E
74
4
IO
7_
23
T6
FP
G
A
_F
IF
O
DA
TA
4
72
0
IO
6_
10
0
U
1
G
16
_A
D
D
R0
74
5
IO
7_
24
T9
FP
G
A
_F
IF
O
DA
TA
5
72
1
IO
7_
0
F5
G
16
_A
D
D
R1
74
6
IO
7_
25
T4
FP
G
A
_F
IF
O
DA
TA
6
72
2
IO
7_
1
G
6
G
16
_A
D
D
R2
74
7
IO
7_
26
T5
FP
G
A
_F
IF
O
DA
TA
7
72
3
IO
7_
2
H
1
G
16
_A
D
D
R3
74
8
IO
7_
27
R
1
FP
G
A
_F
IF
O
DA
TA
8
72
4
IO
7_
3
H
7
G
16
_A
D
D
R4
74
9
IO
7_
28
R
6
FP
G
A
_F
IF
O
DA
TA
9
72
5
IO
7_
4
K
2
G
16
_A
D
D
R5
75
0
IO
7_
29
T1
0
FP
G
A
_F
IF
O
DA
TA
10
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
15
o
f1
7)
198 APPENDIX C: FPGA expansion board schematics
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
7 5
1
IO
7 _
3 0
R
2
F P
G
A
_ F
IF
O
DA
TA
1 1
7 7
6
IO
7 _
5 5
N
8
F P
G
A
_ F
F 3
75
2
IO
7_
31
R
5
FP
G
A
_F
IF
O
DA
TA
12
77
7
IO
7_
56
L2
FP
G
A
_F
F4
75
3
IO
7_
32
P1
FP
G
A
_F
IF
O
DA
TA
13
77
8
IO
7_
57
N
9
/F
PG
AU
SE
R_
CS
75
4
IO
7_
33
P5
FP
G
A
_F
IF
O
DA
TA
14
77
9
IO
7_
58
M
7
FP
G
A
_M
A
RK
EN
D
75
5
IO
7_
34
R
8
FP
G
A
_F
IF
O
DA
TA
15
78
0
IO
7_
59
K
1
M
A
3
75
6
IO
7_
35
P2
FP
G
A
_F
IF
O
DA
TA
16
78
1
IO
7_
60
M
8
M
A
2
75
7
IO
7_
36
R
9
FP
G
A
_F
IF
O
DA
TA
17
78
2
IO
7_
61
L4
M
A
1
75
8
IO
7_
37
N
1
FP
G
A
_F
IF
O
DA
TA
18
78
3
IO
7_
62
J1
M
A
0
75
9
IO
7_
38
P4
FP
G
A
_F
IF
O
DA
TA
19
78
4
IO
7_
63
L5
/M
O
E
76
0
IO
7_
39
R
10
FP
G
A
_F
IF
O
DA
TA
20
78
5
IO
7_
64
J2
/M
W
E
76
1
IO
7_
40
P8
FP
G
A
_F
IF
O
DA
TA
21
78
6
IO
7_
65
K
3
PA
RS
E_
FI
N
IS
H
ED
76
2
IO
7_
41
N
2
FP
G
A
_F
IF
O
DA
TA
22
78
7
IO
7_
66
L7
PA
RS
E_
ER
RO
R
76
3
IO
7_
42
P6
FP
G
A
_F
IF
O
DA
TA
23
78
8
IO
7_
67
J3
PA
RS
E_
ST
A
RT
76
4
IO
7_
43
P7
FP
G
A
_F
IF
O
DA
TA
24
78
9
IO
7_
68
M
9
(no
tu
se
d)
76
5
IO
7_
44
M
1
FP
G
A
_F
IF
O
DA
TA
25
79
0
IO
7_
69
H
2
(no
tu
se
d)
76
6
IO
7_
45
N
4
FP
G
A
_F
IF
O
DA
TA
26
79
1
IO
7_
70
J4
(no
tu
se
d)
76
7
IO
7_
46
N
6
FP
G
A
_F
IF
O
DA
TA
27
79
2
IO
7_
71
K
6
(no
tu
se
d)
76
8
IO
7_
47
N
3
FP
G
A
_F
IF
O
DA
TA
28
79
3
IO
7_
72
L8
(no
tu
se
d)
76
9
IO
7_
48
P9
FP
G
A
_F
IF
O
DA
TA
29
79
4
IO
7_
73
G
2
(no
tu
se
d)
77
0
IO
7_
49
M
2
FP
G
A
_F
IF
O
DA
TA
30
79
5
IO
7_
74
H
3
(no
tu
se
d)
77
1
IO
7_
50
N
7
FP
G
A
_F
IF
O
DA
TA
31
79
6
IO
7_
75
K
7
D
IS
PL
AY
_D
0
77
2
IO
7_
51
M
3
/F
PG
A
_W
EN
1
79
7
IO
7_
76
G
3
D
IS
PL
AY
_D
1
77
3
IO
7_
52
P1
0
/F
PG
A
_W
EN
2/
LD
79
8
IO
7_
77
J5
D
IS
PL
AY
_D
2
77
4
IO
7_
53
M
4
FP
G
A
_F
F1
79
9
IO
7_
78
L9
D
IS
PL
AY
_D
3
77
5
IO
7_
54
L1
FP
G
A
_F
F2
80
0
IO
7_
79
H
5
D
IS
PL
AY
_D
4
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
16
o
f1
7)
199
#
FP
G
A
sig
na
l
pi
n
de
sig
n
sig
na
l
80
1
IO
7_
80
J6
D
IS
PL
AY
_D
5
80
2
IO
7_
81
H
4
D
IS
PL
AY
_D
6
80
3
IO
7_
82
G
4
D
IS
PL
AY
_A
0
80
4
IO
7_
83
K
8
D
IS
PL
AY
_A
1
80
5
IO
7_
84
J7
/D
IS
PL
AY
_W
R
80
6
IO
7_
85
F2
(no
tu
se
d)
80
7
IO
7_
86
F3
(no
tu
se
d)
80
8
IO
7_
87
L1
0
(no
tu
se
d)
80
9
IO
7_
88
E1
D
EB
U
G
0
81
0
IO
7_
89
H
6
D
EB
U
G
1
81
1
IO
7_
90
G
5
D
EB
U
G
2
81
2
IO
7_
91
E2
D
EB
U
G
3
81
3
IO
7_
92
K
9
D
EB
U
G
4
81
4
IO
7_
93
D
1
D
EB
U
G
5
81
5
IO
7_
94
E3
D
EB
U
G
6
81
6
IO
7_
95
J8
D
EB
U
G
7
81
7
IO
7_
96
E4
D
EB
U
G
8
81
8
IO
7_
97
D
2
D
EB
U
G
9
81
9
IO
7_
98
F4
D
EB
U
G
10
82
0
IO
7_
99
D
3
D
EB
U
G
11
Vi
rte
x
-E
X
CV
20
00
ef
g1
15
6-
6
FP
G
A
pi
n
as
sig
nm
en
t(
pa
ge
17
o
f1
7)
200 APPENDIX C: FPGA expansion board schematics
Bibliography
[1] A. V. Aho and J. D. Ullman. "The Theory of Parsing, Translation and Compiling", vol-
ume 1. Prentice-Hall, 1972.
[2] V. L. Arlazarov, E. A. Dinic, M. A. Kronod, and I. A. Faradzev. "On economical con-
struction of the transitive closure of a directed graph". In Soviet Mathematics Doklady,
volume 11, pages 1209–1210, 1970.
[3] F. M. Barcal, O. Sacristan, and J. Grana. "Stochastic Parsing and Parallelism". In Com-
putational Linguistics and Intelligent Text Processing, pages 401–410. Springer-Verlag
Berlin, 2001.
[4] Von L. Burton. "The Programmable Logic Device Handbook". TAB Professional an
Reference Books, Blue Ridge Summit, 1990.
[5] J. C. Chappelier and M. Rajman. "A generalized CYK algorithm for parsing stochastic
CFG". In 1st Workshop on Tabulation in Parsing and Deduction (TAPD98), pages 133–
137, April 1998.
[6] J. C. Chappelier and M. Rajman. "A Practical Bottom-Up Algorithm for On-Line Parsing
with Stochastic Context-Free Grammars". Technical Report 284, EPFL (DI-LIA), July
1998.
[7] Y. T. Chiang and K. S. Fu. "Parallel Parsing Algorithms and VLSI Implementations for
Syntactic Pattern Recognition". In IEEE Transactions on Pattern Analysis and Machine
Intelligence, volume 6, May 1984.
[8] King-Hang Chu and King-Sun Fu. "VLSI architectures for high speed recognition of
context-free languages and finite-state languages". In Proceedings 9th Annual Interna-
tional Symposium on Computing Architecture, April 1982.
[9] C. Ciressan. "Using FPGAs for designing an NLP coprocessor.". Technical Report TR-
339, EPFL, October 2000.
[10] R. Dale, H. Moisl, and H. Somers, editors. "Handbook of Natural Language Processing".
Marcel Dekker, 2000.
[11] G. Erbach. "Bottom-up Earley deduction". In Proceedings of the th International
Conference on Computational Linguistics (COLING’94). Kyoto, Japan, 1994.
[12] M. J. Foster and H. T. Kung. "The design of special-purpose VLSI chips". IEEE Com-
puter, 13, January 1980.
202 BIBLIOGRAPHY
[13] S. L. Graham, M. A. Harrison, and R. L. Mercer. "An Improved Context-Free Recog-
nizer". ACM Transactions on Programming Languages and Systems, 2(3):415–462, July
1980.
[14] J. Hopcroft and J. Ullman. "Introduction to Automata Theory, Languages, and Computa-
tion". Addison-Wesley, 1979.
[15] O. H. Ibarra, T. Jiang, and H. Wang. "Parallel Parsing on a One-Way Linear Array of
Finite-State Machines". In Foundations of Software Technology and Theoretical Com-
puter Science: Proc. of the Ninth Conference, pages 291–300. Springer, 1989.
[16] O. H. Ibarra, T. C. Pong, and S. M. Sohn. "Parallel Recognition and Parsing on the
Hypercube". IEEE Transactions on Computers, 40(6):764–770, June 1991.
[17] S. R. Kosaraju. "Speed of recognition of context-free languages by array automata". SIAM
Journal on Computing, 4, September 1975.
[18] H. T. Kung. "Let’s design algorithms for VLSI systems". In Caltech Conference on VLSI,
January 1979.
[19] Parag K. Lala. "Digital System Design Using Programmable Logic Devices". Prentice
Halls, Englewood Cliffs, 1990.
[20] P. Linz. "An Introduction to Formal Languages and Automata". Jones and Bartlett Pub-
lishers, 1997. Pages 155-168.
[21] T. Ninomiya, K. Torisawa, K. Taura, and J. Tsujii. "A Parallel CKY Parsing Algorithm
on Large-Scale Distributed-Memory Parallel Machines". In Proceedings of the Pacific
Association for Computational Linguistics, pages 232–243, September 1997.
[22] R. A. Peleato, M. Rajman, and J.-C. Chappelier. "Integration of syntactic constraints
within a speech recognition system. Coupling a speech recognizer and a stochastic
context-free parser". Technical report, EPFL (DI-LIA), February 1999.
[23] PLX Technology Inc. "Hardware Reference Manual", v 1.0 edition, September 1999.
[24] PLX Technology Inc. "IOP 480 Data Book", October 1999.
[25] G. Sampson. "The Susanne Corpus Release 3". School of Cognitive & Computing Sci-
ences, University of Sussex, Falmer, Brighton, England, 1994.
[26] E. Sanchez and M. Tomassini. "Towards Evolvable Hardware". Springer-Verlag Berlin,
1996.
[27] L. G. Valiant. "General Context-Free Recognition in Less than Cubic Time". Journal of
Computer and System Sciences, 10:308–315, 1975.
[28] John Villasenor and William H. Mangione-Smith. "Configurable Computing". Scientific
American, 1997.
[29] F. Voisin and J.-C. Raoult. "A new, Bottom-up, General Parsing Algorithm". In Journees
AFCET-GROPLAN, les Avancees en programmation. Nice, 1990.
BIBLIOGRAPHY 203
[30] Xilin Inc.   4( 1.8V Field Programmable Gate Arrays, v1.7 edition, September
2000.
[31] Xilin Inc.   4 2.5V Field Programmable Gate Arrays, v2.5 edition, April 2001.
[32] D. H. Younger. "Recognition of context-free languages in time ". In Information and
Control, pages 189–208, February 1967.
Index
Application Specific Integrated Circuits, 4
ASIC, see Application Specific Integrated
Circuits
CFG, see Context Free Grammars
Chomsky hierarchy, 2
Chomsky Normal Form, 11
CNF, see Chomsky Normal Form
Cocke Younger Kasami algorithm, 3, 12
Complex Programmable Logic Devices, 4
compound words, 14
Context-Free Grammars, 11
coprocessor, 1
CPLD, see Complex Programmable Logic
Devices
CYK algorithm, see Cocke Younger Kasami
algorithm
CYK implementation
distributed memory, 7
FPGA-based, 21–49, 59–93
hypercube mapping, 6
shared memory, 7
VLSI, 6
DAP design, see Dynamic Array of Proces-
sors design
Digital Signal Processor, 4
DSP, see Digital Signal Processor
Dynamic Array of Processors design
block diagram, 61
chart data-structure, see Linear Array
of Processors design, chart data-
structure
design analysis, 83–90
grammar data-structure, see Linear Ar-
ray of Processors design, grammar
data-structure
initialization, 62
performance measurements, 79–82
dynamic processor allocation, 59
enhanced-CYK algorithm, 15
enhanced-CYK design
block diagram, 98
chart data-structure, 100–103
design analysis, 121–133
grammar data-structure, 103–106
initialization, 97
performance measurements, 119–121
throughput, 126–130
enhanced-CYK implementation
FPGA-based, 95–135
Field Programmable Gate Arrays, 1, 4
FPGA, see Field Programmable Gate Array
FPGA-board, 137–147
general-purpose processor, 4
LAP design, see Linear Array of Processors
design
Linear Array of Processors design
block diagram, 25
chart data-structure, 26–30
design analysis, 51–57
grammar data-structure, 31–34
initialization, 24
performance measurements, 43–46
Natural Language Processing, 1
NLP, see Natural Language Processing
parser, 1
RC1000-PP FPGA-board, 21, 46, 82
recognizer, 1
robust parser, 2
shallow parser, 4
tiling mechanism, method, 95, 112
Very Large Scale Integration, 6
INDEX 205
VLSI, see Very Large Scale Integration
vocal interfaces, 3
word lattices, 14
Curriculum Vitae
Name: Cristian CIRE S SAN
Date/Place of birth: 1973, August 11, born in Timisoara, Romania
Languages: Romanian, English, French
Education:
1991-1996 Diploma in Computer Science
(Dipl. Hardware Engineer, "Politechnica" University of Timisoara)
1994-1996 Research & Development,
BEE-SPEED S.R.L., Timisoara, Romania.
1996 Diploma Project "Data transmission using the NICAM standard",
Swiss Federal Institute of Technology in Zürich (ETHZ)
1996-1997 Doctoral School in Communication Systems,
Swiss Federal Institute of Technology in Lausanne (EPFL)
1997 Diploma Project "Real-time 3D body tracking for Virtual Reality Environments"
Swiss Federal Institute of Technology in Lausanne (EPFL)
1997-current PhD. Student, Artificial Intelligence Laboratory, Dept. of Computer Science
Swiss Federal Institute of Technology in Lausanne (EPFL)
PUBLICATIONS
1) C. Ciressan, M. Rajman, E. Sanchez and J.-C. Chappelier, "Towards NLP-
coprocessing: An FPGA implementation of a context-free parser.", 7ème con-
férence sur le Traitement Automatique du Langage Naturel (TALN 2000), Lau-
sanne, Suisse, Octobre, 2000, pp. 91-100.
2) C. Ciressan, E. Sanchez, M. Rajman and J.-C. Chappelier, "An FPGA-based syn-
tactic parser for real-life almost unrestricted context-free grammars.", 11th Inter-
national Conference on Field Programmable Logic and Applications (FPL 2001),
Lecture Notes in Computer Science, Belfast, Northern Ireland, August, 2001.
3) C. Ciressan, E. Sanchez, M. Rajman and J.-C. Chappelier, "An FPGA-based co-
processor for the parsing of context-free grammars.", 2000 IEEE Symposium on
Field-Programmable Custom Computing Machines, Napa Valley, California, April,
2000.
4) C. Ciressan, E. Rajman, E. Sanchez and J.-C. Chappelier, "Using FPGAs for de-
signing an NLP coprocessor", Technical Report No. 00/339, EPFL, Dept. Com-
puter Science, DI-LIA, October 2000
5) C. Ciressan, E. Sanchez, M. Rajman and J.-C. Chappelier, "An FPGA-based syn-
tactic parser for real-life unrestricted context-free grammars", Technical Report No.
01/373, EPFL, Dept. Computer Science, DI-LIA, October 2001
