thesis

Design and evaluation of the linguistic basis of an automatic F-struture annotation algorithm for the Penn-II treebank

Abstract

In this thesis, we describe the design and evaluation of the linguistic basis of an automatic f-structure annotation algorithm for the Wall Street Journal (WSJ) section of the Penn-II Treebank, which consists of more than 1,000,000 words, tagged for part~of-speech information, in about 50,000 sentences and trees. We discuss the background and some of the main principles of Lexical- Functional Grammar (LFG), which is the theory of language used to represent the predicate-argument-modifier structure of a sentence by us in our application. We then present the guidelines for the tagging of the Penn-II Treebank, followed by a description of how the linguistics of the Penn-II Treebank relate to LFG. The automatic annotation of such Treebank grammars is difficult as annotation rules often need to identify sub-sequences in the right-hand-sides of (often) flat Treebank rules as they explicitly encode head, complement and modifier relations. The algorithm we have developed is designed to handle these flat grammar rules. We describe the methodology used to encode the linguistic generalisations needed to annotate Treebank resources with LFG f-structure information, which, unlike previous approaches to this problem, scales up to the size of the WSJ section of the Penn-II Treebank. Finally, we present and assess a number of automatic evaluation methodologies for assessing the effectiveness of the techniques we have developed. We first employ a quantitative evaluation, whcih measures the coverage of our annotation algorithm with respect to rule types and tokens, and calculates the degree of fragmentation of the automatically generated f-structure. Secondly, we present a qualitative evaluation, which measures the quality of the f-structures produced against a manually constructed ‘gold standard’ set of f-structures. Finally, we summarise our work to date, and outline possibilities for further work

    Similar works