We investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. We study how morphosyntactic as well as function tag information percolation in the form of grammar transforms (Johnson, 1998, Kulick et al., 2006) affects the performance of a parser and helps dependency assignment. We focus on the three most frequent functional

tags in the Arabic Penn Treebank: subjects, direct objects and predicates . We merge these functional tags with their phrasal categories and (where appropriate) percolate case information to the non-terminal (POS) category to train the parsers. We then automatically enrich the output of these parsers with full dependency information in order to annotate trees with Lexical Functional Grammar (LFG)

f-structure equations with produce f-structures, i.e. attribute-value matrices approximating to basic predicate-argument-adjunct structure representations. We present a series of experiments evaluating how well lexicalized, history-based, generative (Bikel) as well as latent

variable PCFG (Berkeley) parsers cope with the enriched Arabic data. We measure quality and coverage of both the output trees and the generated LFG f-structures. We show that joint functional and morphological information percolation improves both the recovery of trees as well as dependency results in the form of LFG f-structures

Tounsi, Lamia

van Genabith, Josef

English

DCU Online Research Access Service

Arabic Parsing Using Grammar TransformsLamia Tounsi and Josef van GenabithNCLT, School of Computing, Dublin City University, Ireland{ltounsi, josef}@computing.dcu.ieAbstractWe investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. Westudy how morphosyntactic as well as function tag information percolation in the form of grammar transforms (Johnson, 1998, Kulicket al., 2006) affects the performance of a parser and helps dependency assignment. We focus on the three most frequent functionaltags in the Arabic Penn Treebank: subjects, direct objects and predicates . We merge these functional tags with their phrasal categoriesand (where appropriate) percolate case information to the non-terminal (POS) category to train the parsers. We then automaticallyenrich the output of these parsers with full dependency information in order to annotate trees with Lexical Functional Grammar (LFG)f-structure equations with produce f-structures, i.e. attribute-value matrices approximating to basic predicate-argument-adjunct structurerepresentations. We present a series of experiments evaluating how well lexicalized, history-based, generative (Bikel) as well as latentvariable PCFG (Berkeley) parsers cope with the enriched Arabic data. We measure quality and coverage of both the output trees andthe generated LFG f-structures. We show that joint functional and morphological information percolation improves both the recovery oftrees as well as dependency results in the form of LFG f-structures.1. IntroductionArabic is a Semitic language well-known for its morpho-logical richness and syntactic complexity. Parsing Arabicsentences is a difficult task for several reasons includingthe relatively free word order of Arabic, the length of sen-tences, the omission of diacritics (vowels) in written Ara-bic and the frequency of pro-drop phenomena. The mainobjective of our research is to automatically enrich Ara-bic Penn Treebank (ATB) trees and ATB-trained parseroutput with dependency information such as ATB func-tional tags. Then, based on these and categorical and con-figurational information, we automatically annotate treeswith LFG f-structure information representing predicate-argument structure relations to produce ATB-based LFG re-sources for parsing and generation similar to previous workon English by (Cahill and van Genabith, 2006; Cahill etal., 2008). We investigate Arabic parsing comparing lexi-calised and unlexicalised parsers. We focus on ATB gram-matical function tag assignment (Habash et al., 2009) us-ing grammar transforms for different configurations, mak-ing use of morpho-syntactic information to detect subjects,direct objects and predicates. The paper is structured asfollows: Section 2 describes the general background. Sec-tion 3 presents the parsers and grammar transforms fo-cusing on the three most frequent Functional Tags in theATB: subject, direct object and predicate. Section 4 dis-cusses the results from a series of experiments conductedon parsers trained on the tranformed ATB, measuring qual-ity and coverage of the output trees and the generated LFGf-structures.2. General BackgroundDeep probabilistic constraint-based grammars such asLFGs can be acquired from treebanks. In fact, many tree-banks (such as the Penn Treebank) contain deep linguisticinformation such as: traces, coindexation and functionaltags to support the computation of meaning representa-tions in the form of predicate-argument-adjunct structuresor deep dependency representations.2.1. Lexical Functional Grammar (LFG)Lexical functional Grammar belongs to the family ofconstraint-based grammars (Kaplan and Bresnan, 1982;Bresnan, 2001) and features two levels of representation:c(onstituent)-structure and f(unctional)-structure.C-structure describes surface grammatical configurationsusing phrase structure trees while f-structure encodes moreabstract functional information as a matrix of attribute-value pairs to reduce the impact of surface configurationin the grammar.2.2. Penn Arabic Treebank (ATB)The Penn Arabic Treebank (Maamouri and Bies 2004) isa corpus of 23,611 parse-annotated sentences from Arabicnewswire text in Modern Standard Arabic (MSA). The ATBis a fine-grained corpus, its annotation includes 22 phrasaltags, 20 individual functional tags and 24 basic POS-tags1(with a total of 497 different POS tags with morpholog-ical information). In addition, the ATB involves emptynodes to capture pro-drop as well as non-local dependen-cies (NLDs). The full POS tagset with morphological in-formation indicates case, mood, gender, definiteness, etc.2.3. LFG Arabic Annotation Algorithm (A3)The Arabic LFG f-structure annotation algorithm ex-ploits syntactic, categorical, configurational and ATB func-tional information to automatically annotate the ATBwith abstract LFG functional information (basic predicate-argument structures) (Tounsi et al., 2009). The method-ology has originally been developed for English (Cahill etal., 2004) and extended to other languages including Ger-man, Chinese, Spanish and French. For ATB gold trees theA3 achieves an f-score of 95% against a gold standard f-structure bank (Al-Raheb et al,. 2006).1ATB annotation uses Tim Buckwalter’s morphological anal-yser to generate POS values for each word/token of a sentence.19863. Arabic Parsing Using GrammarTransformsA number of parsers for Arabic use handcrafted gram-mars (Ditters 2001, Zabokrtsky and Smrz 2003, Othmanet al. 2003, Attia 2008). However none of these resultin a large-scale parsing application for Arabic. Conse-quently, we decided to apply the Bikel (implementationof Collins Model 3 (Bikel, 2004)) and Berkeley2 (auto-matic latent variable PCFG grammar induction (Petrov andKlein, 2007)) probabilistic parsing engines to Arabic CFGparsing, using the automatic LFG annotations methodolo-gies and gold POS parsing. We conduct our experiments onthe ATB. We split the data into three disjoint sets, 80% fortraining, 10% for development and 10% for testing. 33% ofthe test set sentences are sentences of length  40 tokens.3.1. Grammar TransformsTreebank-based parsing consists on ”learning” a modelgiven a treebank and using the model for ”computing” thebest parse on unseen text. (Johnson, 1998) improved theperformance of PCFG parsing by enriching treebank non-terminal categories with information about the context inwhich they occur and (Klein and Manning, 2003) exploredmanual and automatic extension of the traditional linguis-tic categories into a richer category set to enhance PCFGparsing results. Traditionally, treebank-based (P)CFG pars-ing abstracts away from much of the information encodedin the treebank (traces, coindexation, functional tags, etc)concentrating exclusively on the CFG skeleton of the tree-bank trees. However, in particular for morphologically richlanguages, morphological and functional information is im-portant to optimally recover syntactic structure. What ismore, the LFG A3 crucially relies on morphological infor-mation and functional tags in ATB trees. In order to capturethe 20 functional tags present in the ATB in parser output,our methodology applies ”masking” and ”percolation” ofinformation through the grammar and training data used bythe parsers. In addition we use grammar transformation-based case percolation.3.2. Function Tag Masking and UnmaskingFunction tag masking is a data transformation. We collapsefunctional and phrasal tags by fusing them into a singlesymbol: e.g. NP-SBJ becomes NPSBJ, effectively inflat-ing the CFG category set. In the ATB, nonterminals canreceive up to three functional tags. Consequently, the sizeof the phrasal tag set used by the parser increases from 20tags to 150 tags. The parser is then updated and trained onthis new data. After parsing, we unmask the function tagsto make them available to the LFG A3.2Originally, the ATB sentences follows Buckwalter translitera-tion scheme, which is a 1-to-1 transliteration of MSA orthographicsymbols using ASCII characters including special characters suchas: *,, ≈, upper and lower case (Buckwalter, 2004). However,the default package used by the Berkeley parser is case-sensitive.Consequently, we provide a new encoding based on lower casecharacters only. In addition, the current implementation of theparser requires CONLL format for the test set.3.3. Case PercolationAs Arabic is morphologically rich, a lot of information ispresent at the leaves of the trees in the ATB. We perco-late morphological information bottom-up in the trees tohelp grammatical function assignment. We focus on thethree most frequent functional tags in the ATB: -SBJ, -OBJ,-PRD.3 Case percolation aims to improve the determina-tion of subject, object and predicate constituent(s) amongthe syntactic structures identified in the parse tree. Arabichas three grammatical cases: nominative, genitive and ac-cusative. Except when they are governed by an overt copulaor a subordinating conjunction, -SBJ and -PRD are nomi-native and -OBJ is accusative (Habash et al. 2005). Addingcase information to POS tag increases the size of the POStag set to 40 tags. e.g. the POS NN is expended to NN,NN nom and NN acc).Figure 1 shows the (unmaked) output tree provided by theparser, trained on a version of the ATB which has under-gone ATB function tag masking and case percolation, forsentence (1). Each node in the tree is assigned an f-structureequation using A3. The subject NP receives ‘↑ SUBJ = ↓’and the predicate which subcategorises for a copula com-plement receives ‘↑ PRED = ‘null be’, ↑ XCOMP = ↓ ,↑ SUBJ= ↓ SUBJ’. The resolution of the equations pro-duces the f-structure shown in Figure 2. Note that the sub-ject of the matrix clause is coindexed with the subject of theembedded clause .(1) é¯QåÓ Ò Ë@Al$amosu mu$riqaputhe-sun bright.The sun is bright.SNP-SBJNN nomAl$amosuNP-PRDNN nommu$riqapuFigure 1: Parser output tree for sentence (1).PRED null be〈SUBJ , XCOMP〉SUBJPRED ‘Al$amosu’CASE nomDEF + 1XCOMPPRED ‘mu$riqapu’CASE nomDEF -SUBJ[ ]1Figure 2: F-structure for sentence (1).3-SBJ, -OBJ and -PRD represent 68% of ATB functional tagstokens.19874. Experimental ResultsTable 1 shows the results of a series of experiments con-ducted on parsers trained on the ATB for different transfor-mations as follows:- Column 1: parsers trained on the original ATB.- Column 2: functional and phrasal tags merged for nonter-minals and parsers trained on converted ATB.- Column 3: functional and phrasal tags merged for nonter-minals and nominative case percolated in the trees for thefunctional tag -SBJ.- Column 4: functional and phrasal tags merged for non-terminals and nominative case percolated in the trees forthe functional tags -SBJ, -PRD and accusative case is per-colated for the functional tag -OBJ. Grammar 1-6 refers tothe number of split-merge cycles involved in the Berkeleyparser training procedure.We measure quality and coverage of the output trees us-ing the standard EVALB (Sekine and Collins 1997). Notethat we ignore punctuation and we evaluate unmasked trees(functional information is not considered for the evalua-tion). We achieve best labeled bracketing f-scores of 73.58for Bikel and 71.56 for Berkeley CFG trees and an LFG f-structure dependency f-score of 88.95 for Bikel and 88.5 forBerkely. As expected, grammar transforms have a positiveeffect on both parsers. The effect on both is quite simi-lar, even if the absolute f-score improvements are not verylarge. In fact, the methodology presented in this paper doesnot confuse the parsers and improves the recovery of func-tional information (for the A3 to produce f-structures).5. Error AnalysisFor both parsers, the noise comes from the data set itself. Infact, the ATB suffers from accusative-genitive case under-specification/ambiguity for feminine and masculine plurals.Both cases are given to nouns where the inflectional mor-phology is the same for both accusative and genitive cases.This issue leads the parsers to overuse the functional tag-OBJ. More precisely, the tag -OBJ is often duplicated inflat trees for both accusative and genitive nouns. We havealso found a confusion between -SBJ and -PRD in nominalsentences.4 In fact, the parsers never assign the tag -PRDbefore the tag -SBJ. In addition, when a sentence starts withan overt copula or a subordinating conjunction, the parsersanalyse the sentence as a verbal sentence and mix-up thefunctional tags -PRD and -OBJ. On the other hand, Bikeloutput trees are better than Berkeley output trees, perhapsbecause Bikel’s probability model is enriched with lexicalinformation.6. Conclusion and Future DirectionsIn this paper, we have shown that morpho-syntactic infor-mation helps to detect grammatical functions in probabilis-tic approaches to parsing Arabic. We used (i) function tagmasking to include functional information in the phrasaltagset and (ii) case percolation to better identify subject, ob-ject and predicate constituent(s) among the syntactic struc-tures identified in the parse tree. The work presented in4A nominal sentence in Arabic has no verb and is composed ofa subject and predicate phrase. Usually, the subject precedes thepredicate but a swap is also possible.this paper is a first step on function labeling methodologiesusing grammar transforms. In the future, we aim to usemachine-learning-based ATB function labelers to provideinput to the annotation algorithm.7. ReferencesY.Al-Raheb, A.Akrout, J.van Genabith, J.Dichy. 2006. DCU 250Arabic Dependency Bank: An LFG Gold Standard Resourcefor the Arabic Penn Treebank The Challenge of Arabic forNLP/MT at the British Computer Society.M.Attia. 2008. Handling Arabic Morphological and SyntacticAmbiguity within the LFG Framework with a View to MachineTranslation. Ph.D. Thesis. University of Manchester, UK.D.Bikel. 2004. Intricacies of Collins’ parsing model Computa-tional Linguistics, 30(4), 2004.J.Bresnan. 2001. Lexical-Functional Syntax Blackwell, Oxford.A.Cahill, M.Burke, R.ODonovan, J.van Genabith, A.Way. 2004.Long-Distance Dependency Resolution in Automatically Ac-quired Wide-Coverage PCFG-Based LFG Approximations. InProceedings of ACL-04, pages 320-327, Barcelona, Spain.A.Cahill, J.van Genabith. 2006. Robust PCFG-Based Generationusing Automatically Acquired LFG Approximations. In Pro-ceedings of ACL-06, pages 1033-1040, Sydney, Australia.A.Cahill, M.Burke, R.O’Donovan, S.Riezler, J.van Genabith,A.Way. 2008. Wide-Coverage Deep Statistical Parsing usingAutomatic Dependency Structure Annotation. ComputationalLinguistics, Vol. 34, No. 1.E.Ditters. 2001. A Formal Grammar for the Description of Sen-tence Structure in Modern Standard Arabic. Workshop on Ara-bic Processing, ACL/EACL’01.N.Habash, G.Ryan, R.Owen, K.Seth, M.Marcus. 2007. Deter-mining Case in Arabic: Learning Complex Linguistic Be-haviour Requires Complex Linguistic Features. In Proceedingsof EMNLP’07, Prague, Czech Republic.N.Habash, F.Reem and R.Ryan. 2009. Syntactic Annotation inthe Columbia Arabic Treebank. In Proceedings of MEDAR’09.M.Johnson 1998. PCFG models of linguistic tree representa-tions. Computational Linguistics, 24(4), pages 613-632.R.Kaplan , J.Bresnan. 1982. Lexical Functional Grammar, a For-mal System for Grammatical Representation. The Mental Rep-resentation of Grammatical Relations.D.Klein, C. Manning. 2003. Accurate Unlexicalized Parsing. InProceedings of ACL03, Sapporo, Japan.S.Kulick S, G.Ryan, M.Mitchell. 2006. Parsing the Arabic Tree-bank: Analysis and improvements. In Proceedings of the Tree-banks and Linguistic Theories Conference, Prague.M.Maamouri, A.Bies. 2004. Developing an Arabic Treebank:Methods, Guidelines, Procedures, and Tools Proceedings ofCOLING 2004, Geneva, Switzerland.E.Othman, K.Shaalan, A.Rafea. 2003. A Chart Parser for Ana-lyzing Modern Standard Arabic Sentence In proceedings of theMT Summit IX Workshop on Machine Translation for SemiticLanguages: Issues and Approaches, Louisiana, USA.S.Petrov and D.Klein. 2007. Improved inference for unlexical-ized parsing In Proceedings of HLT-NAACL’07,pages 404-411, New York, USA.S.Sekine and M.J. Collins. 1997. EVALB bracket scoring pro-gram web page, http://nlp.cs.nyu.edu/evalb/.L.Tounsi, M.Attia, J.van Genabith. 2009. Automatic Treebank-Based Acquisition of Arabic LFG Dependency Structures InProceedings of EACL’2009, Athenes, Greece.Z.Zabokrtsky, O.Smrz. 2003. Arabic syntactic trees: from con-stituency to dependency In Proceedings of EACL’03, pages183-186, Budapest, Hungary.1988F-score basic masked masked+ percol masked+ percol-SBJ -SBJ -OBJ -PRDBerkeleyGrammar 1 C-structure 64.10 64.10 64.48 64.84Grammar 2 C-structure 67.75 67.75 68.43 68.83Grammar 3 C-structure 69.63 69.63 70.23 70.65Grammar 4 C-structure 70.86 70.86 71.19 71.48Grammar 5 C-structure 71.31 71.31 71.78 71.82Grammar 5 F-structure n/a 87.98 88.50 87.60Grammar 6 C-structure 70.99 70.99 71.34 71.56BikelC-structure 72.40 72.70 73.51 73.58F-structure n/a 87.09 88.53 88.95Table 1: C-structure and F-structure evaluation for all sentence lengths.F-score basic masked masked+ percol masked+ percol-SBJ -SBJ -OBJ -PRDBerkeleyGrammar 1 C-structure 67.97 67.97 68.10 69.01Grammar 2 C-structure 71.18 71.18 72.05 72.72Grammar 3 C-structure 73.02 73.02 73.94 74.46Grammar 4 C-structure 74.37 74.37 74.69 75.26Grammar 5 C-structure 75.07 75.07 75.37 75.43Grammar 5 F-structure n/a 87.50 88.68 88.12Grammar 6 C-structure 74.61 74.61 74.68 74.98BikelC-structure 75.61 75.85 76.53 76.53F-structure n/a 87.12 89.20 88.63Table 2: C-structure and F-structure evaluation for sentences ≤ 40 tokens in length.1989

A Chart Parser for Analyzing Modern Standard Arabic Sentence

A Formal Grammar for the Description of Sentence Structure

Accurate Unlexicalized Parsing.

Arabic syntactic trees: from constituency to dependency In

Automatic TreebankBased Acquisition of Arabic LFG Dependency Structures In

DCU 250 Arabic Dependency Bank: An LFG Gold Standard Resource for the Arabic Penn Treebank The Challenge of Arabic for NLP/MT at the British Computer Society.

Determining Case in Arabic: Learning Complex Linguistic Behaviour Requires Complex Linguistic Features.

Developing an Arabic Treebank:

EVALB bracket scoring program web page,

Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation.

Improved inference for unlexicalized parsing

Lexical Functional Grammar, a Formal System for Grammatical Representation. The Mental Representation of Grammatical Relations.

Lexical-Functional Syntax Blackwell,

Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations.

Parsing the Arabic Treebank: Analysis and improvements.

PCFG models of linguistic tree representations.

Robust PCFG-Based Generation using Automatically Acquired LFG Approximations.

Wide-Coverage Deep Statistical Parsing using Automatic Dependency Structure Annotation.

Arabic parsing using grammar transforms

We investigate Arabic Context Free Grammar parsing with dependency annotation comparing lexicalised and unlexicalised parsers. We study how morphosyntactic as well as function tag information percolation in the form of grammar transforms (Johnson, 1998, Kulick et al., 2006) affects the performance of a parser and helps dependency assignment. We focus on the three most frequent functional\ud
tags in the Arabic Penn Treebank: subjects, direct objects and predicates . We merge these functional tags with their phrasal categories and (where appropriate) percolate case information to the non-terminal (POS) category to train the parsers. We then automatically enrich the output of these parsers with full dependency information in order to annotate trees with Lexical Functional Grammar (LFG)\ud
f-structure equations with produce f-structures, i.e. attribute-value matrices approximating to basic predicate-argument-adjunct structure representations. We present a series of experiments evaluating how well lexicalized, history-based, generative (Bikel) as well as latent\ud
variable PCFG (Berkeley) parsers cope with the enriched Arabic data. We measure quality and coverage of both the output trees and the generated LFG f-structures. We show that joint functional and morphological information percolation improves both the recovery of trees as well as dependency results in the form of LFG f-structures

Name not available

http://doras.dcu.ie/15991/1/Arabic_Parsing_Using_Grammar_Transforms.pdf

Arabic parsing using grammar transforms

Abstract

Similar works

Full text

Available Versions

DCU Online Research Access Service

Name not available