En el presente artículo se detalla el proceso de creación de corpora
para el entrenamiento y pruebas de un generador de analizadores de dependencias
(Maltparser). Se parte del corpus Cast3LB, que contiene análisis de constituyentes de
textos en español. Estos análisis de constituyentes se transforman automáticamente
en análisis de dependencias. Además se describe cómo se obtiene, experimentalmente
y de manera semiautomática, un conjunto de etiquetas de funcionalidad sintáctica
para etiquetar adecuadamente el corpus de entrenamiento. El proceso seguido ha
permitido obtener un analizador de dependencias para el español con una precisión
del 91% en la determinación de dependencias.The present paper details the process followed for creating training and
test corpora for a dependency parser generator (Maltparser). The starting point is
the Cast3LB corpus, which contains constituency analyses of Spanish texts. These
constituency analyses are automatically transformed into dependency analyses. In
addition, the empirically and semiautomatically obtention of a set of syntactic function
labels for the training corpus is described. As a result of the process followed, it
has been obtained a dependency parser for Spanish showing a 91% precision when
determining dependencies.Partially supported by the Spanish Ministry
of Education and Science (TIN2006-14433-C02-01
project)

Gervás Gómez-Navarro, Pablo

Herrera de la Cruz, Jesús

Moriano Mohedano, Pedro Jesús

Muñoz Moreno, Alfonso

Romero Tejera, Luis

Repositorio Institucional de la Universidad de Alicante

Building Corpora for the Development of a Dependency Parserfor Spanish Using Maltparser∗Jesu´s HerreraDepartamento de Lenguajes y Sistemas Informa´ticosUniversidad Nacional de Educacio´n a DistanciaC/ Juan del Rosal, 16, E-28040 Madridjesus.herrera@lsi.uned.esPablo Gerva´s, Pedro J. Moriano, Alfonso Mun˜oz, Luis RomeroDepartamento de Ingenier´ıa del Software e Inteligencia ArtificialUniversidad Complutense de MadridC/ Profesor Jose´ Garc´ıa Santesmases, s/n, E-28040 Madridpgervas@sip.ucm.es, {pedrojmoriano, alfonsomm, luis.romero.tejera}@gmail.comResumen: En el presente art´ıculo se detalla el proceso de creacio´n de corporapara el entrenamiento y pruebas de un generador de analizadores de dependencias(Maltparser). Se parte del corpus Cast3LB, que contiene ana´lisis de constituyentes detextos en espan˜ol. Estos ana´lisis de constituyentes se transforman automa´ticamenteen ana´lisis de dependencias. Adema´s se describe co´mo se obtiene, experimentalmentey de manera semiautoma´tica, un conjunto de etiquetas de funcionalidad sinta´cticapara etiquetar adecuadamente el corpus de entrenamiento. El proceso seguido hapermitido obtener un analizador de dependencias para el espan˜ol con una precisio´ndel 91% en la determinacio´n de dependencias.Palabras clave: Ana´lisis de dependencias, corpus de entrenamiento, etiqueta defuncionalidad sinta´ctica, Maltparser, JBeaverAbstract: The present paper details the process followed for creating training andtest corpora for a dependency parser generator (Maltparser). The starting point isthe Cast3LB corpus, which contains constituency analyses of Spanish texts. Theseconstituency analyses are automatically transformed into dependency analyses. Inaddition, the empirically and semiautomatically obtention of a set of syntactic func-tion labels for the training corpus is described. As a result of the process followed, ithas been obtained a dependency parser for Spanish showing a 91% precision whendetermining dependencies.Keywords: Dependency parsing, training corpus, syntactic function label, Malt-parser, JBeaver1. IntroductionThe development of JBeaver, a dependen-cy parser for Spanish (Herrera et al., 2007), isbased on the use of Maltparser (Nivre et al.,2006), which is a machine learning tool forgenerating dependency parsers for, virtually,every language. Such development carries in-herently associated the labour of generatingcorpora for its training and its subsequentevaluation.The amount of work needed for develop-∗ Partially supported by the Spanish Ministryof Education and Science (TIN2006-14433-C02-01project).ing from scratch a corpus annotated with de-pendency analyses, and with a suitable sizefor training Maltparser, exceeded the pos-sibilities of the JBeaver project. Therefore,it was necessary to find an alternative wayfor the generation of such corpus. A possibleapproach was to reuse available resources inorder to build from them a corpus annotat-ed with dependency analyses in a semiauto-matic way. For this, the Cast3LB (Navarroet al., 2003) treebank was used. It is con-formed by 72 Mb of Spanish annotated texts,approximately and itcontains the constituen-cy analysis for every sentence in it. LeavingProcesamiento del Lenguaje Natural, nº39 (2007), pp. 181-186 recibido 18-05-2007; aceptado 22-06-2007ISSN: 1135-5948 © 2007 Sociedad Española para el Procesamiento del Lenguaje Naturalaside certain subtleties (Gelbukh and Torres,2006), constituency analysis and dependencyanalyses can be converted one into the oth-er in a systematic way. After studying theformat and labels used for Cast3LB (Navar-ro et al., 2003) (Civit, 2002), a system ca-pable of transforming the constituency anal-yses contained in Cast3LB into dependencyanalyses was developed bymodifying an algo-rithm proposed by Gelbukh et al. (Gelbukhand Torres, 2006) (Gelbukh et al., 2005). Theexistence of Cast3LB and the possibility oftransforming the analyses contained in it intodependency analyses were important reasonsto use Maltparser in the JBeaver project.On the other hand, having decided thatthe JBeaver parser would be made general-ly available to the public, lead us to consideradditional requirements. For instance, we de-cided to make as easy as possible the use ofJBeaver by tools already adapted to the useof Minipar (Lin, 1998). This is due to the factthat Minipar has become a de facto standardin the last years after being used by a largenumber of applications. Thus, the notationused for JBeaver is, as far as possible, thesame as the one used for Minipar.2. The source corpusA dependency analysis corpus is need-ed for training Maltparser. The construc-tion of such a corpus by hand implied awork load well beyond the constraints ofthe JBeaver project. Thus, it was decidedto take advantage of existing resources. Tak-ing into account that, except for some spe-cific cases (such as non-projective construc-tions), the dependency analysis of a text canbe automatically derived from its constituen-cy analysis (Gelbukh and Torres, 2006), andthat Cast3LB –which contains constituen-cy analyses of Spanish texts– was available,it became the best option as source corpusfor the project. Then, the training corpuswas obtained in a semiautomatic way fromCast3LB.Cast3LB contains 100,000 words in, ap-proximately, 3,700 sentences of texts in Span-ish. 75,000 words of Cast3LB come from theClicTALP corpus, which is a set of text fromseveral domains: literary, journalistic, scien-tific, etcetera, and the other 25,000 wordscome from the EFE news agency’s corpusfrom year 2000 (Navarro et al., 2003). In fig-ure 1 an excerpt from Cast3LB is shown asan example.3. Building a training corpusMalparser requires for its training a cor-pus in which, for every word of the analyzedtext, the following data must be incorporat-ed: a unique identifier, its part of speech la-bel, the identifier of the head of that wordand a label indicating the syntactic functiongiven in the dependency relationship. Malt-parser admits both a XML format and a tabformat at its input. In figure 2 two mutuallyequivalent examples are shown (the first onein XML format and the second one in tabformat).The numeric identifier 0 and the syntacticfunction label ROOT are used by conventionto designate the dependency tree’s root1.All the information needed for the cre-ation of the training corpus was containedin the Cast3LB corpus, but it was necessaryto extract it and to modify it to suit the con-ventions followed by Maltparser. For this, thetwo following actions were accomplished: theobtention of dependency relationships, andthe obtention of syntactic function labels.3.1. Obtaining dependencyrelationshipsIn order to extract the dependency re-lationships between words contained in theCast3LB corpus, an automatic process wasdeveloped. It was designed from an algorithmproposed by Gelbukh et al. (Gelbukh andTorres, 2006) (Gelbukh et al., 2005), modi-fied as needed.3.2. Obtaining syntactic functionslabelsThe great popularity reached in the lastyears by Minipar lead to the decision of us-ing, in the JBeaver project, a set of syntacticfunction labels that followed, as far as possi-ble, the nomenclature given by Minipar. Inthis way, it would be easier to adapt sys-tems currently using Minipar to the use ofJBeaver. Since the Cast3LB corpus containsspecific syntactic function labels, they mustbe translated into the ones used by Miniparin order to train Maltparser with the appro-priate set of labels. For this, the first actionto be accomplished was to obtain the set ofsyntactic function labels from Minipar. Since1http://w3.msi.vxu.se/∼nivre/research/MaltXML.htmlJesús Herrera de la Cruz, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero182<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE FILE SYSTEM "3lb.dtd"><FILE id="agset" language="es" wn="1.5" ewn="dic2002"parsing_state="process" semantic_state="process"last_modified="13-01-2006" project="3LB" about="3LB project annotation file"><LOG auto_file="a1-0-auto3.log" anno_file="a1-0-anno4.log"nosense_file="a1-0-nosense4.log" /><SENTENCE id="agset_1"><Anchor id="agset_1_ac1" offset="0"/><Anchor id="agset_1_ac2" offset="15"/><Anchor id="agset_1_ac3" offset="21"/><Anchor id="agset_1_ac4" offset="23"/><Anchor id="agset_1_ac5" offset="26"/><Anchor id="agset_1_ac6" offset="34"/><Anchor id="agset_1_ac7" offset="40"/><Anchor id="agset_1_ac8" offset="42"/><Anchor id="agset_1_ac9" offset="52"/><Anchor id="agset_1_ac10" offset="54"/><Annotation id="agset_1_an3" start="agset_1_ac1" end="agset_1_ac2"type="syn"><Feature name="roles">SUJ</Feature><Feature name="label">sn</Feature><Feature name="parent">agset_1_an2</Feature></Annotation><Annotation id="agset_1_an4" start="agset_1_ac1" end="agset_1_ac2"type="syn"><Feature name="label">grup.nom.ms</Feature><Feature name="parent">agset_1_an3</Feature></Annotation><Annotation id="agset_1_an5" start="agset_1_ac1" end="agset_1_ac2"type="wrd"><Feature name="label">Medardo_Fraile</Feature><Feature name="sense">C2S</Feature><Feature name="parent">agset_1_an6</Feature></Annotation><Annotation id="agset_1_an6" start="agset_1_ac1" end="agset_1_ac2"type="pos"><Feature name="lema">Medardo_Fraile</Feature><Feature name="label">np00000</Feature><Feature name="parent">agset_1_an4</Feature></Annotation><Annotation id="agset_1_an1" start="agset_1_ac1" end="agset_1_ac10"type="dummy_root"><Feature name="label"/><Feature name="parent"/></Annotation>Figura 1: Excerpt from Cast3LBan exhaustive list of these labels is not pub-licly available, it was necessary to try to ob-tain the best possible approach, from a largenumber of analyses made with Minipar. Fol-lowing this goal, an empirical work was ac-complished, based on the idea that with agreat amount of analyses made with Miniparthe set of different labels found would be veryclose to the real set of labels. The process em-ployed was the following:Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser183<sentence id="2" user="malt" date=""><word id="1" form="Genom" postag="pp" head="3" deprel="ADV"/><word id="2" form="skattereformen" postag="nn.utr.sin.def.nom" head="1"deprel="PR"/><word id="3" form="infors" postag="vb.prs.sfo" head="0" deprel="ROOT"/><word id="4" form="individuell" postag="jj.pos.utr.sin.ind.nom" head="5"deprel="ATT"/><word id="5" form="beskattning" postag="nn.utr.sin.ind.nom" head="3"deprel="SUB"/><word id="6" form="(" postag="pad" head="5" deprel="IP"/><word id="7" form="sarbeskattning" postag="nn.utr.sin.ind.nom" head="5"deprel="APP"/><word id="8" form=")" postag="pad" head="5" deprel="IP"/><word id="9" form="av" postag="pp" head="5" deprel="ATT"/><word id="10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" head="9"deprel="PR"/><word id="11" form="." postag="mad" head="3" deprel="IP"/></sentence>Genom pp 3 ADVskattereformen nn.utr.sin.def.nom 1 PRinfors vb.prs.sfo 0 ROOTindividuell jj.pos.utr.sin.ind.nom 5 ATTbeskattning nn.utr.sin.ind.nom 3 SUB( pad 5 IPsarbeskattning nn.utr.sin.ind.nom 5 APP) pad 5 IPav pp 5 ATTarbetsinkomster nn.utr.plu.ind.nom 9 PR. mad 3 IPFigura 2: Mutually equivalent training files for Maltparser (XML and tab)1. A set of English texts obtained from theweb was parsed with Minipar. It consist-ed of about 1 Mb of texts from sever-al domains extracted from the ProjectGutemberg2 covering the following do-mains: sport (197.1 Kb containing 1,854phrases), economy (207.1 Kb containing1,173 phrases), education (160.5 Kb con-taining 869 phrases), history (162.2 Kbcontaining 1,210 phrases), justice (98.2Kb containing 453 phrases) and health(265.2 Kb containing 2,409 phrases).2. The output files given by Minipar weretreated in order to extract the set of alldifferent syntactic function labels.3. A set of analyses, in which all the labelsfound were present, was selected and thefollowing algorithm was applied to it:2http://www.gutenberg.org/for each syntactic function label identi-fied doif this function may occur in SpanishthenSet one or more rules for suitablytransforming the syntactic function labelfrom Cast3LB into the identified label;elseDiscard the identified label;end ifend forThe rules mentioned above were imple-mented in the program that transforms con-stituency analyses into dependency analyses.A special label was used to identify not yetdiscovered syntactic functions that might befound in the future.After the establishment of the set of syn-tactic rules, a significant set of constituen-Jesús Herrera de la Cruz, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero184cy analyses was transformed into dependen-cy analyses. Having obtained the dependen-cy treebank, all the analyses containing oneor more special labels for not yet discoveredsyntactic functions was manually analyzed.Then, every case was studied in order to de-termine if a new syntactic function label wasincorporated to the set or the considered syn-tactic function could be assimilated to one ofthe known labels. In figure 3 the completelist of syntactic function labels is shown, i.e.,those from Minipar and those that were de-fined ad–hoc.Identified Minipar’s syntactic function labels:sc neg pcomp–npnmod nn genposs lex–depappowhn mod subjaux amod guestnum vrel elsepunc det negamount–valueNew ad–hoc syntactic function labels:ROOT adj fechadescr c-descr compdetFigura 3: Syntactic function labels used inthe training corpusThe set of syntactic function labels finallyobtained was not necessarily complete, but itwas reasonably valid for its purpose. Thus, itwas used by the algorithm that transformedconstituency analyses into dependency anal-yses for labelling the syntactic functions ac-cording to Minipar’s nomenclature.3.3. Part of speech taggingOne of JBeaver’s features is that is ca-pable to parse texts with no need of a pre-vious annotation. Since the model learnedby MaltParser requires, for the parsing step,that every word is labeled with its part ofspeech, the tagging subtask is implementedin JBeaver by the part of speech tagger Tree-tagger (Schmid et al., 1994). The use of Tree-tagger was motivated by the fact that its setof part of speech labels was the one used forMaltParser’s training.3.4. The definitive corpusFollowing the process described in this sec-tion, 280 XML files (72.9 Mb) containing con-stituency analyses from the Cast3LB corpus,consisting of 97,002 words, were transformedinto dependency analyses apt for their pro-cessing by MaltParser (a tab training file of1.6 Mb), being labeled according to the re-quirements of the JBeaver project.4. The test corpus and resultsobtainedFor the evaluation of the trained mod-el a fraction of dependencies correctly foundand labeled was computed. The gold stan-dard was a fraction of the corpus describedin section 3. This corpus was divided in threeequal parts; two of them were used as thetraining corpus and the other one was usedboth as test corpus and as gold standard. Forusing it as test corpus, the annotations con-cerning dependency relationships and syntac-tic function were eliminated, i.e., it was con-formed only by the words and their part ofspeech tags, which is the format required byMaltParser for using it as parser. Thus, theoutput given by the trained model was com-pared with the gold standard, and 91% ofthe dependencies found by the trained modelwere according to the gold standard (Herreraet al., 2007). This result is comparable to theone obtained by Nivre et al. when trainingMaltParser for Spanish (Nivre et al., 2006).5. Conclusions and future workThe process of building corpora for train-ing and testing a specific tool for generat-ing dependency parser (Maltparser) has beenshown. This process has proper features be-cause of the requirements of the project inwhich it has been developed (JBeaver). It wasmandatory to use existing resources, and aconstituency analyses corpus has been sat-isfactorily transformed into a equivalent de-pendency analyses corpus. For this purpose,an algorithm previously proposed by Gel-bukh et al. was modified and applied. In ad-dition and in order to fulfill the necessities ofthe project, the set of syntactic function la-bels of Minipar was empirically determined.The future work includes the search formore syntactic function labels, from Miniparand new ones not considered yet. Also, someresearch could be done in order to improvethe algorithm that transforms constituencyBuilding Corpora for the Development of a Dependency Parser for Spanish Using Maltparser185analyses into dependency analyses. Bymeansof these future improvements, it should bepossible to learn better models for dependen-cy parsing in Spanish.In addition, similar development efforts tothe one described here could be carried outfor other languages.Bibliograf´ıaM. Civit. 2002. Etiquetacio´n de los Cuan-tificadores: Varias Propuestas. TALP Re-search Center–Universidad Polite´cnica deCatalun˜a. Technical Report.A. Gelbukh and S. Torres. 2006. Tratamien-to de Ciertos Pronombres y Conjuncionesen la Transformacio´n de un Corpus deConstituyentes a un Corpus de Dependen-cias. Avances en la Ciencia de la Com-putacio´n. VII Encuentro Internacional deComputacio´n ENC’06.A. Gelbukh, S. Torres and H. Calvo. 2005.Transforming a Constituency Treebank in-to a Dependency Treebank. Procesamientodel Lenguaje Natural, No 35, September2005. Sociedad Espan˜ola para el Proce-samiento de Lenguaje Natural (SEPLN).J. Herrera, P. Gerva´s, P.J. Moriano, A.Mun˜oz, L. Romero. 2007. JBeaver: UnAnalizador de Dependencias para el Es-pan˜ol Basado en Aprendizaje. Under eval-uation process for CAEPIA 2007.D. Lin. 1998. Dependency–based Evaluationof MINIPAR. Proceedings of the Work-shop on the Evaluation of Parsing Sys-tems, Granada, Spain.B. Navarro, M. Civit, M.A. Mart´ı, R. Marcos,B. Ferna´ndez. 2003. Syntactic, Semanticand Pragmatic Annotation in Cast3LB.Proceedings of the Shallow Processing onLarge Corpora (SproLaC), a Workshop onCorpus Linguistics, Lancaster, UK.J. Nivre, J. Hall, J. Nilsson, G. Eryig˘itand S. Marinov. 2006. Labeled Pseudo–Projective Dependency Parsing with Sup-port Vector Machines. Proceedings of theCoNLL-X Shared Task on MultilingualDependency Parsing, New York, USA.H. Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decission Trees.Proceedings of the International Confer-ence on New Methods in Language Pro-cessing, pages 44–49, Manchester, UK.Jesús Herrera de la Cruz, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero186

Building corpora for the development of a dependency parser for Spanish using Maltparser

Abstract

Similar works

Full text

Available Versions

Repositorio Institucional de la Universidad de Alicante