14 research outputs found
Spanish Resource Grammar version 2023
We present the latest version of the Spanish Resource Grammar (SRG). The new
SRG uses the recent version of Freeling morphological analyzer and tagger and
is accompanied by a manually verified treebank and a list of documented issues.
We also present the grammar's coverage and overgeneration on a small portion of
a learner corpus, an entirely new research line with respect to the SRG. The
grammar can be used for linguistic research, such as for empirically driven
development of syntactic theory, and in natural language processing
applications such as computer-assisted language learning. Finally, as the
treebanks grow, they can be used for training high-quality semantic parsers and
other systems which may benefit from precise and detailed semantics.Comment: 10 pages, 4 figure
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition.
The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013).
Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively.
Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it
Parsing and Evaluation Improving Dependency Grammars Accuracy
only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars.
On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars.
The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes.
El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013).
Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional.
Grà cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anà lisi sintà ctica automà tica en millora la qualitat. D’una banda,
Parsing and Evaluation Improving Dependency Grammars Accuracy
s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques.
D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments.
Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una al- tra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica au- tomĂ tica
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
[eng] Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition. The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (PadrĂł et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and SolĂ , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013). Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; SolĂ et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively. Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it Parsing and Evaluation Improving Dependency Grammars Accuracy only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars. On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars. The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.[cat] Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes. El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013). Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional. GrĂ cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integraciĂł de coneixement en l’anĂ lisi sintĂ ctica automĂ tica en millora la qualitat. D’una banda, Parsing and Evaluation Improving Dependency Grammars Accuracy s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques. D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments. Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una altra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica automĂ tica
The Spanish DELPH-IN grammar
In this article we present a Spanish grammar implemented in the Linguistic Knowledge Builder system and grounded in the theoretical framework of Head-driven Phrase Structure Grammar. The grammar is being developed in an international multilingual context, the DELPH-IN Initiative, contributing to an open-source repository of software and linguistic resources for various Natural Language Processing applications. We will show how we have refined and extended a core grammar, derived from the LinGO Grammar Matrix, to achieve a broad-coverage grammar. The Spanish DELPH-IN grammar is the most comprehensive grammar for Spanish deep processing, and it is being deployed in the construction of a treebank for Spanish of 60,000 sentences based in a technical corpus in the framework of the European project METANET4U (Enhancing the European Linguistic Infrastructure, GA 270893GA; http://​www.​meta-net.​eu/​projects/​METANET4U/​.) and a smaller treebank of about 15,000 sentences based in a corpus from the pres
A rule-based translation from written Spanish to Spanish Sign Language glosses
This is the author’s version of a work that was accepted for publication in Computer Speech and Language. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Computer Speech and Language, 28, 3 (2015) DOI: 10.1016/j.csl.2013.10.003One of the aims of Assistive Technologies is to help people with disabilities to communicate with others and to provide means of access to information. As an aid to Deaf people, we present in this work a production-quality rule-based machine system for translating from Spanish to Spanish Sign Language (LSE) glosses, which is a necessary precursor to building a full machine translation system that eventually produces animation output. The system implements a transfer-based architecture from the syntactic functions of dependency analyses. A sketch of LSE is also presented. Several topics regarding translation to sign languages are addressed: the lexical gap, the bootstrapping of a bilingual lexicon, the generation of word order for topic-oriented languages, and the treatment of classifier predicates and classifier names. The system has been evaluated with an open-domain testbed, reporting a 0.30 BLEU (BiLingual Evaluation Understudy) and 42% TER (Translation Error Rate). These results show consistent improvements over a statistical machine translation baseline, and some improvements over the same system preserving the word order in the source sentence. Finally, the linguistic analysis of errors has identified some differences due to a certain degree of structural variation in LSE
Statistical Deep parsing for spanish
This document presents the development of a statistical HPSG parser for Spanish. HPSG is a deep linguistic formalism that combines syntactic and semanticinformation in the same representation, and is capable of elegantly modelingmany linguistic phenomena. Our research consists in the following steps: design of the HPSG grammar, construction of the corpus, implementation of theparsing algorithms, and evaluation of the parsers performance. We created a simple yet powerful HPSG grammar for Spanish that modelsmorphosyntactic information of words, syntactic combinatorial valence, and semantic argument structures in its lexical entries. The grammar uses thirteenvery broad rules for attaching specifiers, complements, modifiers, clitics, relative clauses and punctuation symbols, and for modeling coordinations. In asimplification from standard HPSG, the only type of long range dependency wemodel is the relative clause that modifies a noun phrase, and we use semanticrole labeling as our semantic representation. We transformed the Spanish AnCora corpus using a semi-automatic processand analyzed it using our grammar implementation, creating a Spanish HPSGcorpus of 517,237 words in 17,328 sentences (all of AnCora). We implemented several statistical parsing algorithms and trained them overthis corpus. The implemented strategies are: a bottom-up baseline using bi-lexical comparisons or a multilayer perceptron; a CKY approach that uses theresults of a supertagger; and a top-down approach that encodes word sequencesusing a LSTM network. We evaluated the performance of the implemented parsers and compared them with each other and against other existing Spanish parsers. Our LSTM top-down approach seems to be the best performing parser over our test data, obtaining the highest scores (compared to our strategies and also to externalparsers) according to constituency metrics (87.57 unlabeled F1, 82.06 labeled F1), dependency metrics (91.32 UAS, 88.96 LAS), and SRL (87.68 unlabeled,80.66 labeled), but we must take in consideration that the comparison against the external parsers might be noisy due to the post-processing we needed to do in order to adapt them to our format. We also defined a set of metrics to evaluate the identification of some particular language phenomena, and the LSTM top-down parser out performed the baselines in almost all of these metrics as well.Este documento presenta el desarrollo de un parser HPSG estadĂstico para el español. HPSG es un formalismo lingĂĽĂstico profundo que combina informaciĂłn sintáctica y semántica en sus representaciones, y es capaz de modelar elegantemente una buena cantidad de fenĂłmenos lingĂĽĂsticos. Nuestra investigaciĂłn se compone de los siguiente pasos: diseño de la gramática HPSG, construcciĂłn del corpus, implementaciĂłn de los algoritmos de parsing y evaluaciĂłn de la performance de los parsers. Diseñamos una gramática HPSG para el español simple y a la vez poderosa, que modela en sus entradas lĂ©xicas la informaciĂłn morfosintáctica de las palabras, la valencia combinatoria sintáctica y la estructura argumental semántica. La gramática utiliza trece reglas genĂ©ricas para adjuntar especificadores, complementos, clĂticos, cláusulas relativas y sĂmbolos de puntuaciĂłn, y tambiĂ©n para modelar coordinaciones. Como simplificaciĂłn de la teorĂa HPSG estándar, el Ăşnico tipo de dependencia de largo alcance que modelamos son las cláusulas relativas que modifican sintagmas nominales, y utilizamos etiquetado de roles semánticos como representaciĂłn semántica. Transformamos el corpus AnCora en español utilizando un proceso semiautomático y lo analizamos mediante nuestra implementaciĂłn de la gramática, para crear un corpus HPSG en español de 517,237 palabras en 17,328 oraciones (todo el contenido de AnCora). Implementamos varios algoritmos de parsing estadĂstico entrenados sobre este corpus. En particular, tenĂamos como objetivo probar enfoques basados en redes neuronales. Las estrategias implementadas son: una lĂnea base bottom-up que utiliza comparaciones bi-lĂ©xicas o un perceptrĂłn multicapa; un enfoque tipo CKY que utiliza los resultados de un supertagger; y un enfoque top-down que codifica las secuencias de palabras mediante redes tipo LSTM. Evaluamos la performance de los parsers implementados y los comparamos entre sĂ y con un conjunto de parsers existententes para el español. Nuestro enfoque LSTM top-down parece ser el que tiene mejor desempeño para nuestro conjunto de test, obteniendo los mejores puntajes (comparado con nuestras estrategias y tambiĂ©n con parsers externos) en cuanto a mĂ©tricas de constituyentes (87.57 F1 no etiquetada, 82.06 F1 etiquetada), mĂ©tricas de dependencias (91.32 UAS, 88.96 LAS), y SRL (87.68 no etiquetada, 80.66 etiquetada), pero debemos tener en cuenta que la comparaciĂłn con parsers externos puede ser ruidosa debido al post procesamiento realizado para adaptarlos a nuestro formato. TambiĂ©n definimos un conjunto de mĂ©tricas para evaluar la identificaciĂłn de algunos fenĂłmenos particulares del lenguaje, y el parser LSTM top-down obtuvo mejores resultados que las baselines para casi todas estas mĂ©tricas
Mètodes empĂrics en lingĂĽĂstica cognitiva
Peer-reviewedUna de les conseqüències fonamentals de l'auge de les tecnologies lingĂĽĂstiques Ă©s que han posat de manifest la complexitat del fenomen lingĂĽĂstic i la necessitat de desenvolupar coneixements exhaustius i interdisciplinaris sobre tots els aspectes del llenguatge. Per a avançar, cal el treball empĂric i conjunt de lingĂĽistes, psicòlegs, neuròlegs i informĂ tics. Concretament, un dels principals problemes amb què topen les tecnologies del llenguatge Ă©s que han de resoldre l'ambigĂĽitat intrĂnseca a les llengĂĽes, com Ă©s el cas de l'ambigĂĽitat que presenten els verbs polisèmics. Per tant, ens trobem amb la necessitat de disposar de criteris formals per a identificar i destriar sentits verbals. En aquest article presentem l'aspecte lèxic o Aktionsart com un dels criteris clau per a l'establiment de sentits verbals.Amb l'objectiu final de proposar un model formal de representaciĂł de les categories aspectuals que sigui implementable computacionalment i que, alhora, estigui motivat cognitivament, s'ha desenvolupat un experiment psicolingĂĽĂstic per a avaluar empĂricament l'estatus cognitiu de la macrodistinciĂł aspectual entre estats i esdeveniments i, de manera relacionada, per a validar la relaciĂł entre aspecte i polisèmia verbal. En aquest article presentarem en detall la metodologia seguida per a confeccionar la mostra experimental, el disseny i l'aplicaciĂł de l'experiment dut a terme, i tambĂ© un avenç dels primers resultats que apunten a la verificaciĂł de les hipòtesis de partida.Una de las consecuencias fundamentales del auge de las tecnologĂas lingĂĽĂsticas es que han puesto de manifiesto la complejidad del fenĂłmeno lingĂĽĂstico y la necesidad de desarrollar conocimientos exhaustivos e interdisciplinarios sobre todos los aspectos del lenguaje. Para avanzar, hace falta el trabajo empĂrico y conjunto de lingĂĽistas, psicĂłlogos, neurĂłlogos e informáticos. Concretamente, uno de los principales problemas con que topan las tecnologĂas del lenguaje es que tienen que resolver la ambigĂĽedad intrĂnseca de las lenguas, como es el caso de la ambigĂĽedad que presentan los verbos polisĂ©micos. Por lo tanto, nos encontramos con la necesidad de disponer de criterios formales para identificar y distinguir sentidos verbales. En este artĂculo presentamos el aspecto lĂ©xico o Aktionsart como uno de los criterios clave para el establecimiento de sentidos verbales.
Con el objetivo final de proponer un modelo formal de representaciĂłn de las categorĂas aspectuales que sea implementable computacionalmente y que, al mismo tiempo, estĂ© motivado de forma cognitiva, se ha desarrollado un experimento psicolingĂĽĂstico para evaluar empĂricamente el estatus cognitivo de la macrodistinciĂłn aspectual entre estados y acontecimientos y, de forma relacionada, para validar la relaciĂłn entre aspecto y polisemia verbal. En este artĂculo presentaremos en detalle la metodologĂa seguida para confeccionar la muestra experimental, el diseño y la aplicaciĂłn del experimento llevado a cabo, asĂ como un avance de los primeros resultados que apuntan a la verificaciĂłn de las hipĂłtesis de partida.One of the fundamental consequences of the boom in linguistic technologies is that they have revealed the complexity of the linguistic phenomenon and the need to develop thorough and interdisciplinary knowledge regarding all aspects of language. To move forward, empirical and joint work by linguists, psychologists, neurologists and IT specialists is needed. In particular, one of the main problems that language technologies come across is that they have to settle the intrinsic ambiguity in languages, as is the case of the ambiguity with polysemic verbs. Therefore, we need to have a set of formal criteria to identify and discern verbal meanings. In this article, we will be presenting the lexical aspect or Aktionsart as one of the key criteria for establishing verbal meanings. With the final aim of proposing a formal model of representation of the aspectual categories that can be implemented computationally and which, at the same time, is cognitively motivated, a psycholinguistic experiment has been conducted to evaluate empirically the cognitive status of the aspectual macrodistinction between states and events and, in a related way, to validate the relationship between aspect and verbal polysemy. In this article, we will offer an in-depth presentation of the methodology used to create the experiment sample, design and application of the experiment conducted, and also an advance look at the initial results that point to the verification of the initial hypothesis
Towards a rule-based Spanish to Spanish sign language translation: from written forms to phonological representations
Tesis doctoral inĂ©dita leĂda en la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de TecnologĂa ElectrĂłnica y de las Comunicaciones. Fecha de lectura: noviembre de 2014This thesis addresses several aspects about the automatic translation from Castilian
Spanish to Spanish Sign Language (LSE), two typologically distant languages with not
enough linguistics resources enabling statistical approaches to translation. For this reason,
a rule-based approach grounded on contrastive grammatical studies on both languages is
used.
An architecture following the analysis, transfer and generation model has been chosen.
Transfer is performed at the grammatical function level, which is delivered by a Spanish
dependency parser without incurring into the complexities of a more deeper analysis.
The bilingual base lexicon is obtained from the Diccionario normativo de la lengua de
signos española (DILSE-III), which contains the correspondences between Spanish lemmas
and their SEA (Sistema de escritura alfabética) representation of signs. The lexicon is
extended in two different ways: taking advantage of the difference in flexibility between
the part-of-speech systems of Spanish and LSE and exploiting several lexical semantic
relations, such as synonymy, hyponymy and meronymy.
During the structural transfer phase, some nodes of the dependency analysis are transformed,
others are removed and new nodes are inserted. Some classifier predicates are
generated in this phase. Surface order generation of signs is obtained by means of the
topological ordering of the graph of precedence relations between signs. Pairs of signs
having head-dependent relations or sharing the same head are examined in order to determine
if its relative ordering is marked or not. The system is evaluated at this point and
results are compared to those obtained with statistical models. Best results are obtained
with the rule-based approach, with a 0.30 BLEU (Bilingual Evaluation Understudy) and
a 42% TER (Translation Error Rate). A linguistic-oriented analysis of errors is provided.
Finally, in the morphological generation phase, glosses with morphological annotations
are replaced by the HamNoSys (Hamburg Sign Language Notation System) phonological
representations produced by a computational morphology. These representations are used
for animation synthesis with avatars. The computational morphology that has been implemented
uses inflection, introflection and suppletion to model a significant fragment
of the LSE morphology. Among the phenomena considered, it has been implemented
deictics, nominal plural, aspect marking, verbal agreement, adjectival modification and
degree.Esta tesis aborda varios aspectos sobre traducción automática ed español a lengua de
signos española (LSE), dos lenguas tipológicamente distantes y con insuficientes recursos
lingĂĽĂsticos que hagan posible aproximaciones estadĂsticas a la traducciĂłn. Por ese motivo,
se propone una estrategia basada en reglas lingĂĽĂsticas fundamentadas en los estudios gramaticales
contrastivos existentes entre ambas lenguas.
Se ha optado por una arquitectura para la traducción siguiendo el modelo de análisis,
transferencia y generaciĂłn, en la que la transferencia se realiza al nivel de las funciones
gramaticales proporcionadas por un analizador de dependencias, evitando asĂ las complejidades
asociadas a un análisis lingĂĽĂstico mas profundo para el español.
El lexicĂłn bilĂngĂĽe base para la transferencia lĂ©xica se ha obtenido de las entradas
del Diccionario normativo de la lengua de signos española (DILSE-III), que contiene las
correspondencias entre lemas en español y la representación SEA (Sistema de escritura
alfabĂ©tica) de los signos. Este lexicĂłn se ha ampliado por dos vĂas: Aprovechando las
diferencias de flexibilidad entre las clase de palabras del español y la LSE, y explotando
relaciones semánticas como la sinonimia, la hiperonimia y la meronimia.
Durante la transferencia estructural, algunos nodos del árbol de análisis de dependencias
son transformados, otros son borrados y son insertados nuevos nodos. Algunos
predicados clasificadores son generados en esta fase. La generaciĂłn del orden superficial
de los signos se obtiene mediante la ordenaciĂłn topolĂłgica del grafo de relaciones de
precedencia entre signos. Los pares de signos en nodos que mantienen la relaciĂłn nĂşcleodependiente
o son dependientes de un mismo signo son examinados para determinar si
su orden relativo está marcado o no. El sistema de traducción es evaluado en este punto
utilizando un corpus y comparado con el resultado obtenido con distintos modelos de
traducciĂłn estadĂstica. Sobre un corpus de control de glosas, el sistema basado en reglas
obtiene mejores resultados, con un BLEU (Bilingual Evaluation Understudy) del 0,30 y
un TER (Translation Error Rate) del 42%. Sobre los resultados se ha realizado un análisis
de los errores.
Finalmente, para la generaciĂłn morfolĂłgica, las glosas junto con sus correspondientes
anotaciones morfolĂłgicas son reemplazadas por las representaciones fonolĂłgicas Ham-
NoSys producidas por una morfologĂa computacional y usables para la sĂntesis de animaciones
mediante avatares. La morfologĂa implementada usa flexiĂłn, introflexiĂłn y
supleciĂłn para modelar un fragmento bastante amplio de la LSE. Entre los fenĂłmenos
tratados se incluyen la deixis, la realizaciĂłn de los distintos tipos de plural nominal, el
aspecto, la concordancia argumental del verbo, la modificaciĂłn adjetival y el grado
Head-Driven Phrase Structure Grammar
Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)
Head-Driven Phrase Structure Grammar
Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)