14 research outputs found
A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging
In this paper, we propose a new approach to construct a system of
transformation rules for the Part-of-Speech (POS) tagging task. Our approach is
based on an incremental knowledge acquisition method where rules are stored in
an exception structure and new rules are only added to correct the errors of
existing rules; thus allowing systematic control of the interaction between the
rules. Experimental results on 13 languages show that our approach is fast in
terms of training time and tagging speed. Furthermore, our approach obtains
very competitive accuracy in comparison to state-of-the-art POS and
morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the
European Journal on Artificial Intelligence. Version 3: Resubmitted after
major revisions. Version 4: Resubmitted after minor revisions. Version 5: to
appear in AI Communications (accepted for publication on 3/12/2015
A Legal Perspective on Training Models for Natural Language Processing
A significant concern in processing natural language data is the often unclear legal status of the input and output data/resources. In this paper, we investigate this problem by discussing a typical activity in Natural Language Processing: the training of a machine learning
model from an annotated corpus. We examine which legal rules apply at relevant steps and how they affect the legal status of the results, especially in terms of copyright and copyright-related rights
A rule-based translation from written Spanish to Spanish Sign Language glosses
This is the author’s version of a work that was accepted for publication in Computer Speech and Language. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Computer Speech and Language, 28, 3 (2015) DOI: 10.1016/j.csl.2013.10.003One of the aims of Assistive Technologies is to help people with disabilities to communicate with others and to provide means of access to information. As an aid to Deaf people, we present in this work a production-quality rule-based machine system for translating from Spanish to Spanish Sign Language (LSE) glosses, which is a necessary precursor to building a full machine translation system that eventually produces animation output. The system implements a transfer-based architecture from the syntactic functions of dependency analyses. A sketch of LSE is also presented. Several topics regarding translation to sign languages are addressed: the lexical gap, the bootstrapping of a bilingual lexicon, the generation of word order for topic-oriented languages, and the treatment of classifier predicates and classifier names. The system has been evaluated with an open-domain testbed, reporting a 0.30 BLEU (BiLingual Evaluation Understudy) and 42% TER (Translation Error Rate). These results show consistent improvements over a statistical machine translation baseline, and some improvements over the same system preserving the word order in the source sentence. Finally, the linguistic analysis of errors has identified some differences due to a certain degree of structural variation in LSE
Statistical Deep parsing for spanish
This document presents the development of a statistical HPSG parser for Spanish. HPSG is a deep linguistic formalism that combines syntactic and semanticinformation in the same representation, and is capable of elegantly modelingmany linguistic phenomena. Our research consists in the following steps: design of the HPSG grammar, construction of the corpus, implementation of theparsing algorithms, and evaluation of the parsers performance. We created a simple yet powerful HPSG grammar for Spanish that modelsmorphosyntactic information of words, syntactic combinatorial valence, and semantic argument structures in its lexical entries. The grammar uses thirteenvery broad rules for attaching specifiers, complements, modifiers, clitics, relative clauses and punctuation symbols, and for modeling coordinations. In asimplification from standard HPSG, the only type of long range dependency wemodel is the relative clause that modifies a noun phrase, and we use semanticrole labeling as our semantic representation. We transformed the Spanish AnCora corpus using a semi-automatic processand analyzed it using our grammar implementation, creating a Spanish HPSGcorpus of 517,237 words in 17,328 sentences (all of AnCora). We implemented several statistical parsing algorithms and trained them overthis corpus. The implemented strategies are: a bottom-up baseline using bi-lexical comparisons or a multilayer perceptron; a CKY approach that uses theresults of a supertagger; and a top-down approach that encodes word sequencesusing a LSTM network. We evaluated the performance of the implemented parsers and compared them with each other and against other existing Spanish parsers. Our LSTM top-down approach seems to be the best performing parser over our test data, obtaining the highest scores (compared to our strategies and also to externalparsers) according to constituency metrics (87.57 unlabeled F1, 82.06 labeled F1), dependency metrics (91.32 UAS, 88.96 LAS), and SRL (87.68 unlabeled,80.66 labeled), but we must take in consideration that the comparison against the external parsers might be noisy due to the post-processing we needed to do in order to adapt them to our format. We also defined a set of metrics to evaluate the identification of some particular language phenomena, and the LSTM top-down parser out performed the baselines in almost all of these metrics as well.Este documento presenta el desarrollo de un parser HPSG estadĂstico para el español. HPSG es un formalismo lingĂĽĂstico profundo que combina informaciĂłn sintáctica y semántica en sus representaciones, y es capaz de modelar elegantemente una buena cantidad de fenĂłmenos lingĂĽĂsticos. Nuestra investigaciĂłn se compone de los siguiente pasos: diseño de la gramática HPSG, construcciĂłn del corpus, implementaciĂłn de los algoritmos de parsing y evaluaciĂłn de la performance de los parsers. Diseñamos una gramática HPSG para el español simple y a la vez poderosa, que modela en sus entradas lĂ©xicas la informaciĂłn morfosintáctica de las palabras, la valencia combinatoria sintáctica y la estructura argumental semántica. La gramática utiliza trece reglas genĂ©ricas para adjuntar especificadores, complementos, clĂticos, cláusulas relativas y sĂmbolos de puntuaciĂłn, y tambiĂ©n para modelar coordinaciones. Como simplificaciĂłn de la teorĂa HPSG estándar, el Ăşnico tipo de dependencia de largo alcance que modelamos son las cláusulas relativas que modifican sintagmas nominales, y utilizamos etiquetado de roles semánticos como representaciĂłn semántica. Transformamos el corpus AnCora en español utilizando un proceso semiautomático y lo analizamos mediante nuestra implementaciĂłn de la gramática, para crear un corpus HPSG en español de 517,237 palabras en 17,328 oraciones (todo el contenido de AnCora). Implementamos varios algoritmos de parsing estadĂstico entrenados sobre este corpus. En particular, tenĂamos como objetivo probar enfoques basados en redes neuronales. Las estrategias implementadas son: una lĂnea base bottom-up que utiliza comparaciones bi-lĂ©xicas o un perceptrĂłn multicapa; un enfoque tipo CKY que utiliza los resultados de un supertagger; y un enfoque top-down que codifica las secuencias de palabras mediante redes tipo LSTM. Evaluamos la performance de los parsers implementados y los comparamos entre sĂ y con un conjunto de parsers existententes para el español. Nuestro enfoque LSTM top-down parece ser el que tiene mejor desempeño para nuestro conjunto de test, obteniendo los mejores puntajes (comparado con nuestras estrategias y tambiĂ©n con parsers externos) en cuanto a mĂ©tricas de constituyentes (87.57 F1 no etiquetada, 82.06 F1 etiquetada), mĂ©tricas de dependencias (91.32 UAS, 88.96 LAS), y SRL (87.68 no etiquetada, 80.66 etiquetada), pero debemos tener en cuenta que la comparaciĂłn con parsers externos puede ser ruidosa debido al post procesamiento realizado para adaptarlos a nuestro formato. TambiĂ©n definimos un conjunto de mĂ©tricas para evaluar la identificaciĂłn de algunos fenĂłmenos particulares del lenguaje, y el parser LSTM top-down obtuvo mejores resultados que las baselines para casi todas estas mĂ©tricas
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition.
The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013).
Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively.
Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it
Parsing and Evaluation Improving Dependency Grammars Accuracy
only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars.
On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars.
The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes.
El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013).
Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional.
Grà cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anà lisi sintà ctica automà tica en millora la qualitat. D’una banda,
Parsing and Evaluation Improving Dependency Grammars Accuracy
s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques.
D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments.
Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una al- tra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica au- tomĂ tica
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
[eng] Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition. The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (PadrĂł et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and SolĂ , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013). Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; SolĂ et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively. Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it Parsing and Evaluation Improving Dependency Grammars Accuracy only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars. On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars. The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.[cat] Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes. El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013). Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional. GrĂ cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integraciĂł de coneixement en l’anĂ lisi sintĂ ctica automĂ tica en millora la qualitat. D’una banda, Parsing and Evaluation Improving Dependency Grammars Accuracy s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques. D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments. Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una altra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica automĂ tica
Towards a rule-based Spanish to Spanish sign language translation: from written forms to phonological representations
Tesis doctoral inĂ©dita leĂda en la Universidad AutĂłnoma de Madrid, Escuela PolitĂ©cnica Superior, Departamento de TecnologĂa ElectrĂłnica y de las Comunicaciones. Fecha de lectura: noviembre de 2014This thesis addresses several aspects about the automatic translation from Castilian
Spanish to Spanish Sign Language (LSE), two typologically distant languages with not
enough linguistics resources enabling statistical approaches to translation. For this reason,
a rule-based approach grounded on contrastive grammatical studies on both languages is
used.
An architecture following the analysis, transfer and generation model has been chosen.
Transfer is performed at the grammatical function level, which is delivered by a Spanish
dependency parser without incurring into the complexities of a more deeper analysis.
The bilingual base lexicon is obtained from the Diccionario normativo de la lengua de
signos española (DILSE-III), which contains the correspondences between Spanish lemmas
and their SEA (Sistema de escritura alfabética) representation of signs. The lexicon is
extended in two different ways: taking advantage of the difference in flexibility between
the part-of-speech systems of Spanish and LSE and exploiting several lexical semantic
relations, such as synonymy, hyponymy and meronymy.
During the structural transfer phase, some nodes of the dependency analysis are transformed,
others are removed and new nodes are inserted. Some classifier predicates are
generated in this phase. Surface order generation of signs is obtained by means of the
topological ordering of the graph of precedence relations between signs. Pairs of signs
having head-dependent relations or sharing the same head are examined in order to determine
if its relative ordering is marked or not. The system is evaluated at this point and
results are compared to those obtained with statistical models. Best results are obtained
with the rule-based approach, with a 0.30 BLEU (Bilingual Evaluation Understudy) and
a 42% TER (Translation Error Rate). A linguistic-oriented analysis of errors is provided.
Finally, in the morphological generation phase, glosses with morphological annotations
are replaced by the HamNoSys (Hamburg Sign Language Notation System) phonological
representations produced by a computational morphology. These representations are used
for animation synthesis with avatars. The computational morphology that has been implemented
uses inflection, introflection and suppletion to model a significant fragment
of the LSE morphology. Among the phenomena considered, it has been implemented
deictics, nominal plural, aspect marking, verbal agreement, adjectival modification and
degree.Esta tesis aborda varios aspectos sobre traducción automática ed español a lengua de
signos española (LSE), dos lenguas tipológicamente distantes y con insuficientes recursos
lingĂĽĂsticos que hagan posible aproximaciones estadĂsticas a la traducciĂłn. Por ese motivo,
se propone una estrategia basada en reglas lingĂĽĂsticas fundamentadas en los estudios gramaticales
contrastivos existentes entre ambas lenguas.
Se ha optado por una arquitectura para la traducción siguiendo el modelo de análisis,
transferencia y generaciĂłn, en la que la transferencia se realiza al nivel de las funciones
gramaticales proporcionadas por un analizador de dependencias, evitando asĂ las complejidades
asociadas a un análisis lingĂĽĂstico mas profundo para el español.
El lexicĂłn bilĂngĂĽe base para la transferencia lĂ©xica se ha obtenido de las entradas
del Diccionario normativo de la lengua de signos española (DILSE-III), que contiene las
correspondencias entre lemas en español y la representación SEA (Sistema de escritura
alfabĂ©tica) de los signos. Este lexicĂłn se ha ampliado por dos vĂas: Aprovechando las
diferencias de flexibilidad entre las clase de palabras del español y la LSE, y explotando
relaciones semánticas como la sinonimia, la hiperonimia y la meronimia.
Durante la transferencia estructural, algunos nodos del árbol de análisis de dependencias
son transformados, otros son borrados y son insertados nuevos nodos. Algunos
predicados clasificadores son generados en esta fase. La generaciĂłn del orden superficial
de los signos se obtiene mediante la ordenaciĂłn topolĂłgica del grafo de relaciones de
precedencia entre signos. Los pares de signos en nodos que mantienen la relaciĂłn nĂşcleodependiente
o son dependientes de un mismo signo son examinados para determinar si
su orden relativo está marcado o no. El sistema de traducción es evaluado en este punto
utilizando un corpus y comparado con el resultado obtenido con distintos modelos de
traducciĂłn estadĂstica. Sobre un corpus de control de glosas, el sistema basado en reglas
obtiene mejores resultados, con un BLEU (Bilingual Evaluation Understudy) del 0,30 y
un TER (Translation Error Rate) del 42%. Sobre los resultados se ha realizado un análisis
de los errores.
Finalmente, para la generaciĂłn morfolĂłgica, las glosas junto con sus correspondientes
anotaciones morfolĂłgicas son reemplazadas por las representaciones fonolĂłgicas Ham-
NoSys producidas por una morfologĂa computacional y usables para la sĂntesis de animaciones
mediante avatares. La morfologĂa implementada usa flexiĂłn, introflexiĂłn y
supleciĂłn para modelar un fragmento bastante amplio de la LSE. Entre los fenĂłmenos
tratados se incluyen la deixis, la realizaciĂłn de los distintos tipos de plural nominal, el
aspecto, la concordancia argumental del verbo, la modificaciĂłn adjetival y el grado
Robust part-of-speech tagging of social media text
Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen.
Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen.
Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist.
Insbesondere Texte aus sozialen Medien sind eine groĂźe Herausforderung.
Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind.
Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene.
Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden.
Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen.
Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten.
Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen.
Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht.
Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet.
Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert.
Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert.
Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften fĂĽr einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann.
Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache.
Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind.
Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren.
Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern.
Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben.
Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task.
Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains.
In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate.
These arising challenges originate in a lack of robustness of taggers towards domain transfers.
This increased error rate has an impact on NLP applications that depend on PoS information.
The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness.
Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging.
Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains.
We find that the tagging of informal text is poorly supported by available taggers.
A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution.
We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness.
This approach is based on tagging in two steps.
The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags.
Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible.
We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages.
We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language.
Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain.
We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data.
In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena.
Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis.
These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility
Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020
On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
ConstrucciĂłn de recursos lingĂĽĂsticos para una gramática HPSG para el español
En este trabajo se presenta la construcciĂłn de recursos lingĂĽĂsticos para trabajar con una gramática HPSG para el español. HPSG es un formalismo gramatical rico debido a que el resultado del análisis sintáctico con este formalismo es una representaciĂłn de la oraciĂłn que incluye informaciĂłn tanto sintáctica como semántica. Para el idioma inglĂ©s existen parsers estadĂsticos HPSG de alta performance y cobertura del idioma, pero para el español las herramientas existentes aĂşn no llegan al mismo nivel. Se describe una gramática HPSG para el español, indicando sus estructuras de rasgos principales y sus reglas de combinaciĂłn de expresiones. Se construyĂł un corpus de árboles HPSG para el español utilizando la gramática definida. Para esto, se partiĂł del corpus AnCora y se transformaron las oraciones mediante un proceso automático, obteniendo como resultado un nuevo corpus etiquetado segĂşn el formalismo HPSG. Las heurĂsticas de transformaciĂłn tienen un 95,3% de precisiĂłn en detecciĂłn de nĂşcleos y un 92,5% de precisiĂłn en clasificaciĂłn de argumentos. A partir del corpus se definieron las entradas lĂ©xicas y se agruparon las entradas de las categorĂas lĂ©xicas de mayor complejidad combinatoria (verbos, nombres y adjetivos) segĂşn su comportamiento sintáctico-semántico. Estas agrupaciones de entradas lĂ©xicas se denominan frames lĂ©xicos. A partir de esto se construyĂł un supertagger para identificar los frames lĂ©xicos más probables dadas las palabras de una oraciĂłn. El supertagger tiene un accuracy de 83,58% para verbos, 85,78% para nombres y 81,40% para adjetivos (considerando las tres etiquetas más probables)