26 research outputs found
Disambiguating Nouns, Verbs, and Adjectives Using Automatically Acquired Selectional Preferences
Selectional preferences have been used by word sense disambiguation (WSD) systems as one source of disambiguating information. We evaluate WSD using selectional preferences acquired for English adjective—noun, subject, and direct object grammatical relationships with respect to a standard test corpus. The selectional preferences are specific to verb or adjective classes, rather than individual word forms, so they can be used to disambiguate the co-occurring adjectives and verbs, rather than just the nominal argument heads. We also investigate use of the one-senseper-discourse heuristic to propagate a sense tag for a word to other occurrences of the same word within the current document in order to increase coverage. Although the preferences perform well in comparison with other unsupervised WSD systems on the same corpus, the results show that for many applications, further knowledge sources would be required to achieve an adequate level of accuracy and coverage. In addition to quantifying performance, we analyze the results to investigate the situations in which the selectional preferences achieve the best precision and in which the one-sense-per-discourse heuristic increases performance
Syntaxe computationnelle du hongrois : de l'analyse en chunks à la sous-catégorisation verbale
We present the creation of two resources for Hungarian NLP applications: a rule-based shallow parser and a database of verbal subcategorization frames. Hungarian, as a non-configurational language with a rich morphology, presents specific challenges for NLP at the level of morphological and syntactic processing. While efficient and precise morphological analyzers are already available, Hungarian is under-resourced with respect to syntactic analysis. Our work aimed at overcoming this problem by providing resources for syntactic processing. Hungarian language is characterized by a rich morphology and a non-configurational encoding of grammatical functions. These features imply that the syntactic processing of Hungarian has to rely on morphological features rather than on constituent order. The broader interest of our undertaking is to propose representations and methods that are adapted to these specific characteristics, and at the same time are in line with state of the art research methodologies. More concretely, we attempt to adapt current results in argument realization and lexical semantics to the task of labeling sentence constituents according to their syntactic function and semantic role in Hungarian. Syntax and semantics are not completely independent modules in linguistic analysis and language processing: it has been known for decades that semantic properties of words affect their syntactic distribution. Within the syntax-semantics interface, the field of argument realization deals with the (partial or complete) prediction of verbal subcategorization from semantic properties. Research on verbal lexical semantics and semantically motivated mapping has been concentrating on predicting the syntactic realization of arguments, taking for granted (either explicitly or implicitly) that the distinction between arguments and adjuncts is known, and that adjuncts' syntactic realization is governed by productive syntactic rules, not lexical properties. However, besides the correlation between verbal aspect or actionsart and time adverbs (e.g. Vendler, 1967 or Kiefer, 1992 for Hungarian), the distribution of adjuncts among verbs or verb classes did not receive significant attention, especially within the lexical semantics framework. We claim that contrary to the widely shared presumption, adjuncts are often not fully productive. We therefore propose a gradual notion of productivity, defined in relation to Levin-type lexical semantic verb classes (Levin, 1993; Levin and Rappaport-Hovav, 2005). The definition we propose for the argument-adjunct dichotomy is based on evidence from Hungarian and exploits the idea that lexical semantics not only influences complement structure but is the key to the argument-adjunct distinction and the realization of adjunctsLa linguistique informatique est un domaine de recherche qui se concentre sur les méthodes et les perspectives de la modélisation formelle (statistique ou symbolique) de la langue naturelle. La linguistique informatique, tout comme la linguistique théorique, est une discipline fortement modulaire : les niveaux d'analyse linguistique comprennent la segmentation, l'analyse morphologique, la désambiguïsation, l'analyse syntaxique et sémantique. Tandis qu'un nombre d'outils existent déjà pour les traitements de bas niveau (analyse morphologique, étiquetage grammatical), le hongrois peut être considéré comme une langue peu doté pour l'analyse syntaxique et sémantique. Le travail décrit dans la présente thèse vise à combler ce manque en créant des ressources pour le traitement syntaxique du hongrois : notamment, un analyseur en chunks et une base de données lexicale de schémas de sous-catégorisation verbale. La première partie de la recherche présentée ici se concentre sur la création d'un analyseur syntaxique de surface (ou analyseur en chunks) pour le hongrois. La sortie de l'analyseur de surface est conçue pour servir d'entrée pour un traitement ultérieur visant à annoter les relations de dépendance entre le prédicat et ses compléments essentiels et circonstanciels. L'analyseur profond est mis en œuvre dans NooJ (Silberztein, 2004) en tant qu'une cascade de grammaires. Le deuxième objectif de recherche était de proposer une représentation lexicale pour la structure argumentale en hongrois. Cette représentation doit pouvoir gérer la vaste gamme de phénomènes qui échappent à la dichotomie traditionnelle entre un complément essentiel et un circonstanciel (p. ex. des structures partiellement productives, des écarts entre la prédictibilité syntaxique et sémantique). Nous avons eu recours à des résultats de la recherche récente sur la réalisation d'arguments et choisi un cadre qui répond à nos critères et qui est adaptable à une langue non-configurationnelle. Nous avons utilisé la classification sémantique de Levin (1993) comme modèle. Nous avons adapté les notions relatives à cette classification, à savoir celle de la composante sémantique et celle de l'alternance syntaxique, ainsi que la méthodologie d'explorer et de décrire le comportement des prédicats à l'aide de cette représentation, à la tâche de construire une représentation lexicale des verbes dans une langue non-configurationnelle. La première étape consistait à définir les règles de codage et de construire un vaste base de données lexicale pour les verbes et leurs compléments. Par la suite, nous avons entrepris deux expériences pour l'enrichissement de ce lexique avec des informations sémantiques lexicales afin de formaliser des généralisations syntaxiques et sémantiques pertinentes sur les classes de prédicats sous-jacentes. La première approche que nous avons testée consistait en une élaboration manuelle de classification de verbes en fonction de leur structure de compléments et de l'attribution de rôles sémantiques à ces compléments. Nous avons cherché la réponse aux questions suivantes: quelles sont les composants sémantiques pertinents pour définir une classification sémantique des prédicats hongrois? Quelles sont les implications syntaxiques spécifiques à ces classes? Et, plus généralement, quelle est la nature des alternances spécifiques aux classes verbales en hongrois ? Dans la phase finale de la recherche, nous avons étudié le potentiel de l'acquisition automatique pour extraire des classes de verbes à partir de corpus. Nous avons effectué une classification non supervisée, basée sur des données distributionnelles, pour obtenir une classification sémantique pertinente des verbes hongrois. Nous avons également testé la méthode de classification non supervisée sur des données françaises
Inquiries into the lexicon-syntax relations in Basque
Index:- Foreword. B. Oyharçabal.- Morphosyntactic disambiguation and shallow parsing in computational processing in Basque. I. Aduriz, A. DĂaz de Ilarraza.- The transitivity of borrowed verbs in Basque: an outline. X. Alberdi.- Patrixa: a unification-based parser for Basque and its application to the automatic analysis of verbs. I. Aldezabal, M. J. Aranzabe, A. Atutxa, K.Gojenola, K, Sarasola.- Learning argument/adjunct distinction for Basque. I. Aldezabal, M. J. Aranzabe, K. Gojenola, K, Sarasola, A. Atutxa.- Analyzing verbal subcategorization aimed at its computation application. I. Aldezabal, P. Goenaga.- Automatic extraction of verb paterns from “hauta-lanerako euskal hiztegia”. J. M. Arriola, X. Artola, A. Soroa.- The case of an enlightening, provoking an admirable Basque derivational siffux with implications for the theory of argument structure. X. Artiagoitia.- Verb-deriving processes in Basque. J. C. Odriozola.- Lexical causatives and causative alternation in Basque. B. Oyharçabal.- Causation and semantic control; diagnosis of incorrect use in minorized languages. I. Zabala.- Subject index.- Contributions
A distributional investigation of German verbs
Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick über linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhängt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, Selektionspräferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, für eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nämlich semantische Rollenkennzeichnung. Darüber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestätigen, dass diese beiden Facetten der Verbbedeutung auf grundsätzliche Weise zusammenhängen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition.
The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013).
Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively.
Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it
Parsing and Evaluation Improving Dependency Grammars Accuracy
only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars.
On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars.
The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes.
El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013).
Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional.
Grà cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anà lisi sintà ctica automà tica en millora la qualitat. D’una banda,
Parsing and Evaluation Improving Dependency Grammars Accuracy
s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques.
D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments.
Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una al- tra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica au- tomĂ tica
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
[eng] Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition. The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (PadrĂł et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and SolĂ , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013). Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; SolĂ et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively. Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it Parsing and Evaluation Improving Dependency Grammars Accuracy only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars. On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars. The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.[cat] Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes. El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013). Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional. GrĂ cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integraciĂł de coneixement en l’anĂ lisi sintĂ ctica automĂ tica en millora la qualitat. D’una banda, Parsing and Evaluation Improving Dependency Grammars Accuracy s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques. D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments. Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una altra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica automĂ tica
A computational approach to Latin verbs: new resources and methods
Questa tesi presenta l'applicazione di metodi computazionali allo studio dei verbi latini. In particolare, mostriamo la creazione di un lessico di sottocategorizzazione estratto automaticamente da corpora annotati; inoltre presentiamo un modello probabilistico per l'acquisizione di preferenze di selezione a partire da corpora annotati e da un'ontologia (Latin WordNet). Infine, descriviamo i risultati di uno studio diacronico e quantitativo sui preverbi spaziali latini
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches