377 research outputs found
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions
In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems
Diagnosing Reading strategies: Paraphrase Recognition
Paraphrase recognition is a form of natural language processing used in tutoring, question answering, and information retrieval systems. The context of the present work is an automated reading strategy trainer called iSTART (Interactive Strategy Trainer for Active Reading and Thinking). The ability to recognize the use of paraphrase—a complete, partial, or inaccurate paraphrase; with or without extra information—in the student\u27s input is essential if the trainer is to give appropriate feedback. I analyzed the most common patterns of paraphrase and developed a means of representing the semantic structure of sentences. Paraphrases are recognized by transforming sentences into this representation and comparing them. To construct a precise semantic representation, it is important to understand the meaning of prepositions. Adding preposition disambiguation to the original system improved its accuracy by 20%. The preposition sense disambiguation module itself achieves about 80% accuracy for the top 10 most frequently used prepositions.
The main contributions of this work to the research community are the preposition classification and generalized preposition disambiguation processes, which are integrated into the paraphrase recognition system and are shown to be quite effective. The recognition model also forms a significant part of this contribution. The present effort includes the modeling of the paraphrase recognition process, featuring the Syntactic-Semantic Graph as a sentence representation, the implementation of a significant portion of this design demonstrating its effectiveness, the modeling of an effective preposition classification based on prepositional usage, the design of the generalized preposition disambiguation module, and the integration of the preposition disambiguation module into the paraphrase recognition system so as to gain significant improvement
Parsing and Evaluation. Improving Dependency Grammars Accuracy. Anà lisi Sintà ctica Automà tica i Avaluació. Millora de qualitat per a Gramà tiques de Dependències
Because parsers are still limited in analysing specific ambiguous constructions, the research presented in this thesis mainly aims to contribute to the improvement of parsing performance when it has knowledge integrated in order to deal with ambiguous linguistic phenomena. More precisely, this thesis intends to provide empirical solutions to the disambiguation of prepositional phrase attachment and argument recognition in order to assist parsers in generating a more accurate syntactic analysis. The disambiguation of these two highly ambiguous linguistic phenomena by the integration of knowledge about the language necessarily relies on linguistic and statistical strategies for knowledge acquisition.
The starting point of this research proposal is the development of a rule-based grammar for Spanish and for Catalan following the theoretical basis of Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988) in order to carry out two experiments about the integration of automatically- acquired knowledge. In order to build two robust grammars that understand a sentence, the FreeLing pipeline (Padró et al., 2010) has been used as a framework. On the other hand, an eclectic repertoire of criteria about the nature of syntactic heads is proposed by reviewing the postulates of Generative Grammar (Chomsky, 1981; Bonet and Solà , 1986; Haegeman, 1991) and Dependency Grammar (Tesnière, 1959; Mel’čuk, 1988). Furthermore, a set of dependency relations is provided and mapped to Universal Dependencies (Mcdonald et al., 2013).
Furthermore, an empirical evaluation method has been designed in order to carry out both a quantitative and a qualitative analysis. In particular, the dependency parsed trees generated by the grammars are compared to real linguistic data. The quantitative evaluation is based on the Spanish Tibidabo Treebank (Marimon et al., 2014), which is large enough to carry out a real analysis of the grammars performance and which has been annotated with the same formalism as the grammars, syntactic dependencies. Since the criteria between both resources are differ- ent, a process of harmonization has been applied developing a set of rules that automatically adapt the criteria of the corpus to the grammar criteria. With regard to qualitative evaluation, there are no available resources to evaluate Spanish and Catalan dependency grammars quali- tatively. For this reason, a test suite of syntactic phenomena about structure and word order has been built. In order to create a representative repertoire of the languages observed, descriptive grammars (Bosque and Demonte, 1999; Solà et al., 2002) and the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) have been used for capturing relevant structures and word order patterns, respectively.
Thanks to these two tools, two experiments have been carried out in order to prove that knowl- edge integration improves the parsing accuracy. On the one hand, the automatic learning of lan- guage models has been explored by means of statistical methods in order to disambiguate PP- attachment. More precisely, a model has been learned with a supervised classifier using Weka (Witten and Frank, 2005). Furthermore, an unsupervised model based on word embeddings has been applied (Mikolov et al., 2013a,b). The results of the experiment show that the supervised method is limited in predicting solutions for unseen data, which is resolved by the unsupervised method since provides a solution for any case. However, the unsupervised method is limited if it
Parsing and Evaluation Improving Dependency Grammars Accuracy
only learns from lexical data. For this reason, training data needs to be enriched with the lexical value of the preposition, as well as semantic and syntactic features. In addition, the number of patterns used to learn language models has to be extended in order to have an impact on the grammars.
On the other hand, another experiment is carried out in order to improve the argument recog- nition in the grammars by the acquisition of linguistic knowledge. In this experiment, knowledge is acquired automatically from the extraction of verb subcategorization frames from the SenSem Corpus (Vázquez and Fernández-Montraveta, 2015) which contains the verb predicate and its arguments annotated syntactically. As a result of the information extracted, subcategorization frames have been classified into subcategorization classes regarding the patterns observed in the corpus. The results of the subcategorization classes integration in the grammars prove that this information increases the accuracy of the argument recognition in the grammars.
The results of the research of this thesis show that grammars’ rules on their own are not ex- pressive enough to resolve complex ambiguities. However, the integration of knowledge about these ambiguities in the grammars may be decisive in the disambiguation. On the one hand, sta- tistical knowledge about PP-attachment can improve the grammars accuracy, but syntactic and semantic information, and new patterns of PP-attachment need to be included in the language models in order to contribute to disambiguate this phenomenon. On the other hand, linguistic knowledge about verb subcategorization acquired from annotated linguistic resources show a positive influence positively on grammars’ accuracy.Aquesta tesi vol tractar les limitacions amb què es troben els analitzadors sintĂ ctics automĂ tics actualment. Tot i els progressos que s’han fet en l’à rea del Processament del Llenguatge Nat- ural en els darrers anys, les tecnologies del llenguatge i, en particular, els analitzadors sintĂ c- tics automĂ tics no han pogut traspassar el llindar de certes ambiguĂŻtats estructurals com ara l’agrupaciĂł del sintagma preposicional i el reconeixement d’arguments. És per aquest motiu que la recerca duta a terme en aquesta tesi tĂ© com a objectiu aportar millores signiflcatives de quali- tat a l’anĂ lisi sintĂ ctica automĂ tica per mitjĂ de la integraciĂł de coneixement lingĂĽĂstic i estadĂstic per desambiguar construccions sintĂ ctiques ambigĂĽes.
El punt de partida de la recerca ha estat el desenvolupament de d’una gramĂ tica en espanyol i una altra en catalĂ basades en regles que segueixen els postulats de la GramĂ tica de Dependèn- dencies (Tesnière, 1959; Mel’čuk, 1988) per tal de dur a terme els experiments sobre l’adquisiciĂł de coneixement automĂ tic. Per tal de crear dues gramĂ tiques robustes que analitzin i entenguin l’oraciĂł en profunditat, ens hem basat en l’arquitectura de FreeLing (PadrĂł et al., 2010), una lli- breria de Processament de Llenguatge Natural que proveeix una anĂ lisi lingĂĽĂstica automĂ tica de l’oraciĂł. Per una altra banda, s’ha elaborat una proposta eclèctica de criteris lingĂĽĂstics per determinar la formaciĂł dels sintagmes i les clĂ usules a la gramĂ tica per mitjĂ de la revisiĂł de les propostes teòriques de la GramĂ tica Generativa (Chomsky, 1981; Bonet and SolĂ , 1986; Haege- man, 1991) i de la GramĂ tica de Dependències (Tesnière, 1959; Mel’čuk, 1988). Aquesta proposta s’acompanya d’un llistat de les etiquetes de relaciĂł de dependència que fan servir les regles de les gramĂ tques. A mĂ©s a mĂ©s de l’elaboraciĂł d’aquest llistat, s’han establert les correspondències amb l’estĂ ndard d’anotaciĂł de les Dependències Universals (Mcdonald et al., 2013).
Alhora, s’ha dissenyat un sistema d’avaluaciĂł empĂric que tĂ© en compte l’anĂ lisi quantitativa i qualitativa per tal de fer una valoraciĂł completa dels resultats dels experiments. Precisament, es tracta una tasca empĂrica pel fet que es comparen les anĂ lisis generades per les gramĂ tiques amb dades reals de la llengua. Per tal de dur a terme l’avaluaciĂł des d’una perspectiva quan- titativa, s’ha fet servir el corpus Tibidabo en espanyol (Marimon et al., 2014) disponible nomĂ©s en espanyol que Ă©s prou extens per construir una anĂ lisi real de les gramĂ tiques i que ha estat anotat amb el mateix formalisme que les gramĂ tiques. En concret, per tal com els criteris de les gramĂ tiques i del corpus no sĂłn coincidents, s’ha dut a terme un procĂ©s d’harmonitzaciĂł de cri- teris per mitjĂ d’unes regles creades manualment que adapten automĂ ticament l’estructura i la relaciĂł de dependència del corpus al criteri de les gramĂ tiques. Pel que fa a l’avaluaciĂł qualitativa, pel fet que no hi ha recursos disponibles en espanyol i catalĂ , hem dissenyat un reprertori de test de fenòmens sintĂ ctics estructurals i relacionats amb l’ordre de l’oraciĂł. Amb l’objectiu de crear un repertori representatiu de les llengĂĽes estudiades, s’han fet servir gramĂ tiques descriptives per fornir el repertori d’estructures sintĂ ctiques (Bosque and Demonte, 1999; SolĂ et al., 2002) i el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015) per capturar automĂ ticament l’ordre oracional.
Grà cies a aquestes dues eines, s’han pogut dur a terme dos experiments per provar que la integració de coneixement en l’anà lisi sintà ctica automà tica en millora la qualitat. D’una banda,
Parsing and Evaluation Improving Dependency Grammars Accuracy
s’ha explorat l’aprenentatge de models de llenguatge per mitjĂ de models estadĂstics per tal de proposar solucions a l’agrupaciĂł del sintagma preposicional. MĂ©s concretament, s’ha desen- volupat un model de llenguatge per mitjĂ d’un classiflcador d’aprenentatge supervisat de Weka (Witten and Frank, 2005). A mĂ©s a mĂ©s, s’ha après un model de llenguatge per mitjĂ d’un mètode no supervisat basat en l’aproximaciĂł distribucional anomenat word embeddings (Mikolov et al., 2013a,b). Els resultats de l’experiment posen de manifest que el mètode supervisat tĂ© greus lim- itacions per fer donar una resposta en dades que no ha vist prèviament, cosa que Ă©s superada pel mètode no supervisat pel fet que Ă©s capaç de classiflcar qualsevol cas. De tota manera, el mètode no supervisat que s’ha estudiat Ă©s limitat si aprèn a partir de dades lèxiques. Per aquesta raĂł, Ă©s necessari que les dades utilitzades per entrenar el model continguin el valor de la preposi- ciĂł, trets sintĂ ctics i semĂ ntics. A mĂ©s a mĂ©s, cal ampliar el nĂşmero de patrons apresos per tal d’ampliar la cobertura dels models i tenir un impacte en els resultats de les gramĂ tiques.
D’una altra banda, s’ha proposat una manera de millorar el reconeixement d’arguments a les gramĂ tiques per mitjĂ de l’adquisiciĂł de coneixement lingĂĽĂstic. En aquest experiment, s’ha op- tat per extreure automĂ ticament el coneixement en forma de classes de subcategoritzaciĂł verbal d’el Corpus SenSem (Vázquez and Fernández-Montraveta, 2015), que contĂ© anotats sintĂ ctica- ment el predicat verbal i els seus arguments. A partir de la informaciĂł extreta, s’ha classiflcat les diverses diĂ tesis verbals en classes de subcategoritzaciĂł verbal en funciĂł dels patrons observats en el corpus. Els resultats de la integraciĂł de les classes de subcategoritzaciĂł a les gramĂ tiques mostren que aquesta informaciĂł determina positivament el reconeixement dels arguments.
Els resultats de la recerca duta a terme en aquesta tesi doctoral posen de manifest que les regles de les gramĂ tiques no sĂłn prou expressives per elles mateixes per resoldre ambigĂĽitats complexes del llenguatge. No obstant això, la integraciĂł de coneixement sobre aquestes am- bigĂĽitats pot ser decisiu a l’hora de proposar una soluciĂł. D’una banda, el coneixement estadĂstic sobre l’agrupaciĂł del sintagma preposicional pot millorar la qualitat de les gramĂ tiques, però per aflrmar-ho cal incloure informaciĂł sintĂ ctica i semĂ ntica en els models d’aprenentatge automĂ tic i capturar mĂ©s patrons per contribuir en la desambiguaciĂł de fenòmens complexos. D’una al- tra banda, el coneixement lingĂĽĂstic sobre subcategoritzaciĂł verbal adquirit de recursos lingĂĽĂs- tics anotats influeix decisivament en la qualitat de les gramĂ tiques per a l’anĂ lisi sintĂ ctica au- tomĂ tica
Natural Language Processing Resources for Finnish. Corpus Development in the General and Clinical Domains
Siirretty Doriast
Examining inter-sentential influences on predicted verb subcategorization
This study investigated the influences of prior discourse context and cumulative syntactic priming on readers' predictions for verb subcategorizations. An additional aim was to determine whether cumulative syntactic priming has the same degree of influence following coherent discourse contexts as when following series of unrelated sentences. Participants (N = 40) read sentences using a self-paced, sentence-by-sentence procedure. Half of these sentences comprised a coherent discourse context intended to increase the expectation for a sentential complement (S) completion. The other half consisted of scrambled sentences. The trials in both conditions varied according to the proportion of verbs that resolved to an S (either 6S or 2S). Following each condition, participants read temporarily ambiguous sentences that resolved to an S. Reading times across the disambiguating and postdisambiguating regions were measured. No significant main effects or interactions were found for either region. However, the lack of significant findings for these analyses may have been due to low power. In a follow-up analysis, data from each gender were analyzed separately. For the data contributed by males, there were no significant findings. For the data contributed by females, the effect of coherence was significant (by participants but not by items) across the postdisambiguating region, and there was a marginally significant interaction (p =.05) between coherence and frequency across this region suggesting that discourse-level information may differentially influence the local sentence processing of female and male participant
- …