43 research outputs found

    Graph-based approaches to word sense induction

    Get PDF
    This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses

    Waves and Words: Oscillatory activity and language processing

    Get PDF
    Successful language comprehension depends not only on the involvement of different domain-specific linguistic processes, but also on their respective time-courses. Both aspects of the comprehension process can be examined by means of event-related brain potentials (ERPs), which not only provide a direct reflection of human brain activity within the millisecond range, but also allow for a qualitative dissociation between different language-related processing domains. However, recent ERP findings indicate that the desired one-to-one mapping between ERP components and linguistic processes cannot be upheld, thus leading to an interpretative uncertainty. This thesis presents a fundamentally new analysis technique for language-based ERP components, which aims to address the ambiguity associated with traditional language-related ERP effects. It is argued that this new method, which supplements ERP measures with corresponding frequency-based analyses, not only allows for a differentiation of ERP components on the basis of activity in distinct frequency bands and underlying dynamic behaviour (in terms of power changes and/or phase locking), but also provides further insights into the functional organisation of the language comprehension system and its inherent complexity. On the basis of 5 EEG experiments, I show (1) that it is possible to dissociate two superficially indistinguishable language-related ERP components on the basis of their respective underlying frequency characteristics (Experiment 1), thereby resolving the vagueness of interpretation inherent to the ERP components themselves; (2) that the processing nature of the ‘classical’ semantic N400 effect can be unambiguously specified in terms of its underlying frequency characteristics, i.e. in terms of (evoked and whole) power and phase-locking differences in specific frequency bands, thereby allowing for a first interpretative categorisation of the N400 effect with respect to its underlying neuronal processing dynamics; and (3) that frequency-based analyses may be employed to distinguish the semantic N400 effect from N400-like effects that appear in contexts which cannot readily be characterised as semantic-interpretative processes. Experiments 2 – 5 investigated the processing of antonym relations under different task conditions. Whereas in Experiment 2, the processing of antonym pairs (black – white) was compared to that of related (black – yellow) and non-related (black – nice) word pairs in a sentence context, Experiments 3 to 5 presented isolated word pairs. The frequency-based analysis showed that the observed N400 effects were not uniform in nature, but rather resulted from the superposition of functionally different frequency components. Task-relevant targets elicited a specific frequency modulation, which showed up as a P300-like positivity in terms of ERP measures. In addition, lexical-semantic processing elicited a pronounced increase in a different frequency range that was independent of the experimental context. For antonyms (Experiments 2 and 3), the task-related positive component appeared almost simultaneously to the N400 deflection for non-related words, thereby giving rise to a substantial N400 effect. In contrast, for pseudowords (Experiment 5), this positivity appeared in temporal succession to the N400. In sum, in the present results provide converging evidence that N400 effects should not be regarded as functionally uniform. Depending on the respective task and stimulus manipulations, the N400 effect appears as a result of the superposition of functionally different activities, which can be clearly distinguished in terms of their underlying frequency characteristics. In this way, the proposed frequency-based methods directly bear upon the interpretation of language-related ERP effects and thus have straightforward consequences for psycholinguistic theory. In view of the phenomenon that language-related processes have, in a number of cases, been directly attributed to the lexical-semantic processing domain on account of the observation of an N400, these results not only call for a reinterpretation of previous findings but also for a reinterpretation of their theoretical consequences

    The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study

    Get PDF
    Carminati MN, Knoeferle P. The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study. Presented at the Architectures and Mechanisms of Language and Processing (AMLaP), Riva del Garda, Italy

    A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

    Full text link
    Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi

    Harnessing sense-level information for semantically augmented knowledge extraction

    Get PDF
    Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge. This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text. In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors. In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge

    Semantic radical consistency and character transparency effects in Chinese: an ERP study

    Get PDF
    BACKGROUND: This event-related potential (ERP) study aims to investigate the representation and temporal dynamics of Chinese orthography-to-semantics mappings by simultaneously manipulating character transparency and semantic radical consistency. Character components, referred to as radicals, make up the building blocks used dur...postprin

    Waves and Words: Oscillatory activity and language processing

    Get PDF
    Successful language comprehension depends not only on the involvement of different domain-specific linguistic processes, but also on their respective time-courses. Both aspects of the comprehension process can be examined by means of event-related brain potentials (ERPs), which not only provide a direct reflection of human brain activity within the millisecond range, but also allow for a qualitative dissociation between different language-related processing domains. However, recent ERP findings indicate that the desired one-to-one mapping between ERP components and linguistic processes cannot be upheld, thus leading to an interpretative uncertainty. This thesis presents a fundamentally new analysis technique for language-based ERP components, which aims to address the ambiguity associated with traditional language-related ERP effects. It is argued that this new method, which supplements ERP measures with corresponding frequency-based analyses, not only allows for a differentiation of ERP components on the basis of activity in distinct frequency bands and underlying dynamic behaviour (in terms of power changes and/or phase locking), but also provides further insights into the functional organisation of the language comprehension system and its inherent complexity. On the basis of 5 EEG experiments, I show (1) that it is possible to dissociate two superficially indistinguishable language-related ERP components on the basis of their respective underlying frequency characteristics (Experiment 1), thereby resolving the vagueness of interpretation inherent to the ERP components themselves; (2) that the processing nature of the ‘classical’ semantic N400 effect can be unambiguously specified in terms of its underlying frequency characteristics, i.e. in terms of (evoked and whole) power and phase-locking differences in specific frequency bands, thereby allowing for a first interpretative categorisation of the N400 effect with respect to its underlying neuronal processing dynamics; and (3) that frequency-based analyses may be employed to distinguish the semantic N400 effect from N400-like effects that appear in contexts which cannot readily be characterised as semantic-interpretative processes. Experiments 2 – 5 investigated the processing of antonym relations under different task conditions. Whereas in Experiment 2, the processing of antonym pairs (black – white) was compared to that of related (black – yellow) and non-related (black – nice) word pairs in a sentence context, Experiments 3 to 5 presented isolated word pairs. The frequency-based analysis showed that the observed N400 effects were not uniform in nature, but rather resulted from the superposition of functionally different frequency components. Task-relevant targets elicited a specific frequency modulation, which showed up as a P300-like positivity in terms of ERP measures. In addition, lexical-semantic processing elicited a pronounced increase in a different frequency range that was independent of the experimental context. For antonyms (Experiments 2 and 3), the task-related positive component appeared almost simultaneously to the N400 deflection for non-related words, thereby giving rise to a substantial N400 effect. In contrast, for pseudowords (Experiment 5), this positivity appeared in temporal succession to the N400. In sum, in the present results provide converging evidence that N400 effects should not be regarded as functionally uniform. Depending on the respective task and stimulus manipulations, the N400 effect appears as a result of the superposition of functionally different activities, which can be clearly distinguished in terms of their underlying frequency characteristics. In this way, the proposed frequency-based methods directly bear upon the interpretation of language-related ERP effects and thus have straightforward consequences for psycholinguistic theory. In view of the phenomenon that language-related processes have, in a number of cases, been directly attributed to the lexical-semantic processing domain on account of the observation of an N400, these results not only call for a reinterpretation of previous findings but also for a reinterpretation of their theoretical consequences

    Examining the learning burden and decay of second language vocabulary knowledge

    Get PDF
    Research in second language (L2) vocabulary learning has shown that not all words are equally easy to learn, and that several factors affect the difficulty with which words are acquired, i.e., their learning burden. However, research to date has explored only a few of the many factors affecting learning burden and existing findings are inconclusive. Another important finding in the L2 vocabulary learning literature is that L2 lexical knowledge is forgotten after learning but, to date, there has been minimal investigation of the variables that influence lexical decay. It has also been assumed that the lexical items most difficult to acquire are those easiest to forget, pointing towards a positive relationship between learning burden and decay (Webb & Nation, 2017). However, there is currently limited empirical evidence to support this assumption. This thesis reports research undertaken to explore the effect of different variables on learning burden and lexical decay, and the relationship between burden and decay. It consists of three empirical studies that investigated the effect of intralexical (i.e., part of speech, word length), contextual (i.e., meaning presentation code, form presentation mode), and individual (i.e., perceived target item usefulness, language learning aptitude) factors on the learning burden and decay of vocabulary knowledge that was intentionally learned with flashcard software. Each study also considered the effect of learning burden on lexical decay. Additionally, a cross-study analysis was conducted to explore the effect of the retention interval length on decay. The empirical studies showed that word length, aspects of language learning aptitude, and form presentation mode impacted learning burden but not decay, with shorter words, higher associative memory capacity, and bimodal form presentation related to less burden. Perceived target item usefulness was found to have no effect on burden or decay. Meaning presentation code and PoS were found to affect both burden and decay. Lexical items presented with an L2 definition and verbs were more burdensome and more likely to decay than items presented with an L1 equivalent and nouns. The findings also indicated that more learning burden was associated with a higher likelihood of decay. The cross-study analysis showed that decay was not directly proportional to the retention interval length and that form recall knowledge was more susceptible to decay than form recognition. Additionally, this thesis explores implications for vocabulary research and L2 pedagogy
    corecore