Search CORE

43 research outputs found

Graph-based approaches to word sense induction

Author: Hope David Richard
Publication venue
Publication date: 03/02/2015
Field of study

This thesis is a study of Word Sense Induction (WSI), the Natural Language Processing (NLP) task of automatically discovering word meanings from text. WSI is an open problem in NLP whose solution would be of considerable benefit to many other NLP tasks. It has, however, has been studied by relatively few NLP researchers and often in set ways. Scope therefore exists to apply novel methods to the problem, methods that may improve upon those previously applied. This thesis applies a graph-theoretic approach to WSI. In this approach, word senses are identifed by finding particular types of subgraphs in word co-occurrence graphs. A number of original methods for constructing, analysing, and partitioning graphs are introduced, with these methods then incorporated into graphbased WSI systems. These systems are then shown, in a variety of evaluation scenarios, to return results that are comparable to those of the current best performing WSI systems. The main contributions of the thesis are a novel parameter-free soft clustering algorithm that runs in time linear in the number of edges in the input graph, and novel generalisations of the clustering coeficient (a measure of vertex cohesion in graphs) to the weighted case. Further contributions of the thesis include: a review of graph-based WSI systems that have been proposed in the literature; analysis of the methodologies applied in these systems; analysis of the metrics used to evaluate WSI systems, and empirical evidence to verify the usefulness of each novel method introduced in the thesis for inducing word senses

Sussex Research Online

Recommended from our members

Distributional similarity for Chinese: Exploiting characters and radicals

Author: Carroll J
Jin P
McCarthy D
Wu Y
Publication venue: Mathematical Problems in Engineering
Publication date: 13/07/2017
Field of study

Distributional Similarity has attracted considerable attention in the field of natural language processing as an automatic means of countering the ubiquitous problem of sparse data. As a logographic language, Chinese words consist of characters and each of them is composed of one or more radicals. The meanings of characters are usually highly related to the words which contain them. Likewise, radicals often make a predictable contribution to the meaning of a character: characters that have the same components tend to have similar or related meanings. In this paper, we utilize these properties of the Chinese language to improve Chinese word similarity computation. Given a content word, we first extract similar words based on a large corpus and a similarity score for ranking. This rank is then adjusted according to the characters and components shared between the similar word and the target word. Experiments on two gold standard datasets show that the adjusted rank is superior and closer to human judgments than the original rank. In addition to quantitative evaluation, we examine the reasons behind errors drawing on linguistic phenomena for our explanations.Peer Reviewe

Apollo (Cambridge)

Recommended from our members

Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring

Author: Gritta Milan
Publication venue: University of Cambridge
Publication date: 16/07/2019
Field of study

The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI

Apollo (Cambridge)

Waves and Words: Oscillatory activity and language processing

Author: Roehm Dietmar
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2004
Field of study

Successful language comprehension depends not only on the involvement of different domain-specific linguistic processes, but also on their respective time-courses. Both aspects of the comprehension process can be examined by means of event-related brain potentials (ERPs), which not only provide a direct reflection of human brain activity within the millisecond range, but also allow for a qualitative dissociation between different language-related processing domains. However, recent ERP findings indicate that the desired one-to-one mapping between ERP components and linguistic processes cannot be upheld, thus leading to an interpretative uncertainty. This thesis presents a fundamentally new analysis technique for language-based ERP components, which aims to address the ambiguity associated with traditional language-related ERP effects. It is argued that this new method, which supplements ERP measures with corresponding frequency-based analyses, not only allows for a differentiation of ERP components on the basis of activity in distinct frequency bands and underlying dynamic behaviour (in terms of power changes and/or phase locking), but also provides further insights into the functional organisation of the language comprehension system and its inherent complexity. On the basis of 5 EEG experiments, I show (1) that it is possible to dissociate two superficially indistinguishable language-related ERP components on the basis of their respective underlying frequency characteristics (Experiment 1), thereby resolving the vagueness of interpretation inherent to the ERP components themselves; (2) that the processing nature of the classical semantic N400 effect can be unambiguously specified in terms of its underlying frequency characteristics, i.e. in terms of (evoked and whole) power and phase-locking differences in specific frequency bands, thereby allowing for a first interpretative categorisation of the N400 effect with respect to its underlying neuronal processing dynamics; and (3) that frequency-based analyses may be employed to distinguish the semantic N400 effect from N400-like effects that appear in contexts which cannot readily be characterised as semantic-interpretative processes. Experiments 2 5 investigated the processing of antonym relations under different task conditions. Whereas in Experiment 2, the processing of antonym pairs (black white) was compared to that of related (black yellow) and non-related (black nice) word pairs in a sentence context, Experiments 3 to 5 presented isolated word pairs. The frequency-based analysis showed that the observed N400 effects were not uniform in nature, but rather resulted from the superposition of functionally different frequency components. Task-relevant targets elicited a specific frequency modulation, which showed up as a P300-like positivity in terms of ERP measures. In addition, lexical-semantic processing elicited a pronounced increase in a different frequency range that was independent of the experimental context. For antonyms (Experiments 2 and 3), the task-related positive component appeared almost simultaneously to the N400 deflection for non-related words, thereby giving rise to a substantial N400 effect. In contrast, for pseudowords (Experiment 5), this positivity appeared in temporal succession to the N400. In sum, in the present results provide converging evidence that N400 effects should not be regarded as functionally uniform. Depending on the respective task and stimulus manipulations, the N400 effect appears as a result of the superposition of functionally different activities, which can be clearly distinguished in terms of their underlying frequency characteristics. In this way, the proposed frequency-based methods directly bear upon the interpretation of language-related ERP effects and thus have straightforward consequences for psycholinguistic theory. In view of the phenomenon that language-related processes have, in a number of cases, been directly attributed to the lexical-semantic processing domain on account of the observation of an N400, these results not only call for a reinterpretation of previous findings but also for a reinterpretation of their theoretical consequences

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study

Author: Carminati Maria Nella
Knoeferle Pia
Publication venue
Publication date: 01/01/2012
Field of study

Carminati MN, Knoeferle P. The Processing of Emotional Sentences by Young and Older Adults: A Visual World Eye-movement Study. Presented at the Architectures and Mechanisms of Language and Processing (AMLaP), Riva del Garda, Italy

Publications at Bielefeld University

A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

Author: Franco Salvador Marc
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 03/07/2017
Field of study

Tesis por compendioNatural Language Processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human languages. One of its most challenging aspects involves enabling computers to derive meaning from human natural language. To do so, several meaning or context representations have been proposed with competitive performance. However, these representations still have room for improvement when working in a cross-domain or cross-language scenario. In this thesis we study the use of knowledge graphs as a cross-domain and cross-language representation of text and its meaning. A knowledge graph is a graph that expands and relates the original concepts belonging to a set of words. We obtain its characteristics using a wide-coverage multilingual semantic network as knowledge base. This allows to have a language coverage of hundreds of languages and millions human-general and -specific concepts. As starting point of our research we employ knowledge graph-based features - along with other traditional ones and meta-learning - for the NLP task of single- and cross-domain polarity classification. The analysis and conclusions of that work provide evidence that knowledge graphs capture meaning in a domain-independent way. The next part of our research takes advantage of the multilingual semantic network and focuses on cross-language Information Retrieval (IR) tasks. First, we propose a fully knowledge graph-based model of similarity analysis for cross-language plagiarism detection. Next, we improve that model to cover out-of-vocabulary words and verbal tenses and apply it to cross-language document retrieval, categorisation, and plagiarism detection. Finally, we study the use of knowledge graphs for the NLP tasks of community questions answering, native language identification, and language variety identification. The contributions of this thesis manifest the potential of knowledge graphs as a cross-domain and cross-language representation of text and its meaning for NLP and IR tasks. These contributions have been published in several international conferences and journals.El Procesamiento del Lenguaje Natural (PLN) es un campo de la informática, la inteligencia artificial y la lingüística computacional centrado en las interacciones entre las máquinas y el lenguaje de los humanos. Uno de sus mayores desafíos implica capacitar a las máquinas para inferir el significado del lenguaje natural humano. Con este propósito, diversas representaciones del significado y el contexto han sido propuestas obteniendo un rendimiento competitivo. Sin embargo, estas representaciones todavía tienen un margen de mejora en escenarios transdominios y translingües. En esta tesis estudiamos el uso de grafos de conocimiento como una representación transdominio y translingüe del texto y su significado. Un grafo de conocimiento es un grafo que expande y relaciona los conceptos originales pertenecientes a un conjunto de palabras. Sus propiedades se consiguen gracias al uso como base de conocimiento de una red semántica multilingüe de amplia cobertura. Esto permite tener una cobertura de cientos de lenguajes y millones de conceptos generales y específicos del ser humano. Como punto de partida de nuestra investigación empleamos características basadas en grafos de conocimiento - junto con otras tradicionales y meta-aprendizaje - para la tarea de PLN de clasificación de la polaridad mono- y transdominio. El análisis y conclusiones de ese trabajo muestra evidencias de que los grafos de conocimiento capturan el significado de una forma independiente del dominio. La siguiente parte de nuestra investigación aprovecha la capacidad de la red semántica multilingüe y se centra en tareas de Recuperación de Información (RI). Primero proponemos un modelo de análisis de similitud completamente basado en grafos de conocimiento para detección de plagio translingüe. A continuación, mejoramos ese modelo para cubrir palabras fuera de vocabulario y tiempos verbales, y lo aplicamos a las tareas translingües de recuperación de documentos, clasificación, y detección de plagio. Por último, estudiamos el uso de grafos de conocimiento para las tareas de PLN de respuesta de preguntas en comunidades, identificación del lenguaje nativo, y identificación de la variedad del lenguaje. Las contribuciones de esta tesis ponen de manifiesto el potencial de los grafos de conocimiento como representación transdominio y translingüe del texto y su significado en tareas de PLN y RI. Estas contribuciones han sido publicadas en diversas revistas y conferencias internacionales.El Processament del Llenguatge Natural (PLN) és un camp de la informàtica, la intel·ligència artificial i la lingüística computacional centrat en les interaccions entre les màquines i el llenguatge dels humans. Un dels seus majors reptes implica capacitar les màquines per inferir el significat del llenguatge natural humà. Amb aquest propòsit, diverses representacions del significat i el context han estat proposades obtenint un rendiment competitiu. No obstant això, aquestes representacions encara tenen un marge de millora en escenaris trans-dominis i trans-llenguatges. En aquesta tesi estudiem l'ús de grafs de coneixement com una representació trans-domini i trans-llenguatge del text i el seu significat. Un graf de coneixement és un graf que expandeix i relaciona els conceptes originals pertanyents a un conjunt de paraules. Les seves propietats s'aconsegueixen gràcies a l'ús com a base de coneixement d'una xarxa semàntica multilingüe d'àmplia cobertura. Això permet tenir una cobertura de centenars de llenguatges i milions de conceptes generals i específics de l'ésser humà. Com a punt de partida de la nostra investigació emprem característiques basades en grafs de coneixement - juntament amb altres tradicionals i meta-aprenentatge - per a la tasca de PLN de classificació de la polaritat mono- i trans-domini. L'anàlisi i conclusions d'aquest treball mostra evidències que els grafs de coneixement capturen el significat d'una forma independent del domini. La següent part de la nostra investigació aprofita la capacitat\hyphenation{ca-pa-ci-tat} de la xarxa semàntica multilingüe i se centra en tasques de recuperació d'informació (RI). Primer proposem un model d'anàlisi de similitud completament basat en grafs de coneixement per a detecció de plagi trans-llenguatge. A continuació, vam millorar aquest model per cobrir paraules fora de vocabulari i temps verbals, i ho apliquem a les tasques trans-llenguatges de recuperació de documents, classificació, i detecció de plagi. Finalment, estudiem l'ús de grafs de coneixement per a les tasques de PLN de resposta de preguntes en comunitats, identificació del llenguatge natiu, i identificació de la varietat del llenguatge. Les contribucions d'aquesta tesi posen de manifest el potencial dels grafs de coneixement com a representació trans-domini i trans-llenguatge del text i el seu significat en tasques de PLN i RI. Aquestes contribucions han estat publicades en diverses revistes i conferències internacionals.Franco Salvador, M. (2017). A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/84285TESISCompendi

Crossref

RiuNet

Harnessing sense-level information for semantically augmented knowledge extraction

Author: DELLI BOVI Claudio
Publication venue
Publication date: 12/02/2018
Field of study

Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge. This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text. In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors. In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge

Archivio della ricerca- Università di Roma La Sapienza

Semantic radical consistency and character transparency effects in Chinese: an ERP study

Author: Su IF
Weekes BS
Publication venue: 'United States Sports Academy'
Publication date: 01/01/2009
Field of study

BACKGROUND: This event-related potential (ERP) study aims to investigate the representation and temporal dynamics of Chinese orthography-to-semantics mappings by simultaneously manipulating character transparency and semantic radical consistency. Character components, referred to as radicals, make up the building blocks used dur...postprin

HKU Scholars Hub

Waves and Words: Oscillatory activity and language processing

Author: Roehm Dietmar
Schlesewsky Matthias HD Dr.
Publication venue: Philipps-Universität Marburg, Germanistische Sprachwissenschaft
Publication date: 01/01/2004
Field of study

Examining the learning burden and decay of second language vocabulary knowledge

Author: Barclay Samuel Christopher
Publication venue: UCL (University College London)
Publication date: 28/05/2021
Field of study

Research in second language (L2) vocabulary learning has shown that not all words are equally easy to learn, and that several factors affect the difficulty with which words are acquired, i.e., their learning burden. However, research to date has explored only a few of the many factors affecting learning burden and existing findings are inconclusive. Another important finding in the L2 vocabulary learning literature is that L2 lexical knowledge is forgotten after learning but, to date, there has been minimal investigation of the variables that influence lexical decay. It has also been assumed that the lexical items most difficult to acquire are those easiest to forget, pointing towards a positive relationship between learning burden and decay (Webb & Nation, 2017). However, there is currently limited empirical evidence to support this assumption. This thesis reports research undertaken to explore the effect of different variables on learning burden and lexical decay, and the relationship between burden and decay. It consists of three empirical studies that investigated the effect of intralexical (i.e., part of speech, word length), contextual (i.e., meaning presentation code, form presentation mode), and individual (i.e., perceived target item usefulness, language learning aptitude) factors on the learning burden and decay of vocabulary knowledge that was intentionally learned with flashcard software. Each study also considered the effect of learning burden on lexical decay. Additionally, a cross-study analysis was conducted to explore the effect of the retention interval length on decay. The empirical studies showed that word length, aspects of language learning aptitude, and form presentation mode impacted learning burden but not decay, with shorter words, higher associative memory capacity, and bimodal form presentation related to less burden. Perceived target item usefulness was found to have no effect on burden or decay. Meaning presentation code and PoS were found to affect both burden and decay. Lexical items presented with an L2 definition and verbs were more burdensome and more likely to decay than items presented with an L1 equivalent and nouns. The findings also indicated that more learning burden was associated with a higher likelihood of decay. The cross-study analysis showed that decay was not directly proportional to the retention interval length and that form recall knowledge was more susceptible to decay than form recognition. Additionally, this thesis explores implications for vocabulary research and L2 pedagogy

UCL Discovery