6 research outputs found

    Representation Learning for Natural Language Processing

    Get PDF
    This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing

    Delving into the uncharted territories of Word Sense Disambiguation

    Get PDF
    The automatic disambiguation of word senses, i.e. Word Sense Disambiguation, is a long-standing task in the field of Natural Language Processing; an AI-complete problem that took its first steps more than half a century ago, and which, to date, has apparently attained human-like performances on standard evaluation benchmarks. Unfortunately, the steady evolution that the task experienced over time in terms of sheer performance has not been followed hand in hand by adequate theoretical support, nor by careful error analysis. Furthermore, we believe that the lack of an exhaustive bird’s eye view which accounts for the sort of high-end and unrealistic computational architectures that systems will soon need in order to further refine their performances could lead the field to a dead angle in a few years. In essence, taking advantage of the current moment of great accomplishments and renewed interest in the task, we argue that Word Sense Disambiguation is mature enough for researchers to really observe the extent of the results hitherto obtained, evaluate what is actually missing, and answer the much sought for question: “are current state-of-the-art systems really able to effectively solve lexical ambiguity?” Driven by the desire to become both architects and participants in this period of pondering, we have identified a few macro-areas representatives of the challenges of automatic disambiguation. From this point of view, in this thesis, we propose experimental solutions and empirical tools so as to bring to the attention of the Word Sense Disambiguation community unusual and unexplored points of view. We hope these will represent a new perspective through which to best observe the current state of disambiguation, as well as to foresee future paths for the task to evolve on. Specifically, 1q) prompted by the growing concern about the rise in performance being closely linked to the demand for more and more unrealistic computational architectures in all areas of application of Deep Learning related techniques, we 1a) provide evidence for the undisclosed potential of approaches based on knowledge-bases, via the exploitation of syntagmatic information. Moreover, 2q) driven by the dissatisfaction with the use of cognitively-inaccurate, finite inventories of word senses in Word Sense Disambiguation, we 2a) introduce an approach based on Definition Modeling paradigms to generate contextual definitions for target words and phrases, hence going beyond the limits set by specific lexical-semantic inventories. Finally, 3q) moved by the desire to analyze the real implications beyond the idea of “machines performing disambiguation on par with their human counterparts” we 3a) put forward a detailed analysis of the shared errors affecting current state-of-the-art systems based on diverse approaches for Word Sense Disambiguation, and highlight, by means of a novel evaluation dataset tailored to represent common and critical issues shared by all systems, performances way lower than those usually reported in the current literature

    Semantic similarity framework for Thai conversational agents

    Get PDF
    Conversational Agents integrate computational linguistics techniques and natural language to support human-like communication with complex computer systems. There are a number of applications in business, education and entertainment, including unmanned call centres, or as personal shopping or navigation assistants. Initial research has been performed on Conversational Agents in languages other than English. There has been no significant publication on Thai Conversational Agents. Moreover, no research has been conducted on supporting algorithms for Thai word similarity measures and Thai sentence similarity measures. Consequently, this thesis details the development of a novel Thai sentence semantic similarity measure that can be used to create a Thai Conversational Agent. This measure, Thai Sentence Semantic Similarity measure (TSTS) is inspired by the seminal English measure, Sentence Similarity based on Semantic Nets and Corpus Statistics (STASIS). A Thai sentence benchmark dataset, called 65 Thai Sentence pairs benchmark dataset (TSS-65), is also presented in this thesis for the evaluation of TSTS. The research starts with the development a simple Thai word similarity measure called TWSS. Additionally, a novel word measure called a Semantic Similarity Measure, based on a Lexical Chain Created from a Search Engine (LCSS), is also proposed using a search engine to create the knowledge base instead of WordNet. LCSS overcomes the problem that a prototype version of Thai Word semantic similarity measure (TWSS) has with the word pairs that are related to Thai culture. Thai word benchmark datasets are also presented for the evaluation of TWSS and LCSS called the 30 Thai Word Pair benchmark dataset (TWS-30) and 65 Thai Word Pair benchmark dataset (TWS-65), respectively. The result of TSTS is considered a starting point for a Thai sentence measure which can be illustrated to create semantic-based Conversational Agents in future. This is illustrated using a small sample of real English Conversational Agent human dialogue utterances translated into Thai

    Language representations for computational argumentation

    Full text link
    Argumentation is an essential feature and, arguably, one of the most exciting phenomena of natural language use. Accordingly, it has fascinated scholars and researchers in various fields, such as linguistics and philosophy, for long. Its computational analysis, falling under the notion of computational argumentation, is useful in a variety of domains of text for a range of applications. For instance, it can help to understand users’ stances in online discussion forums towards certain controversies, to provide targeted feedback to users for argumentative writing support, and to automatically summarize scientific publications. As in all natural language processing pipelines, the text we would like to analyze has to be introduced to computational argumentation models in the form of numeric features. Choosing such suitable semantic representations is considered a core challenge in natural language processing. In this context, research employing static and contextualized pretrained text embedding models has recently shown to reach state-of-the-art performances for a range of natural language processing tasks. However, previous work has noted the specific difficulty of computational argumentation scenarios with language representations as one of the main bottlenecks and called for targeted research on the intersection of the two fields. Still, the efforts focusing on the interplay between computational argumentation and representation learning have been few and far apart. This is despite (a) the fast-growing body of work in both computational argumentation and representation learning in general and (b) the fact that some of the open challenges are well known in the natural language processing community. In this thesis, we address this research gap and acknowledge the specific importance of research on the intersection of representation learning and computational argumentation. To this end, we (1) identify a series of challenges driven by inherent characteristics of argumentation in natural language and (2) present new analyses, corpora, and methods to address and mitigate each of the identified issues. Concretely, we focus on five main challenges pertaining to the current state-of-the-art in computational argumentation: (C1) External knowledge: static and contextualized language representations encode distributional knowledge only. We propose two approaches to complement this knowledge with knowledge from external resources. First, we inject lexico-semantic knowledge through an additional prediction objective in the pretraining stage. In a second study, we demonstrate how to inject conceptual knowledge post hoc employing the adapter framework. We show the effectiveness of these approaches on general natural language understanding and argumentative reasoning tasks. (C2) Domain knowledge: pretrained language representations are typically trained on big and general-domain corpora. We study the trade-off between employing such large and general-domain corpora versus smaller and domain-specific corpora for training static word embeddings which we evaluate in the analysis of scientific arguments. (C3) Complementarity of knowledge across tasks: many computational argumentation tasks are interrelated but are typically studied in isolation. In two case studies, we show the effectiveness of sharing knowledge across tasks. First, based on a corpus of scientific texts, which we extend with a new annotation layer reflecting fine-grained argumentative structures, we show that coupling the argumentative analysis with other rhetorical analysis tasks leads to performance improvements for the higher-level tasks. In the second case study, we focus on assessing the argumentative quality of texts. To this end, we present a new multi-domain corpus annotated with ratings reflecting different dimensions of argument quality. We then demonstrate the effectiveness of sharing knowledge across the different quality dimensions in multi-task learning setups. (C4) Multilinguality: argumentation arguably exists in all cultures and languages around the globe. To foster inclusive computational argumentation technologies, we dissect the current state-of-the-art in zero-shot cross-lingual transfer. We show big drops in performance when it comes to resource-lean and typologically distant target languages. Based on this finding, we analyze the reasons for these losses and propose to move to inexpensive few-shot target-language transfer, leading to consistent performance improvements in higher-level semantic tasks, e.g., argumentative reasoning. (C5) Ethical considerations: envisioned computational argumentation applications, e.g., systems for self-determined opinion formation, are highly sensitive. We first discuss which ethical aspects should be considered when representing natural language for computational argumentation tasks. Focusing on the issue of unfair stereotypical bias, we then conduct a multi-dimensional analysis of the amount of bias in monolingual and cross-lingual embedding spaces. In the next step, we devise a general framework for implicit and explicit bias evaluation and debiasing. Employing intrinsic bias measures and benchmarks reflecting the semantic quality of the embeddings, we demonstrate the effectiveness of new debiasing methods, which we propose. Finally, we complement this analysis by testing the original as well as the debiased language representations for stereotypically unfair bias in argumentative inferences. We hope that our contributions in language representations for computational argumentation fuel more research on the intersection of the two fields and contribute to fair, efficient, and effective natural language processing technologies
    corecore