71 research outputs found
Compositional Approaches for Representing Relations Between Words: A Comparative Study
Identifying the relations that exist between words (or entities) is important for various natural language processing tasks such as, relational search, noun-modifier classification and analogy detection. A popular approach to represent the relations between a pair of words is to extract the patterns in which the words co-occur with from a corpus, and assign each word-pair a vector of pattern frequencies. Despite the simplicity of this approach, it suffers from data sparseness, information scalability and linguistic creativity as the model is unable to handle previously unseen word pairs in a corpus. In contrast, a compositional approach for representing relations between words overcomes these issues by using the attributes of each individual word to indirectly compose a representation for the common relations that hold between the two words. This study aims to compare different operations for creating relation representations from word-level representations. We investigate the performance of the compositional methods by measuring the relational similarities using several benchmark datasets for word analogy. Moreover, we evaluate the different relation representations in a knowledge base completion task
Learning Linear Transformations between Counting-based and Prediction-based Word Embeddings
Despite the growing interest in prediction-based word embedding learning methods, it remains unclear as to how the vector spaces learnt by the prediction-based methods differ from that of the counting-based methods, or whether one can be transformed into the other. To study the relationship between counting-based and prediction-based embeddings, we propose a method for learning a linear transformation between two given sets of word embeddings. Our proposal contributes to the word embedding learning research in three ways: (a) we propose an efficient method to learn a linear transformation between two sets of word embeddings, (b) using the transformation learnt in (a), we empirically show that it is possible to predict distributed word embeddings for novel unseen words, and (c) empirically it is possible to linearly transform counting-based embeddings to prediction-based embeddings, for frequent words, different POS categories, and varying degrees of ambiguities
Evaluating Co-reference Chains based Conversation History in Conversational Question Answering
This paper examines the effect of using co-reference chains based conversational history against the use of entire conversation history for conversational question answering (CoQA) task. The QANet model is modified to include conversational history and NeuralCoref is used to obtain co-reference chains based conversation history. The results of the study indicates that in spite of the availability of a large proportion of co-reference links in CoQA, the abstract nature of questions in CoQA renders it difficult to obtain correct mapping of co-reference related conversation history, and thus results in lower performance compared to systems that use entire conversation history. The effect of co-reference resolution examined on various domains and different conversation length, shows that co-reference resolution across questions is helpful for certain domains and medium-length conversations
Transductive Learning with String Kernels for Cross-Domain Text Classification
For many text classification tasks, there is a major problem posed by the
lack of labeled data in a target domain. Although classifiers for a target
domain can be trained on labeled text data from a related source domain, the
accuracy of such classifiers is usually lower in the cross-domain setting.
Recently, string kernels have obtained state-of-the-art results in various text
classification tasks such as native language identification or automatic essay
scoring. Moreover, classifiers based on string kernels have been found to be
robust to the distribution gap between different domains. In this paper, we
formally describe an algorithm composed of two simple yet effective
transductive learning approaches to further improve the results of string
kernels in cross-domain settings. By adapting string kernels to the test set
without using the ground-truth test labels, we report significantly better
accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap
with arXiv:1808.0840
Is something better than nothing? automatically predicting stance-based arguments using deep learning and small labelled dataset
Online reviews have become a popular portal among customers making decisions about purchasing products. A number of corpora of reviews have been widely investigated in NLP in general, and, in particular, in argument mining. This is a subset of NLP that deals with extracting arguments and the relations among them from user-based content. A major problem faced by argument mining research is the lack of human-annotated data. In this paper, we investigate the use of weakly supervised and semi-supervised methods for automatically annotating data, and thus providing large annotated datasets. We do this by building on previous work that explores the classification of opinions present in reviews based on whether the stance is expressed explicitly or implicitly. In the work described here, we automatically annotate stance as implicit or explicit and our results show that the datasets we generate, although noisy, can be used to learn better models for implicit/explicit opinion classification
Tick parasitism classification from noisy medical records
Much of the health information in the medical domain comes in the form of clinical narratives. The rich semantic information contained in these notes can be modeled to make inferences that assist the decision making process for medical practitioners, which is particularly important under time and resource constraints. However, the creation of such assistive tools is made difficult given the ubiquity of misspellings, unsegmented words and morphologically complex or rare medical terms. This reduces the coverage of vocabulary terms present in commonly used pretrained distributed word representations that are passed as input to parametric models that makes such predictions. This paper presents an ensemble architecture that combines indomain and general word embeddings to overcome these challenges, showing best performance on a binary classification task when compared to various other baselines. We demonstrate our approach in the context of the veterinary domain for the task of identifying tick parasitism from small animals. The best model shows 84.29% test accuracy, showing some improvement over models, which only use pretrained embeddings that are not specifically trained for the medical sub-domain of interest
Correcting crowdsourced annotations to improve detection of outcome types in evidence based medicine
The validity and authenticity of annotations in datasets massively influences the performance of Natural Language Processing (NLP) systems. In other words, poorly annotated datasets are likely to produce fatal results in at-least most NLP problems hence misinforming consumers of these models, systems or applications. This is a bottleneck in most domains, especially in healthcare where crowdsourcing is a popular strategy in obtaining annotations. In this paper, we present a framework that automatically corrects incorrectly captured annotations of outcomes, thereby improving the quality of the crowdsourced annotations. We investigate a publicly available dataset called EBM-NLP, built to power NLP tasks in support of Evidence based Medicine (EBM) primarily focusing on health outcomes
- …