536 research outputs found
Revisiting Unsupervised Relation Extraction
Unsupervised relation extraction (URE) extracts relations between named
entities from raw text without manually-labelled data and existing knowledge
bases (KBs). URE methods can be categorised into generative and discriminative
approaches, which rely either on hand-crafted features or surface form.
However, we demonstrate that by using only named entities to induce relation
types, we can outperform existing methods on two popular datasets. We conduct a
comparison and evaluation of our findings with other URE techniques, to
ascertain the important features in URE. We conclude that entity types provide
a strong inductive bias for URE.Comment: 8 pages, 1 figure, 2 tables. Accepted in ACL 202
Modelling Instance-Level Annotator Reliability for Natural Language Labelling Tasks
When constructing models that learn from noisy labels produced by multiple
annotators, it is important to accurately estimate the reliability of
annotators. Annotators may provide labels of inconsistent quality due to their
varying expertise and reliability in a domain. Previous studies have mostly
focused on estimating each annotator's overall reliability on the entire
annotation task. However, in practice, the reliability of an annotator may
depend on each specific instance. Only a limited number of studies have
investigated modelling per-instance reliability and these only considered
binary labels. In this paper, we propose an unsupervised model which can handle
both binary and multi-class labels. It can automatically estimate the
per-instance reliability of each annotator and the correct label for each
instance. We specify our model as a probabilistic model which incorporates
neural networks to model the dependency between latent variables and instances.
For evaluation, the proposed method is applied to both synthetic and real data,
including two labelling tasks: text classification and textual entailment.
Experimental results demonstrate our novel method can not only accurately
estimate the reliability of annotators across different instances, but also
achieve superior performance in predicting the correct labels and detecting the
least reliable annotators compared to state-of-the-art baselines.Comment: 9 pages, 1 figures, 10 tables, 2019 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL2019
Disambiguating the species of biomedical named entities using natural language parsers
Motivation: Text mining technologies have been shown to reduce the laborious work involved in organizing the vast amount of information hidden in the literature. One challenge in text mining is linking ambiguous word forms to unambiguous biological concepts. This article reports on a comprehensive study on resolving the ambiguity in mentions of biomedical named entities with respect to model organisms and presents an array of approaches, with focus on methods utilizing natural language parsers
ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization
The performance of abstractive text summarization has been greatly boosted by
pre-trained language models recently. The main concern of existing abstractive
summarization methods is the factual inconsistency problem of their generated
summary. To alleviate the problem, many efforts have focused on developing
effective factuality evaluation metrics based on natural language inference and
question answering et al. However, they have limitations of high computational
complexity and relying on annotated data. Most recently, large language models
such as ChatGPT have shown strong ability in not only natural language
understanding but also natural language inference. In this paper, we study the
factual inconsistency evaluation ability of ChatGPT under the zero-shot setting
by evaluating it on the coarse-grained and fine-grained factuality evaluation
tasks including binary natural language inference (NLI), summary ranking, and
consistency rating. Experimental results show that ChatGPT outperforms previous
SOTA evaluation metrics on 6/9 datasets across three tasks, demonstrating its
great potential for assessing factual inconsistency in the zero-shot setting.
The results also highlight the importance of prompt design and the need for
future efforts to address ChatGPT's limitations on evaluation bias, wrong
reasoning, and hallucination.Comment: ongoing work, 12 pages, 4 figure
A method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data
BACKGROUND: We consider the user task of designing clinical trial protocols and propose a method that discovers and outputs the most appropriate eligibility criteria from a potentially huge set of candidates. Each document d in our collection D is a clinical trial protocol which itself contains a set of eligibility criteria. Given a small set of sample documents [Formula: see text] , a user has initially identified as relevant e.g., via a user query interface, our scoring method automatically suggests eligibility criteria from D, D ⊃ D', by ranking them according to how appropriate they are to the clinical trial protocol currently being designed. The appropriateness is measured by the degree to which they are consistent with the user-supplied sample documents D'. METHOD: We propose a novel three-step method called LDALR which views documents as a mixture of latent topics. First, we infer the latent topics in the sample documents using Latent Dirichlet Allocation (LDA). Next, we use logistic regression models to compute the probability that a given candidate criterion belongs to a particular topic. Lastly, we score each criterion by computing its expected value, the probability-weighted sum of the topic proportions inferred from the set of sample documents. Intuitively, the greater the probability that a candidate criterion belongs to the topics that are dominant in the samples, the higher its expected value or score. RESULTS: Our experiments have shown that LDALR is 8 and 9 times better (resp., for inclusion and exclusion criteria) than randomly choosing from a set of candidates obtained from relevant documents. In user simulation experiments using LDALR, we were able to automatically construct eligibility criteria that are on the average 75% and 70% (resp., for inclusion and exclusion criteria) similar to the correct eligibility criteria. CONCLUSIONS: We have proposed LDALR, a practical method for discovering and inferring appropriate eligibility criteria in clinical trial protocols without labeled data. Results from our experiments suggest that LDALR models can be used to effectively find appropriate eligibility criteria from a large repository of clinical trial protocols
- …