68 research outputs found
Document-Level Machine Translation Quality Estimation
Assessing Machine Translation (MT) quality at document level is a challenge as metrics need to account for many linguistic phenomena on different levels. Large units of text encompass different linguistic phenomena and, as a consequence, a machine translated document can have different problems. It is hard for humans to evaluate documents regarding document-wide phenomena (e.g. coherence) as they get easily distracted by problems at other levels (e.g. grammar). Although standard automatic evaluation metrics (e.g. BLEU) are often used for this purpose, they focus on n-grams matches and often disregard document-wide information. Therefore, although such metrics are useful to compare different MT systems, they may not reflect nuances of quality in individual documents.
Machine translated documents can also be evaluated according to the task they will be used for. Methods based on measuring the distance between machine translations and post-edited machine translations are widely used for task-based purposes. Another task-based method is to use reading comprehension questions about the machine translated document, as a proxy of the document quality. Quality Estimation (QE) is an evaluation approach that attempts to predict MT outputs quality, using trained Machine Learning (ML) models. This method is robust because it can consider any type of quality assessment for building the QE models. Thus far, for document-level QE, BLEU-style metrics were used as quality labels, leading to unreliable predictions, as document information is neglected. Challenges of document-level QE encompass the choice of adequate labels for the task, the use of appropriate features for the task and the study of appropriate ML models.
In this thesis we focus on feature engineering, the design of quality labels and the use of ML methods for document-level QE. Our new features can be classified as document-wide (use shallow document information), discourse-aware (use information about discourse structures) and consensus-based (use other machine translations as pseudo-references). New labels are proposed in order to overcome the lack of reliable labels for document-level QE. Two different approaches are proposed: one aimed at MT for assimilation with a low requirement, and another aimed at MT for dissemination with a high quality requirement. The assimilation labels use reading comprehension questions as a proxy of document quality. The dissemination approach uses a two-stage post-editing method to derive the quality labels. Different ML techniques are also explored for the document-level QE task, including the appropriate use of regression or classification and the study of kernel combination to deal with features of different nature (e.g. handcrafted features versus consensus features). We show that, in general, QE models predicting our new labels and using our discourse-aware features are more successful than models predicting automatic evaluation metrics. Regarding ML techniques, no conclusions could be drawn, given that different models performed similarly throughout the different experiments
Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature
Lay summarisation aims to jointly summarise and simplify a given text, thus
making its content more comprehensible to non-experts. Automatic approaches for
lay summarisation can provide significant value in broadening access to
scientific literature, enabling a greater degree of both interdisciplinary
knowledge sharing and public understanding when it comes to research findings.
However, current corpora for this task are limited in their size and scope,
hindering the development of broadly applicable data-driven approaches. Aiming
to rectify these issues, we present two novel lay summarisation datasets, PLOS
(large-scale) and eLife (medium-scale), each of which contains biomedical
journal articles alongside expert-written lay summaries. We provide a thorough
characterisation of our lay summaries, highlighting differing levels of
readability and abstractiveness between datasets that can be leveraged to
support the needs of different applications. Finally, we benchmark our datasets
using mainstream summarisation approaches and perform a manual evaluation with
domain experts, demonstrating their utility and casting light on the key
challenges of this task.Comment: 16 pages, 9 figures. Accepted to EMNLP 202
A Quantitative Analysis of Discourse Phenomena in Machine Translation
State-of-the-art Machine Translation (MT) systems translate documents by considering isolated sentences, disregarding information beyond sentence level. As a result, machine-translated documents often contain problems related to discourse coherence and cohesion. Recently, some initiatives in the evaluation and quality estimation of MT outputs have attempted to detect discourse problems in order to assess the quality of these machine translations. However, a quantitative analysis of discourse phenomena in MT outputs is still needed in order to better understand the phenomena and identify possible solutions or ways to improve evaluation. This paper aims to answer the following questions: What is the impact of discourse phenomena on MT quality? Can we capture and measure quantitatively any issues related to discourse in MT outputs? In order to answer these questions, we present a quantitative analysis of several discourse phenomena and correlate the resulting figures with scores from automatic translation quality evaluation metrics. We show that figures related to discourse phenomena present a higher correlation with quality scores than the baseline counts widely used for quality estimation of MT
Classifying COVID-19 vaccine narratives
Vaccine hesitancy is widespread, despite the government's information
campaigns and the efforts of the World Health Organisation (WHO). Categorising
the topics within vaccine-related narratives is crucial to understand the
concerns expressed in discussions and identify the specific issues that
contribute to vaccine hesitancy. This paper addresses the need for monitoring
and analysing vaccine narratives online by introducing a novel vaccine
narrative classification task, which categorises COVID-19 vaccine claims into
one of seven categories. Following a data augmentation approach, we first
construct a novel dataset for this new classification task, focusing on the
minority classes. We also make use of fact-checker annotated data. The paper
also presents a neural vaccine narrative classifier that achieves an accuracy
of 84% under cross-validation. The classifier is publicly available for
researchers and journalists.Comment: In Proceedings of the 14th International Conference on Recent
Advances in Natural Language Processing, 202
Detecting Misinformation with LLM-Predicted Credibility Signals and Weak Supervision
Credibility signals represent a wide range of heuristics that are typically
used by journalists and fact-checkers to assess the veracity of online content.
Automating the task of credibility signal extraction, however, is very
challenging as it requires high-accuracy signal-specific extractors to be
trained, while there are currently no sufficiently large datasets annotated
with all credibility signals. This paper investigates whether large language
models (LLMs) can be prompted effectively with a set of 18 credibility signals
to produce weak labels for each signal. We then aggregate these potentially
noisy labels using weak supervision in order to predict content veracity. We
demonstrate that our approach, which combines zero-shot LLM credibility signal
labeling and weak supervision, outperforms state-of-the-art classifiers on two
misinformation datasets without using any ground-truth labels for training. We
also analyse the contribution of the individual credibility signals towards
predicting content veracity, which provides new valuable insights into their
role in misinformation detection
Improving Tokenisation by Alternative Treatment of Spaces
Tokenisation is the first step in almost all NLP tasks, and state-of-the-art
transformer-based language models all use subword tokenisation algorithms to
process input text. Existing algorithms have problems, often producing
tokenisations of limited linguistic validity, and representing equivalent
strings differently depending on their position within a word. We hypothesise
that these problems hinder the ability of transformer-based models to handle
complex words, and suggest that these problems are a result of allowing tokens
to include spaces. We thus experiment with an alternative tokenisation approach
where spaces are always treated as individual tokens. Specifically, we apply
this modification to the BPE and Unigram algorithms. We find that our modified
algorithms lead to improved performance on downstream NLP tasks that involve
handling complex words, whilst having no detrimental effect on performance in
general natural language understanding tasks. Intrinsically, we find our
modified algorithms give more morphologically correct tokenisations, in
particular when handling prefixes. Given the results of our experiments, we
advocate for always treating spaces as individual tokens as an improved
tokenisation method
Enhancing Biomedical Lay Summarisation with External Knowledge Graphs
Previous approaches for automatic lay summarisation are exclusively reliant
on the source article that, given it is written for a technical audience (e.g.,
researchers), is unlikely to explicitly define all technical concepts or state
all of the background information that is relevant for a lay audience. We
address this issue by augmenting eLife, an existing biomedical lay
summarisation dataset, with article-specific knowledge graphs, each containing
detailed information on relevant biomedical concepts. Using both automatic and
human evaluations, we systematically investigate the effectiveness of three
different approaches for incorporating knowledge graphs within lay
summarisation models, with each method targeting a distinct area of the
encoder-decoder model architecture. Our results confirm that integrating
graph-based domain knowledge can significantly benefit lay summarisation by
substantially increasing the readability of generated text and improving the
explanation of technical concepts.Comment: Accepted to the EMNLP 2023 main conferenc
Bilexical embeddings for quality estimation
© 2017 The Authors. Published by Association for Computational Linguistics. This is an open access article available under a Creative Commons licence.
The published version can be accessed at the following link on the publisher’s website: http://dx.doi.org/10.18653/v1/W17-4760This work was supported by the QT21 project (H2020 No. 645452)
VaxxHesitancy: A Dataset for Studying Hesitancy Towards COVID-19 Vaccination on Twitter
Vaccine hesitancy has been a common concern, probably since vaccines were
created and, with the popularisation of social media, people started to express
their concerns about vaccines online alongside those posting pro- and
anti-vaccine content. Predictably, since the first mentions of a COVID-19
vaccine, social media users posted about their fears and concerns or about
their support and belief into the effectiveness of these rapidly developing
vaccines. Identifying and understanding the reasons behind public hesitancy
towards COVID-19 vaccines is important for policy markers that need to develop
actions to better inform the population with the aim of increasing vaccine
take-up. In the case of COVID-19, where the fast development of the vaccines
was mirrored closely by growth in anti-vaxx disinformation, automatic means of
detecting citizen attitudes towards vaccination became necessary. This is an
important computational social sciences task that requires data analysis in
order to gain in-depth understanding of the phenomena at hand. Annotated data
is also necessary for training data-driven models for more nuanced analysis of
attitudes towards vaccination. To this end, we created a new collection of over
3,101 tweets annotated with users' attitudes towards COVID-19 vaccination
(stance). Besides, we also develop a domain-specific language model (VaxxBERT)
that achieves the best predictive performance (73.0 accuracy and 69.3 F1-score)
as compared to a robust set of baselines. To the best of our knowledge, these
are the first dataset and model that model vaccine hesitancy as a category
distinct from pro- and anti-vaccine stance.Comment: Accepted at ICWSM 202
- …