225 research outputs found
Discharge Summary Hospital Course Summarisation of In Patient Electronic Health Record Text with Clinical Concept Guided Deep Pre-Trained Transformer Models
Brief Hospital Course (BHC) summaries are succinct summaries of an entire
hospital encounter, embedded within discharge summaries, written by senior
clinicians responsible for the overall care of a patient. Methods to
automatically produce summaries from inpatient documentation would be
invaluable in reducing clinician manual burden of summarising documents under
high time-pressure to admit and discharge patients. Automatically producing
these summaries from the inpatient course, is a complex, multi-document
summarisation task, as source notes are written from various perspectives (e.g.
nursing, doctor, radiology), during the course of the hospitalisation. We
demonstrate a range of methods for BHC summarisation demonstrating the
performance of deep learning summarisation models across extractive and
abstractive summarisation scenarios. We also test a novel ensemble extractive
and abstractive summarisation model that incorporates a medical concept
ontology (SNOMED) as a clinical guidance signal and shows superior performance
in 2 real-world clinical data sets
POLIS: a probabilistic summarisation logic for structured documents
PhDAs the availability of structured documents, formatted in markup languages such as SGML, RDF,
or XML, increases, retrieval systems increasingly focus on the retrieval of document-elements,
rather than entire documents. Additionally, abstraction layers in the form of formalised retrieval
logics have allowed developers to include search facilities into numerous applications, without
the need of having detailed knowledge of retrieval models.
Although automatic document summarisation has been recognised as a useful tool for reducing
the workload of information system users, very few such abstraction layers have been developed
for the task of automatic document summarisation. This thesis describes the development
of an abstraction logic for summarisation, called POLIS, which provides users (such as developers
or knowledge engineers) with a high-level access to summarisation facilities. Furthermore,
POLIS allows users to exploit the hierarchical information provided by structured documents.
The development of POLIS is carried out in a step-by-step way. We start by defining a series
of probabilistic summarisation models, which provide weights to document-elements at a user
selected level. These summarisation models are those accessible through POLIS. The formal
definition of POLIS is performed in three steps. We start by providing a syntax for POLIS,
through which users/knowledge engineers interact with the logic. This is followed by a definition
of the logics semantics. Finally, we provide details of an implementation of POLIS.
The final chapters of this dissertation are concerned with the evaluation of POLIS, which is
conducted in two stages. Firstly, we evaluate the performance of the summarisation models by
applying POLIS to two test collections, the DUC AQUAINT corpus, and the INEX IEEE corpus.
This is followed by application scenarios for POLIS, in which we discuss how POLIS can be used in specific IR tasks
Towards Personalized and Human-in-the-Loop Document Summarization
The ubiquitous availability of computing devices and the widespread use of
the internet have generated a large amount of data continuously. Therefore, the
amount of available information on any given topic is far beyond humans'
processing capacity to properly process, causing what is known as information
overload. To efficiently cope with large amounts of information and generate
content with significant value to users, we require identifying, merging and
summarising information. Data summaries can help gather related information and
collect it into a shorter format that enables answering complicated questions,
gaining new insight and discovering conceptual boundaries.
This thesis focuses on three main challenges to alleviate information
overload using novel summarisation techniques. It further intends to facilitate
the analysis of documents to support personalised information extraction. This
thesis separates the research issues into four areas, covering (i) feature
engineering in document summarisation, (ii) traditional static and inflexible
summaries, (iii) traditional generic summarisation approaches, and (iv) the
need for reference summaries. We propose novel approaches to tackle these
challenges, by: i)enabling automatic intelligent feature engineering, ii)
enabling flexible and interactive summarisation, iii) utilising intelligent and
personalised summarisation approaches. The experimental results prove the
efficiency of the proposed approaches compared to other state-of-the-art
models. We further propose solutions to the information overload problem in
different domains through summarisation, covering network traffic data, health
data and business process data.Comment: PhD thesi
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
Large Language Models (LLMs) evaluation is a patchy and inconsistent
landscape, and it is becoming clear that the quality of automatic evaluation
metrics is not keeping up with the pace of development of generative models. We
aim to improve the understanding of current models' performance by providing a
preliminary and hybrid evaluation on a range of open and closed-source
generative LLMs on three NLP benchmarks: text summarisation, text
simplification and grammatical error correction (GEC), using both automatic and
human evaluation. We also explore the potential of the recently released GPT-4
to act as an evaluator. We find that ChatGPT consistently outperforms many
other popular models according to human reviewers on the majority of metrics,
while scoring much more poorly when using classic automatic evaluation metrics.
We also find that human reviewers rate the gold reference as much worse than
the best models' outputs, indicating the poor quality of many popular
benchmarks. Finally, we find that GPT-4 is capable of ranking models' outputs
in a way which aligns reasonably closely to human judgement despite
task-specific variations, with a lower alignment in the GEC task.Comment: Accepted at EMNLP 202
Detecting and Mitigating Hallucinations in Multilingual Summarisation
Hallucinations pose a significant challenge to the reliability of neural
models for abstractive summarisation. While automatically generated summaries
may be fluent, they often lack faithfulness to the original document. This
issue becomes even more pronounced in low-resource settings, such as
cross-lingual transfer. With the existing faithful metrics focusing on English,
even measuring the extent of this phenomenon in cross-lingual settings is hard.
To address this, we first develop a novel metric, mFACT, evaluating the
faithfulness of non-English summaries, leveraging translation-based transfer
from multiple English faithfulness metrics. We then propose a simple but
effective method to reduce hallucinations with a cross-lingual transfer, which
weighs the loss of each training example by its faithfulness score. Through
extensive experiments in multiple languages, we demonstrate that mFACT is the
metric that is most suited to detect hallucinations. Moreover, we find that our
proposed loss weighting method drastically increases both performance and
faithfulness according to both automatic and human evaluation when compared to
strong baselines for cross-lingual transfer such as MAD-X. Our code and dataset
are available at https://github.com/yfqiu-nlp/mfact-summ
A Study of Automatic Metrics for the Evaluation of Natural Language Explanations
As transparency becomes key for robotics and AI, it will be necessary to
evaluate the methods through which transparency is provided, including
automatically generated natural language (NL) explanations. Here, we explore
parallels between the generation of such explanations and the much-studied
field of evaluation of Natural Language Generation (NLG). Specifically, we
investigate which of the NLG evaluation measures map well to explanations. We
present the ExBAN corpus: a crowd-sourced corpus of NL explanations for
Bayesian Networks. We run correlations comparing human subjective ratings with
NLG automatic measures. We find that embedding-based automatic NLG evaluation
methods, such as BERTScore and BLEURT, have a higher correlation with human
ratings, compared to word-overlap metrics, such as BLEU and ROUGE. This work
has implications for Explainable AI and transparent robotic and autonomous
systems.Comment: Accepted at EACL 202
- …