22 research outputs found
Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue
Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects
QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation
Question Generation (QG) aims to automate the task of composing questions for
a passage with a set of chosen answers found within the passage. In recent
years, the introduction of neural generation models has resulted in substantial
improvements of automatically generated questions in terms of quality,
especially compared to traditional approaches that employ manually crafted
heuristics. However, the metrics commonly applied in QG evaluations have been
criticized for their low agreement with human judgement. We therefore propose a
new reference-free evaluation metric that has the potential to provide a better
mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a
language model to maximize its correlation with human judgements, QAScore
evaluates a question by computing the cross entropy according to the
probability that the language model can correctly generate the masked words in
the answer to that question. Furthermore, we conduct a new crowd-sourcing human
evaluation experiment for the QG evaluation to investigate how QAScore and
other metrics can correlate with human judgements. Experiments show that
QAScore obtains a stronger correlation with the results of our proposed human
evaluation method compared to existing traditional word-overlap-based metrics
such as BLEU and ROUGE, as well as the existing pretrained-model-based metric
BERTScore.Comment: 19 pages, 5 figures, 7 table
Document-Level Machine Translation with Large Language Models
Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive,
relevant, and fluent answers for various natural language processing (NLP)
tasks. Taking document-level machine translation (MT) as a testbed, this paper
provides an in-depth evaluation of LLMs' ability on discourse modeling. The
study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where
we investigate the impact of different prompts on document-level translation
quality and discourse phenomena; 2) Comparison of Translation Models, where we
compare the translation performance of Chat-GPT with commercial MT systems and
advanced document-level MT methods; 3) Analysis of Discourse Modelling
Abilities, where we further probe discourse knowledge encoded in LLMs and
examine the impact of training techniques on discourse modeling. By evaluating
a number of benchmarks, we surprisingly find that 1) leveraging their powerful
long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in
terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain
discourse knowledge, even through it may select incorrect translation
candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated
superior performance and show potential to become a new and promising paradigm
for document-level translation. This work highlights the challenges and
opportunities of discourse modeling for LLMs, which we hope can inspire the
future design and evaluation of LLMs
Semantic-aware dynamic retrospective-prospective reasoning for event-level video question answering
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed to provide optimal answers. However, despite significant progress in model performance, few studies have focused on using the explicit semantic connections between the question and visual information especially at the event level. There is need for using such semantic connections to facilitate complex reasoning across video frames. Therefore, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. Specifically, we explicitly use the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process where we decide to move to the next frame based on which part of the SRL structure (agent, verb, patient, etc.) of the question is being focused on. We conduct experiments on a benchmark EVQA dataset - TrafficQA. Results show that our proposed approach achieves superior performance compared to previous state-of-the-art models. Our code is publicly available at https://github.com/lyuchenyang/Semantic-aware-VideoQA}
Is a video worth n × n Images? A highly efficient approach to transformer-based video question answering
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one
or more image encoders followed by interaction between frames and question. However,
such schema incur significant memory use and
inevitably slow down the training and inference
speed. In this work, we present a highly efficient approach for VideoQA based on existing
vision-language pre-trained models where we
concatenate video frames to a n × n matrix
and then convert it to one image. By doing
so, we reduce the use of the image encoder
from n 2 to 1 while maintaining the temporal
structure of the original video. Experimental
results on MSRVTT and TrafficQA show that
our proposed approach achieves state-of-theart performance with nearly 4× faster speed
and only 30% memory use. We show that
by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up
for training and inference. We believe the proposed approach can facilitate VideoQA-related
research by reducing the computational requirements for those who have limited access to budgets and resources. Our code is publicly available at https://github.com/lyuchenyang/
Efficient-VideoQA for research use
Azimuthal asymmetries in lepton-pair production at a fixed-target experiment using the LHC beams (AFTER)
A multi-purpose fixed-target experiment using the proton and lead-ion beams
of the LHC was recently proposed by Brodsky, Fleuret, Hadjidakis and Lansberg,
and here we concentrate our study on some issues related to the spin physics
part of this project (referred to as AFTER). We study the nucleon spin
structure through and processes with a fixed-target experiment using
the LHC proton beams, for the kinematical region with 7 TeV proton beams at the
energy in center-of-mass frame of two nucleons GeV. We calculate
and estimate the azimuthal asymmetries of unpolarized and
dilepton production processes in the Drell--Yan continuum region and at the
-pole. We also calculate the , and
azimuthal asymmetries of and dilepton production
processes with the target proton and deuteron longitudinally or transversally
polarized in the Drell--Yan continuum region and around resonances region.
We conclude that it is feasible to measure these azimuthal asymmetries,
consequently the three-dimensional or transverse momentum dependent parton
distribution functions (3dPDFs or TMDs), at this new AFTER facility.Comment: 15 pages, 40 figures. Version accepted for publication in EPJ
B_c meson rare decays in the light-cone quark model
We investigate the rare decays
and in the framework of the
light-cone quark model (LCQM). The transition form factors are calculated in
the space-like region and then analytically continued to the time-like region
via exponential parametrization. The branching ratios and longitudinal lepton
polarization asymmetries (LPAs) for the two decays are given and compared with
each other. The results are helpful to investigating the structure of
meson and to testing the unitarity of CKM quark mixing matrix. All these
results can be tested in the future experiments at the LHC.Comment: 9 pages, 11 figures, version accepted for publication in EPJ
Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering
Event-Level Video Question Answering (EVQA) requires complex reasoning across
video events to obtain the visual information needed to provide optimal
answers. However, despite significant progress in model performance, few
studies have focused on using the explicit semantic connections between the
question and visual information especially at the event level. There is need
for using such semantic connections to facilitate complex reasoning across
video frames. Therefore, we propose a semantic-aware dynamic
retrospective-prospective reasoning approach for video-based question
answering. Specifically, we explicitly use the Semantic Role Labeling (SRL)
structure of the question in the dynamic reasoning process where we decide to
move to the next frame based on which part of the SRL structure (agent, verb,
patient, etc.) of the question is being focused on. We conduct experiments on a
benchmark EVQA dataset - TrafficQA. Results show that our proposed approach
achieves superior performance compared to previous state-of-the-art models. Our
code will be made publicly available for research use
Is a Video worth Images? A Highly Efficient Approach to Transformer-based Video Question Answering
Conventional Transformer-based Video Question Answering (VideoQA) approaches
generally encode frames independently through one or more image encoders
followed by interaction between frames and question. However, such schema would
incur significant memory use and inevitably slow down the training and
inference speed. In this work, we present a highly efficient approach for
VideoQA based on existing vision-language pre-trained models where we
concatenate video frames to a matrix and then convert it to one
image. By doing so, we reduce the use of the image encoder from to
while maintaining the temporal structure of the original video. Experimental
results on MSRVTT and TrafficQA show that our proposed approach achieves
state-of-the-art performance with nearly faster speed and only 30%
memory use. We show that by integrating our approach into VideoQA systems we
can achieve comparable, even superior, performance with a significant speed up
for training and inference. We believe the proposed approach can facilitate
VideoQA-related research by reducing the computational requirements for those
who have limited access to budgets and resources. Our code will be made
publicly available for research use