6 research outputs found
How to Ask Better Questions? A Large-Scale Multi-Domain Dataset for Rewriting Ill-Formed Questions
We present a large-scale dataset for the task of rewriting an ill-formed
natural language question to a well-formed one. Our multi-domain question
rewriting MQR dataset is constructed from human contributed Stack Exchange
question edit histories. The dataset contains 427,719 question pairs which come
from 303 domains. We provide human annotations for a subset of the dataset as a
quality estimate. When moving from ill-formed to well-formed questions, the
question quality improves by an average of 45 points across three aspects. We
train sequence-to-sequence neural models on the constructed dataset and obtain
an improvement of 13.2% in BLEU-4 over baseline methods built from other data
resources. We release the MQR dataset to encourage research on the problem of
question rewriting.Comment: AAAI 202
Answering Unanswered Questions through Semantic Reformulations in Spoken QA
Spoken Question Answering (QA) is a key feature of voice assistants, usually
backed by multiple QA systems. Users ask questions via spontaneous speech which
can contain disfluencies, errors, and informal syntax or phrasing. This is a
major challenge in QA, causing unanswered questions or irrelevant answers, and
leading to bad user experiences. We analyze failed QA requests to identify core
challenges: lexical gaps, proposition types, complex syntactic structure, and
high specificity. We propose a Semantic Question Reformulation (SURF) model
offering three linguistically-grounded operations (repair, syntactic reshaping,
generalization) to rewrite questions to facilitate answering. Offline
evaluation on 1M unanswered questions from a leading voice assistant shows that
SURF significantly improves answer rates: up to 24% of previously unanswered
questions obtain relevant answers (75%). Live deployment shows positive impact
for millions of customers with unanswered questions; explicit relevance
feedback shows high user satisfaction.Comment: Accepted by ACL 2023 Industry Trac
Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue
Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects