8 research outputs found
Entertaining and Opinionated but Too Controlling: A Large-Scale User Study of an Open Domain Alexa Prize System
Conversational systems typically focus on functional tasks such as scheduling
appointments or creating todo lists. Instead we design and evaluate SlugBot
(SB), one of 8 semifinalists in the 2018 AlexaPrize, whose goal is to support
casual open-domain social inter-action. This novel application requires both
broad topic coverage and engaging interactive skills. We developed a new
technical approach to meet this demanding situation by crowd-sourcing novel
content and introducing playful conversational strategies based on storytelling
and games. We collected over 10,000 conversations during August 2018 as part of
the Alexa Prize competition. We also conducted an in-lab follow-up qualitative
evaluation. Over-all users found SB moderately engaging; conversations averaged
3.6 minutes and involved 26 user turns. However, users reacted very differently
to different conversation subtypes. Storytelling and games were evaluated
positively; these were seen as entertaining with predictable interactive
structure. They also led users to impute personality and intelligence to SB. In
contrast, search and general Chit-Chat induced coverage problems; here users
found it hard to infer what topics SB could understand, with these
conversations seen as being too system-driven. Theoretical and design
implications suggest a move away from conversational systems that simply
provide factual information. Future systems should be designed to have their
own opinions with personal stories to share, and SB provides an example of how
we might achieve this.Comment: To appear in 1st International Conference on Conversational User
Interfaces (CUI 2019
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
GEMv2 : Multilingual NLG benchmarking in a single line of code
Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe
Recommended from our members
Diversifying Language Generated by Deep Learning Models in Dialogue Systems
Conversational AI has seen tremendous progress in recent years, achieving near-human or even surpassing human performance in certain well-defined tasks, including speech recognition and question answering. Yet it tends to struggle with tasks which are less constrained, in particular those that involve producing human language. Current approaches to natural language generation (NLG) in dialogue systems still heavily rely on techniques that lack scalability and transferability to different domains, despite the general embrace of more robust methods by the NLG community, in particular deep learning (neural) models. These methods rely on large amounts of annotated data, yet they tend to produce generic, robotic, and boring responses that lack most of the human language nuances that make conversation creative and varied. While the naturalness of the generated language is an important factor affecting the perceived quality of a dialogue system, semantic accuracy is also extremely important. If a system is not semantically accurate, it may provide the user with incorrect information or contradict its earlier responses. In this thesis, we focus on the task of generating an utterance from a structured meaning representation (MR). To support our work, we create and release a new parallel corpus with more varied dialogue acts and more conversational utterances than previous MR-to-text corpora. We explore different ways of promoting output diversity in neural data-to-text generation while ensuring high semantic accuracy by developing new methods to help deep learning NLG models produce diverse utterances that are faithful to their MRs. This is an important step toward making conversational AI more reliable and pleasant to interact with.We first observe in our initial experiments that NLG models have the ability to produce more diverse and natural-sounding texts when explicitly prompted to, however, this diversity comes at the expense of semantic accuracy. This leads us to develop a set of methods for automatically assessing and enforcing semantic accuracy in the generated utterances. We introduce a general tool to find a semantic alignment between an utterance and the corresponding input, which can be used for automatically evaluating the accuracy of generated utterances and ranking a pool of candidate utterances a model produces. We also propose a novel semantically attention-guided decoding method for neural encoder-decoder models, which utilizes the models' own knowledge acquired from training in a way that enables them to track semantic accuracy during inference and rerank generated utterance candidates accordingly. We show on multiple datasets that both of these methods have an ability to dramatically reduce semantic errors in model outputs, while maintaining their overall quality and fluency.We then systematically explore Monte-Carlo Tree Search (MCTS) as a way to simultaneously optimize both semantic accuracy and stylistic diversity during inference. To guide the MCTS, we propose a new referenceless automatic metric for utterance evaluation. Our results show that, using this novel method, we can successfully increase diversity while maintaining, or even improving, semantic accuracy