Search CORE

8 research outputs found

Entertaining and Opinionated but Too Controlling: A Large-Scale User Study of an Open Domain Alexa Prize System

Author: Bowden Kevin K.
Cui Wen
Harrison Vrindavan
Juraska Juraj
Santer Nicholas
Schwarzmann Brian
Walker Marilyn
Whittaker Steve
Wu Jiaqi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Conversational systems typically focus on functional tasks such as scheduling appointments or creating todo lists. Instead we design and evaluate SlugBot (SB), one of 8 semifinalists in the 2018 AlexaPrize, whose goal is to support casual open-domain social inter-action. This novel application requires both broad topic coverage and engaging interactive skills. We developed a new technical approach to meet this demanding situation by crowd-sourcing novel content and introducing playful conversational strategies based on storytelling and games. We collected over 10,000 conversations during August 2018 as part of the Alexa Prize competition. We also conducted an in-lab follow-up qualitative evaluation. Over-all users found SB moderately engaging; conversations averaged 3.6 minutes and involved 26 user turns. However, users reacted very differently to different conversation subtypes. Storytelling and games were evaluated positively; these were seen as entertaining with predictable interactive structure. They also led users to impute personality and intelligence to SB. In contrast, search and general Chit-Chat induced coverage problems; here users found it hard to infer what topics SB could understand, with these conversations seen as being too system-driven. Theoretical and design implications suggest a move away from conversational systems that simply provide factual information. Future systems should be designed to have their own opinions with personal stories to share, and SB provides an example of how we might achieve this.Comment: To appear in 1st International Conference on Conversational User Interfaces (CUI 2019

arXiv.org e-Print Archive

Crossref

GEMv2 : Multilingual NLG benchmarking in a single line of code

Author: Adewumi Tosin
Ammanamanch Pawan Sasanka
Bhagavatula Chandra
Bhattacharjee Abhik
Bohnet Bernd
Cahyawijaya Samuel
Cardenas Ronald
Chim Jenny
Clark Elizabeth
Clive Jordan
Creutz Mathias
Daheim Nico
Deutsch Daniel
Dhole Kaustubh
Durmus Esin
Dusek Ondrej
Garbacea Cristina
Gehrmann Sebastian
Ginter Filip
Gkatzia Dimitra
Hasan Tahmid
Hayashi Hiroaki
Hou Yufang
Jernite Yacine
Jin Di
Jolly Shailza
Juraska Juraj
Kamal Eddine Moussa
Kanerva Jenna
Kriz Reno
Ladhak Faisal
Liu Yixin
Madaan Aman
Mahamood Saad
Mahendiran Abinaya
Maynez Joshua
McMillan-Major Angelina
Mille Simon
Montella Sebastien
Nikolaev Vitaly
Novikova Jekaterina
Osei Salomey
Papangelis Alexandros
Perez-Beltrachini Laura
Pu Liang Paul
Puduppully Ratish
Pushkarna Mahima
Radev Dragomir
Raghavi Chandu Khyathi
Raheja Vipul
Raunak Vikas
Ribeiro Leonardo F. R.
Sang Yisi
Sanjay Kale Mihir
Sedoc João
Shahriyar Rifat
Shen Tianhao
Shvets Anna
Strobelt Hendrik
Subramani Nishant
Thomson Craig
Tsai Vivian
Tunstall Lewis
Upadhyay Ashish
Wang Alex
Wang Dakuo
White Michael
Wilie Bryan
Winata Genta Indra
Xiong Deyi
Xu Ying
Yao Bingsheng
You Chaobin
Zhang Li
Zhou Jiawei
Zhu Qi
Štajner Sanja
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2022
Field of study

Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

GEMv2 : Multilingual NLG benchmarking in a single line of code

Author: Adewumi Tosin
Ammanamanch Pawan Sasanka
Bhagavatula Chandra
Bhattacharjee Abhik
Bohnet Bernd
Cahyawijaya Samuel
Cardenas Ronald
Chim Jenny
Clark Elizabeth
Clive Jordan
Creutz Mathias
Daheim Nico
Deutsch Daniel
Dhole Kaustubh
Durmus Esin
Dusek Ondrej
Garbacea Cristina
Gehrmann Sebastian
Ginter Filip
Gkatzia Dimitra
Hasan Tahmid
Hayashi Hiroaki
Hou Yufang
Jernite Yacine
Jin Di
Jolly Shailza
Juraska Juraj
Kamal Eddine Moussa
Kanerva Jenna
Kriz Reno
Ladhak Faisal
Liu Yixin
Madaan Aman
Mahamood Saad
Mahendiran Abinaya
Maynez Joshua
McMillan-Major Angelina
Mille Simon
Montella Sebastien
Nikolaev Vitaly
Novikova Jekaterina
Osei Salomey
Papangelis Alexandros
Perez-Beltrachini Laura
Pu Liang Paul
Puduppully Ratish
Pushkarna Mahima
Radev Dragomir
Raghavi Chandu Khyathi
Raheja Vipul
Raunak Vikas
Ribeiro Leonardo F. R.
Sang Yisi
Sanjay Kale Mihir
Sedoc João
Shahriyar Rifat
Shen Tianhao
Shvets Anna
Strobelt Hendrik
Subramani Nishant
Thomson Craig
Tsai Vivian
Tunstall Lewis
Upadhyay Ashish
Wang Alex
Wang Dakuo
White Michael
Wilie Bryan
Winata Genta Indra
Xiong Deyi
Xu Ying
Yao Bingsheng
You Chaobin
Zhang Li
Zhou Jiawei
Zhu Qi
Štajner Sanja
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2022
Field of study

Aberdeen University Research

Biblio at Institute of Formal and Applied Linguistics

Helsingin yliopiston digitaalinen arkisto

Recommended from our members

Diversifying Language Generated by Deep Learning Models in Dialogue Systems

Author: Juraska Juraj
Publication venue: eScholarship, University of California
Publication date: 01/01/2022
Field of study

Conversational AI has seen tremendous progress in recent years, achieving near-human or even surpassing human performance in certain well-defined tasks, including speech recognition and question answering. Yet it tends to struggle with tasks which are less constrained, in particular those that involve producing human language. Current approaches to natural language generation (NLG) in dialogue systems still heavily rely on techniques that lack scalability and transferability to different domains, despite the general embrace of more robust methods by the NLG community, in particular deep learning (neural) models. These methods rely on large amounts of annotated data, yet they tend to produce generic, robotic, and boring responses that lack most of the human language nuances that make conversation creative and varied. While the naturalness of the generated language is an important factor affecting the perceived quality of a dialogue system, semantic accuracy is also extremely important. If a system is not semantically accurate, it may provide the user with incorrect information or contradict its earlier responses. In this thesis, we focus on the task of generating an utterance from a structured meaning representation (MR). To support our work, we create and release a new parallel corpus with more varied dialogue acts and more conversational utterances than previous MR-to-text corpora. We explore different ways of promoting output diversity in neural data-to-text generation while ensuring high semantic accuracy by developing new methods to help deep learning NLG models produce diverse utterances that are faithful to their MRs. This is an important step toward making conversational AI more reliable and pleasant to interact with.We first observe in our initial experiments that NLG models have the ability to produce more diverse and natural-sounding texts when explicitly prompted to, however, this diversity comes at the expense of semantic accuracy. This leads us to develop a set of methods for automatically assessing and enforcing semantic accuracy in the generated utterances. We introduce a general tool to find a semantic alignment between an utterance and the corresponding input, which can be used for automatically evaluating the accuracy of generated utterances and ranking a pool of candidate utterances a model produces. We also propose a novel semantically attention-guided decoding method for neural encoder-decoder models, which utilizes the models' own knowledge acquired from training in a way that enables them to track semantic accuracy during inference and rerank generated utterance candidates accordingly. We show on multiple datasets that both of these methods have an ability to dramatically reduce semantic errors in model outputs, while maintaining their overall quality and fluency.We then systematically explore Monte-Carlo Tree Search (MCTS) as a way to simultaneously optimize both semantic accuracy and stylistic diversity during inference. To guide the MCTS, we propose a new referenceless automatic metric for utterance evaluation. Our results show that, using this novel method, we can successfully increase diversity while maintaining, or even improving, semantic accuracy

eScholarship - University of California