25 research outputs found
Towards Best Experiment Design for Evaluating Dialogue System Output
To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for
evaluating dialogue systems, researchers typically use human judgments to
provide convergent evidence. While it has been demonstrated that human
judgments can suffer from the inconsistency of ratings, extant research has
also found that the design of the evaluation task affects the consistency and
quality of human judgments. We conduct a between-subjects study to understand
the impact of four experiment conditions on human ratings of dialogue system
output. In addition to discrete and continuous scale ratings, we also
experiment with a novel application of Best-Worst scaling to dialogue
evaluation. Through our systematic study with 40 crowdsourced workers in each
task, we find that using continuous scales achieves more consistent ratings
than Likert scale or ranking-based experiment design. Additionally, we find
that factors such as time taken to complete the task and no prior experience of
participating in similar studies of rating dialogue system output positively
impact consistency and agreement amongst ratersComment: Accepted at INLG 201
A TOGAF Based Chatbot Evaluation Metrics: Insights from Literature Review
Chatbots have been used for basic conversational functionalities and task performance in today\u27s world. With the surge in the use of chatbots, several design features have emerged to cater to its rising demands and increasing complexity. Researchers have grappled with the issues of modeling and evaluating these tools because of the vast number of metrics associated with their measure of successful. This paper conducted a literature survey to identify the various conversational metrics used to evaluate chatbots. The selected evaluation metrics were mapped to the various layers of The Open Group Architecture Framework (TOGAF) architecture. TOGAF architecture helped us divide the metrics based on the various facets critical to developing successful chatbot applications. Our results show that the metrics related to the business layer have been well studied. However, metrics associated with the data, information, and system layers warrant more research. As chatbots become more complex, success metrics across the intermediate layers may assume greater significance
Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval
Smart assistants and recommender systems must deal with lots of information coming from different sources and having different formats. This is more frequent in text data, which presents increased variability and complexity, and is rather common for conversational assistants or chatbots. Moreover, this issue is very evident in the food and nutrition lexicon, where the semantics present increased variability, namely due to hypernyms and hyponyms. This work describes the creation of a set of word embeddings based on the incorporation of information from a food thesaurus - LanguaL - through retrofitting. The ingredients were classified according to three different facet label groups. Retrofitted embeddings seem to properly encode food-specific knowledge, as shown by an increase on accuracy as compared to generic embeddings (+23%, +10% and +31% per group). Moreover, a weighing mechanism based on TF-IDF was applied to embedding creation before retrofitting, also bringing an increase on accuracy (+5%, +9% and +5% per group). Finally, the approach has been tested with human users in an ingredient retrieval exercise, showing very positive evaluation (77.3% of the volunteer testers preferred this method over a string-based matching algorithm)
Conversational Agents - Exploring Generative Mechanisms and Second-hand Effects of Actualized Technology Affordances
Many organisations jumped on the bandwagon and implemented conversational agents (CAs) as a new communication channel. Customers benefit from shorter resolution times, ubiquitous availability, and consistent and compliant responses. However, despite the hype around CAs and the various benefits for customers, we know little about the effects of external facing CAs on the human workforce. This is crucial to better manage the possible changes in the work organisation. Adopting a critical realist stance and using the lens of technology affordances we explore a) why users increasingly actualize CA affordances and b) the first and second-hand effects of affordance actualisation on customers and human employees. We conducted semi-structured interviews with 18 experts in the field and introduce the term affordance effects pairs describing the relationships between the first and second-hand effects. We further explain which generative mechanisms lead to an increasing actualization of affordances and the associated effects
Conversational Agent Experience: How to Create Good Alexa Skill
Conversational Design guidelines offer recommendations on how to lead the user-agent conversation, how to help customers achieve their goals, and how to handle the mistakes caused by each side. However, the effective methodology to evaluate the experience of user-agent conversation is unclear. Here we show a data pipeline that evaluates the user-agent experience on a variety of scenarios. We found that the coherence of Alexa's response has a positive impact on user's experience, which is based on the categories of skills, the number of slots in utterances, and the goals that users are trying to achieve. Furthermore, our study shows a gap between the theoretical conversational design guideline and the needs for practical testing for CA. Our data pipeline demonstrates the importance of testing experience by measurements that cast positive or negative affect on conversational experience. We anticipate our study to be a starting point for a more robust user experience evaluating system for CA and related applications.Master of Science in InformationSchool of Informationhttp://deepblue.lib.umich.edu/bitstream/2027.42/168558/1/20210104_Zhou,Xunan[Andy]_Final_MTOP_Thesis.pd
A Maturity Assessment Framework for Conversational AI Development Platforms
Conversational Artificial Intelligence (AI) systems have recently
sky-rocketed in popularity and are now used in many applications, from car
assistants to customer support. The development of conversational AI systems is
supported by a large variety of software platforms, all with similar goals, but
different focus points and functionalities. A systematic foundation for
classifying conversational AI platforms is currently lacking. We propose a
framework for assessing the maturity level of conversational AI development
platforms. Our framework is based on a systematic literature review, in which
we extracted common and distinguishing features of various open-source and
commercial (or in-house) platforms. Inspired by language reference frameworks,
we identify different maturity levels that a conversational AI development
platform may exhibit in understanding and responding to user inputs. Our
framework can guide organizations in selecting a conversational AI development
platform according to their needs, as well as helping researchers and platform
developers improving the maturity of their platforms.Comment: 10 pages, 10 figures. Accepted for publication at SAC 2021:
ACM/SIGAPP Symposium On Applied Computin