25 research outputs found

    Towards Best Experiment Design for Evaluating Dialogue System Output

    Full text link
    To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst ratersComment: Accepted at INLG 201

    A TOGAF Based Chatbot Evaluation Metrics: Insights from Literature Review

    Get PDF
    Chatbots have been used for basic conversational functionalities and task performance in today\u27s world. With the surge in the use of chatbots, several design features have emerged to cater to its rising demands and increasing complexity. Researchers have grappled with the issues of modeling and evaluating these tools because of the vast number of metrics associated with their measure of successful. This paper conducted a literature survey to identify the various conversational metrics used to evaluate chatbots. The selected evaluation metrics were mapped to the various layers of The Open Group Architecture Framework (TOGAF) architecture. TOGAF architecture helped us divide the metrics based on the various facets critical to developing successful chatbot applications. Our results show that the metrics related to the business layer have been well studied. However, metrics associated with the data, information, and system layers warrant more research. As chatbots become more complex, success metrics across the intermediate layers may assume greater significance

    Enriching Word Embeddings with Food Knowledge for Ingredient Retrieval

    Get PDF
    Smart assistants and recommender systems must deal with lots of information coming from different sources and having different formats. This is more frequent in text data, which presents increased variability and complexity, and is rather common for conversational assistants or chatbots. Moreover, this issue is very evident in the food and nutrition lexicon, where the semantics present increased variability, namely due to hypernyms and hyponyms. This work describes the creation of a set of word embeddings based on the incorporation of information from a food thesaurus - LanguaL - through retrofitting. The ingredients were classified according to three different facet label groups. Retrofitted embeddings seem to properly encode food-specific knowledge, as shown by an increase on accuracy as compared to generic embeddings (+23%, +10% and +31% per group). Moreover, a weighing mechanism based on TF-IDF was applied to embedding creation before retrofitting, also bringing an increase on accuracy (+5%, +9% and +5% per group). Finally, the approach has been tested with human users in an ingredient retrieval exercise, showing very positive evaluation (77.3% of the volunteer testers preferred this method over a string-based matching algorithm)

    Conversational Agents - Exploring Generative Mechanisms and Second-hand Effects of Actualized Technology Affordances

    Get PDF
    Many organisations jumped on the bandwagon and implemented conversational agents (CAs) as a new communication channel. Customers benefit from shorter resolution times, ubiquitous availability, and consistent and compliant responses. However, despite the hype around CAs and the various benefits for customers, we know little about the effects of external facing CAs on the human workforce. This is crucial to better manage the possible changes in the work organisation. Adopting a critical realist stance and using the lens of technology affordances we explore a) why users increasingly actualize CA affordances and b) the first and second-hand effects of affordance actualisation on customers and human employees. We conducted semi-structured interviews with 18 experts in the field and introduce the term affordance effects pairs describing the relationships between the first and second-hand effects. We further explain which generative mechanisms lead to an increasing actualization of affordances and the associated effects

    Conversational Agent Experience: How to Create Good Alexa Skill

    Full text link
    Conversational Design guidelines offer recommendations on how to lead the user-agent conversation, how to help customers achieve their goals, and how to handle the mistakes caused by each side. However, the effective methodology to evaluate the experience of user-agent conversation is unclear. Here we show a data pipeline that evaluates the user-agent experience on a variety of scenarios. We found that the coherence of Alexa's response has a positive impact on user's experience, which is based on the categories of skills, the number of slots in utterances, and the goals that users are trying to achieve. Furthermore, our study shows a gap between the theoretical conversational design guideline and the needs for practical testing for CA. Our data pipeline demonstrates the importance of testing experience by measurements that cast positive or negative affect on conversational experience. We anticipate our study to be a starting point for a more robust user experience evaluating system for CA and related applications.Master of Science in InformationSchool of Informationhttp://deepblue.lib.umich.edu/bitstream/2027.42/168558/1/20210104_Zhou,Xunan[Andy]_Final_MTOP_Thesis.pd

    A Maturity Assessment Framework for Conversational AI Development Platforms

    Full text link
    Conversational Artificial Intelligence (AI) systems have recently sky-rocketed in popularity and are now used in many applications, from car assistants to customer support. The development of conversational AI systems is supported by a large variety of software platforms, all with similar goals, but different focus points and functionalities. A systematic foundation for classifying conversational AI platforms is currently lacking. We propose a framework for assessing the maturity level of conversational AI development platforms. Our framework is based on a systematic literature review, in which we extracted common and distinguishing features of various open-source and commercial (or in-house) platforms. Inspired by language reference frameworks, we identify different maturity levels that a conversational AI development platform may exhibit in understanding and responding to user inputs. Our framework can guide organizations in selecting a conversational AI development platform according to their needs, as well as helping researchers and platform developers improving the maturity of their platforms.Comment: 10 pages, 10 figures. Accepted for publication at SAC 2021: ACM/SIGAPP Symposium On Applied Computin
    corecore