36 research outputs found
Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services
Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user's tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user's request in natural language. Building task-oriented bots, which aim at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Natural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to invoke their back-end APIs.
Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges stemming from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks -- called intents.
To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets,
by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, generating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task).
This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourcing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models
Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data and Methodology
Conversational interfaces are increasingly popular as a way of connecting
people to information. Corpus-based conversational interfaces are able to
generate more diverse and natural responses than template-based or
retrieval-based agents. With their increased generative capacity of corpusbased
conversational agents comes the need to classify and filter out malevolent
responses that are inappropriate in terms of content and dialogue acts.
Previous studies on the topic of recognizing and classifying inappropriate
content are mostly focused on a certain category of malevolence or on single
sentences instead of an entire dialogue. In this paper, we define the task of
Malevolent Dialogue Response Detection and Classification (MDRDC). We make
three contributions to advance research on this task. First, we present a
Hierarchical Malevolent Dialogue Taxonomy (HMDT). Second, we create a labelled
multi-turn dialogue dataset and formulate the MDRDC task as a hierarchical
classification task over this taxonomy. Third, we apply stateof-the-art text
classification methods to the MDRDC task and report on extensive experiments
aimed at assessing the performance of these approaches.Comment: under review at JASIS
Recommended from our members
Recommendation in Dialogue Systems
Dialogue system has been an active research field for decades and is developing fast in recent years, due to the recent breakthrough of the deep learning techniques. How to make recommendations in dialogue systems is attracting increasing attention because such systems could meet various user information needs and have much commercial potential.Current dialogue system researches typically focus on building systems for social conversation, question answering, and performing specific tasks. However, making recommendations to users, as important information need, has not been intensively researched. Meanwhile, traditional recommender systems are usually developed for non-conversation scenarios. In this dissertation, we explore how to integrate these two systems into one framework that specifically aims at making recommendations in dialogues. Such a system helps users find items by chatting with users to understand their preferences and recommending accordingly.First, we build conversational recommendation datasets, because existing dialogue datasets do not have user-item preference information or the dialogue utterances discussing facets of items, and current recommendation datasets do not have dialogue scripts associated with each user-item pair. We build the datasets by requesting crowdsourcing workers to compose dialogue utterances based on schemas and then use the delexicalization approach to simulate dialogues with the collected utterances. The datasets are used to train the natural language understanding component and provide recommendation information for our system.Based on collected datasets, we propose a reinforcement learning based conversational recommendation framework. Such a framework has three components, a belief tracker, a dialogue manager, and a recommender. The dialogue agent learns to first chat with a user to understand her preferences, and when it feels confident enough, it recommends a list of items to the user. We conduct both offline and online experiments to demonstrate the effectiveness of the framework.We further extend this framework with a personalized probabilistic recommender module. This recommender learns to predict the probability of a user likes an item given the dialogue utterance information and the personalized user preference information. By leveraging this hybrid information, the recommendation and dialogue performances are further improved. We evaluate the dialogue agent's strength in various simulated environments as well as in online user studies and demonstrate the advantages of this approach
Evaluating Machine Intelligence with Question Answering
Humans ask questions to learn about the world and to test knowledge understanding. The ability to ask questions combines aspects of intelligence unique to humans: language understanding, knowledge representation, and reasoning. Thus, building systems capable of intelligent question answering (QA) is a grand goal of natural language processing (NLP). To measure progress in NLP, we create "exams" for computer systems and compare their effectiveness against a reference point---often based on humans. How precisely we measure progress depends on whether we are building computer systems that optimize human satisfaction in information-seeking tasks or that measure progress towards intelligent QA. In the first part of this dissertation, we explore each goal in turn, how they differ, and describe their relationship to QA formats. As an example of an information-seeking evaluation, we introduce a new dialog QA task paired with a new evaluation method. Afterward, we turn our attention to using QA to evaluate machine intelligence.
A good evaluation should be able to discriminate between lesser and more capable QA models. This dissertation explores three ways to improve the discriminative power of QA evaluations: (1) dynamic weighting of test questions, (2) a format that by construction tests multiple levels of knowledge, and (3) evaluation data that is created through human-computer collaboration.
By dynamically weighting test questions, we challenge a foundational assumption of the de facto standard in QA evaluation---the leaderboard. Namely, we contend that contrary to nearly all QA and NLP evaluations which implicitly assign equal weights to examples by averaging scores, that examples are not equally useful for estimating machine (or human) QA ability. As any student may tell you, not all questions on an exam are equally difficult and in the worst-case questions are unsolvable. Drawing on decades of research in educational testing, we propose adopting an alternative evaluation methodology---Item Response Theory---that is widely used to score human exams (e.g., the SAT). By dynamically weighting questions, we show that this improves the reliability of leaderboards in discriminating between models of differing QA ability while also being helpful in the construction of new evaluation datasets.
Having improved the scoring of models, we next turn to improving the format and data in QA evaluations. Our idea is simple. In most QA tasks (e.g., Jeopardy!), each question tests a single level of knowledge; in our task (the trivia game Quizbowl), we test multiple levels of knowledge with each question. Since each question tests multiple levels of knowledge, this decreases the likelihood that we learn nothing about the difference between two models (i.e., they are both correct or both wrong), which substantially increases discriminative power.
Despite the improved format, we next show that while our QA models defeat accomplished trivia players, that they are overly reliant on brittle pattern matching, which indicates a failure to intelligently answer questions. To mitigate this problem, we introduce a new framework for building evaluation data where humans and machines cooperatively craft trivia questions that are difficult to answer through clever pattern matching tricks alone---while being no harder for humans.
We conclude by sketching a broader vision for QA evaluation that combines the three components of evaluation we improve---scoring, format, and data---to create living evaluations and re-imagine the role of leaderboards
A Survey of Natural Language Generation
This paper offers a comprehensive review of the research on Natural Language
Generation (NLG) over the past two decades, especially in relation to
data-to-text generation and text-to-text generation deep learning methods, as
well as new applications of NLG technology. This survey aims to (a) give the
latest synthesis of deep learning research on the NLG core tasks, as well as
the architectures adopted in the field; (b) detail meticulously and
comprehensively various NLG tasks and datasets, and draw attention to the
challenges in NLG evaluation, focusing on different evaluation methods and
their relationships; (c) highlight some future emphasis and relatively recent
research issues that arise due to the increasing synergy between NLG and other
artificial intelligence areas, such as computer vision, text and computational
creativity.Comment: Accepted by ACM Computing Survey (CSUR) 202