106 research outputs found
Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection
Linguistically diverse datasets are critical for training and evaluating
robust machine learning systems, but data collection is a costly process that
often requires experts. Crowdsourcing the process of paraphrase generation is
an effective means of expanding natural language datasets, but there has been
limited analysis of the trade-offs that arise when designing tasks. In this
paper, we present the first systematic study of the key factors in
crowdsourcing paraphrase collection. We consider variations in instructions,
incentives, data domains, and workflows. We manually analyzed paraphrases for
correctness, grammaticality, and linguistic diversity. Our observations provide
new insight into the trade-offs between accuracy and diversity in crowd
responses that arise as a result of task design, providing guidance for future
paraphrase generation procedures.Comment: Published at ACL 201
Worker Retention, Response Quality, and Diversity in Microtask Crowdsourcing: An Experimental Investigation of the Potential for Priming Effects to Promote Project Goals
Online microtask crowdsourcing platforms act as efficient resources for delegating small units of work, gathering data, generating ideas, and more. Members of research and business communities have incorporated crowdsourcing into problem-solving processes. When human workers contribute to a crowdsourcing task, they are subject to various stimuli as a result of task design. Inter-task priming effects - through which work is nonconsciously, yet significantly, influenced by exposure to certain stimuli - have been shown to affect microtask crowdsourcing responses in a variety of ways. Instead of simply being wary of the potential for priming effects to skew results, task administrators can utilize proven priming procedures in order to promote project goals. In a series of three experiments conducted on Amazon’s Mechanical Turk, we investigated the effects of proposed priming treatments on worker retention, response quality, and response diversity. In our first two experiments, we studied the effect of initial response freedom on sustained worker participation and response quality. We expected that workers who were granted greater levels of freedom in an initial response would be stimulated to complete more work and deliver higher quality work than workers originally constrained in their initial response possibilities. We found no significant relationship between the initial response freedom granted to workers and the amount of optional work they completed. The degree of initial response freedom also did not have a significant impact on subsequent response quality. However, the influence of inter-task effects were evident based on response tendencies for different question types. We found evidence that consistency in task structure may play a stronger role in promoting response quality than proposed priming procedures. In our final experiment, we studied the influence of a group-level priming treatment on response diversity. Instead of varying task structure for different workers, we varied the degree of overlap in question content distributed to different workers in a group. We expected groups of workers that were exposed to more diverse preliminary question sets to offer greater diversity in response to a subsequent question. Although differences in response diversity were revealed, no consistent trend between question content overlap and response diversity prevailed. Nevertheless, combining consistent task structure with crowd-level priming procedures - to encourage diversity in inter-task effects across the crowd - offers an exciting path for future study
Crowdsourcing for Reminiscence Chatbot Design
In this work-in-progress paper we discuss the challenges in identifying
effective and scalable crowd-based strategies for designing content,
conversation logic, and meaningful metrics for a reminiscence chatbot targeted
at older adults. We formalize the problem and outline the main research
questions that drive the research agenda in chatbot design for reminiscence and
for relational agents for older adults in general
ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
We describe PARANMT-50M, a dataset of more than 50 million English-English
sentential paraphrase pairs. We generated the pairs automatically by using
neural machine translation to translate the non-English side of a large
parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M
can be a valuable resource for paraphrase generation and can provide a rich
source of semantic knowledge to improve downstream natural language
understanding tasks. To show its utility, we use ParaNMT-50M to train
paraphrastic sentence embeddings that outperform all supervised systems on
every SemEval semantic textual similarity competition, in addition to showing
how it can be used for paraphrase generation
Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services
Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user's tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user's request in natural language. Building task-oriented bots, which aim at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Natural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to invoke their back-end APIs.
Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges stemming from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks -- called intents.
To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets,
by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, generating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task).
This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourcing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
Commonsense AI has long been seen as a near impossible goal -- until
recently. Now, research interest has sharply increased with an influx of new
benchmarks and models.
We propose two new ways to evaluate commonsense models, emphasizing their
generality on new tasks and building on diverse, recently introduced
benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote
research on commonsense models that generalize well over multiple tasks and
datasets. Second, we propose a novel evaluation, the cost equivalent curve,
that sheds new insight on how the choice of source datasets, pretrained
language models, and transfer learning methods impacts performance and data
efficiency.
We perform extensive experiments -- over 200 experiments encompassing 4800
models -- and report multiple valuable and sometimes surprising findings, e.g.,
that transfer almost always leads to better or equivalent performance if
following a particular recipe, that QA-based commonsense datasets transfer well
with each other, while commonsense knowledge graphs do not, and that perhaps
counter-intuitively, larger models benefit more from transfer than smaller
ones.
Last but not least, we introduce a new universal commonsense reasoning model,
UNICORN, that establishes new state-of-the-art performance across 8 popular
commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA
(90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA
(79.3%).Comment: 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For
associated code and data see https://github.com/allenai/rainbo
Impressions: Understanding Visual Semiotics and Aesthetic Impact
Is aesthetic impact different from beauty? Is visual salience a reflection of
its capacity for effective communication? We present Impressions, a novel
dataset through which to investigate the semiotics of images, and how specific
visual features and design choices can elicit specific emotions, thoughts and
beliefs. We posit that the impactfulness of an image extends beyond formal
definitions of aesthetics, to its success as a communicative act, where style
contributes as much to meaning formation as the subject matter. However, prior
image captioning datasets are not designed to empower state-of-the-art
architectures to model potential human impressions or interpretations of
images. To fill this gap, we design an annotation task heavily inspired by
image analysis techniques in the Visual Arts to collect 1,440 image-caption
pairs and 4,320 unique annotations exploring impact, pragmatic image
description, impressions, and aesthetic design choices. We show that existing
multimodal image captioning and conditional generation models struggle to
simulate plausible human responses to images. However, this dataset
significantly improves their ability to model impressions and aesthetic
evaluations of images through fine-tuning and few-shot adaptation.Comment: To be published in EMNLP 202
- …