20 research outputs found
MTurk 101: An Introduction to Amazon Mechanical Turk for Extension Professionals
Amazon Mechanical Turk (MTurk) is an online marketplace for labor recruitment that has become a popular platform for data collection. In particular, MTurk can be a valuable tool for Extension professionals. As an example, MTurk workers can provide feedback, write reviews, or give input on a website design. In this article we discuss the many uses of MTurk for Extension professionals and provide best practices for its use
The (Statistical) Power of Mechanical Turk
In this paper, I argue for the use of Amazon Mechanical Turk (AMT) in language research. AMT is an online marketplace of paid workers who may be used as subjects, which can greatly increase the statistical power of studies quickly and with minimal funding. I will show thatādespite some obvious limitations of using distant subjectsāproperly designed experiments completed on AMT are trustworthy, cheap, and much faster than traditional face-to-face data collection. Not only this, but AMT workers may help with data analysis, which can greatly increase the scope of research that one researcher may carry out. This paper will first argue several reasons for using online subjects, then quickly outline how to build a survey-type experiment using AMT, and finally review several best practices for ensuring reliable data
Introduction to the special issue on annotated corpora
International audienceLes corpus annoteĢs sont toujours plus cruciaux, aussi bien pour la recherche scien- tifique en linguistique que le traitement automatique des langues. Ce numeĢro speĢcial passe brieĢvement en revue lāeĢvolution du domaine et souligne les deĢfis aĢ relever en restant dans le cadre actuel dāannotations utilisant des cateĢgories analytiques, ainsi que ceux remettant en question le cadre lui-meĢme. Il preĢsente trois articles, lāun concernant lāeĢvaluation de la qualiteĢ dāannotation, et deux concernant des corpus arboreĢs du francĢ§ais, lāun traitant du plus ancien projet de corpus arboreĢ du francĢ§ais, le French Treebank, le second concernant la conversion de corpus francĢ§ais dans le scheĢma interlingue des Universal Dependencies, offrant ainsi une illustration de lāhistoire du deĢveloppement des corpus arboreĢs.Annotated corpora are increasingly important for linguistic scholarship, science and technology. This special issue briefly surveys the development of the field and points to challenges within the current framework of annotation using analytical categories as well as challenges to the framework itself. It presents three articles, one concerning the evaluation of the quality of annotation, and two concerning French treebanks, one dealing with the oldest project for French, the French Treebank, the second concerning the conversion of French corpora into the cross-lingual framework of Universal Dependencies, thus offering an illustration of the history of treebank development worldwide
Crowdsourcing Emotions in Music Domain
An important source of intelligence for music emotion recognition today comes from user-provided
community tags about songs or artists. Recent crowdsourcing approaches such as harvesting social tags,
design of collaborative games and web services or the use of Mechanical Turk, are becoming popular in
the literature. They provide a cheap, quick and efficient method, contrary to professional labeling of songs
which is expensive and does not scale for creating large datasets. In this paper we discuss the viability of
various crowdsourcing instruments providing examples from research works. We also share our own
experience, illustrating the steps we followed using tags collected from Last.fm for the creation of two
music mood datasets which are rendered public. While processing affect tags of Last.fm, we observed that
they tend to be biased towards positive emotions; the resulting dataset thus contain more positive songs
than negative ones
Descartes: Generating Short Descriptions of Wikipedia Articles
Wikipedia is one of the richest knowledge sources on the Web today. In order
to facilitate navigating, searching, and maintaining its content, Wikipedia's
guidelines state that all articles should be annotated with a so-called short
description indicating the article's topic (e.g., the short description of beer
is "Alcoholic drink made from fermented cereal grains"). Nonetheless, a large
fraction of articles (ranging from 10.2% in Dutch to 99.7% in Kazakh) have no
short description yet, with detrimental effects for millions of Wikipedia
users. Motivated by this problem, we introduce the novel task of automatically
generating short descriptions for Wikipedia articles and propose Descartes, a
multilingual model for tackling it. Descartes integrates three sources of
information to generate an article description in a target language: the text
of the article in all its language versions, the already-existing descriptions
(if any) of the article in other languages, and semantic type information
obtained from a knowledge graph. We evaluate a Descartes model trained for
handling 25 languages simultaneously, showing that it beats baselines
(including a strong translation-based baseline) and performs on par with
monolingual models tailored for specific languages. A human evaluation on three
languages further shows that the quality of Descartes's descriptions is largely
indistinguishable from that of human-written descriptions; e.g., 91.3% of our
English descriptions (vs. 92.1% of human-written descriptions) pass the bar for
inclusion in Wikipedia, suggesting that Descartes is ready for production, with
the potential to support human editors in filling a major gap in today's
Wikipedia across languages
Five sources of bias in natural language processing
Recently, there has been an increased interest in demographically grounded bias in natural language processing (NLP) applications. Much of the recent work has focused on describing bias and providing an overview of bias in a larger context. Here, we provide a simple, actionable summary of this recent work. We outline five sources where bias can occur in NLP systems: (1) the data, (2) the annotation process, (3) the input representations, (4) the models, and finally (5) the research design (or how we conceptualize our research). We explore each of the bias sources in detail in this article, including examples and links to related work, as well as potential counter-measures
Characterizing the Global Crowd Workforce: A Cross-Country Comparison of Crowdworker Demographics
Micro-task crowdsourcing is an international phenomenon that has emerged
during the past decade. This paper sets out to explore the characteristics of
the international crowd workforce and provides a cross-national comparison of
the crowd workforce in ten countries. We provide an analysis and comparison of
demographic characteristics and shed light on the significance of micro-task
income for workers in different countries. This study is the first large-scale
country-level analysis of the characteristics of workers on the platform Figure
Eight (formerly CrowdFlower), one of the two platforms dominating the
micro-task market. We find large differences between the characteristics of the
crowd workforces of different countries, both regarding demography and
regarding the importance of micro-task income for workers. Furthermore, we find
that the composition of the workforce in the ten countries was largely stable
across samples taken at different points in time
Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk
Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data collection