187 research outputs found
Crowdsourcing for Speech: Economic, Legal and Ethical analysis
With respect to spoken language resource production, Crowdsourcing - the process of distributing tasks to an open, unspecified population via the internet - offers a wide range of opportunities: populations with specific skills are potentially instantaneously accessible somewhere on the globe for any spoken language. As is the case for most newly introduced high-tech services, crowdsourcing raises both hopes and doubts, certainties and questions. A general analysis of Crowdsourcing for Speech processing could be found in (Eskenazi et al., 2013). This article will focus on ethical, legal and economic issues of crowdsourcing in general (Zittrain, 2008a) and of crowdsourcing services such as Amazon Mechanical Turk (Fort et al., 2011; Adda et al., 2011), a major platform for multilingual language resources (LR) production
LIG-AIKUMA: a Mobile App to Collect Parallel Speech for Under-Resourced Language Studies
International audienceThis paper reports on our ongoing efforts to collect speech data in under-resourced or endangered languages of Africa. Data collection is carried out using an improved version of the An-droid application (AIKUMA) developed by Steven Bird and colleagues [1]. Features were added to the app in order to facilitate the collection of parallel speech data in line with the requirements of the French-German ANR/DFG BULB (Breaking the Unwritten Language Barrier) project. The resulting app, called LIG-AIKUMA, runs on various mobile phones and tablets and proposes a range of different speech collection modes (recording , respeaking, translation and elicitation). It was used for field data collections in Congo-Brazzaville resulting in a total of over 80 hours of speech
Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use
International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc
Un turc mécanique pour les ressources linguistiques : critique de la myriadisation du travail parcellisé
International audienceThis article is a position paper concerning Amazon Mechanical Turk-like systems, the use of which has been steadily growing in natural language processing in the past few years. According to the mainstream opinion expressed in the articles of the domain, these online working platforms allow to develop very quickly all sorts of quality language resources, for a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal, be it from the point of view of quality, price, workers' status or ethics. We shall then bring back to mind already existing or proposed alternatives. Our goal here is twofold : to inform researchers, so that they can make their own choices with all the elements of the reflection in mind, and propose practical and organizational solutions in order to improve new language resources development, while limiting the risks of ethical and legal issues without letting go price or quality.Cet article est une prise de position concernant les plate-formes de type Amazon Mechanical Turk, dont l'utilisation est en plein essor depuis quelques années dans le traitement automatique des langues. Ces plateformes de travail en ligne permettent, selon le discours qui prévaut dans les articles du domaine, de faire développer toutes sortes de ressources linguistiques de qualité, pour un prix imbattable et en un temps très réduit, par des gens pour qui il s'agit d'un passe-temps. Nous allons ici démontrer que la situation est loin d'être aussi idéale, que ce soit sur le plan de la qualité, du prix, du statut des travailleurs ou de l'éthique. Nous rappellerons ensuite les solutions alternatives déjà existantes ou proposées. Notre but est ici double : informer les chercheurs, afin qu'ils fassent leur choix en toute connaissance de cause, et proposer des solutions pratiques et organisationnelles pour améliorer le développement de nouvelles ressources linguistiques en limitant les risques de dérives éthiques et légales, sans que cela se fasse au prix de leur coût ou de leur qualité
The CAMOMILE collaborative annotation platform for multi-modal, multi-lingual and multi-media documents
In this paper, we describe the organization and the implementation of the CAMOMILE collaborative annotation framework for multimodal, multimedia, multilingual (3M) data. Given the versatile nature of the analysis which can be performed on 3M data, the structure of the server was kept intentionally simple in order to preserve its genericity, relying on standard Web technologies. Layers of annotations, defined as data associated to a media fragment from the corpus, are stored in a database and can be managed through standard interfaces with authentication. Interfaces tailored specifically to the needed task can then be developed in an agile way, relying on simple but reliable services for the management of the centralized annotations. We then present our implementation of an active learning scenario for person annotation in video, relying on the CAMOMILE server; during a dry run experiment, the manual annotation of 716 speech segments was thus propagated to 3504 labeled tracks. The code of the CAMOMILE framework is distributed in open source.Peer ReviewedPostprint (author's final draft
Questions-Réponses en domaine ouvert (sélection pertinente de documents en fonction du contexte de la question)
Les problématiques abordées dans ma thèse sont de définir une adaptation unifiée entre la sélection des documents et les stratégies de recherche de la réponse à partir du type des documents et de celui des questions, intégrer la solution au système de Questions-Réponses (QR) RITEL du LIMSI et évaluer son apport. Nous développons et étudions une méthode basée sur une approche de Recherche d Information pour la sélection de documents en QR. Celle-ci s appuie sur un modèle de langue et un modèle de classification binaire de texte en catégorie pertinent ou non pertinent d un point de vue QR. Cette méthode permet de filtrer les documents sélectionnés pour l extraction de réponses par un système QR. Nous présentons la méthode et ses modèles, et la testons dans le cadre QR à l aide de RITEL. L évaluation est faite en français en contexte web sur un corpus de 500 000 pages web et de questions factuelles fournis par le programme Quaero. Celle-ci est menée soit sur des documents complets, soit sur des segments de documents. L hypothèse suivie est que le contenu informationnel des segments est plus cohérent et facilite l extraction de réponses. Dans le premier cas, les gains obtenus sont faibles comparés aux résultats de référence (sans filtrage). Dans le second cas, les gains sont plus élevés et confortent l hypothèse, sans pour autant être significatifs. Une étude approfondie des liens existant entre les performances de RITEL et les paramètres de filtrage complète ces évaluations. Le système de segmentation créé pour travailler sur des segments est détaillé et évalué. Son évaluation nous sert à mesurer l impact de la variabilité naturelle des pages web (en taille et en contenu) sur la tâche QR, en lien avec l hypothèse précédente. En général, les résultats expérimentaux obtenus suggèrent que notre méthode aide un système QR dans sa tâche. Cependant, de nouvelles évaluations sont à mener pour rendre ces résultats significatifs, et notamment en utilisant des corpus de questions plus importants.This thesis aims at defining a unified adaptation of the document selection and answer extraction strategies, based on the document and question types, in a Question-Answering (QA) context. The solution is integrated in RITEL (a LIMSI QA system) to assess the contribution. We develop and investigate a method based on an Information Retrieval approach for the selection of relevant documents in QA. The method is based on a language model and a binary model of textual classification in relevant or irrelevant category. It is used to filter unusable documents for answer extraction by matching lists of a priori relevant documents to the question type automatically. First, we present the method along with its underlying models and we evaluate it on the QA task with RITEL in French. The evaluation is done on a corpus of 500,000 unsegmented web pages with factoid questions provided by the Quaero program (i.e. evaluation at the document level or D-level). Then, we evaluate the methodon segmented web pages (i.e. evaluation at the segment level or S-level). The idea is that information content is more consistent with segments, which facilitates answer extraction. D-filtering brings a small improvement over the baseline (no filtering). S-filtering outperforms both the baseline and D-filtering but not significantly. Finally, we study at the S-level the links between RITEL s performances and the key parameters of the method. In order to apply the method on segments, we created a system of web page segmentation. We present and evaluate it on the QA task with the same corpora used to evaluate our document selection method. This evaluation follows the former hypothesis and measures the impact of natural web page variability (in terms of size and content) on RITEL in its task. In general, the experimental results we obtained suggest that our IR-based method helps a QA system in its task, however further investigations should be conducted especially with larger corpora of questions to make them significant.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF
Preliminary Experiments on Unsupervised Word Discovery in Mboshi
International audienceThe necessity to document thousands of endangered languages encourages the collaboration between linguists and computer scientists in order to provide the documentary linguistics community with the support of automatic processing tools. The French-German ANR-DFG project Breaking the Unwritten Language Barrier (BULB) aims at developing such tools for three mostly unwritten African languages of the Bantu family. For one of them, Mboshi, a language originating from the " Cu-vette " region of the Republic of Congo, we investigate unsuper-vised word discovery techniques from an unsegmented stream of phonemes. We compare different models and algorithms, both monolingual and bilingual, on a new corpus in Mboshi and French, and discuss various ways to represent the data with suitable granularity. An additional French-English corpus allows us to contrast the results obtained on Mboshi and to experiment with more data
"Where the data are coming from?" Ethics, crowdsourcing and traceability for Big Data in Human Language Technology
National audienceBased on the experience gained on the observation of the corpora developement in HLT, the authors want to warn the Big Data community about some recent usage of hu-man computation. For instance, the growing use in the HLT community of crowdsourcing methods, and especially of microworking retributed crowsourcing platforms, lead to many ethical, economical and juridical concerns. The au-thors want also to foster some behaviours, especially con-cerning traceability, implemented in the form of a charter, the Ethics and Big Data Charter
Images and imagination : automated analysis of priming effects related to autism spectrum disorder and developmental language disorder
Different aspects of language processing have been shown to be sensitive to priming but the findings of studies examining priming effects in adolescents with Autism Spectrum Disorder (ASD) and Developmental Language Disorder (DLD) have been inconclusive. We present a study analysing visual and implicit semantic priming in adolescents with ASD and DLD. Based on a dataset of fictional and script-like narratives, we evaluate how often and how extensively, content of two different priming sources is used by the participants. The first priming source was visual, consisting of images shown to the participants to assist them with their storytelling. The second priming source originated from commonsense knowledge, using crowdsourced data containing prototypical script elements. Our results show that individuals with ASD are less sensitive to both types of priming, but show typical usage of primed cues when they use them at all. In contrast, children with DLD show mostly average priming sensitivity, but exhibit an over-proportional use of the priming cues
- …