Search CORE

32 research outputs found

Competence-based Curriculum Learning for Neural Machine Translation

Author: Platanios Emmanouil Antonios
Stretcu Otilia
Neubig Graham
Poczos Barnabas
Mitchell Tom M.
Publication venue
Publication date: 11/05/1906
Field of study

Current state-of-the-art NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70% decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU

arXiv.org e-Print Archive

Trinity College

Crowdsourcing and translation/localization : threat or opportunity?

Author: Capdevila Fernández Cristian
Publication venue: 'Universitat Autonoma de Barcelona'
Publication date: 01/01/2012
Field of study

El crowdsourcing ha canviat el món de la traducció als darrers anys. En alguns projectes ja no hi participen professionals, sinó usuaris i aficionats, i en molts casos sense compensació econòmica. La qualitat, però, no té perquè estar compromesa, ni tampoc la professió del traductor/localitzador en un futur pròxim.El crowdsourcing ha cambiado el mundo de la traducción en los últimos años. En algunos proyectos ya no participan profesionales, sino usuarios y aficionados, y en muchos casos sin compensación económica. La calidad, sin embargo, no tiene por qué estar comprometida, ni tampoco la profesión del traductor / localizador en un futuro próximo.Crowdsourcing has changed the translation world in recent years. Professional translators are not involved in some projects anymore, as some users and amateurs participate without getting economic compensation. However, neither the overall quality nor the profession of the translator/localizer in the near future are necessarily in danger

Revistes Catalanes amb Accés Obert

Crowdsourcing i traducció/localització : una amenaça o una oportunitat?

Author: Capdevila Fernández Cristian
Publication venue: 'Universitat Autonoma de Barcelona'
Publication date: 01/01/2012
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Diposit Digital de Documents de la UAB

Improving Machine Translation of Educational Content via Crowdsourcing

Author: Behnke Maximiliana
Castilho Sheila
Egg Markus
Gaspari Federico
Georgakopoulou Panayota
Kermanidis Katia Lida
Kordoni Valia
Miceli Barone Antonio Valerio
Naskos Thanasis
Sennrich Rico
Sosoni Vilelmini
Stasimioti Maria
Takoulidou Eirini
van Zaanan Menno
Publication venue
Publication date: 01/01/2018
Field of study

The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with pre-existing in-domain corpora

Archivio della ricerca - Università degli studi di Napoli Federico II

Irish Universities

Edinburgh Research Explorer

DCU Online Research Access Service

Tilburg University Repository

An introduction to crowdsourcing for language and multimedia technology research

Author: A. Doan
C. Callison-Burch
C. Rashtchian
G. Paolacci
G. Pickard
J. Ross
L. Ahn von
L. Ahn von
M. Larson
O. Alonso
R. Snow
S. Novotney
T. Yan
V.C. Rayker
V.S. Sheng
W. Mason
W. Willett
W.S. Lasecki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Language and multimedia technology research often relies on large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible

Crossref

Irish Universities

DCU Online Research Access Service

Active Learning for NLP with Large Language Models

Author: Wang Xuesong
Publication venue
Publication date: 14/01/2024
Field of study

Human annotation of training samples is expensive, laborious, and sometimes challenging, especially for Natural Language Processing (NLP) tasks. To reduce the labeling cost and enhance the sample efficiency, Active Learning (AL) technique can be used to label as few samples as possible to reach a reasonable or similar results. To reduce even more costs and with the significant advances of Large Language Models (LLMs), LLMs can be a good candidate to annotate samples. This work investigates the accuracy and cost of using LLMs (GPT-3.5 and GPT-4) to label samples on 3 different datasets. A consistency-based strategy is proposed to select samples that are potentially incorrectly labeled so that human annotations can be used for those samples in AL settings, and we call it mixed annotation strategy. Then we test performance of AL under two different settings: (1) using human annotations only; (2) using the proposed mixed annotation strategy. The accuracy of AL models under 3 AL query strategies are reported on 3 text classification datasets, i.e., AG's News, TREC-6, and Rotten Tomatoes. On AG's News and Rotten Tomatoes, the models trained with the mixed annotation strategy achieves similar or better results compared to that with human annotations. The method reveals great potentials of LLMs as annotators in terms of accuracy and cost efficiency in active learning settings.Comment: init repor

arXiv.org e-Print Archive

An Improved Crowdsourcing Based Evaluation Technique for Word Embedding Methods

Author: Grzes Marek
Liza Farhana Ferdousi
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 31/08/2016
Field of study

In this proposal track paper, we have presented a crowdsourcing-based word embedding evaluation technique that will be more reliable and linguistically justified. The method is designed for intrinsic evaluation and extends the approach proposed in (Schnabel et al., 2015). Our improved evaluation technique captures word relatedness based on the word context

Kent Academic Repository