32 research outputs found
Competence-based Curriculum Learning for Neural Machine Translation
Current state-of-the-art NMT systems use large neural networks that are not
only slow to train, but also often require many heuristics and optimization
tricks, such as specialized learning rate schedules and large batch sizes. This
is undesirable as it requires extensive hyperparameter tuning. In this paper,
we propose a curriculum learning framework for NMT that reduces training time,
reduces the need for specialized heuristics or large batch sizes, and results
in overall better performance. Our framework consists of a principled way of
deciding which training samples are shown to the model at different times
during training, based on the estimated difficulty of a sample and the current
competence of the model. Filtering training samples in this manner prevents the
model from getting stuck in bad local optima, making it converge faster and
reach a better solution than the common approach of uniformly sampling training
examples. Furthermore, the proposed method can be easily applied to existing
NMT models by simply modifying their input data pipelines. We show that our
framework can help improve the training time and the performance of both
recurrent neural network models and Transformers, achieving up to a 70%
decrease in training time, while at the same time obtaining accuracy
improvements of up to 2.2 BLEU
Crowdsourcing and translation/localization : threat or opportunity?
El crowdsourcing ha canviat el m贸n de la traducci贸 als darrers anys. En alguns projectes ja no hi participen professionals, sin贸 usuaris i aficionats, i en molts casos sense compensaci贸 econ貌mica. La qualitat, per貌, no t茅 perqu猫 estar compromesa, ni tampoc la professi贸 del traductor/localitzador en un futur pr貌xim.El crowdsourcing ha cambiado el mundo de la traducci贸n en los 煤ltimos a帽os. En algunos proyectos ya no participan profesionales, sino usuarios y aficionados, y en muchos casos sin compensaci贸n econ贸mica. La calidad, sin embargo, no tiene por qu茅 estar comprometida, ni tampoco la profesi贸n del traductor / localizador en un futuro pr贸ximo.Crowdsourcing has changed the translation world in recent years. Professional translators are not involved in some projects anymore, as some users and amateurs participate without getting economic compensation. However, neither the overall quality nor the profession of the translator/localizer in the near future are necessarily in danger
Crowdsourcing i traducci贸/localitzaci贸 : una amena莽a o una oportunitat?
El crowdsourcing ha canviat el m贸n de la traducci贸 als darrers anys. En alguns projectes ja no hi participen professionals, sin贸 usuaris i aficionats, i en molts casos sense compensaci贸 econ貌mica. La qualitat, per貌, no t茅 perqu猫 estar compromesa, ni tampoc la professi贸 del traductor/localitzador en un futur pr貌xim.El crowdsourcing ha cambiado el mundo de la traducci贸n en los 煤ltimos a帽os. En algunos proyectos ya no participan profesionales, sino usuarios y aficionados, y en muchos casos sin compensaci贸n econ贸mica. La calidad, sin embargo, no tiene por qu茅 estar comprometida, ni tampoco la profesi贸n del traductor / localizador en un futuro pr贸ximo.Crowdsourcing has changed the translation world in recent years. Professional translators are not involved in some projects anymore, as some users and amateurs participate without getting economic compensation. However, neither the overall quality nor the profession of the translator/localizer in the near future are necessarily in danger
Improving Machine Translation of Educational Content via Crowdsourcing
The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation
models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of
using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a
lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain
by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine
translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected
with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with
pre-existing in-domain corpora
An introduction to crowdsourcing for language and multimedia technology research
Language and multimedia technology research often relies on
large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible
Active Learning for NLP with Large Language Models
Human annotation of training samples is expensive, laborious, and sometimes
challenging, especially for Natural Language Processing (NLP) tasks. To reduce
the labeling cost and enhance the sample efficiency, Active Learning (AL)
technique can be used to label as few samples as possible to reach a reasonable
or similar results. To reduce even more costs and with the significant advances
of Large Language Models (LLMs), LLMs can be a good candidate to annotate
samples. This work investigates the accuracy and cost of using LLMs (GPT-3.5
and GPT-4) to label samples on 3 different datasets. A consistency-based
strategy is proposed to select samples that are potentially incorrectly labeled
so that human annotations can be used for those samples in AL settings, and we
call it mixed annotation strategy. Then we test performance of AL under two
different settings: (1) using human annotations only; (2) using the proposed
mixed annotation strategy. The accuracy of AL models under 3 AL query
strategies are reported on 3 text classification datasets, i.e., AG's News,
TREC-6, and Rotten Tomatoes. On AG's News and Rotten Tomatoes, the models
trained with the mixed annotation strategy achieves similar or better results
compared to that with human annotations. The method reveals great potentials of
LLMs as annotators in terms of accuracy and cost efficiency in active learning
settings.Comment: init repor
An Improved Crowdsourcing Based Evaluation Technique for Word Embedding Methods
In this proposal track paper, we have presented a crowdsourcing-based word embedding evaluation technique that will be more reliable and linguistically justified. The method is designed for intrinsic evaluation and extends the approach proposed in (Schnabel et al., 2015). Our improved evaluation technique captures word relatedness based on the word context