32 research outputs found

    Competence-based Curriculum Learning for Neural Machine Translation

    Get PDF
    Current state-of-the-art NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70% decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU

    Crowdsourcing and translation/localization : threat or opportunity?

    Get PDF
    El crowdsourcing ha canviat el m贸n de la traducci贸 als darrers anys. En alguns projectes ja no hi participen professionals, sin贸 usuaris i aficionats, i en molts casos sense compensaci贸 econ貌mica. La qualitat, per貌, no t茅 perqu猫 estar compromesa, ni tampoc la professi贸 del traductor/localitzador en un futur pr貌xim.El crowdsourcing ha cambiado el mundo de la traducci贸n en los 煤ltimos a帽os. En algunos proyectos ya no participan profesionales, sino usuarios y aficionados, y en muchos casos sin compensaci贸n econ贸mica. La calidad, sin embargo, no tiene por qu茅 estar comprometida, ni tampoco la profesi贸n del traductor / localizador en un futuro pr贸ximo.Crowdsourcing has changed the translation world in recent years. Professional translators are not involved in some projects anymore, as some users and amateurs participate without getting economic compensation. However, neither the overall quality nor the profession of the translator/localizer in the near future are necessarily in danger

    Crowdsourcing i traducci贸/localitzaci贸 : una amena莽a o una oportunitat?

    Get PDF
    El crowdsourcing ha canviat el m贸n de la traducci贸 als darrers anys. En alguns projectes ja no hi participen professionals, sin贸 usuaris i aficionats, i en molts casos sense compensaci贸 econ貌mica. La qualitat, per貌, no t茅 perqu猫 estar compromesa, ni tampoc la professi贸 del traductor/localitzador en un futur pr貌xim.El crowdsourcing ha cambiado el mundo de la traducci贸n en los 煤ltimos a帽os. En algunos proyectos ya no participan profesionales, sino usuarios y aficionados, y en muchos casos sin compensaci贸n econ贸mica. La calidad, sin embargo, no tiene por qu茅 estar comprometida, ni tampoco la profesi贸n del traductor / localizador en un futuro pr贸ximo.Crowdsourcing has changed the translation world in recent years. Professional translators are not involved in some projects anymore, as some users and amateurs participate without getting economic compensation. However, neither the overall quality nor the profession of the translator/localizer in the near future are necessarily in danger

    Improving Machine Translation of Educational Content via Crowdsourcing

    Get PDF
    The limited availability of in-domain training data is a major issue in the training of application-specific neural machine translation models. Professional outsourcing of bilingual data collections is costly and often not feasible. In this paper we analyze the influence of using crowdsourcing as a scalable way to obtain translations of target in-domain data having in mind that the translations can be of a lower quality. We apply crowdsourcing with carefully designed quality controls to create parallel corpora for the educational domain by collecting translations of texts from MOOCs from English to eleven languages, which we then use to fine-tune neural machine translation models previously trained on general-domain data. The results from our research indicate that crowdsourced data collected with proper quality controls consistently yields performance gains over general-domain baseline systems, and systems fine-tuned with pre-existing in-domain corpora

    An introduction to crowdsourcing for language and multimedia technology research

    Get PDF
    Language and multimedia technology research often relies on large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible

    Active Learning for NLP with Large Language Models

    Full text link
    Human annotation of training samples is expensive, laborious, and sometimes challenging, especially for Natural Language Processing (NLP) tasks. To reduce the labeling cost and enhance the sample efficiency, Active Learning (AL) technique can be used to label as few samples as possible to reach a reasonable or similar results. To reduce even more costs and with the significant advances of Large Language Models (LLMs), LLMs can be a good candidate to annotate samples. This work investigates the accuracy and cost of using LLMs (GPT-3.5 and GPT-4) to label samples on 3 different datasets. A consistency-based strategy is proposed to select samples that are potentially incorrectly labeled so that human annotations can be used for those samples in AL settings, and we call it mixed annotation strategy. Then we test performance of AL under two different settings: (1) using human annotations only; (2) using the proposed mixed annotation strategy. The accuracy of AL models under 3 AL query strategies are reported on 3 text classification datasets, i.e., AG's News, TREC-6, and Rotten Tomatoes. On AG's News and Rotten Tomatoes, the models trained with the mixed annotation strategy achieves similar or better results compared to that with human annotations. The method reveals great potentials of LLMs as annotators in terms of accuracy and cost efficiency in active learning settings.Comment: init repor

    An Improved Crowdsourcing Based Evaluation Technique for Word Embedding Methods

    Get PDF
    In this proposal track paper, we have presented a crowdsourcing-based word embedding evaluation technique that will be more reliable and linguistically justified. The method is designed for intrinsic evaluation and extends the approach proposed in (Schnabel et al., 2015). Our improved evaluation technique captures word relatedness based on the word context
    corecore