Search CORE

11 research outputs found

Results of the WMT17 Neural MT Training Task

Author: Bojar Ondřej
Helcl Jindřich
Kocmi Tom
Libovický Jindřich
Musil Tomáš
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

This paper presents the results of the WMT17 Neural MT Training Task. The objective of this task is to explore the methods of training a fixed neural architecture, aiming primarily at the best translation quality and, as a secondary goal, shorter training time. Task participants were provided with a complete neural machine translation system, fixed training data and the configuration of the network. The translation was performed in the English-to-Czech direction and the task was divided into two subtasks of different configurations - one scaled to fit on a 4GB and another on an 8GB GPU card. We received 3 submissions for the 4GB variant and 1 submission for the 8GB variant; we provided also our run for each of the sizes and two baselines. We translated the test set with the trained models and evaluated the outputs using several automatic metrics. We also report results of the human evaluation of the submitted systems

Crossref

Edinburgh Research Explorer

Biblio at Institute of Formal and Applied Linguistics

Competence-based Curriculum Learning for Neural Machine Translation

Author: Platanios Emmanouil Antonios
Stretcu Otilia
Neubig Graham
Poczos Barnabas
Mitchell Tom M.
Publication venue
Publication date: 11/05/1906
Field of study

Current state-of-the-art NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70% decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU

arXiv.org e-Print Archive

Trinity College

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Author: He Yuxiong
Holmes Connor
Li Cheng
Li Conglong
Wu Xiaoxia
Yao Zhewei
Zhang Minjia
Publication venue
Publication date: 14/01/2024
Field of study

Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.Comment: Published in AAAI 2024 Main Technical Track. Equal contribution by the first 3 authors. Code has been released as a part of https://github.com/microsoft/DeepSpeed. Part of this paper is from our previous arxiv report (arXiv:2211.11586

arXiv.org e-Print Archive

English-to-Czech MT: Large Data and Beyond

Author: Bojar Ondřej
Publication venue
Publication date: 06/12/2018
Field of study

CU Digital Repository

Findings of the 2017 Conference on Machine Translation (WMT17)

Author: Barry Haddow
Christian Federmann
Christof Monz
Lucia Specia
Marco Turchi .
Matt Post
Matteo Negri
Matthias Huck
Ondˇrej Bojar
Philipp Koehn
Qun Liu
Rajen Chatterjee
Raphael Rubino
Shujianhuang
Varvara Logacheva
Yvette Graham
Publication venue: The Association for Computational Linguistics
Publication date
Field of study

This paper presents the results of theWMT17 shared tasks, which included three machine translation (MT) tasks(news, biomedical, and multimodal), two evaluation tasks (metrics and run-time estimation of MT quality), an automatic post-editing task, a neural MT training task, and a bandit learning task

Archivio della ricerca - Fondazione Bruno Kessler