Search CORE

88 research outputs found

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

Author: Chen Pinzhen
Heafield Kenneth
Publication venue
Publication date: 12/01/2021
Field of study

Supervised Chinese word segmentation has entered the deep learning era which reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation which further simplified model designing and building, but there is still a performance gap between the translation-based approach and other methods. In this work, we apply the best practices from low-resource neural machine translation to Chinese word segmentation. We build encoder-decoder models with attention, and examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning and ensembling. Our method is generic for word segmentation, without the need for feature engineering or model implementation. In the closed test with constrained data, our method ties with the state of the art on the MSR dataset and is comparable to other methods on the PKU dataset

arXiv.org e-Print Archive

Edinburgh Research Explorer

Sparse Communication for Distributed Gradient Descent

Author: Aji Alham
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Zero-Resource Neural Machine Translation with Monolingual Pivot Data

Author: Currey Anna
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

Edinburgh Research Explorer

Losing Heads in the Lottery: Pruning Transformer

Author: Behnke Maximiliana
Heafield Kenneth
Publication venue
Publication date: 16/11/2020
Field of study

Edinburgh Research Explorer

Normalized Log-Linear Interpolation of Backoff Language Models is Efficient

Author: Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 12/08/2016
Field of study

Edinburgh Research Explorer

Making Asynchronous Stochastic Gradient Descent Work for Transformers

Author: Aji Alham Fikri
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer