88 research outputs found
Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task
Supervised Chinese word segmentation has entered the deep learning era which
reduces the hassle of feature engineering. Recently, some researchers attempted
to treat it as character-level translation which further simplified model
designing and building, but there is still a performance gap between the
translation-based approach and other methods. In this work, we apply the best
practices from low-resource neural machine translation to Chinese word
segmentation. We build encoder-decoder models with attention, and examine a
series of techniques including regularization, data augmentation, objective
weighting, transfer learning and ensembling. Our method is generic for word
segmentation, without the need for feature engineering or model implementation.
In the closed test with constrained data, our method ties with the state of the
art on the MSR dataset and is comparable to other methods on the PKU dataset
Sparse Communication for Distributed Gradient Descent
We make distributed stochastic gradient descent faster by exchanging sparse
updates instead of dense updates. Gradient updates are positively skewed as
most updates are near zero, so we map the 99% smallest updates (by absolute
value) to zero then exchange sparse matrices. This method can be combined with
quantization to further improve the compression. We explore different
configurations and apply them to neural machine translation and MNIST image
classification tasks. Most configurations work on MNIST, whereas different
configurations reduce convergence rate on the more complex translation task.
Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on
NMT without damaging the final accuracy or BLEU.Comment: EMNLP 201
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Asynchronous stochastic gradient descent (SGD) is attractive from a speed
perspective because workers do not wait for synchronization. However, the
Transformer model converges poorly with asynchronous SGD, resulting in
substantially lower quality compared to synchronous SGD. To investigate why
this is the case, we isolate differences between asynchronous and synchronous
methods to investigate batch size and staleness effects. We find that summing
several asynchronous updates, rather than applying them immediately, restores
convergence behavior. With this hybrid method, Transformer training for neural
machine translation task reaches a near-convergence level 1.36x faster in
single-node multi-GPU training with no impact on model quality
- …