208 research outputs found
Transformer-based NMT : modeling, training and implementation
International trade and industrial collaborations enable countries and regions to concentrate their developments on specific industries while making the most of other countries' specializations, which significantly accelerates global development. However, globalization also increases the demand for cross-region communication. Language barriers between many languages worldwide create a challenge for achieving deep collaboration between groups speaking different languages, increasing the need for translation. Language technology, specifically, Machine Translation (MT) holds the promise to enable communication between languages efficiently in real-time with minimal costs. Even though nowadays computers can perform computation in parallel very fast, which provides machine translation users with translations with very low latency, and although the evolution from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) with the utilization of advanced deep learning algorithms has significantly boosted translation quality, current machine translation algorithms are still far from accurately translating all input. Thus, how to further improve the performance of state-of-the-art NMT algorithm remains a valuable open research question which has received a wide range of attention. In the research presented in this thesis, we first investigate the long-distance relation modeling ability of the state-of-the-art NMT model, the Transformer. We propose to learn source phrase representations and incorporate them into the Transformer translation model, aiming to enhance its ability to capture long-distance dependencies well. Second, though previous work (Bapna et al., 2018) suggests that deep Transformers have difficulty in converging, we empirically find that the convergence of deep Transformers depends on the interaction between the layer normalization and residual connections employed to stabilize its training. We conduct a theoretical study about how to ensure the convergence of Transformers, especially for deep Transformers, and propose to ensure the convergence of deep Transformers by putting the Lipschitz constraint on its parameter initialization. Finally, we investigate how to dynamically determine proper and efficient batch sizes during the training of the Transformer model. We find that the gradient direction gets stabilized with increasing batch size during gradient accumulation. Thus we propose to dynamically adjust batch sizes during training by monitoring the gradient direction change within gradient accumulation, and to achieve a proper and efficient batch size by stopping the gradient accumulation when the gradient direction starts to fluctuate. For our research in this thesis, we also implement our own NMT toolkit, the Neutron implementation of the Transformer and its variants. In addition to providing fundamental features as the basis of our implementations for the approaches presented in this thesis, we support many advanced features from recent cutting-edge research work. Implementations of all our approaches in this thesis are also included and open-sourced in the toolkit. To compare with previous approaches, we mainly conducted our experiments on the data from the WMT 14 English to German (En-De) and English to French (En-Fr) news translation tasks, except when studying the convergence of deep Transformers, where we alternated the WMT 14 En-Fr task with the WMT 15 Czech to English (Cs-En) news translation task to compare with Bapna et al. (2018). The sizes of these datasets vary from medium (the WMT 14 En-De, ~ 4.5M sentence pairs) to very large (the WMT 14 En-Fr, ~ 36M sentence pairs), thus we suggest our approaches help improve the translation quality between popular language pairs which are widely used and have sufficient data.China Scholarship Counci
Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant
This article provides a comprehensive understanding of optimization in deep
learning, with a primary focus on the challenges of gradient vanishing and
gradient exploding, which normally lead to diminished model representational
ability and training instability, respectively. We analyze these two challenges
through several strategic measures, including the improvement of gradient flow
and the imposition of constraints on a network's Lipschitz constant. To help
understand the current optimization methodologies, we categorize them into two
classes: explicit optimization and implicit optimization. Explicit optimization
methods involve direct manipulation of optimizer parameters, including weight,
gradient, learning rate, and weight decay. Implicit optimization methods, by
contrast, focus on improving the overall landscape of a network by enhancing
its modules, such as residual shortcuts, normalization methods, attention
mechanisms, and activations. In this article, we provide an in-depth analysis
of these two optimization classes and undertake a thorough examination of the
Jacobian matrices and the Lipschitz constants of many widely used deep learning
modules, highlighting existing issues as well as potential improvements.
Moreover, we also conduct a series of analytical experiments to substantiate
our theoretical discussions. This article does not aim to propose a new
optimizer or network. Rather, our intention is to present a comprehensive
understanding of optimization in deep learning. We hope that this article will
assist readers in gaining a deeper insight in this field and encourages the
development of more robust, efficient, and high-performing models.Comment: International Digital Economy Academy (IDEA
LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
We present a Lipschitz continuous Transformer, called LipsFormer, to pursue
training stability both theoretically and empirically for Transformer-based
models. In contrast to previous practical tricks that address training
instability by learning rate warmup, layer normalization, attention
formulation, and weight initialization, we show that Lipschitz continuity is a
more essential property to ensure training stability. In LipsFormer, we replace
unstable Transformer component modules with Lipschitz continuous counterparts:
CenterNorm instead of LayerNorm, spectral initialization instead of Xavier
initialization, scaled cosine similarity attention instead of dot-product
attention, and weighted residual shortcut. We prove that these introduced
modules are Lipschitz continuous and derive an upper bound on the Lipschitz
constant of LipsFormer. Our experiments show that LipsFormer allows stable
training of deep Transformer architectures without the need of careful learning
rate tuning such as warmup, yielding a faster convergence and better
generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny
based on Swin Transformer training for 300 epochs can obtain 82.7\% without any
learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training
for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M
parameters. The code will be released at
\url{https://github.com/IDEA-Research/LipsFormer}.Comment: To appear in ICLR 2023, our code will be public at
https://github.com/IDEA-Research/LipsForme
Optimizing Deep Transformers for Chinese-Thai Low-Resource Translation
In this paper, we study the use of deep Transformer translation model for the
CCMT 2022 Chinese-Thai low-resource machine translation task. We first explore
the experiment settings (including the number of BPE merge operations, dropout
probability, embedding size, etc.) for the low-resource scenario with the
6-layer Transformer. Considering that increasing the number of layers also
increases the regularization on new model parameters (dropout modules are also
introduced when using more layers), we adopt the highest performance setting
but increase the depth of the Transformer to 24 layers to obtain improved
translation quality. Our work obtains the SOTA performance in the
Chinese-to-Thai translation in the constrained evaluation
What can a Single Attention Layer Learn? A Study Through the Random Features Lens
Attention layers -- which map a sequence of inputs to a sequence of outputs
-- are core building blocks of the Transformer architecture which has achieved
significant breakthroughs in modern artificial intelligence. This paper
presents a rigorous theoretical study on the learning and generalization of a
single multi-head attention layer, with a sequence of key vectors and a
separate query vector as input. We consider the random feature setting where
the attention layer has a large number of heads, with randomly sampled frozen
query and key matrices, and trainable value matrices. We show that such a
random-feature attention layer can express a broad class of target functions
that are permutation invariant to the key vectors. We further provide
quantitative excess risk bounds for learning these target functions from finite
samples, using random feature attention with finitely many heads.
Our results feature several implications unique to the attention structure
compared with existing random features theory for neural networks, such as (1)
Advantages in the sample complexity over standard two-layer random-feature
networks; (2) Concrete and natural classes of functions that can be learned
efficiently by a random-feature attention layer; and (3) The effect of the
sampling distribution of the query-key weight matrix (the product of the query
and key matrix), where Gaussian random weights with a non-zero mean result in
better sample complexities over the zero-mean counterpart for learning certain
natural target functions. Experiments on simulated data corroborate our
theoretical findings and further illustrate the interplay between the sample
size and the complexity of the target function.Comment: 41pages, 5 figure
DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method
This paper proposes a new easy-to-implement parameter-free gradient-based
optimizer: DoWG (Distance over Weighted Gradients). We prove that DoWG is
efficient -- matching the convergence rate of optimally tuned gradient descent
in convex optimization up to a logarithmic factor without tuning any
parameters, and universal -- automatically adapting to both smooth and
nonsmooth problems. While popular algorithms following the AdaGrad framework
compute a running average of the squared gradients to use for normalization,
DoWG maintains a new distance-based weighted version of the running average,
which is crucial to achieve the desired properties. To complement our theory,
we also show empirically that DoWG trains at the edge of stability, and
validate its effectiveness on practical machine learning tasks.Comment: 22 pages, 1 table, 4 figure
Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees
We consider transfer learning approaches that fine-tune a pretrained deep
neural network on a target task. We study generalization properties of
fine-tuning to understand the problem of overfitting, which commonly occurs in
practice. Previous works have shown that constraining the distance from the
initialization of fine-tuning improves generalization. Using a PAC-Bayesian
analysis, we observe that besides distance from initialization, Hessians affect
generalization through the noise stability of deep neural networks against
noise injections. Motivated by the observation, we develop Hessian
distance-based generalization bounds for a wide range of fine-tuning methods.
Additionally, we study the robustness of fine-tuning in the presence of noisy
labels. Motivated by our theory, we design an algorithm that incorporates
consistent losses and distance-based regularization for fine-tuning, along with
a generalization error guarantee under class conditional independent noise in
the training set labels. We perform a detailed empirical study of our algorithm
on various noisy environments and architectures. On six image classification
tasks whose training labels are generated with programmatic labeling, we find a
3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian
distance measure of the fine-tuned model decreases by six times more than
existing approaches.Comment: 36 pages, 5 figures, 8 tables; ICML 202
On Separate Normalization in Self-supervised Transformers
Self-supervised training methods for transformers have demonstrated
remarkable performance across various domains. Previous transformer-based
models, such as masked autoencoders (MAE), typically utilize a single
normalization layer for both the [CLS] symbol and the tokens. We propose in
this paper a simple modification that employs separate normalization layers for
the tokens and the [CLS] symbol to better capture their distinct
characteristics and enhance downstream task performance. Our method aims to
alleviate the potential negative effects of using the same normalization
statistics for both token types, which may not be optimally aligned with their
individual roles. We empirically show that by utilizing a separate
normalization layer, the [CLS] embeddings can better encode the global
contextual information and are distributed more uniformly in its anisotropic
space. When replacing the conventional normalization layer with the two
separate layers, we observe an average 2.7% performance improvement over the
image, natural language, and graph domains.Comment: NIPS 202
- …