61 research outputs found
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
Despite the success of diffusion models, the training and inference of
diffusion models are notoriously expensive due to the long chain of the reverse
process. In parallel, the Lottery Ticket Hypothesis (LTH) claims that there
exists winning tickets (i.e., aproperly pruned sub-network together with
original weight initialization) that can achieve performance competitive to the
original dense neural network when trained in isolation. In this work, we for
the first time apply LTH to diffusion models. We empirically find subnetworks
at sparsity 90%-99% without compromising performance for denoising diffusion
probabilistic models on benchmarks (CIFAR-10, CIFAR-100, MNIST). Moreover,
existing LTH works identify the subnetworks with a unified sparsity along
different layers. We observe that the similarity between two winning tickets of
a model varies from block to block. Specifically, the upstream layers from two
winning tickets for a model tend to be more similar than the downstream layers.
Therefore, we propose to find the winning ticket with varying sparsity along
different layers in the model. Experimental results demonstrate that our method
can find sparser sub-models that require less memory for storage and reduce the
necessary number of FLOPs. Codes are available at
https://github.com/osier0524/Lottery-Ticket-to-DDPM
Doge Tickets: Uncovering Domain-general Language Models by Playing Lottery Tickets
Over-parameterized models, typically pretrained language models (LMs), have
shown an appealing expressive power due to their small learning bias. However,
the huge learning capacity of LMs can also lead to large learning variance. In
a pilot study, we find that, when faced with multiple domains, a critical
portion of parameters behave unexpectedly in a domain-specific manner while
others behave in a domain-general one. Motivated by this phenomenon, we for the
first time posit that domain-general parameters can underpin a domain-general
LM that can be derived from the original LM. To uncover the domain-general LM,
we propose to identify domain-general parameters by playing lottery tickets
(dubbed doge tickets). In order to intervene the lottery, we propose a
domain-general score, which depicts how domain-invariant a parameter is by
associating it with the variance. Comprehensive experiments are conducted on
the Amazon, Mnli and OntoNotes datasets. The results show that the doge tickets
obtains an improved out-of-domain generalization in comparison with a range of
competitive baselines. Analysis results further hint the existence of
domain-general parameters and the performance consistency of doge tickets.Comment: Accepted to NLPCC 2022. Code is available at
https://github.com/Ylily1015/DogeTicket
Two-Way Neural Machine Translation: A Proof of Concept for Bidirectional Translation Modeling using a Two-Dimensional Grid
Neural translation models have proven to be effective in capturing sufficient
information from a source sentence and generating a high-quality target
sentence. However, it is not easy to get the best effect for bidirectional
translation, i.e., both source-to-target and target-to-source translation using
a single model. If we exclude some pioneering attempts, such as multilingual
systems, all other bidirectional translation approaches are required to train
two individual models. This paper proposes to build a single end-to-end
bidirectional translation model using a two-dimensional grid, where the
left-to-right decoding generates source-to-target, and the bottom-to-up
decoding creates target-to-source output. Instead of training two models
independently, our approach encourages a single network to jointly learn to
translate in both directions. Experiments on the WMT 2018
GermanEnglish and TurkishEnglish translation
tasks show that the proposed model is capable of generating a good translation
quality and has sufficient potential to direct the research.Comment: 6 pages, accepted at SLT202
Structural pruning for speed in neural machine translation
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With
the emergence of a transformer architecture, we consistently train and deploy deeper and
larger models, often with billions of parameters, as an ongoing effort to achieve even better
quality. On the other hand, there is also a constant pursuit for optimisation opportunities to
reduce inference runtime.
Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise
sparsity is the most popular for compression purposes, it is not easy to make a model run
faster. Sparse matrix multiplication routines require custom approaches, usually depending on
low-level hardware implementations for the most efficiency. In my thesis, I focus on structural
pruning in the field of NMT, which results in smaller but still dense architectures that do not
need any further modifications to work efficiently.
My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis
(LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning
criterion. It involves partial training and pruning steps performed in a loop. Experiments with
LTH produced substantial speed-up when applied to prune heads in the attention mechanism
of a transformer. While this method has proven successful, it carries the burden of prolonged
training cost that makes an already expensive training routine even more so.
From that point, I exclusively concentrate on research incorporating pruning into training via
regularisation. I experiment with a standard group lasso, which zeroes-out parameters together
in a structural pre-defined way. By targeting feedforward and attention layers in a transformer,
group lasso significantly improves inference speed with already optimised state-of-the-art fast
models. Improving upon that work, I designed a novel approach called aided regularisation,
where every layer penalty is scaled based on statistics gathered as training progresses. Both
gradient- and parameter-based approaches aim to decrease the depth of a model, further
optimising speed while maintaining the translation quality of an unpruned baseline.
The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but
tangible structural sparsity methods. The majority of all experiments in the thesis involve
highly-optimised models as baselines to show that this work pushes the Pareto frontier of
quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster
with no change in translation quality
3D GANs and Latent Space: A comprehensive survey
Generative Adversarial Networks (GANs) have emerged as a significant player
in generative modeling by mapping lower-dimensional random noise to
higher-dimensional spaces. These networks have been used to generate
high-resolution images and 3D objects. The efficient modeling of 3D objects and
human faces is crucial in the development process of 3D graphical environments
such as games or simulations. 3D GANs are a new type of generative model used
for 3D reconstruction, point cloud reconstruction, and 3D semantic scene
completion. The choice of distribution for noise is critical as it represents
the latent space. Understanding a GAN's latent space is essential for
fine-tuning the generated samples, as demonstrated by the morphing of
semantically meaningful parts of images. In this work, we explore the latent
space and 3D GANs, examine several GAN variants and training methods to gain
insights into improving 3D GAN training, and suggest potential future
directions for further research
Transformer-based NMT : modeling, training and implementation
International trade and industrial collaborations enable countries and regions to concentrate their developments on specific industries while making the most of other countries' specializations, which significantly accelerates global development. However, globalization also increases the demand for cross-region communication. Language barriers between many languages worldwide create a challenge for achieving deep collaboration between groups speaking different languages, increasing the need for translation. Language technology, specifically, Machine Translation (MT) holds the promise to enable communication between languages efficiently in real-time with minimal costs. Even though nowadays computers can perform computation in parallel very fast, which provides machine translation users with translations with very low latency, and although the evolution from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) with the utilization of advanced deep learning algorithms has significantly boosted translation quality, current machine translation algorithms are still far from accurately translating all input. Thus, how to further improve the performance of state-of-the-art NMT algorithm remains a valuable open research question which has received a wide range of attention. In the research presented in this thesis, we first investigate the long-distance relation modeling ability of the state-of-the-art NMT model, the Transformer. We propose to learn source phrase representations and incorporate them into the Transformer translation model, aiming to enhance its ability to capture long-distance dependencies well. Second, though previous work (Bapna et al., 2018) suggests that deep Transformers have difficulty in converging, we empirically find that the convergence of deep Transformers depends on the interaction between the layer normalization and residual connections employed to stabilize its training. We conduct a theoretical study about how to ensure the convergence of Transformers, especially for deep Transformers, and propose to ensure the convergence of deep Transformers by putting the Lipschitz constraint on its parameter initialization. Finally, we investigate how to dynamically determine proper and efficient batch sizes during the training of the Transformer model. We find that the gradient direction gets stabilized with increasing batch size during gradient accumulation. Thus we propose to dynamically adjust batch sizes during training by monitoring the gradient direction change within gradient accumulation, and to achieve a proper and efficient batch size by stopping the gradient accumulation when the gradient direction starts to fluctuate. For our research in this thesis, we also implement our own NMT toolkit, the Neutron implementation of the Transformer and its variants. In addition to providing fundamental features as the basis of our implementations for the approaches presented in this thesis, we support many advanced features from recent cutting-edge research work. Implementations of all our approaches in this thesis are also included and open-sourced in the toolkit. To compare with previous approaches, we mainly conducted our experiments on the data from the WMT 14 English to German (En-De) and English to French (En-Fr) news translation tasks, except when studying the convergence of deep Transformers, where we alternated the WMT 14 En-Fr task with the WMT 15 Czech to English (Cs-En) news translation task to compare with Bapna et al. (2018). The sizes of these datasets vary from medium (the WMT 14 En-De, ~ 4.5M sentence pairs) to very large (the WMT 14 En-Fr, ~ 36M sentence pairs), thus we suggest our approaches help improve the translation quality between popular language pairs which are widely used and have sufficient data.China Scholarship Counci
- …