Search CORE

61 research outputs found

Losing Heads in the Lottery: Pruning Transformer

Author: Behnke Maximiliana
Heafield Kenneth
Publication venue
Publication date: 16/11/2020
Field of study

Successfully Applying Lottery Ticket Hypothesis to Diffusion Model

Author: Hui Bo
Jiang Chao
Liu Bohan
Yan Da
Publication venue
Publication date: 28/10/2023
Field of study

Despite the success of diffusion models, the training and inference of diffusion models are notoriously expensive due to the long chain of the reverse process. In parallel, the Lottery Ticket Hypothesis (LTH) claims that there exists winning tickets (i.e., aproperly pruned sub-network together with original weight initialization) that can achieve performance competitive to the original dense neural network when trained in isolation. In this work, we for the first time apply LTH to diffusion models. We empirically find subnetworks at sparsity 90%-99% without compromising performance for denoising diffusion probabilistic models on benchmarks (CIFAR-10, CIFAR-100, MNIST). Moreover, existing LTH works identify the subnetworks with a unified sparsity along different layers. We observe that the similarity between two winning tickets of a model varies from block to block. Specifically, the upstream layers from two winning tickets for a model tend to be more similar than the downstream layers. Therefore, we propose to find the winning ticket with varying sparsity along different layers in the model. Experimental results demonstrate that our method can find sparser sub-models that require less memory for storage and reduce the necessary number of FLOPs. Codes are available at https://github.com/osier0524/Lottery-Ticket-to-DDPM

arXiv.org e-Print Archive

Doge Tickets: Uncovering Domain-general Language Models by Playing Lottery Tickets

Author: Song Dawei
Wang Benyou
Yang Yi
Zhang Chen
Publication venue
Publication date: 19/09/2022
Field of study

Over-parameterized models, typically pretrained language models (LMs), have shown an appealing expressive power due to their small learning bias. However, the huge learning capacity of LMs can also lead to large learning variance. In a pilot study, we find that, when faced with multiple domains, a critical portion of parameters behave unexpectedly in a domain-specific manner while others behave in a domain-general one. Motivated by this phenomenon, we for the first time posit that domain-general parameters can underpin a domain-general LM that can be derived from the original LM. To uncover the domain-general LM, we propose to identify domain-general parameters by playing lottery tickets (dubbed doge tickets). In order to intervene the lottery, we propose a domain-general score, which depicts how domain-invariant a parameter is by associating it with the variance. Comprehensive experiments are conducted on the Amazon, Mnli and OntoNotes datasets. The results show that the doge tickets obtains an improved out-of-domain generalization in comparison with a range of competitive baselines. Analysis results further hint the existence of domain-general parameters and the performance consistency of doge tickets.Comment: Accepted to NLPCC 2022. Code is available at https://github.com/Ylily1015/DogeTicket

arXiv.org e-Print Archive

Two-Way Neural Machine Translation: A Proof of Concept for Bidirectional Translation Modeling using a Two-Dimensional Grid

Author: Bahar Parnia
Brix Christopher
Ney Hermann
Publication venue
Publication date: 01/01/2020
Field of study

Neural translation models have proven to be effective in capturing sufficient information from a source sentence and generating a high-quality target sentence. However, it is not easy to get the best effect for bidirectional translation, i.e., both source-to-target and target-to-source translation using a single model. If we exclude some pioneering attempts, such as multilingual systems, all other bidirectional translation approaches are required to train two individual models. This paper proposes to build a single end-to-end bidirectional translation model using a two-dimensional grid, where the left-to-right decoding generates source-to-target, and the bottom-to-up decoding creates target-to-source output. Instead of training two models independently, our approach encourages a single network to jointly learn to translate in both directions. Experiments on the WMT 2018 German

\leftrightarrow

English and Turkish

\leftrightarrow

English translation tasks show that the proposed model is capable of generating a good translation quality and has sufficient potential to direct the research.Comment: 6 pages, accepted at SLT202

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Structural pruning for speed in neural machine translation

Author: Behnke Maximiliana
Publication venue: The University of Edinburgh
Publication date: 05/12/2022
Field of study

Neural machine translation (NMT) strongly outperforms previous statistical techniques. With the emergence of a transformer architecture, we consistently train and deploy deeper and larger models, often with billions of parameters, as an ongoing effort to achieve even better quality. On the other hand, there is also a constant pursuit for optimisation opportunities to reduce inference runtime. Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise sparsity is the most popular for compression purposes, it is not easy to make a model run faster. Sparse matrix multiplication routines require custom approaches, usually depending on low-level hardware implementations for the most efficiency. In my thesis, I focus on structural pruning in the field of NMT, which results in smaller but still dense architectures that do not need any further modifications to work efficiently. My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis (LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning criterion. It involves partial training and pruning steps performed in a loop. Experiments with LTH produced substantial speed-up when applied to prune heads in the attention mechanism of a transformer. While this method has proven successful, it carries the burden of prolonged training cost that makes an already expensive training routine even more so. From that point, I exclusively concentrate on research incorporating pruning into training via regularisation. I experiment with a standard group lasso, which zeroes-out parameters together in a structural pre-defined way. By targeting feedforward and attention layers in a transformer, group lasso significantly improves inference speed with already optimised state-of-the-art fast models. Improving upon that work, I designed a novel approach called aided regularisation, where every layer penalty is scaled based on statistics gathered as training progresses. Both gradient- and parameter-based approaches aim to decrease the depth of a model, further optimising speed while maintaining the translation quality of an unpruned baseline. The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but tangible structural sparsity methods. The majority of all experiments in the thesis involve highly-optimised models as baselines to show that this work pushes the Pareto frontier of quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster with no change in translation quality

Edinburgh Research Archive

3D GANs and Latent Space: A comprehensive survey

Author: Mishra Subhankar
Tata Satya Pratheek
Publication venue
Publication date: 08/04/2023
Field of study

Generative Adversarial Networks (GANs) have emerged as a significant player in generative modeling by mapping lower-dimensional random noise to higher-dimensional spaces. These networks have been used to generate high-resolution images and 3D objects. The efficient modeling of 3D objects and human faces is crucial in the development process of 3D graphical environments such as games or simulations. 3D GANs are a new type of generative model used for 3D reconstruction, point cloud reconstruction, and 3D semantic scene completion. The choice of distribution for noise is critical as it represents the latent space. Understanding a GAN's latent space is essential for fine-tuning the generated samples, as demonstrated by the morphing of semantically meaningful parts of images. In this work, we explore the latent space and 3D GANs, examine several GAN variants and training methods to gain insights into improving 3D GAN training, and suggest potential future directions for further research

arXiv.org e-Print Archive

Transformer-based NMT : modeling, training and implementation

Author: Xu Hongfei
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2021
Field of study

International trade and industrial collaborations enable countries and regions to concentrate their developments on specific industries while making the most of other countries' specializations, which significantly accelerates global development. However, globalization also increases the demand for cross-region communication. Language barriers between many languages worldwide create a challenge for achieving deep collaboration between groups speaking different languages, increasing the need for translation. Language technology, specifically, Machine Translation (MT) holds the promise to enable communication between languages efficiently in real-time with minimal costs. Even though nowadays computers can perform computation in parallel very fast, which provides machine translation users with translations with very low latency, and although the evolution from Statistical Machine Translation (SMT) to Neural Machine Translation (NMT) with the utilization of advanced deep learning algorithms has significantly boosted translation quality, current machine translation algorithms are still far from accurately translating all input. Thus, how to further improve the performance of state-of-the-art NMT algorithm remains a valuable open research question which has received a wide range of attention. In the research presented in this thesis, we first investigate the long-distance relation modeling ability of the state-of-the-art NMT model, the Transformer. We propose to learn source phrase representations and incorporate them into the Transformer translation model, aiming to enhance its ability to capture long-distance dependencies well. Second, though previous work (Bapna et al., 2018) suggests that deep Transformers have difficulty in converging, we empirically find that the convergence of deep Transformers depends on the interaction between the layer normalization and residual connections employed to stabilize its training. We conduct a theoretical study about how to ensure the convergence of Transformers, especially for deep Transformers, and propose to ensure the convergence of deep Transformers by putting the Lipschitz constraint on its parameter initialization. Finally, we investigate how to dynamically determine proper and efficient batch sizes during the training of the Transformer model. We find that the gradient direction gets stabilized with increasing batch size during gradient accumulation. Thus we propose to dynamically adjust batch sizes during training by monitoring the gradient direction change within gradient accumulation, and to achieve a proper and efficient batch size by stopping the gradient accumulation when the gradient direction starts to fluctuate. For our research in this thesis, we also implement our own NMT toolkit, the Neutron implementation of the Transformer and its variants. In addition to providing fundamental features as the basis of our implementations for the approaches presented in this thesis, we support many advanced features from recent cutting-edge research work. Implementations of all our approaches in this thesis are also included and open-sourced in the toolkit. To compare with previous approaches, we mainly conducted our experiments on the data from the WMT 14 English to German (En-De) and English to French (En-Fr) news translation tasks, except when studying the convergence of deep Transformers, where we alternated the WMT 14 En-Fr task with the WMT 15 Czech to English (Cs-En) news translation task to compare with Bapna et al. (2018). The sizes of these datasets vary from medium (the WMT 14 En-De, ~ 4.5M sentence pairs) to very large (the WMT 14 En-Fr, ~ 36M sentence pairs), thus we suggest our approaches help improve the translation quality between popular language pairs which are widely used and have sufficient data.China Scholarship Counci

Universaar

Acronym