Search CORE

1,172 research outputs found

Attentional Parsing Networks

Author: Karr Marcus
Publication venue: DigitalCommons@CalPoly
Publication date: 01/12/2020
Field of study

Convolutional neural networks (CNNs) have dominated the computer vision field since the early 2010s, when deep learning largely replaced previous approaches like hand-crafted feature engineering and hierarchical image parsing. Meanwhile transformer architectures have attained preeminence in natural language processing, and have even begun to supplant CNNs as the state of the art for some computer vision tasks. This study proposes a novel transformer-based architecture, the attentional parsing network, that reconciles the deep learning and hierarchical image parsing approaches to computer vision. We recast unsupervised image representation as a sequence-to-sequence translation problem where image patches are mapped to successive layers of latent variables; and we enforce symmetry and sparsity constraints to encourage these mappings take the form of a parse tree. We measure the quality of learned representations by passing them to a classifier and find high accuracy (\u3e 90%) for even small models. We also demonstrate controllable image generation: first by “back translating” from latent variables to pixels, and then by selecting subsets of those variables with attention masks. Finally we discuss our design choices and compare them with alternatives, suggesting best practices and possible areas of improvement

DigitalCommons@CalPoly

EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate

Author: Cao Shijie
Cui Bin
Liu Qibin
Liu Yi
Ma Lingxiao
Miao Xupeng
Miao Youshan
Nie Xiaonan
Xue Jilong
Yang Zhi
Publication venue
Publication date: 09/10/2022
Field of study

Mixture-of-experts (MoE) is becoming popular due to its success in improving the model quality, especially in Transformers. By routing tokens with a sparse gate to a few experts (i.e., a small pieces of the full model), MoE can easily increase the model parameters to a very large scale while keeping the computation cost in a constant level. Most existing works just initialize some random experts, set a fixed gating strategy (e.g., Top-k), and train the model from scratch in an ad-hoc way. We identify that these MoE models are suffering from the immature experts and unstable sparse gate, which are harmful to the convergence performance. In this paper, we propose an efficient end-to-end MoE training framework called EvoMoE. EvoMoE starts from training one single expert and gradually evolves into a large and sparse MoE structure. EvoMoE mainly contains two phases: the expert-diversify phase to train the base expert for a while and spawn multiple diverse experts from it, and the gate-sparsify phase to learn an adaptive sparse gate and activate a dynamic number of experts. EvoMoE naturally decouples the joint learning of both the experts and the sparse gate and focuses on learning the basic knowledge with a single expert at the early training stage. Then it diversifies the experts and continues to train the MoE with a novel Dense-to-Sparse gate (DTS-Gate). Specifically, instead of using a permanent sparse gate, DTS-Gate begins as a dense gate that routes tokens to all experts, then gradually and adaptively becomes sparser while routes to fewer experts. Evaluations are conducted on three popular models and tasks, including RoBERTa for masked language modeling task, GPT for language modeling task and Transformer for machine translation task. The results show that EvoMoE outperforms existing baselines, including Switch, BASE Layer, Hash Layer and StableMoE

arXiv.org e-Print Archive

Bird-Eye Transformers for Text Generation Models

Author: Lukasiewicz Thomas
Salvatori Tommaso
Sha Lei
Song Yuhang
Yordanov Yordan
Publication venue
Publication date: 08/10/2022
Field of study

Transformers have become an indispensable module for text generation models since their great success in machine translation. Previous works attribute the~success of transformers to the query-key-value dot-product attention, which provides a robust inductive bias by the fully connected token graphs. However, we found that self-attention has a severe limitation. When predicting the (i+1)-th token, self-attention only takes the i-th token as an information collector, and it tends to give a high attention weight to those tokens similar to itself. Therefore, most of the historical information that occurred before the i-th token is not taken into consideration. Based on this observation, in this paper, we propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers by reweighting self-attention to encourage it to focus more on important historical information. We have conducted experiments on multiple text generation tasks, including machine translation (2 datasets) and language models (3 datasets). These experimental~results show that our proposed model achieves a better performance than the baseline transformer architectures on~all~datasets. The code is released at: \url{https://sites.google.com/view/bet-transformer/home}

arXiv.org e-Print Archive

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Author: Brahma Siddhartha
Mimno David
Zablotskaia Polina
Publication venue
Publication date: 07/10/2022
Field of study

Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods.Comment: Shorter version accepted to SNN2021 worksho

arXiv.org e-Print Archive