42 research outputs found
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
A Neural ODE Interpretation of Transformer Layers
Transformer layers, which use an alternating pattern of multi-head attention
and multi-layer perceptron (MLP) layers, provide an effective tool for a
variety of machine learning problems. As the transformer layers use residual
connections to avoid the problem of vanishing gradients, they can be viewed as
the numerical integration of a differential equation. In this extended
abstract, we build upon this connection and propose a modification of the
internal architecture of a transformer layer. The proposed model places the
multi-head attention sublayer and the MLP sublayer parallel to each other. Our
experiments show that this simple modification improves the performance of
transformer networks in multiple tasks. Moreover, for the image classification
task, we show that using neural ODE solvers with a sophisticated integration
scheme further improves performance
GNN-LM: Language Modeling based on Global Contexts via GNN
Inspired by the notion that ``{\it to copy is easier than to memorize}``, in
this work, we introduce GNN-LM, which extends the vanilla neural language model
(LM) by allowing to reference similar contexts in the entire training corpus.
We build a directed heterogeneous graph between an input context and its
semantically related neighbors selected from the training corpus, where nodes
are tokens in the input context and retrieved neighbor contexts, and edges
represent connections between nodes. Graph neural networks (GNNs) are
constructed upon the graph to aggregate information from similar contexts to
decode the token. This learning paradigm provides direct access to the
reference contexts and helps improve a model's generalization ability. We
conduct comprehensive experiments to validate the effectiveness of the GNN-LM:
GNN-LM achieves a new state-of-the-art perplexity of 14.8 on WikiText-103 (a
3.9 point improvement over its counterpart of the vanilla LM model), and shows
substantial improvement on One Billion Word and Enwiki8 datasets against strong
baselines. In-depth ablation studies are performed to understand the mechanics
of GNN-LM. \footnote{The code can be found at
https://github.com/ShannonAI/GNN-LMComment: To appear at ICLR 202