829 research outputs found
Character-Level Language Modeling with Deeper Self-Attention
LSTMs and other RNN variants have shown strong performance on character-level
language modeling. These models are typically trained using truncated
backpropagation through time, and it is common to assume that their success
stems from their ability to remember long-term contexts. In this paper, we show
that a deep (64-layer) transformer model with fixed context outperforms RNN
variants by a large margin, achieving state of the art on two popular
benchmarks: 1.13 bits per character on text8 and 1.06 on enwik8. To get good
results at this depth, we show that it is important to add auxiliary losses,
both at intermediate network layers and intermediate sequence positions.Comment: 8 pages, 7 figure
Alternating Synthetic and Real Gradients for Neural Language Modeling
Training recurrent neural networks (RNNs) with backpropagation through time
(BPTT) has known drawbacks such as being difficult to capture longterm
dependencies in sequences. Successful alternatives to BPTT have not yet been
discovered. Recently, BP with synthetic gradients by a decoupled neural
interface module has been proposed to replace BPTT for training RNNs. On the
other hand, it has been shown that the representations learned with synthetic
and real gradients are different though they are functionally identical. In
this project, we explore ways of combining synthetic and real gradients with
application to neural language modeling tasks. Empirically, we demonstrate the
effectiveness of alternating training with synthetic and real gradients after
periodic warm restarts on language modeling tasks
- …