20 research outputs found
Initial Explorations on Regularizing the SCRN Model
Recurrent neural networks are very powerful sequence models which are used for language modeling as well. Under correct regularization such as naive dropout these models are able to achieve substantial improvement in their performance. We regularize the Structurally Constrained Recurrent Network (SCRN) model and show that despite its simplicity it can achieve the performance comparable to the ubiquitous LSTM model in language modeling task while being smaller in size and up to 2x faster to train. Further analysis shows that regularizing both context and hidden states of the SCRN is crucial
Simple Recurrent Units for Highly Parallelizable Recurrence
Common recurrent neural architectures scale poorly due to the intrinsic
difficulty in parallelizing their state computations. In this work, we propose
the Simple Recurrent Unit (SRU), a light recurrent unit that balances model
capacity and scalability. SRU is designed to provide expressive recurrence,
enable highly parallelized implementation, and comes with careful
initialization to facilitate training of deep models. We demonstrate the
effectiveness of SRU on multiple NLP tasks. SRU achieves 5--9x speed-up over
cuDNN-optimized LSTM on classification and question answering datasets, and
delivers stronger results than LSTM and convolutional models. We also obtain an
average of 0.7 BLEU improvement over the Transformer model on translation by
incorporating SRU into the architecture.Comment: EMNL
Pyramidal Recurrent Unit for Language Modeling
LSTMs are powerful tools for modeling contextual information, as evidenced by
their success at the task of language modeling. However, modeling contexts in
very high dimensional space can lead to poor generalizability. We introduce the
Pyramidal Recurrent Unit (PRU), which enables learning representations in high
dimensional space with more generalization power and fewer parameters. PRUs
replace the linear transformation in LSTMs with more sophisticated interactions
including pyramidal and grouped linear transformations. This architecture gives
strong results on word-level language modeling while reducing the number of
parameters significantly. In particular, PRU improves the perplexity of a
recent state-of-the-art language model Merity et al. (2018) by up to 1.3 points
while learning 15-20% fewer parameters. For similar number of model parameters,
PRU outperforms all previous RNN models that exploit different gating
mechanisms and transformations. We provide a detailed examination of the PRU
and its behavior on the language modeling tasks. Our code is open-source and
available at https://sacmehta.github.io/PRU/Comment: Accepted as a long paper in EMNLP 201