2 research outputs found
A Lightweight Recurrent Network for Sequence Modeling
Recurrent networks have achieved great success on various sequential tasks
with the assistance of complex recurrent units, but suffer from severe
computational inefficiency due to weak parallelization. One direction to
alleviate this issue is to shift heavy computations outside the recurrence. In
this paper, we propose a lightweight recurrent network, or LRN. LRN uses input
and forget gates to handle long-range dependencies as well as gradient
vanishing and explosion, with all parameter related calculations factored
outside the recurrence. The recurrence in LRN only manipulates the weight
assigned to each token, tightly connecting LRN with self-attention networks. We
apply LRN as a drop-in replacement of existing recurrent units in several
neural sequential models. Extensive experiments on six NLP tasks show that LRN
yields the best running efficiency with little or no loss in model performance.Comment: ACL 2019, long pape
Root Mean Square Layer Normalization
Layer normalization (LayerNorm) has been successfully applied to various deep
neural networks to help stabilize training and boost model convergence because
of its capability in handling re-centering and re-scaling of both inputs and
weight matrix. However, the computational overhead introduced by LayerNorm
makes these improvements expensive and significantly slows the underlying
network, e.g. RNN in particular. In this paper, we hypothesize that
re-centering invariance in LayerNorm is dispensable and propose root mean
square layer normalization, or RMSNorm. RMSNorm regularizes the summed inputs
to a neuron in one layer according to root mean square (RMS), giving the model
re-scaling invariance property and implicit learning rate adaptation ability.
RMSNorm is computationally simpler and thus more efficient than LayerNorm. We
also present partial RMSNorm, or pRMSNorm where the RMS is estimated from p% of
the summed inputs without breaking the above properties. Extensive experiments
on several tasks using diverse network architectures show that RMSNorm achieves
comparable performance against LayerNorm but reduces the running time by 7%~64%
on different models. Source code is available at
https://github.com/bzhangGo/rmsnorm.Comment: NeurIPS 201