1,279 research outputs found
Recurrent Highway Networks
Many sequential processing tasks require complex nonlinear transition
functions from one step to the next. However, recurrent neural networks with
'deep' transition functions remain difficult to train, even when using Long
Short-Term Memory (LSTM) networks. We introduce a novel theoretical analysis of
recurrent networks based on Gersgorin's circle theorem that illuminates several
modeling and optimization issues and improves our understanding of the LSTM
cell. Based on this analysis we propose Recurrent Highway Networks, which
extend the LSTM architecture to allow step-to-step transition depths larger
than one. Several language modeling experiments demonstrate that the proposed
architecture results in powerful and efficient models. On the Penn Treebank
corpus, solely increasing the transition depth from 1 to 10 improves word-level
perplexity from 90.6 to 65.4 using the same number of parameters. On the larger
Wikipedia datasets for character prediction (text8 and enwik8), RHNs outperform
all previous results and achieve an entropy of 1.27 bits per character.Comment: 12 pages, 6 figures, 3 table
Nonlinear Advantage: Trained Networks Might Not Be As Complex as You Think
We perform an empirical study of the behaviour of deep networks when fully
linearizing some of its feature channels through a sparsity prior on the
overall number of nonlinear units in the network. In experiments on image
classification and machine translation tasks, we investigate how much we can
simplify the network function towards linearity before performance collapses.
First, we observe a significant performance gap when reducing nonlinearity in
the network function early on as opposed to late in training, in-line with
recent observations on the time-evolution of the data-dependent NTK. Second, we
find that after training, we are able to linearize a significant number of
nonlinear units while maintaining a high performance, indicating that much of a
network's expressivity remains unused but helps gradient descent in early
stages of training. To characterize the depth of the resulting partially
linearized network, we introduce a measure called average path length,
representing the average number of active nonlinearities encountered along a
path in the network graph. Under sparsity pressure, we find that the remaining
nonlinear units organize into distinct structures, forming core-networks of
near constant effective depth and width, which in turn depend on task
difficulty
- …