Recurrent Neural Networks (RNNs) have been widely used in natural language
processing and computer vision. Among them, the Hierarchical Multi-scale RNN
(HM-RNN), a kind of multi-scale hierarchical RNN proposed recently, can learn
the hierarchical temporal structure from data automatically. In this paper, we
extend the work to solve the computer vision task of action recognition.
However, in sequence-to-sequence models like RNN, it is normally very hard to
discover the relationships between inputs and outputs given static inputs. As a
solution, attention mechanism could be applied to extract the relevant
information from input thus facilitating the modeling of input-output
relationships. Based on these considerations, we propose a novel attention
network, namely Hierarchical Multi-scale Attention Network (HM-AN), by
combining the HM-RNN and the attention mechanism and apply it to action
recognition. A newly proposed gradient estimation method for stochastic
neurons, namely Gumbel-softmax, is exploited to implement the temporal boundary
detectors and the stochastic hard attention mechanism. To amealiate the
negative effect of sensitive temperature of the Gumbel-softmax, an adaptive
temperature training method is applied to better the system performance. The
experimental results demonstrate the improved effect of HM-AN over LSTM with
attention on the vision task. Through visualization of what have been learnt by
the networks, it can be observed that both the attention regions of images and
the hierarchical temporal structure can be captured by HM-AN