136,392 research outputs found
Hierarchical Multi-scale Attention Networks for action recognition
Recurrent Neural Networks (RNNs) have been widely used in natural language
processing and computer vision. Among them, the Hierarchical Multi-scale RNN
(HM-RNN), a kind of multi-scale hierarchical RNN proposed recently, can learn
the hierarchical temporal structure from data automatically. In this paper, we
extend the work to solve the computer vision task of action recognition.
However, in sequence-to-sequence models like RNN, it is normally very hard to
discover the relationships between inputs and outputs given static inputs. As a
solution, attention mechanism could be applied to extract the relevant
information from input thus facilitating the modeling of input-output
relationships. Based on these considerations, we propose a novel attention
network, namely Hierarchical Multi-scale Attention Network (HM-AN), by
combining the HM-RNN and the attention mechanism and apply it to action
recognition. A newly proposed gradient estimation method for stochastic
neurons, namely Gumbel-softmax, is exploited to implement the temporal boundary
detectors and the stochastic hard attention mechanism. To amealiate the
negative effect of sensitive temperature of the Gumbel-softmax, an adaptive
temperature training method is applied to better the system performance. The
experimental results demonstrate the improved effect of HM-AN over LSTM with
attention on the vision task. Through visualization of what have been learnt by
the networks, it can be observed that both the attention regions of images and
the hierarchical temporal structure can be captured by HM-AN
Facial Action Unit Detection Using Attention and Relation Learning
Attention mechanism has recently attracted increasing attentions in the field
of facial action unit (AU) detection. By finding the region of interest of each
AU with the attention mechanism, AU-related local features can be captured.
Most of the existing attention based AU detection works use prior knowledge to
predefine fixed attentions or refine the predefined attentions within a small
range, which limits their capacity to model various AUs. In this paper, we
propose an end-to-end deep learning based attention and relation learning
framework for AU detection with only AU labels, which has not been explored
before. In particular, multi-scale features shared by each AU are learned
firstly, and then both channel-wise and spatial attentions are adaptively
learned to select and extract AU-related local features. Moreover, pixel-level
relations for AUs are further captured to refine spatial attentions so as to
extract more relevant local features. Without changing the network
architecture, our framework can be easily extended for AU intensity estimation.
Extensive experiments show that our framework (i) soundly outperforms the
state-of-the-art methods for both AU detection and AU intensity estimation on
the challenging BP4D, DISFA, FERA 2015 and BP4D+ benchmarks, (ii) can
adaptively capture the correlated regions of each AU, and (iii) also works well
under severe occlusions and large poses.Comment: This paper is accepted by IEEE Transactions on Affective Computin
Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment
Facial action unit (AU) detection and face alignment are two highly
correlated tasks since facial landmarks can provide precise AU locations to
facilitate the extraction of meaningful local features for AU detection. Most
existing AU detection works often treat face alignment as a preprocessing and
handle the two tasks independently. In this paper, we propose a novel
end-to-end deep learning framework for joint AU detection and face alignment,
which has not been explored before. In particular, multi-scale shared features
are learned firstly, and high-level features of face alignment are fed into AU
detection. Moreover, to extract precise local features, we propose an adaptive
attention learning module to refine the attention map of each AU adaptively.
Finally, the assembled local features are integrated with face alignment
features and global features for AU detection. Experiments on BP4D and DISFA
benchmarks demonstrate that our framework significantly outperforms the
state-of-the-art methods for AU detection.Comment: This paper has been accepted by ECCV 201
- …