42,991 research outputs found
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
In this paper, we explore the encoding/pooling layer and loss function in the
end-to-end speaker and language recognition system. First, a unified and
interpretable end-to-end system for both speaker and language recognition is
developed. It accepts variable-length input and produces an utterance level
result. In the end-to-end system, the encoding layer plays a role in
aggregating the variable-length input sequence into an utterance level
representation. Besides the basic temporal average pooling, we introduce a
self-attentive pooling layer and a learnable dictionary encoding layer to get
the utterance level representation. In terms of loss function for open-set
speaker verification, to get more discriminative speaker embedding, center loss
and angular softmax loss is introduced in the end-to-end system. Experimental
results on Voxceleb and NIST LRE 07 datasets show that the performance of
end-to-end learning system could be significantly improved by the proposed
encoding layer and loss function.Comment: Accepted for Speaker Odyssey 201
Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition
Spatio-temporal feature encoding is essential for encoding the dynamics in
video sequences. Recurrent neural networks, particularly long short-term memory
(LSTM) units, have been popular as an efficient tool for encoding
spatio-temporal features in sequences. In this work, we investigate the effect
of mode variations on the encoded spatio-temporal features using LSTMs. We show
that the LSTM retains information related to the mode variation in the
sequence, which is irrelevant to the task at hand (e.g. classification facial
expressions). Actually, the LSTM forget mechanism is not robust enough to mode
variations and preserves information that could negatively affect the encoded
spatio-temporal features. We propose the mode variational LSTM to encode
spatio-temporal features robust to unseen modes of variation. The mode
variational LSTM modifies the original LSTM structure by adding an additional
cell state that focuses on encoding the mode variation in the input sequence.
To efficiently regulate what features should be stored in the additional cell
state, additional gating functionality is also introduced. The effectiveness of
the proposed mode variational LSTM is verified using the facial expression
recognition task. Comparative experiments on publicly available datasets
verified that the proposed mode variational LSTM outperforms existing methods.
Moreover, a new dynamic facial expression dataset with different modes of
variation, including various modes like pose and illumination variations, was
collected to comprehensively evaluate the proposed mode variational LSTM.
Experimental results verified that the proposed mode variational LSTM encodes
spatio-temporal features robust to unseen modes of variation.Comment: Accepted in AAAI-1
- …