116 research outputs found
Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition
Spatio-temporal feature encoding is essential for encoding the dynamics in
video sequences. Recurrent neural networks, particularly long short-term memory
(LSTM) units, have been popular as an efficient tool for encoding
spatio-temporal features in sequences. In this work, we investigate the effect
of mode variations on the encoded spatio-temporal features using LSTMs. We show
that the LSTM retains information related to the mode variation in the
sequence, which is irrelevant to the task at hand (e.g. classification facial
expressions). Actually, the LSTM forget mechanism is not robust enough to mode
variations and preserves information that could negatively affect the encoded
spatio-temporal features. We propose the mode variational LSTM to encode
spatio-temporal features robust to unseen modes of variation. The mode
variational LSTM modifies the original LSTM structure by adding an additional
cell state that focuses on encoding the mode variation in the input sequence.
To efficiently regulate what features should be stored in the additional cell
state, additional gating functionality is also introduced. The effectiveness of
the proposed mode variational LSTM is verified using the facial expression
recognition task. Comparative experiments on publicly available datasets
verified that the proposed mode variational LSTM outperforms existing methods.
Moreover, a new dynamic facial expression dataset with different modes of
variation, including various modes like pose and illumination variations, was
collected to comprehensively evaluate the proposed mode variational LSTM.
Experimental results verified that the proposed mode variational LSTM encodes
spatio-temporal features robust to unseen modes of variation.Comment: Accepted in AAAI-1
Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning
Recently, it has been widely known that deep neural networks are highly
vulnerable and easily broken by adversarial attacks. To mitigate the
adversarial vulnerability, many defense algorithms have been proposed.
Recently, to improve adversarial robustness, many works try to enhance feature
representation by imposing more direct supervision on the discriminative
feature. However, existing approaches lack an understanding of learning
adversarially robust feature representation. In this paper, we propose a novel
training framework called Robust Proxy Learning. In the proposed method, the
model explicitly learns robust feature representations with robust proxies. To
this end, firstly, we demonstrate that we can generate class-representative
robust features by adding class-wise robust perturbations. Then, we use the
class representative features as robust proxies. With the class-wise robust
features, the model explicitly learns adversarially robust features through the
proposed robust proxy learning framework. Through extensive experiments, we
verify that we can manually generate robust features, and our proposed learning
framework could increase the robustness of the DNNs.Comment: Accepted at IEEE Transactions on Information Forensics and Security
(TIFS
Incorporating Language-Driven Appearance Knowledge Units with Visual Cues in Pedestrian Detection
Large language models (LLMs) have shown their capability in understanding
contextual and semantic information regarding appearance knowledge of
instances. In this paper, we introduce a novel approach to utilize the strength
of an LLM in understanding contextual appearance variations and to leverage its
knowledge into a vision model (here, pedestrian detection). While pedestrian
detection is considered one of crucial tasks directly related with our safety
(e.g., intelligent driving system), it is challenging because of varying
appearances and poses in diverse scenes. Therefore, we propose to formulate
language-driven appearance knowledge units and incorporate them with visual
cues in pedestrian detection. To this end, we establish description corpus
which includes numerous narratives describing various appearances of
pedestrians and others. By feeding them through an LLM, we extract appearance
knowledge sets that contain the representations of appearance variations. After
that, we perform a task-prompting process to obtain appearance knowledge units
which are representative appearance knowledge guided to be relevant to a
downstream pedestrian detection task. Finally, we provide plentiful appearance
information by integrating the language-driven knowledge units with visual
cues. Through comprehensive experiments with various pedestrian detectors, we
verify the effectiveness of our method showing noticeable performance gains and
achieving state-of-the-art detection performance.Comment: 11 pages, 4 figures, 9 table
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
The goal of this work is to reconstruct speech from a silent talking face
video. Recent studies have shown impressive performance on synthesizing speech
from silent talking face videos. However, they have not explicitly considered
on varying identity characteristics of different speakers, which place a
challenge in the video-to-speech synthesis, and this becomes more critical in
unseen-speaker settings. Our approach is to separate the speech content and the
visage-style from a given silent talking face video. By guiding the model to
independently focus on modeling the two representations, we can obtain the
speech of high intelligibility from the model even when the input video of an
unseen subject is given. To this end, we introduce speech-visage selection that
separates the speech content and the speaker identity from the visual features
of the input video. The disentangled representations are jointly incorporated
to synthesize speech through visage-style based synthesizer which generates
speech by coating the visage-styles while maintaining the speech content. Thus,
the proposed framework brings the advantage of synthesizing the speech
containing the right content even with the silent talking face video of an
unseen subject. We validate the effectiveness of the proposed framework on the
GRID, TCD-TIMIT volunteer, and LRW datasets.Comment: Accepted by ECCV 202
- …