103 research outputs found
Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
We tackle image question answering (ImageQA) problem by learning a
convolutional neural network (CNN) with a dynamic parameter layer whose weights
are determined adaptively based on questions. For the adaptive parameter
prediction, we employ a separate parameter prediction network, which consists
of gated recurrent unit (GRU) taking a question as its input and a
fully-connected layer generating a set of candidate weights as its output.
However, it is challenging to construct a parameter prediction network for a
large number of parameters in the fully-connected dynamic parameter layer of
the CNN. We reduce the complexity of this problem by incorporating a hashing
technique, where the candidate weights given by the parameter prediction
network are selected using a predefined hash function to determine individual
weights in the dynamic parameter layer. The proposed network---joint network
with the CNN for ImageQA and the parameter prediction network---is trained
end-to-end through back-propagation, where its weights are initialized using a
pre-trained CNN and GRU. The proposed algorithm illustrates the
state-of-the-art performance on all available public ImageQA benchmarks
Randomized Adversarial Style Perturbations for Domain Generalization
We propose a novel domain generalization technique, referred to as Randomized
Adversarial Style Perturbation (RASP), which is motivated by the observation
that the characteristics of each domain are captured by the feature statistics
corresponding to style. The proposed algorithm perturbs the style of a feature
in an adversarial direction towards a randomly selected class, and makes the
model learn against being misled by the unexpected styles observed in unseen
target domains. While RASP is effective to handle domain shifts, its naive
integration into the training procedure might degrade the capability of
learning knowledge from source domains because it has no restriction on the
perturbations of representations. This challenge is alleviated by Normalized
Feature Mixup (NFM), which facilitates the learning of the original features
while achieving robustness to perturbed representations via their mixup during
training. We evaluate the proposed algorithm via extensive experiments on
various benchmarks and show that our approach improves domain generalization
performance, especially in large-scale benchmarks
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on
untrimmed videos using convolutional neural networks. Our algorithm learns from
video-level class labels and predicts temporal intervals of human actions with
no requirement of temporal localization annotations. We design our network to
identify a sparse subset of key segments associated with target actions in a
video using an attention module and fuse the key segments through adaptive
temporal pooling. Our loss function is comprised of two terms that minimize the
video-level action classification error and enforce the sparsity of the segment
selection. At inference time, we extract and score temporal proposals using
temporal class activations and class-agnostic attentions to estimate the time
intervals that correspond to target actions. The proposed algorithm attains
state-of-the-art results on the THUMOS14 dataset and outstanding performance on
ActivityNet1.3 even with its weak supervision.Comment: Accepted to CVPR 201
- β¦