310 research outputs found
Training Group Orthogonal Neural Networks with Privileged Information
Learning rich and diverse representations is critical for the performance of
deep convolutional neural networks (CNNs). In this paper, we consider how to
use privileged information to promote inherent diversity of a single CNN model
such that the model can learn better representations and offer stronger
generalization ability. To this end, we propose a novel group orthogonal
convolutional neural network (GoCNN) that learns untangled representations
within each layer by exploiting provided privileged information and enhances
representation diversity effectively. We take image classification as an
example where image segmentation annotations are used as privileged information
during the training process. Experiments on two benchmark datasets -- ImageNet
and PASCAL VOC -- clearly demonstrate the strong generalization ability of our
proposed GoCNN model. On the ImageNet dataset, GoCNN improves the performance
of state-of-the-art ResNet-152 model by absolute value of 1.2% while only uses
privileged information of 10% of the training images, confirming effectiveness
of GoCNN on utilizing available privileged knowledge to train better CNNs.Comment: Proceedings of the IJCAI-1
Deep Self-Taught Learning for Weakly Supervised Object Localization
Most existing weakly supervised localization (WSL) approaches learn detectors
by finding positive bounding boxes based on features learned with image-level
supervision. However, those features do not contain spatial location related
information and usually provide poor-quality positive samples for training a
detector. To overcome this issue, we propose a deep self-taught learning
approach, which makes the detector learn the object-level features reliable for
acquiring tight positive samples and afterwards re-train itself based on them.
Consequently, the detector progressively improves its detection ability and
localizes more informative positive samples. To implement such self-taught
learning, we propose a seed sample acquisition method via image-to-object
transferring and dense subgraph discovery to find reliable positive samples for
initializing the detector. An online supportive sample harvesting scheme is
further proposed to dynamically select the most confident tight positive
samples and train the detector in a mutual boosting way. To prevent the
detector from being trapped in poor optima due to overfitting, we propose a new
relative improvement of predicted CNN scores for guiding the self-taught
learning process. Extensive experiments on PASCAL 2007 and 2012 show that our
approach outperforms the state-of-the-arts, strongly validating its
effectiveness.Comment: Accepted as spotlight paper by CVPR 201
Deep Learning with S-shaped Rectified Linear Activation Units
Rectified linear activation units are important components for
state-of-the-art deep convolutional networks. In this paper, we propose a novel
S-shaped rectified linear activation unit (SReLU) to learn both convex and
non-convex functions, imitating the multiple function forms given by the two
fundamental laws, namely the Webner-Fechner law and the Stevens law, in
psychophysics and neural sciences. Specifically, SReLU consists of three
piecewise linear functions, which are formulated by four learnable parameters.
The SReLU is learned jointly with the training of the whole deep network
through back propagation. During the training phase, to initialize SReLU in
different layers, we propose a "freezing" method to degenerate SReLU into a
predefined leaky rectified linear unit in the initial several training epochs
and then adaptively learn the good initial values. SReLU can be universally
used in the existing deep networks with negligible additional parameters and
computation cost. Experiments with two popular CNN architectures, Network in
Network and GoogLeNet on scale-various benchmarks including CIFAR10, CIFAR100,
MNIST and ImageNet demonstrate that SReLU achieves remarkable improvement
compared to other activation functions.Comment: Accepted by AAAI-1
Exploring Domain Incremental Video Highlights Detection with the LiveFood Benchmark
Video highlights detection (VHD) is an active research field in computer
vision, aiming to locate the most user-appealing clips given raw video inputs.
However, most VHD methods are based on the closed world assumption, i.e., a
fixed number of highlight categories is defined in advance and all training
data are available beforehand. Consequently, existing methods have poor
scalability with respect to increasing highlight domains and training data. To
address above issues, we propose a novel video highlights detection method
named Global Prototype Encoding (GPE) to learn incrementally for adapting to
new domains via parameterized prototypes. To facilitate this new research
direction, we collect a finely annotated dataset termed LiveFood, including
over 5,100 live gourmet videos that consist of four domains: ingredients,
cooking, presentation, and eating. To the best of our knowledge, this is the
first work to explore video highlights detection in the incremental learning
setting, opening up new land to apply VHD for practical scenarios where both
the concerned highlight domains and training data increase over time. We
demonstrate the effectiveness of GPE through extensive experiments. Notably,
GPE surpasses popular domain incremental learning methods on LiveFood,
achieving significant mAP improvements on all domains. Concerning the classic
datasets, GPE also yields comparable performance as previous arts. The code is
available at: https://github.com/ForeverPs/IncrementalVHD_GPE.Comment: AAAI 202
Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
Beyond the existing single-person and multiple-person human parsing tasks in
static images, this paper makes the first attempt to investigate a more
realistic video instance-level human parsing that simultaneously segments out
each person instance and parses each instance into more fine-grained parts
(e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding
Network (ATEN) that alternatively performs temporal encoding among key frames
and flow-guided feature propagation from other consecutive frames between two
key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the
instance-level parsing result for each key frame, which integrates both the
global human parsing and instance-level human segmentation into a unified
model. To balance between accuracy and efficiency, the flow-guided feature
propagation is used to directly parse consecutive frames according to their
identified temporal consistency with key frames. On the other hand, ATEN
leverages the convolution gated recurrent units (convGRU) to exploit temporal
changes over a series of key frames, which are further used to facilitate the
frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding network among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation
research. To demonstrate the superiority of our ATEN, extensive experiments are
conducted on the most popular video segmentation benchmark (DAVIS) and a newly
collected Video Instance-level Parsing (VIP) dataset, which is the first video
instance-level human parsing dataset comprised of 404 sequences and over 20k
frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link:
https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li
- …