12,193 research outputs found
Pairwise Confusion for Fine-Grained Visual Classification
Fine-Grained Visual Classification (FGVC) datasets contain small sample
sizes, along with significant intra-class variation and inter-class similarity.
While prior work has addressed intra-class variation using localization and
segmentation techniques, inter-class similarity may also affect feature
learning and reduce classification performance. In this work, we address this
problem using a novel optimization procedure for the end-to-end neural network
training on FGVC tasks. Our procedure, called Pairwise Confusion (PC) reduces
overfitting by intentionally {introducing confusion} in the activations. With
PC regularization, we obtain state-of-the-art performance on six of the most
widely-used FGVC datasets and demonstrate improved localization ability. {PC}
is easy to implement, does not need excessive hyperparameter tuning during
training, and does not add significant overhead during test time.Comment: Camera-Ready version for ECCV 201
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify
temporal segments corresponding to the actions, and learn models that
generalize to unconstrained web videos. We find that web images queried by
action names serve as well-localized highlights for many actions, but are
noisily labeled. To solve this problem, we propose a simple yet effective
method that takes weak video labels and noisy image labels as input, and
generates localized action frames as output. This is achieved by cross-domain
transfer between video frames and web images, using pre-trained deep
convolutional neural networks. We then use the localized action frames to train
action recognition models with long short-term memory networks. We collect a
fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014
localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition
A key challenge in fine-grained recognition is how to find and represent
discriminative local regions. Recent attention models are capable of learning
discriminative region localizers only from category labels with reinforcement
learning. However, not utilizing any explicit part information, they are not
able to accurately find multiple distinctive regions. In this work, we
introduce an attribute-guided attention localization scheme where the local
region localizers are learned under the guidance of part attribute
descriptions. By designing a novel reward strategy, we are able to learn to
locate regions that are spatially and semantically distinctive with
reinforcement learning algorithm. The attribute labeling requirement of the
scheme is more amenable than the accurate part location annotation required by
traditional part-based fine-grained recognition methods. Experimental results
on the CUB-200-2011 dataset demonstrate the superiority of the proposed scheme
on both fine-grained recognition and attribute recognition
Part Detector Discovery in Deep Convolutional Neural Networks
Current fine-grained classification approaches often rely on a robust
localization of object parts to extract localized feature representations
suitable for discrimination. However, part localization is a challenging task
due to the large variation of appearance and pose. In this paper, we show how
pre-trained convolutional neural networks can be used for robust and efficient
object part discovery and localization without the necessity to actually train
the network on the current dataset. Our approach called "part detector
discovery" (PDD) is based on analyzing the gradient maps of the network outputs
and finding activation centers spatially related to annotated semantic parts or
bounding boxes.
This allows us not just to obtain excellent performance on the CUB200-2011
dataset, but in contrast to previous approaches also to perform detection and
bird classification jointly without requiring a given bounding box annotation
during testing and ground-truth parts during training. The code is available at
http://www.inf-cv.uni-jena.de/part_discovery and
https://github.com/cvjena/PartDetectorDisovery.Comment: Accepted for publication on Asian Conference on Computer Vision
(ACCV) 201
- …