5,728 research outputs found
Incorporating Intra-Class Variance to Fine-Grained Visual Recognition
Fine-grained visual recognition aims to capture discriminative
characteristics amongst visually similar categories. The state-of-the-art
research work has significantly improved the fine-grained recognition
performance by deep metric learning using triplet network. However, the impact
of intra-category variance on the performance of recognition and robust feature
representation has not been well studied. In this paper, we propose to leverage
intra-class variance in metric learning of triplet network to improve the
performance of fine-grained recognition. Through partitioning training images
within each category into a few groups, we form the triplet samples across
different categories as well as different groups, which is called Group
Sensitive TRiplet Sampling (GS-TRS). Accordingly, the triplet loss function is
strengthened by incorporating intra-class variance with GS-TRS, which may
contribute to the optimization objective of triplet network. Extensive
experiments over benchmark datasets CompCar and VehicleID show that the
proposed GS-TRS has significantly outperformed state-of-the-art approaches in
both classification and retrieval tasks.Comment: 6 pages, 5 figure
Fine-grained Categorization and Dataset Bootstrapping using Deep Metric Learning with Humans in the Loop
Existing fine-grained visual categorization methods often suffer from three
challenges: lack of training data, large number of fine-grained categories, and
high intraclass vs. low inter-class variance. In this work we propose a generic
iterative framework for fine-grained categorization and dataset bootstrapping
that handles these three challenges. Using deep metric learning with humans in
the loop, we learn a low dimensional feature embedding with anchor points on
manifolds for each category. These anchor points capture intra-class variances
and remain discriminative between classes. In each round, images with high
confidence scores from our model are sent to humans for labeling. By comparing
with exemplar images, labelers mark each candidate image as either a "true
positive" or a "false positive". True positives are added into our current
dataset and false positives are regarded as "hard negatives" for our metric
learning model. Then the model is retrained with an expanded dataset and hard
negatives for the next round. To demonstrate the effectiveness of the proposed
framework, we bootstrap a fine-grained flower dataset with 620 categories from
Instagram images. The proposed deep metric learning scheme is evaluated on both
our dataset and the CUB-200-2001 Birds dataset. Experimental evaluations show
significant performance gain using dataset bootstrapping and demonstrate
state-of-the-art results achieved by the proposed deep metric learning methods.Comment: 10 pages, 9 figures, CVPR 201
Context-aware Captions from Context-agnostic Supervision
We introduce an inference technique to produce discriminative context-aware
image captions (captions that describe differences between images or visual
concepts) using only generic context-agnostic training data (captions that
describe a concept or an image in isolation). For example, given images and
captions of "siamese cat" and "tiger cat", we generate language that describes
the "siamese cat" in a way that distinguishes it from "tiger cat". Our key
novelty is that we show how to do joint inference over a language model that is
context-agnostic and a listener which distinguishes closely-related concepts.
We first apply our technique to a justification task, namely to describe why an
image contains a particular fine-grained category as opposed to another
closely-related category of the CUB-200-2011 dataset. We then study
discriminative image captioning to generate language that uniquely refers to
one of two semantically-similar images in the COCO dataset. Evaluations with
discriminative ground truth for justification and human studies for
discriminative image captioning reveal that our approach outperforms baseline
generative and speaker-listener approaches for discrimination.Comment: Accepted to CVPR 2017 (Spotlight
Multi-Cue Structure Preserving MRF for Unconstrained Video Segmentation
Video segmentation is a stepping stone to understanding video context. Video
segmentation enables one to represent a video by decomposing it into coherent
regions which comprise whole or parts of objects. However, the challenge
originates from the fact that most of the video segmentation algorithms are
based on unsupervised learning due to expensive cost of pixelwise video
annotation and intra-class variability within similar unconstrained video
classes. We propose a Markov Random Field model for unconstrained video
segmentation that relies on tight integration of multiple cues: vertices are
defined from contour based superpixels, unary potentials from temporal smooth
label likelihood and pairwise potentials from global structure of a video.
Multi-cue structure is a breakthrough to extracting coherent object regions for
unconstrained videos in absence of supervision. Our experiments on VSB100
dataset show that the proposed model significantly outperforms competing
state-of-the-art algorithms. Qualitative analysis illustrates that video
segmentation result of the proposed model is consistent with human perception
of objects
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
- …