13,585 research outputs found
Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition
Attention-based learning for fine-grained image recognition remains a
challenging task, where most of the existing methods treat each object part in
isolation, while neglecting the correlations among them. In addition, the
multi-stage or multi-scale mechanisms involved make the existing methods less
efficient and hard to be trained end-to-end. In this paper, we propose a novel
attention-based convolutional neural network (CNN) which regulates multiple
object parts among different input images. Our method first learns multiple
attention region features of each input image through the one-squeeze
multi-excitation (OSME) module, and then apply the multi-attention multi-class
constraint (MAMC) in a metric learning framework. For each anchor feature, the
MAMC functions by pulling same-attention same-class features closer, while
pushing different-attention or different-class features away. Our method can be
easily trained end-to-end, and is highly efficient which requires only one
training stage. Moreover, we introduce Dogs-in-the-Wild, a comprehensive dog
species dataset that surpasses similar existing datasets by category coverage,
data volume and annotation quality. This dataset will be released upon
acceptance to facilitate the research of fine-grained image recognition.
Extensive experiments are conducted to show the substantial improvements of our
method on four benchmark datasets
HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis
Pedestrian analysis plays a vital role in intelligent video surveillance and
is a key component for security-centric computer vision systems. Despite that
the convolutional neural networks are remarkable in learning discriminative
features from images, the learning of comprehensive features of pedestrians for
fine-grained tasks remains an open problem. In this study, we propose a new
attention-based deep neural network, named as HydraPlus-Net (HP-net), that
multi-directionally feeds the multi-level attention maps to different feature
layers. The attentive deep features learned from the proposed HP-net bring
unique advantages: (1) the model is capable of capturing multiple attentions
from low-level to semantic-level, and (2) it explores the multi-scale
selectiveness of attentive features to enrich the final feature representations
for a pedestrian image. We demonstrate the effectiveness and generality of the
proposed HP-net for pedestrian analysis on two tasks, i.e. pedestrian attribute
recognition and person re-identification. Intensive experimental results have
been provided to prove that the HP-net outperforms the state-of-the-art methods
on various datasets.Comment: Accepted by ICCV 201
Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization
Fine-grained image classification is to recognize hundreds of subcategories
in each basic-level category. Existing methods employ discriminative
localization to find the key distinctions among subcategories. However, they
generally have two limitations: (1) Discriminative localization relies on
region proposal methods to hypothesize the locations of discriminative regions,
which are time-consuming. (2) The training of discriminative localization
depends on object or part annotations, which are heavily labor-consuming. It is
highly challenging to address the two key limitations simultaneously, and
existing methods only focus on one of them. Therefore, we propose a weakly
supervised discriminative localization approach (WSDL) for fast fine-grained
image classification to address the two limitations at the same time, and its
main advantages are: (1) n-pathway end-to-end discriminative localization
network is designed to improve classification speed, which simultaneously
localizes multiple different discriminative regions for one image to boost
classification accuracy, and shares full-image convolutional features generated
by region proposal network to accelerate the process of generating region
proposals as well as reduce the computation of convolutional operation. (2)
Multi-level attention guided localization learning is proposed to localize
discriminative regions with different focuses automatically, without using
object and part annotations, avoiding the labor consumption. Different level
attentions focus on different characteristics of the image, which are
complementary and boost the classification accuracy. Both are jointly employed
to simultaneously improve classification speed and eliminate dependence on
object and part annotations. Compared with state-of-the-art methods on 2
widely-used fine-grained image classification datasets, our WSDL approach
achieves the best performance.Comment: 13pages, submitted to IEEE Transactions on Circuits and Systems for
Video Technology. arXiv admin note: text overlap with arXiv:1709.0829
Fully Convolutional Attention Networks for Fine-Grained Recognition
Fine-grained recognition is challenging due to its subtle local inter-class
differences versus large intra-class variations such as poses. A key to address
this problem is to localize discriminative parts to extract pose-invariant
features. However, ground-truth part annotations can be expensive to acquire.
Moreover, it is hard to define parts for many fine-grained classes. This work
introduces Fully Convolutional Attention Networks (FCANs), a reinforcement
learning framework to optimally glimpse local discriminative regions adaptive
to different fine-grained domains. Compared to previous methods, our approach
enjoys three advantages: 1) the weakly-supervised reinforcement learning
procedure requires no expensive part annotations; 2) the fully-convolutional
architecture speeds up both training and testing; 3) the greedy reward strategy
accelerates the convergence of the learning. We demonstrate the effectiveness
of our method with extensive experiments on four challenging fine-grained
benchmark datasets, including CUB-200-2011, Stanford Dogs, Stanford Cars and
Food-101
Where to Focus: Deep Attention-based Spatially Recurrent Bilinear Networks for Fine-Grained Visual Recognition
Fine-grained visual recognition typically depends on modeling subtle
difference from object parts. However, these parts often exhibit dramatic
visual variations such as occlusions, viewpoints, and spatial transformations,
making it hard to detect. In this paper, we present a novel attention-based
model to automatically, selectively and accurately focus on critical object
regions with higher importance against appearance variations. Given an image,
two different Convolutional Neural Networks (CNNs) are constructed, where the
outputs of two CNNs are correlated through bilinear pooling to simultaneously
focus on discriminative regions and extract relevant features. To capture
spatial distributions among the local regions with visual attention, soft
attention based spatial Long-Short Term Memory units (LSTMs) are incorporated
to realize spatially recurrent yet visually selective over local input
patterns. All the above intuitions equip our network with the following novel
model: two-stream CNN layers, bilinear pooling layer, spatial recurrent layer
with location attention are jointly trained via an end-to-end fashion to serve
as the part detector and feature extractor, whereby relevant features are
localized and extracted attentively. We show the significance of our network
against two well-known visual recognition tasks: fine-grained image
classification and person re-identification.Comment: 8 page
Diversified Visual Attention Networks for Fine-Grained Object Classification
Fine-grained object classification is a challenging task due to the subtle
inter-class difference and large intra-class variation. Recently, visual
attention models have been applied to automatically localize the discriminative
regions of an image for better capturing critical difference and demonstrated
promising performance. However, without consideration of the diversity in
attention process, most of existing attention models perform poorly in
classifying fine-grained objects. In this paper, we propose a diversified
visual attention network (DVAN) to address the problems of fine-grained object
classification, which substan- tially relieves the dependency on
strongly-supervised information for learning to localize discriminative regions
compared with attentionless models. More importantly, DVAN explicitly pursues
the diversity of attention and is able to gather discriminative information to
the maximal extent. Multiple attention canvases are generated to extract
convolutional features for attention. An LSTM recurrent unit is employed to
learn the attentiveness and discrimination of attention canvases. The proposed
DVAN has the ability to attend the object from coarse to fine granularity, and
a dynamic internal representation for classification is built up by
incrementally combining the information from different locations and scales of
the image. Extensive experiments con- ducted on CUB-2011, Stanford Dogs and
Stanford Cars datasets have demonstrated that the proposed diversified visual
attention networks achieve competitive performance compared to the state-
of-the-art approaches, without using any prior knowledge, user interaction or
external resource in training or testing
Object-Part Attention Model for Fine-grained Image Classification
Fine-grained image classification is to recognize hundreds of subcategories
belonging to the same basic-level category, such as 200 subcategories belonging
to the bird, which is highly challenging due to large variance in the same
subcategory and small variance among different subcategories. Existing methods
generally first locate the objects or parts and then discriminate which
subcategory the image belongs to. However, they mainly have two limitations:
(1) Relying on object or part annotations which are heavily labor consuming.
(2) Ignoring the spatial relationships between the object and its parts as well
as among these parts, both of which are significantly helpful for finding
discriminative parts. Therefore, this paper proposes the object-part attention
model (OPAM) for weakly supervised fine-grained image classification, and the
main novelties are: (1) Object-part attention model integrates two level
attentions: object-level attention localizes objects of images, and part-level
attention selects discriminative parts of object. Both are jointly employed to
learn multi-view and multi-scale features to enhance their mutual promotions.
(2) Object-part spatial constraint model combines two spatial constraints:
object spatial constraint ensures selected parts highly representative, and
part spatial constraint eliminates redundancy and enhances discrimination of
selected parts. Both are jointly employed to exploit the subtle and local
differences for distinguishing the subcategories. Importantly, neither object
nor part annotations are used in our proposed approach, which avoids the heavy
labor consumption of labeling. Comparing with more than 10 state-of-the-art
methods on 4 widely-used datasets, our OPAM approach achieves the best
performance.Comment: 14 pages, submitted to IEEE Transactions on Image Processin
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
Text in natural images contains rich semantics that are often highly relevant
to objects or scene. In this paper, we focus on the problem of fully exploiting
scene text for visual understanding. The main idea is combining word
representations and deep visual features into a globally trainable deep
convolutional neural network. First, the recognized words are obtained by a
scene text reading system. Then, we combine the word embedding of the
recognized words and the deep visual features into a single representation,
which is optimized by a convolutional neural network for fine-grained image
classification. In our framework, the attention mechanism is adopted to reveal
the relevance between each recognized word and the given image, which further
enhances the recognition performance. We have performed experiments on two
datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for
fine-grained classification of business places and drink bottles, respectively.
The experimental results consistently demonstrate that the proposed method
combining textual and visual cues significantly outperforms classification with
only visual representations. Moreover, we have shown that the learned
representation improves the retrieval performance on the drink bottle images by
a large margin, making it potentially useful in product search
Fine-grained Visual-textual Representation Learning
Fine-grained visual categorization is to recognize hundreds of subcategories
belonging to the same basic-level category, which is a highly challenging task
due to the quite subtle and local visual distinctions among similar
subcategories. Most existing methods generally learn part detectors to discover
discriminative regions for better categorization performance. However, not all
parts are beneficial and indispensable for visual categorization, and the
setting of part detector number heavily relies on prior knowledge as well as
experimental validation. As is known to all, when we describe the object of an
image via textual descriptions, we mainly focus on the pivotal characteristics,
and rarely pay attention to common characteristics as well as the background
areas. This is an involuntary transfer from human visual attention to textual
attention, which leads to the fact that textual attention tells us how many and
which parts are discriminative and significant to categorization. So textual
attention could help us to discover visual attention in image. Inspired by
this, we propose a fine-grained visual-textual representation learning (VTRL)
approach, and its main contributions are: (1) Fine-grained visual-textual
pattern mining devotes to discovering discriminative visual-textual pairwise
information for boosting categorization performance through jointly modeling
vision and text with generative adversarial networks (GANs), which
automatically and adaptively discovers discriminative parts. (2) Visual-textual
representation learning jointly combines visual and textual information, which
preserves the intra-modality and inter-modality information to generate
complementary fine-grained representation, as well as further improves
categorization performance.Comment: 12 pages, accepted by IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT
- …